OKapi BM25算法介紹
目的
給定1個或多個搜索詞,如「Intel、IBM、TSMC」,計算每篇文章的BM25分數,從文件中找出最相關的(n篇)文件,換句話說計算分數後取Top n。
理論知識
文件檢索(Text Retrieval)的常用策略,是用一個Ranking function根據搜索詞對所有文本進行排序,選取前n個,就像Google search一樣。
Ranking function是決定檢索效果最重要的因素,而Okapi BM25即是一個效果不錯的Ranking function被許多搜尋引擎使用,像是Lucene、Elasticsearch、Solr。
公式說明
Version:
System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Codes:
from gensim import corpora
from gensim.summarization import bm25
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
# 以下資料取自: https://www.businessinsider.com.au/bitcoin-futures-markets-unusual-behaviour-2018-2
article_row=[
'For the majority of bitcoin futures trading since their launch in December, the futures curves have been in steep contango, meaning that near-dated prices are below longer-dated prices,」 the analysts said',
'But according to Goldman, the recent price action in Bitcoin futures implied higher prices for longer-dated contracts — far in excess of the cost to borrow money.',
'Cboe futures contracts are also just based on one exchange — Gemini, run by the Winklevoss twins — whereas CME futures are based on a Bitcoin reference rate derived from an aggregate of major exchanges',
]
article_list =[]
for a in article_row:
a_split = a.replace('?',' ').replace('(',' ').replace(')',' ').split(' ')
# 詞干提取
stemmed_tokens = [p_stemmer.stem(i) for i in a_split]
article_list.append(stemmed_tokens)
query =['bitcoin','prices','futur','winklevoss']
query_stemmed = [p_stemmer.stem(i) for i in query]
print('query_stemmed :',query_stemmed )
# bm25模型
bm25Model = bm25.BM25(article_list)
# 逆文件頻率
average_idf = sum(map(lambda k: float(bm25Model.idf[k]), bm25Model.idf.keys())) / len(bm25Model.idf.keys())
scores = bm25Model.get_scores(query_stemmed,average_idf)
print('scores :',scores)
Result:
query_stemmed : ['bitcoin', 'price', 'futur', 'winklevoss']
scores : [0.3029151141722034, 0.3029151141722034, 1.068730347708848]
沒有留言:
張貼留言