2018年2月12日 星期一

OKapi BM25 算法介紹與python實作

OKapi BM25算法介紹

目的

給定1個或多個搜索詞,如「Intel、IBM、TSMC」,計算每篇文章的BM25分數,從文件中找出最相關的(n篇)文件,換句話說計算分數後取Top n。

理論知識

文件檢索(Text Retrieval)的常用策略,是用一個Ranking function根據搜索詞對所有文本進行排序,選取前n個,就像Google search一樣。
Ranking function是決定檢索效果最重要的因素,而Okapi BM25即是一個效果不錯的Ranking function被許多搜尋引擎使用,像是Lucene、Elasticsearch、Solr。

公式說明

Version:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Codes:

from gensim import corpora
from gensim.summarization import bm25
from nltk.stem.porter import PorterStemmer  
p_stemmer = PorterStemmer()  

# 以下資料取自: https://www.businessinsider.com.au/bitcoin-futures-markets-unusual-behaviour-2018-2
article_row=[
    'For the majority of bitcoin futures trading since their launch in December, the futures curves have been in steep contango, meaning that near-dated prices are below longer-dated prices,」 the analysts said',
    'But according to Goldman, the recent price action in Bitcoin futures implied higher prices for longer-dated contracts — far in excess of the cost to borrow money.',
    'Cboe futures contracts are also just based on one exchange — Gemini, run by the Winklevoss twins — whereas CME futures are based on a Bitcoin reference rate derived from an aggregate of major exchanges',
]

article_list =[]
for a in article_row:
    a_split = a.replace('?',' ').replace('(',' ').replace(')',' ').split(' ')
    # 詞干提取
    stemmed_tokens = [p_stemmer.stem(i) for i in a_split]  
    article_list.append(stemmed_tokens)

query =['bitcoin','prices','futur','winklevoss']
query_stemmed = [p_stemmer.stem(i) for i in query]  
print('query_stemmed :',query_stemmed )

# bm25模型
bm25Model = bm25.BM25(article_list)
# 逆文件頻率
average_idf = sum(map(lambda k: float(bm25Model.idf[k]), bm25Model.idf.keys())) / len(bm25Model.idf.keys())
scores = bm25Model.get_scores(query_stemmed,average_idf)
print('scores :',scores)

Result:


query_stemmed : ['bitcoin', 'price', 'futur', 'winklevoss']

scores : [0.3029151141722034, 0.3029151141722034, 1.068730347708848]

沒有留言:

張貼留言