2018年2月12日 星期一

Python- 去除list中半形與全形標點符號 -Removing Punctuation From Python List Items

版本相關資訊:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Codes:

tokens = ['abc',',.aaa','=','』','abc']
full_punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' + '→↓△▿⋄•!??〞#$%&』()*+,-╱︰;<=>@〔╲〕 _ˋ{∣}∼、〃》「」『』【】﹝﹞【】〝〞–—『』「」…﹏'

punctuation_tokens_1 = [ele for ele in tokens if not ele in full_punctuation] 
print('punctuation_tokens_1 :',punctuation_tokens_1)

translator = str.maketrans('', '', full_punctuation)
punctuation_tokens_2 = [s.translate(translator) for s in tokens]
print('punctuation_tokens_2 :',punctuation_tokens_2)

punctuation_tokens_3 = [ele for ele in punctuation_tokens_2 if not ele=='']
print('punctuation_tokens_3 :',punctuation_tokens_3)

結果:

punctuation_tokens_1 : ['abc', ',.aaa', 'abc']
punctuation_tokens_2 : ['abc', 'aaa', '', '', 'abc']
punctuation_tokens_3 : ['abc', 'aaa', 'abc']

OKapi BM25 算法介紹與python實作

OKapi BM25算法介紹

目的

給定1個或多個搜索詞,如「Intel、IBM、TSMC」,計算每篇文章的BM25分數,從文件中找出最相關的(n篇)文件,換句話說計算分數後取Top n。

理論知識

文件檢索(Text Retrieval)的常用策略,是用一個Ranking function根據搜索詞對所有文本進行排序,選取前n個,就像Google search一樣。
Ranking function是決定檢索效果最重要的因素,而Okapi BM25即是一個效果不錯的Ranking function被許多搜尋引擎使用,像是Lucene、Elasticsearch、Solr。

公式說明

Version:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Codes:

from gensim import corpora
from gensim.summarization import bm25
from nltk.stem.porter import PorterStemmer  
p_stemmer = PorterStemmer()  

# 以下資料取自: https://www.businessinsider.com.au/bitcoin-futures-markets-unusual-behaviour-2018-2
article_row=[
    'For the majority of bitcoin futures trading since their launch in December, the futures curves have been in steep contango, meaning that near-dated prices are below longer-dated prices,」 the analysts said',
    'But according to Goldman, the recent price action in Bitcoin futures implied higher prices for longer-dated contracts — far in excess of the cost to borrow money.',
    'Cboe futures contracts are also just based on one exchange — Gemini, run by the Winklevoss twins — whereas CME futures are based on a Bitcoin reference rate derived from an aggregate of major exchanges',
]

article_list =[]
for a in article_row:
    a_split = a.replace('?',' ').replace('(',' ').replace(')',' ').split(' ')
    # 詞干提取
    stemmed_tokens = [p_stemmer.stem(i) for i in a_split]  
    article_list.append(stemmed_tokens)

query =['bitcoin','prices','futur','winklevoss']
query_stemmed = [p_stemmer.stem(i) for i in query]  
print('query_stemmed :',query_stemmed )

# bm25模型
bm25Model = bm25.BM25(article_list)
# 逆文件頻率
average_idf = sum(map(lambda k: float(bm25Model.idf[k]), bm25Model.idf.keys())) / len(bm25Model.idf.keys())
scores = bm25Model.get_scores(query_stemmed,average_idf)
print('scores :',scores)

Result:


query_stemmed : ['bitcoin', 'price', 'futur', 'winklevoss']

scores : [0.3029151141722034, 0.3029151141722034, 1.068730347708848]

2018年2月5日 星期一

Python-如何用迴圈取得清單的index-How to Loop With Indexes in Python

版本相關資訊:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Codes:

list1 =['a','b','c']
for idx, val in enumerate(list1):
    print('index :',idx,' , value :', val)

結果:

index : 0 , value : a
index : 1 , value : b
index : 2 , value : c

2018年2月2日 星期五

Python - 在Pythpon產生GUID/UUID - How to create a GUID/UUID in Python

版本相關資訊:

System version : Windows 10  64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Codes:

import uuid
uuid1 = uuid.uuid4()
uuid2_str = str(uuid.uuid4())
print(uuid1)
print(type(uuid1))
print('-'*38)
print(uuid2_str)
print(type(uuid2_str))

結果:

562e06e3-4d0b-405d-bc51-d4011a3f0c52
<class uuid.uuid="">
--------------------------------------
2f46634d-69ef-45b7-833a-3bb528c2ae95
<class str="">

Visual Studio Code更改UI語言-繁體中文轉英文 & 英文轉繁體中文

版本相關資訊:

System version : Windows 10  64-bit
Visual Studio Code version : 版本 1.19.3

繁體中文轉英文

鍵盤輸入 Ctrl + Shift + P 跳出指令輸入框
輸入框 => 設定語言 => Enter
把"locale":"zh-tw"改為"locale":"en" 
vs_code1.png
vs_code2.png
儲存後重開 Visual Studio Code 即可

英文轉繁體中文

鍵盤輸入 Ctrl + Shift + P 跳出指令輸入框
輸入框 => configure Language => Enter
把"locale":"en"改為"locale":"zh-tw"  
vs_code3.png
vs_code4.png
儲存後重開 Visual Studio Code 即可

2018年2月1日 星期四

Python - Information Retrieval Evaluation-排序效果評估 (NDCG)

Information Retrieval Evaluation-排序效果評估 (NDCG)

版本相關資訊:

System version : Windows 10  64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

內文:

NDCG (Normalized Discount Cumulative Gain)
當資料有標記多等級時,可以使用NDCG來評估
算出的NDCG值愈接近1,代表效果越佳
以下為NDCG的計算公式與簡單範例:
i為新排序結果
reli為真實的排序等級



python codes:

"""
程式參考自:
https://gist.github.com/bwhite/3726239
https://gist.github.com/gumption/b54278ec9bab2c0e0472816d1d7663be
差異:新增「 sum (2^rel_i - 1) / log2(i + 1) 」的版本
作者:Jie Dondon
版本:ndcg_dondon_20180201_v2
"""

import numpy as np

def dcg_at_k(r, k, method=0):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary as the previous methods.

    There is a typographical error on the formula referenced in the original definition of this function:
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    log2(i) should be log2(i+1)

    The formulas here are derived from
    https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Discounted_Cumulative_Gain

    The formulas return the same results when r contains only binary values
    >>> r = [2,3,2,3,1,1]
    >>> dcg_at_k(r,6,1)
    12.674304175856518
    >>> r = [3,3,2,2,1,1]
    >>> dcg_at_k(r,6,1)
    14.951597943562946


    Args:
        r: Relevance scores (list or numpy array) in rank order
            (first element is the most relevant item)
        k: Number of results to consider
        method: If 0 then sum rel_i / log2(i + 1) [not log2(i)]
                If 1 then sum (2^rel_i - 1) / log2(i + 1)
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        elif method == 1 :
            return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must in [0,1]')
    return 0.

def ndcg_at_k(r, k, method=0):
    """Score is normalized discounted cumulative gain (ndcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from

    2013-Introduction to Information Retrieval Evaluation p.3
    (http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf)
    ↑ 此份文件的公式有誤,導致提供的結果也是錯誤的
    2012-微軟亞洲研究院-武威-機器學習及排序學習基礎 p.36

    >>> r = [2, 1, 2, 0]
    >>> ndcg_at_k(r,4,0)
    0.96519546960144276
    >>> r = [2,3,2,3,1,1]
    0.84768893757694552

    Args:
        r: Relevance scores (list or numpy array) in rank order
            (first element is the most relevant item)
        k: Number of results to consider
        method: If 0 then sum rel_i / log2(i + 1) [not log2(i)]
                If 1 then sum (2^rel_i - 1) / log2(i + 1)
    Returns:
        Normalized discounted cumulative gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

使用方式:

r是一個list,list的順序為透過新方法排序的新排序結果,裡面的數字代表真實的等級,分數越高代表等級愈高。
k是要考慮的結果數量。
method為使用不同的公式:NDCG有不同的計算公式,原始作者僅提供兩種,此次修改的程式多提供一種方法,方法編號為1,公式如本文開頭所放的範例圖片。
r = [2, 1, 2, 0]
ndcg_at_k(r,4,0)
0.96519546960144276
r = [2,3,2,3,1,1]
ndcg_at_k(r,6,1)
0.84768893757694552