顯示具有 pandas 標籤的文章。 顯示所有文章
顯示具有 pandas 標籤的文章。 顯示所有文章

2018年4月20日 星期五

Python - 使用apply與自定義function建立dataframe的新欄位,常用於分組 - Applying function with multiple arguments to create a new pandas column

Information:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
number = [1,2,3,4,5]
sex = ['male','male','female','female','female']
df1['number'] = number
df1['sex'] = sex
df1['income'] = [500,1500,1300,2500,500]
print(df1)
print('-'*45)

def income_group(df):
    if df['income']>2000:
        return 'high'
    elif df['income']>1000:
        return 'medium'
    else:
        return 'low'

df1['group'] = df1.apply(income_group, axis=1)
print(df1)
print('-'*45)

Result:

   number     sex  income
0       1    male     500
1       2    male    1500
2       3  female    1300
3       4  female    2500
4       5  female     500
---------------------------------------------
   number     sex  income   group
0       1    male     500     low
1       2    male    1500  medium
2       3  female    1300  medium
3       4  female    2500    high
4       5  female     500     low
---------------------------------------------

Python - 使用條件來新增欄位,常用於將資料分組 - Pandas conditional creation of a dataframe column

Information:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
number = [1,2,3,4,5]
sex = ['male','male','female','female','female']
df1['number'] = number
df1['sex'] = sex
df1['income'] = [500,2000,500,2000,500]
print(df1)
print('='*45)

conditions = [
    (df1['sex']=='male') & (df1['income']>1000),
    (df1['sex']=='male') & (df1['income']<1000),
    (df1['sex']=='female') & (df1['income']>1000)
]                    

choices = ['male-high_income','male-low_income', 'female-high_income']
df1['group'] = np.select(conditions, choices, default='no_group')
print(df1)

Result:

   number     sex  income
0       1    male     500
1       2    male    2000
2       3  female     500
3       4  female    2000
4       5  female     500
=============================================
   number     sex  income               group
0       1    male     500     male-low_income
1       2    male    2000    male-high_income
2       3  female     500            no_group
3       4  female    2000  female-high_income
4       5  female     500            no_group

2018年4月19日 星期四

Python - pandas dataframe的append處理後保留欄位名稱的順序 - How to control column order when appending to a DataFrame

Information:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

Code:

import pandas as pd

df1 = pd.DataFrame()
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df1['number'] = number
df1['sex'] = sex
df1['b'] = [1,2,3,4,5]

df2 = pd.DataFrame()
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df2['number'] = number
df2['sex'] = sex
df2['a'] = [1,2,3,4,5]
print('df1')
print(df1)
print('-'*45)
print('df2')
print(df2)
print('-'*45)

print('未處理欄位名稱,欄位名稱重新排序')
df3 = df1.append(df2,ignore_index =True)
print(df3)
print('-'*45)

print('處理後,欄位名稱會以df1為準,新的欄位依序加入')

columns_all = list(df1.columns)
columns_new = list(df2.columns)

for ele in columns_new:
    if ele not in columns_all:
        columns_all.append(ele)

df3 = df1.append(df2,ignore_index =True)
df3 = df3[columns_all]
print(df3)
print('-'*45)

Result:

df1
   number     sex  b
0       1    male  1
1       2  female  2
2       3  female  3
3       4  female  4
4       5    male  5
---------------------------------------------
df2
   number     sex  a
0       1    male  1
1       2  female  2
2       3  female  3
3       4  female  4
4       5    male  5
---------------------------------------------
未處理欄位名稱,欄位名稱重新排序
     a    b  number     sex
0  NaN  1.0       1    male
1  NaN  2.0       2  female
2  NaN  3.0       3  female
3  NaN  4.0       4  female
4  NaN  5.0       5    male
5  1.0  NaN       1    male
6  2.0  NaN       2  female
7  3.0  NaN       3  female
8  4.0  NaN       4  female
9  5.0  NaN       5    male
---------------------------------------------
處理後,欄位名稱會以df1為準,新的欄位依序加入
   number     sex    b    a
0       1    male  1.0  NaN
1       2  female  2.0  NaN
2       3  female  3.0  NaN
3       4  female  4.0  NaN
4       5    male  5.0  NaN
5       1    male  NaN  1.0
6       2  female  NaN  2.0
7       3  female  NaN  3.0
8       4  female  NaN  4.0
9       5    male  NaN  5.0
---------------------------------------------

2018年4月16日 星期一

Python - 在 Pandas DataFrame 中去除重複的row - How to drop duplicate rows in Python Pandas DataFrame- Stack Overflow

Information:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

先建立資料

Code:

number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
print(df_new)

Result:

   number     sex
0       1    male
1       2  female
2       3  female
3       4  female
4       5    male

去重複

Code:

df_new.drop_duplicates(['sex'])

Result:

   number     sex
0       1    male
1       2  female

2018年4月9日 星期一

Python - pandasql - 如何在Python中使用sql語法操作dataframe -Querying a dataframe with SQL

Information:

System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)

先建立資料

Code:

number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
print(df_new)

Result:

   number     sex
0       1    male
1       2  female
2       3  female
3       4  female
4       5    male

使用sql語法來操作dataframe

Code:

from pandasql import sqldf
q = 'SELECT * FROM df_new  where sex = "male";'
df_one= sqldf(q)
print(df_one)

Result:

   number   sex
0       1  male
1       5  male

2018年3月23日 星期五

Python Pandas Dataframe - 如何將多行row依指定欄位轉換為1行row - How to combine multiple rows to one row -

版本相關資訊:
System version : Windows 10 64-bit
Python version : Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Pandas version : 0.21.1
How do I get from

  device_id  temp_a  temp_b  temp_c
0         0     0.2     0.8     0.6
1         0     0.1     0.9     0.4
2         1     0.3     0.7     0.2
3         2     0.5     0.5     0.1
4         2     0.1     0.9     0.4
5         2     0.7     0.3     0.9
to

  device_id  temp_a  temp_b  temp_c  temp_a_1  temp_b_1  temp_c_1  temp_a_2  \
0         0     0.2     0.8     0.6       0.1       0.9       0.4       NaN   
1         1     0.3     0.7     0.2       NaN       NaN       NaN       NaN   
2         2     0.5     0.5     0.1       0.1       0.9       0.4       0.7   

   temp_b_2  temp_c_2  
0       NaN       NaN  
1       NaN       NaN  
2       0.3       0.9
Code:
df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
                   'temp_a'    : [0.2,0.1,0.3,0.5,0.1,0.7],
                   'temp_b'    : [0.8,0.9,0.7,0.5,0.9,0.3],
                   'temp_c'    : [0.6,0.4,0.2,0.1,0.4,0.9],
              })

print(df_raw)

df_except = pd.DataFrame({'device_id' : ['0','1','2'],
                          'temp_a':[0.2,0.3,0.5],
                          'temp_b':[0.8,0.7,0.5],
                          'temp_c':[0.6,0.2,0.1],
                          'temp_a_1':[0.1,None,0.1],
                          'temp_b_1':[0.9,None,0.9],
                          'temp_c_1':[0.4,None,0.4],
                          'temp_a_2':[None,None,0.7],
                          'temp_b_2':[None,None,0.3],
                          'temp_c_2':[None,None,0.9],

              })
df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']]
print(df_except2)

# Method1
g = df_raw.groupby('device_id').cumcount()
df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns]
df = df.reset_index()
print (df)

# Method2
import numpy as np
device_id_list = df_raw['device_id'].tolist()
device_id_list = list(np.unique(device_id_list))

append_df = pd.DataFrame()
for device_id in device_id_list:
    tmp_df = df_raw.query('device_id=="%s"'%(device_id))

    if len(tmp_df)>1:
        # 拆成單筆row存到list
        one_raw_list=[]
        for i in range(0,len(tmp_df)):
            one_raw_df = tmp_df.iloc[i:i+1]
            one_raw_list.append(one_raw_df)

        # 用來暫存做concat後的資料
        tmp_combine_df = pd.DataFrame()
        for i in range(0,len(one_raw_list)-1):
            # 取得與整理下個row
            next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True)
            new_name_list=[]
            for old_name in list(next_raw.columns):
                new_name_list.append(old_name+'_'+str(i+1))
            next_raw.columns = new_name_list

            if i==0:
                # 整理第1筆raw
                current_raw = one_raw_list[i].reset_index(drop=True)
                tmp_combine_df = pd.concat([current_raw, next_raw], axis=1)
            else:
                # 與先前整理好的raw做concat
                tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1)
        tmp_df = tmp_combine_df
    tmp_df_columns = tmp_df.columns
    append_df_columns = append_df.columns
    append_df = pd.concat([append_df,tmp_df],ignore_index =True)
    if  len(tmp_df_columns) > len(append_df_columns):
        append_df = append_df[tmp_df_columns]
    else:
        append_df = append_df[append_df_columns]
print(append_df)

2017年8月2日 星期三

Python - how to widen output display to see more columns and rows in pandas dataframe? - 更改pandas dataframe顯示的行數列數

Version

Python Version:Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
System Version:Windows 10

Code:

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

2017年7月26日 星期三

Python - Looking up the list of sheets in an excel file - 找出指定excel檔的sheet name

Version

Python version :Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Pandas version :0.19.2
System version :Windows 10

Codes

import pandas as pd
file_path=r'E:\download\tmp\test.xlsx'
sheet_name_list = pd.ExcelFile(file_path).sheet_names
sheet_name_list

執行結果:

['sheet1', 'sheet2', 'sheet3']

2017年7月10日 星期一

Python - How to append new rows from pandas dataframe to existing excel - 如何將新資料append到已存在的excel檔

Python version :Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Pandas version :0.19.2
System version :Windows 10
import pandas as pd
import os
from datetime import datetime
# def
# create dir
def mkdir(path):
    import os
    if  os.path.exists(path) == False:
        os.makedirs(path)
        print('mkdir:',path)
    elif  os.path.exists(path) == True:
        print('dir already exist:',path)
# 建立樣本資料   
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new

# excel存放的資料夾
file_dir='E:/download/tmp/python/%s/' % (datetime.now().strftime("%Y%m%d"))
# 若資料夾不存在,建立指定資料夾
mkdir(file_dir)
file_out=file_dir+'test.xlsx'
# 如果檔案不存在,建立檔案
if  os.path.exists(file_out) == False:
    df_new.to_excel(file_out,index=False)
# 如果檔案存在,append
elif  os.path.exists(file_out) == True:
    df_old = pd.read_excel(file_out)
    df_combine = df_old.append(df_new)
    df_combine.to_excel(file_out,index=False)
執行結果:

2017年7月6日 星期四

Python - pandas dataframe and csv read / write- pandas dataframe 與csv的讀寫IO

版本相關資訊:

Python版本:Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Pandas版本:0.19.2
系統版本:Windows 10

將 dataframe 寫入 csv 檔案

import pandas as pd
# Windows路徑的寫法
file_path=r'E:\download\tmp\test.csv'
# 相對路徑的寫法
file_path='./tmp/test.csv'
data.to_csv(file_path,sep=',',index=False)

讀入 csv 檔案,轉成dataframe

import pandas as pd
# Windows路徑的寫法
file_path=r'E:\download\tmp\test.csv'
# 相對路徑的寫法
file_path='./tmp/test.csv'
df_name= pd.read_csv(file_path,sep=',')
如果輸入時,出現以下錯誤: OSError: Initializing from file failed 將engine由C改為python
import pandas as pd 
file_path=r'E:\download\tmp\test.csv' 
df_name= pd.read_csv(file_path ,engine='python')
若要修改輸出的編碼:
import pandas as pd 
# utf-8
file_path=r'E:\download\tmp\test.csv' df_name.to_csv(file_path, encoding ='utf-8') 
# Big5
file_path=r'E:\download\tmp\test.csv' df_name.to_csv(file_path, encoding ='Big5')

2017年7月5日 星期三

Python - dataframe apply - Using conditional to generate new column in pandas dataframe - 在dataframe新建column以現有的其他column的value為條件

Python版本:Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Pandas版本:0.19.2
系統版本:Windows 10
在資料分析中,時常會遇到要將類別資料與數值資料間轉換的情況
numerical variable to categorical variable or categorical variable to numerical variable
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()

# create def for category to number 0/1
def tran_cat_to_num(df):
    if df['sex'] == 'male':
        return 1
    elif df['sex'] == 'female':
        return 0
# create sex_new 
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
執行結果: