[Pandas教學]資料分析必懂的Pandas DataFrame處理雙維度資料方法

Photo by Slidebean on Unsplash

現在有許多的企業或商家，都會利用取得的使用者資料來進行分析，瞭解其中的趨勢或商機，由此可見，資料分析越來越受到重視，而這時候，能夠懂得使用資料分析工具就非常的重要。

在上一篇[Pandas教學]資料分析必懂的Pandas Series處理單維度資料方法文章中，分享了Pandas Series資料結構用於處理單維度資料集的實用方法，而本文則要來介紹Pandas套件的另一個非常重要的資料結構，也就是DataFrame。

DataFrame主要用來處理雙維度的資料，也就是具有列(row)與欄(column)的表格式資料集，所以經常應用於讀取CSV檔案、網頁表格或資料庫等，來進行其中的資料分析或處理，本文就來分享Pandas DataFrame幾個基本的觀念，包含：

什麼是Pandas DataFrame
建立Pandas DataFrame
取得Pandas DataFrame資料
新增Pandas DataFrame資料
修改Pandas DataFrame資料
刪除Pandas DataFrame資料
篩選Pandas DataFrame資料
排序Pandas DataFrame資料

一、什麼是Pandas DataFrame

相較於Pandas Series處理單維度或單一欄位的資料，Pandas DataFrame則可以處理雙維度或多欄位的資料，就像是Excel的表格(Table)，具有資料索引(列)及欄位標題(欄)，如下範例：

在開始本文的實作前，首先需利用以下的指令來安裝Pandas套件：

$ pip install pandas

二、建立Pandas DataFrame

想要使用Pandas DataFrame來儲存雙維度的資料，就要先建立Pandas DataFrame物件，語法如下：

my_dataframe = pandas.DataFrame(字典或陣列資料)

範例

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("使用字典來建立df：")
print(df)

print("=====================")

grades = [
    ["Mike", 80, 63],
    ["Sherry", 75, 90],
    ["Cindy", 93, 85],
    ["John", 86, 70]
]

new_df = pd.DataFrame(grades)

print("使用陣列來建立df：")
print(new_df)

執行結果

從執行結果可以看到，相同的資料內容，使用Python字典(Dictionary)來進行指定的話，鍵值(Key)就是Pandas DataFrame的欄位名稱，值(Value)則是該欄位的資料內容。而使用陣列來指定的話，就是單純的每一筆資料內容。

如果想要客製化Pandas DataFrame的資料索引及欄位名稱，可以分別利用index及columns屬性(Attribute)來達成，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
df.index = ["s1", "s2", "s3", "s4"]  #自訂索引值
df.columns = ["student_name", "math_score", "chinese_score"]  #自訂欄位名稱
print(df)

執行結果

三、取得Pandas DataFrame資料

head()：取得最前面的n筆資料，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

new_df = df.head(2)
print("取得最前面的兩筆資料")
print(new_df)

執行結果

tail()：取得最後面的n筆資料，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

new_df = df.tail(3)
print("取得最後面的三筆資料")
print(new_df)

執行結果

中括號[]：在中括號中指定「欄位名稱」或「資料索引值」，來取得所需的資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("取得單一欄位資料(型別為Series)")
print(df["name"])

print("=================================")

print("取得單一欄位資料(型別為DataFrame)")
print(df[["name"]])

print("=================================")

print("取得多欄位資料(型別為DataFrame)")
print(df[["name", "chinese"]])

print("=================================")

print("取得索引值0~2的資料")
print(df[0:3])

執行結果

at[資料索引值,欄位名稱]：利用資料索引值及欄位名稱來取得「單一值」，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("利用at()方法取得索引值為1的math欄位資料")
print(df.at[1, "math"])

執行結果

iat[資料索引值,欄位順序]：利用資料索引值及欄位順序來取得「單一值」，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("利用iat()方法取得索引值為1的第一個欄位資料")
print(df.iat[1, 0])

執行結果

loc[資料索引值,欄位名稱]：利用資料索引值及欄位名稱來取得「資料集」，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("取得資料索引值為1和3的name及chinese欄位資料集")
print(df.loc[[1, 3], ["name", "chinese"]])

執行結果

iloc[資料索引值,欄位順序]：利用資料索引值及欄位順序來取得「資料集」，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("取得資料索引值為1和3的第一個及第三個欄位資料集")
print(df.iloc[[1, 3], [0, 2]])

執行結果

四、新增Pandas DataFrame資料

insert()：在指定的欄位位置新增欄位資料，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

df.insert(2, column="engilsh", value=[88, 72, 74, 98])
print("在第三欄的地方新增一個欄位資料")
print(df)

執行結果

append()：新增一筆或一列的資料，透過傳入字典(Dictionary)來指定各欄位的值，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

new_df = df.append({
    "name": "Henry",
    "math": 60,
    "chinese": 62
}, ignore_index=True)

print("新增一筆資料")
print(new_df)

執行結果

concat()：利用合併多個Pandas DataFrame的方式來新增資料，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df1 = pd.DataFrame(grades)
print("原來的df")
print(df1)

print("=================================")

df2 = pd.DataFrame({
    "name": ["Henry"],
    "math": [60],
    "chinese": [62]
})

new_df = pd.concat([df1, df2], ignore_index=True)
print("合併df來新增資料")
print(new_df)

執行結果

五、修改Pandas DataFrame資料

利用Pandas DataFrame的at[]及iat[]取得所要修改的單一值後，來進行資料內容的修改，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

df.at[1, "math"] = 100  #修改索引值為1的math欄位資料
df.iat[1, 0] = "Larry"  #修改索引值為1的第一個欄位資料
print("修改後的df")
print(df)

執行結果

六、刪除Pandas DataFrame資料

drop(欄位名稱串列,axis=1)：刪除指定欄位名稱的欄位，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

new_df = df.drop(["math"], axis=1)
print("刪除math欄位")
print(new_df)

執行結果

drop(資料索引串列,axis=0)：刪除指定資料索引的資料，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("=================================")

new_df = df.drop([0, 3], axis=0)  # 刪除第一筆及第四筆資料
print("刪除第一筆及第四筆資料")
print(new_df)

執行結果

dropna()：刪除含有NaN或空值的資料，並且會回傳一個新的Pandas DataFrame資料集，在進行資料清理的時候非常實用，如下範例：

import pandas as pd
import numpy as np


grades = {
    "name": ["Mike", "Sherry", np.NaN, "John"],
    "city": ["Taipei", np.NaN, "Kaohsiung", "Taichung"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("======================================")

new_df = df.dropna()
print("刪除空值後的df")
print(new_df)

執行結果

drop_duplicates()：刪除重複的資料，並且會回傳一個新的Pandas DataFrame資料集，同樣最常應用在資料清理的時候，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Mike", "Cindy", "John"],
    "city": ["Taipei", "Taipei", "Kaohsiung", "Taichung"],
    "math": [80, 80, 93, 86],
    "chinese": [80, 80, 93, 86]
}

df = pd.DataFrame(grades)
print("原來的df")
print(df)

print("======================================")

new_df = df.drop_duplicates()
print("刪除重複值後的df")
print(new_df)

執行結果

七、篩選Pandas DataFrame資料

通常在處理大量的資料集時，有很高的機率會需要利用條件式來篩選所需的資料，這時候就可以利用中括號[]存取欄位來進行資料篩選，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("篩選math大於80的資料集")
print(df[df["math"] > 80])

執行結果

另一個最常見的資料篩選情境，就是找出包含特定值的資料集，這時候可以利用Pandas DataFrame的isin()方法(Method)來達成，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("=================================")

print("篩選name欄位包含John的資料集")
print(df[df["name"].isin(["John"])])

執行結果

八、排序Pandas DataFrame資料

sort_index()：依照索引值來進行排序，並且會回傳一個新的Pandas DataFrame資料集，如下範例：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)
df.index = ["s3", "s1", "s4", "s2"]  # 自訂資料索引值

print("原來的df")
print(df)

print("============================")

new_df = df.sort_index(ascending=True)
print("遞增排序")
print(new_df)

print("============================")

new_df = df.sort_index(ascending=False)
print("遞減排序")
print(new_df)

執行結果

sort_values()：依照欄位內容來進行排序，並且會回傳一個新的Pandas DataFrame資料集，下面範例以math欄位內容來進行排序：

import pandas as pd


grades = {
    "name": ["Mike", "Sherry", "Cindy", "John"],
    "math": [80, 75, 93, 86],
    "chinese": [63, 90, 85, 70]
}

df = pd.DataFrame(grades)

print("原來的df")
print(df)

print("============================")

new_df = df.sort_values(["math"], ascending=True)
print("遞增排序")
print(new_df)

print("============================")

new_df = df.sort_values(["math"], ascending=False)
print("遞減排序")
print(new_df)

執行結果