資料清洗的10個Pandas單行程式碼技巧

資料清洗的10個Pandas單行程式碼技巧

清理資料並不複雜。掌握用於資料清理的 Python 單行程式可以大大加快工作流程,並保持程式碼的整潔。本部落格將重點介紹用於資料清理的最有用的 Python 單行程式碼,幫助您在一行程式碼中處理缺失值、重複資料、格式化問題等。我們將探討適合初學者和專家的 Pandas 資料清理單行示例。您還將發現基本的 Python 資料清理庫,讓預處理變得高效而直觀。準備好以更智慧而非更困難的方式清理資料了嗎?讓我們深入瞭解緊湊而強大的單行程式碼!

資料清理為何重要?

在深入瞭解資料清理流程之前,瞭解資料清理為何是準確分析和機器學習的關鍵至關重要。原始資料集通常比較混亂,其中的缺失值、重複資料和不一致的格式可能會扭曲結果。正確的資料清理可確保為分析奠定可靠的基礎,提高演算法效能和洞察力。

我們將探討的單行程式碼能以最少的程式碼解決常見的資料問題,使資料預處理更快、更高效。現在,讓我們來看看清理資料集的步驟,輕鬆地將資料集轉換為乾淨、分析就緒的形式。

資料清理的單行解決方案

1. 使用dropna()處理缺失資料

現實世界中的資料集很少是完美無缺的。你會面臨的最常見問題之一就是缺失值,無論是由於資料收集、資料集合並中的錯誤,還是手動輸入造成的。幸運的是,Pandas 提供了一個簡單而強大的方法來處理這個問題:dropna()

但是 dropna() 可以使用多個引數。讓我們來探索如何充分利用它。

  1. axis

指定刪除行還是列:

  • axis=0: 丟棄行(預設值)
  • axis=1: 刪除列

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.dropna(axis=0) # Drops rows
df.dropna(axis=1) # Drops columns
df.dropna(axis=0) # Drops rows df.dropna(axis=1) # Drops columns
df.dropna(axis=0)  # Drops rows
df.dropna(axis=1)  # Drops columns
  1. how

定義要刪除的條件:

  • how=’any’: 如果缺少任何值,則刪除(預設)
  • how=’all’: 僅當所有值都缺失時才丟棄

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.dropna(how='any') # Drop if at least one NaN
df.dropna(how='all') # Drop only if all values are NaN
df.dropna(how='any') # Drop if at least one NaN df.dropna(how='all') # Drop only if all values are NaN
df.dropna(how='any')   # Drop if at least one NaN
df.dropna(how='all')   # Drop only if all values are NaN
  1. thresh

指定保留行/列所需的最小非 NAN 值個數。

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.dropna(thresh=3) # Keep rows with at least 3 non-NaN values
df.dropna(thresh=3) # Keep rows with at least 3 non-NaN values
df.dropna(thresh=3)  # Keep rows with at least 3 non-NaN values

注意:不能同時使用 how 和 thresh。

  1. subset

僅將條件應用於特定列(或行,如果軸=1)。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.dropna(subset=['col1', 'col2']) # Drop rows if NaN in col1 or col2#import csv
df.dropna(subset=['col1', 'col2']) # Drop rows if NaN in col1 or col2#import csv
df.dropna(subset=['col1', 'col2'])  # Drop rows if NaN in col1 or col2#import csv

2. 使用fillna()處理缺失資料

您可以使用 Pandas 的 fillna() 方法填補缺失資料,而不是丟棄缺失資料。當你想估算值而不是丟失資料時,這個方法尤其有用。

讓我們來探討一下如何使用不同引數的 fillna() 方法。

  1. subset

指定一個標量、字典、數列或計算值(如平均值、中位數或模式)來填補缺失資料。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.fillna(0) # Fill all NaNs with 0
df.fillna({'col1': 0, 'col2': 99}) # Fill col1 with 0, col2 with 99
# Fill with mean, median, or mode of a column
df['col1'].fillna(df['col1'].mean(), inplace=True)
df['col2'].fillna(df['col2'].median(), inplace=True)
df['col3'].fillna(df['col3'].mode()[0], inplace=True) # Mode returns a Series
df.fillna(0) # Fill all NaNs with 0 df.fillna({'col1': 0, 'col2': 99}) # Fill col1 with 0, col2 with 99 # Fill with mean, median, or mode of a column df['col1'].fillna(df['col1'].mean(), inplace=True) df['col2'].fillna(df['col2'].median(), inplace=True) df['col3'].fillna(df['col3'].mode()[0], inplace=True) # Mode returns a Series
df.fillna(0)  # Fill all NaNs with 0
df.fillna({'col1': 0, 'col2': 99})  # Fill col1 with 0, col2 with 99
# Fill with mean, median, or mode of a column
df['col1'].fillna(df['col1'].mean(), inplace=True)
df['col2'].fillna(df['col2'].median(), inplace=True)
df['col3'].fillna(df['col3'].mode()[0], inplace=True)  # Mode returns a Series
  1. method

用於向前或向後傳播非空值:

  • ‘ffill’ 或 ‘pad’: 正向填充
  • ‘bfill’ 或 ‘backfill’:向後填充

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.fillna(method='ffill') # Fill forward
df.fillna(method='bfill') # Fill backward
df.fillna(method='ffill') # Fill forward df.fillna(method='bfill') # Fill backward
df.fillna(method='ffill')  # Fill forward
df.fillna(method='bfill')  # Fill backward
  1. axis

選擇填充方向:

  • axis=0:向下填充(行方向,預設值)
  • axis=1:橫向填充(列向填充)

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.fillna(method='ffill', axis=0) # Fill down
df.fillna(method='bfill', axis=1) # Fill across
df.fillna(method='ffill', axis=0) # Fill down df.fillna(method='bfill', axis=1) # Fill across
df.fillna(method='ffill', axis=0)  # Fill down
df.fillna(method='bfill', axis=1)  # Fill across
  1. limit

前向/後向填充中最多可填充的 NaN 個數。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.fillna(method='ffill', limit=1) # Fill at most 1 NaN in a row/column#import csv
df.fillna(method='ffill', limit=1) # Fill at most 1 NaN in a row/column#import csv
df.fillna(method='ffill', limit=1)  # Fill at most 1 NaN in a row/column#import csv

3. 使用drop_duplicates()刪除重複值

使用 drop_duplicates() 函式輕鬆刪除資料集中的重複行,只需一行程式碼就能確保資料的乾淨和唯一性。

讓我們探討如何使用不同的引數來使用 drop_dupliucates

  1. subset

指定查詢重複資料的特定列。

  • 預設值: 檢查所有列
  • 使用單列或列列表

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.drop_duplicates(subset='col1') # Check duplicates only in 'col1'
df.drop_duplicates(subset=['col1', 'col2']) # Check based on multiple columns
df.drop_duplicates(subset='col1') # Check duplicates only in 'col1' df.drop_duplicates(subset=['col1', 'col2']) # Check based on multiple columns
df.drop_duplicates(subset='col1')         # Check duplicates only in 'col1'
df.drop_duplicates(subset=['col1', 'col2'])  # Check based on multiple columns
  1. keep

決定保留哪個副本:

  • ‘first’ (default): 保留第一次出現的副本
  • ‘last’:保留最後一次重複
  • False:刪除所有重複

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.drop_duplicates(keep='first') # Keep first duplicate
df.drop_duplicates(keep='last') # Keep last duplicate
df.drop_duplicates(keep=False) # Drop all duplicates
df.drop_duplicates(keep='first') # Keep first duplicate df.drop_duplicates(keep='last') # Keep last duplicate df.drop_duplicates(keep=False) # Drop all duplicates
df.drop_duplicates(keep='first')  # Keep first duplicate
df.drop_duplicates(keep='last')   # Keep last duplicate
df.drop_duplicates(keep=False)    # Drop all duplicates

4. 使用replace()替換特定值

您可以使用 replace() 替換 DataFrameSeries 中的特定值。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Replace a single value
df.replace(0, np.nan)
# Replace multiple values
df.replace([0, -1], np.nan)
# Replace with dictionary
df.replace({'A': {'old': 'new'}, 'B': {1: 100}})
# Replace in-place
df.replace('missing', np.nan, inplace=True)#import csv
# Replace a single value df.replace(0, np.nan) # Replace multiple values df.replace([0, -1], np.nan) # Replace with dictionary df.replace({'A': {'old': 'new'}, 'B': {1: 100}}) # Replace in-place df.replace('missing', np.nan, inplace=True)#import csv
# Replace a single value
df.replace(0, np.nan)
# Replace multiple values
df.replace([0, -1], np.nan)
# Replace with dictionary
df.replace({'A': {'old': 'new'}, 'B': {1: 100}})
# Replace in-place
df.replace('missing', np.nan, inplace=True)#import csv

5. 使用astype()更改資料型別

更改列的資料型別有助於確保正確操作和記憶體效率。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Age'] = df['Age'].astype(int) # Convert to integer
df['Price'] = df['Price'].astype(float) # Convert to float
df['Date'] = pd.to_datetime(df['Date']) # Convert to datetime
df['Age'] = df['Age'].astype(int) # Convert to integer df['Price'] = df['Price'].astype(float) # Convert to float df['Date'] = pd.to_datetime(df['Date']) # Convert to datetime
df['Age'] = df['Age'].astype(int)         # Convert to integer
df['Price'] = df['Price'].astype(float)   # Convert to float
df['Date'] = pd.to_datetime(df['Date'])   # Convert to datetime

6. 使用str.strip()刪除字串中的空格

在資料集中,字串值中不需要的前導空格或尾部空格會導致排序、比較或分組問題。str.strip() 方法可以有效地刪除這些空格。

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['col'].str.lstrip() # Removes leading spaces
df['col'].str.rstrip() # Removes trailing spaces
df['col'].str.strip() # Removes both leading & trailing
df['col'].str.lstrip() # Removes leading spaces df['col'].str.rstrip() # Removes trailing spaces df['col'].str.strip() # Removes both leading & trailing
df['col'].str.lstrip()   # Removes leading spaces
df['col'].str.rstrip()   # Removes trailing spaces
df['col'].str.strip()    # Removes both leading & trailing

7.清理和提取列值

透過刪除不需要的字元或使用正規表示式提取特定模式,可以清理列值。

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Remove punctuation
df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True)
# Extract the username part before '@' in an email address
df['email_user'] = df['email'].str.extract(r'(^[^@]+)')
# Extract the 4-digit year from a date string
df['year'] = df['date'].str.extract(r'(\d{4})')
# Extract the first hashtag from a tweet
df['hashtag'] = df['tweet'].str.extract(r'#(\w+)')
# Extract phone numbers in the format 123-456-7890
df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')
# Remove punctuation df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True) # Extract the username part before '@' in an email address df['email_user'] = df['email'].str.extract(r'(^[^@]+)') # Extract the 4-digit year from a date string df['year'] = df['date'].str.extract(r'(\d{4})') # Extract the first hashtag from a tweet df['hashtag'] = df['tweet'].str.extract(r'#(\w+)') # Extract phone numbers in the format 123-456-7890 df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')
 # Remove punctuation
df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True) 
# Extract the username part before '@' in an email address
df['email_user'] = df['email'].str.extract(r'(^[^@]+)')
# Extract the 4-digit year from a date string
df['year'] = df['date'].str.extract(r'(\d{4})')
# Extract the first hashtag from a tweet
df['hashtag'] = df['tweet'].str.extract(r'#(\w+)')
# Extract phone numbers in the format 123-456-7890
df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')

8. 對映和替換值

您可以對映或替換列中的特定值,以規範或轉換資料。

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'}) df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})

9.處理異常值

異常值會扭曲統計分析和模型效能。以下是處理異常值的常用方法:

  1. Z-score 法

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Keep only numeric columns, remove rows where any z-score > 3
df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]
# Keep only numeric columns, remove rows where any z-score > 3 df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]
# Keep only numeric columns, remove rows where any z-score > 3
df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]
  1. 剪下異常值(封頂至一定範圍)

程式碼:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['col'].clip(lower=df['col'].quantile(0.05),upper=df['col'].quantile(0.95))
df['col'].clip(lower=df['col'].quantile(0.05),upper=df['col'].quantile(0.95))
df['col'].clip(lower=df['col'].quantile(0.05),upper=df['col'].quantile(0.95))

10. 使用Lambda應用函式

Lambda 函式與 apply() 配合使用,可快速轉換或操作列中的資料。lambda 函式起轉換作用,而 apply() 則將其應用於整個列。

程式碼

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['col'] = df['col'].apply(lambda x: x.strip().lower()) # Removes extra spaces and converts text to lowercase
df['col'] = df['col'].apply(lambda x: x.strip().lower()) # Removes extra spaces and converts text to lowercase
df['col'] = df['col'].apply(lambda x: x.strip().lower())   # Removes extra spaces and converts text to lowercase

問題陳述

現在,您已經瞭解了這些 Python 單行程式碼,讓我們來看看問題陳述並嘗試解決它。您從一個線上零售平臺獲得了一個客戶資料集。資料存在以下問題

  • 電子郵件、年齡、推特和電話等列中存在缺失值。
  • 重複條目(如相同的姓名和電子郵件)。
  • 格式不一致(如姓名中的空白、字串中的“missing”)。
  • 資料型別問題(例如,Join_Date 值無效)。
  • Age 和 Purchase_Amount 中的異常值。
  • 需要使用 regex 清理和提取的文字資料(例如,從 Tweet 中提取標籤,從電子郵件中提取使用者名稱)。

您的任務是演示如何清理該資料集。

解決方案

有關完整的解決方案,請參閱此Google Colab notebook。它將引導您完成使用 Python 和 pandas 有效清理資料集所需的每個步驟。

按照以下說明清理資料集

  1. 刪除所有值缺失的行
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.dropna(how='all', inplace=True)
df.dropna(how='all', inplace=True)
df.dropna(how='all', inplace=True)
  1. 將 ‘missing’ 或 ‘not available’ 等佔位符文字標準化為 NaN
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.replace(['missing', 'not available', 'NaN'], np.nan, inplace=True)
df.replace(['missing', 'not available', 'NaN'], np.nan, inplace=True)
df.replace(['missing', 'not available', 'NaN'], np.nan, inplace=True)
  1. 填補缺失值
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Email'] = df['Email'].fillna('unknown@example.com')
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median())
df['Join_Date'] = df['Join_Date'].fillna(method='ffill')
df['Tweet'] = df['Tweet'].fillna('No tweet')
df['Phone'] = df['Phone'].fillna('000-000-0000')
df['Age'] = df['Age'].fillna(df['Age'].median()) df['Email'] = df['Email'].fillna('unknown@example.com') df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0]) df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median()) df['Join_Date'] = df['Join_Date'].fillna(method='ffill') df['Tweet'] = df['Tweet'].fillna('No tweet') df['Phone'] = df['Phone'].fillna('000-000-0000')
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Email'] = df['Email'].fillna('unknown@example.com')
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median())
df['Join_Date'] = df['Join_Date'].fillna(method='ffill')
df['Tweet'] = df['Tweet'].fillna('No tweet')
df['Phone'] = df['Phone'].fillna('000-000-0000')
  1. 刪除重複內容
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df.drop_duplicates(inplace=True)
df.drop_duplicates(inplace=True)
df.drop_duplicates(inplace=True)
  1. 刪除空格,規範文字欄位
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x)
df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)
df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x) df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)
df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x)
df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)
  1. 轉換資料型別
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Age'] = df['Age'].astype(int)
df['Purchase_Amount'] = df['Purchase_Amount'].astype(float)
df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')
df['Age'] = df['Age'].astype(int) df['Purchase_Amount'] = df['Purchase_Amount'].astype(float) df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')
df['Age'] = df['Age'].astype(int)
df['Purchase_Amount'] = df['Purchase_Amount'].astype(float)
df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')
  1. 修復無效值
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df = df[df['Age'].between(10, 100)] # realistic age
df = df[df['Purchase_Amount'] > 0] # remove negative or zero purchases
df = df[df['Age'].between(10, 100)] # realistic age df = df[df['Purchase_Amount'] > 0] # remove negative or zero purchases
df = df[df['Age'].between(10, 100)]  # realistic age
df = df[df['Purchase_Amount'] > 0]   # remove negative or zero purchases
  1. 利用 Z 值去除離群值
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
numeric_cols = df[['Age', 'Purchase_Amount']]
z_scores = np.abs(stats.zscore(numeric_cols))
df = df[(z_scores < 3).all(axis=1)]
numeric_cols = df[['Age', 'Purchase_Amount']] z_scores = np.abs(stats.zscore(numeric_cols)) df = df[(z_scores < 3).all(axis=1)]
numeric_cols = df[['Age', 'Purchase_Amount']]
z_scores = np.abs(stats.zscore(numeric_cols))
df = df[(z_scores < 3).all(axis=1)]
  1. Regex 提取
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)')
df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})')
df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')
df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)') df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})') df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')
df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)')
df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})')
df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')
  1. ‘Name’ 的最後清理
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df['Name'] = df['Name'].apply(lambda x: x if isinstance(x, str) else 'unknown')
df['Name'] = df['Name'].apply(lambda x: x if isinstance(x, str) else 'unknown')
df['Name'] = df['Name'].apply(lambda x: x if isinstance(x, str) else 'unknown')

清理前的資料集

清理前的資料集

清理後的資料集

清理後的資料集

小結

清理資料是任何資料分析或機器學習專案的關鍵步驟。透過掌握這些功能強大的 Python 資料清理單行程式,您可以簡化資料預處理工作流程,確保資料準確、一致,併為分析做好準備。從處理缺失值和重複值到刪除異常值和格式化問題,這些單行程式碼讓您無需編寫冗長的程式碼就能高效地清理資料。利用 Pandas 和正規表示式的強大功能,您可以保持程式碼乾淨、簡潔並易於維護。無論您是初學者還是專家,這些方法都能幫助您更智慧、更快速地清理資料。

評論留言