資料清洗的10個Pandas單行程式碼技巧

清理資料並不複雜。掌握用於資料清理的 Python 單行程式可以大大加快工作流程，並保持程式碼的整潔。本部落格將重點介紹用於資料清理的最有用的 Python 單行程式碼，幫助您在一行程式碼中處理缺失值、重複資料、格式化問題等。我們將探討適合初學者和專家的 Pandas 資料清理單行示例。您還將發現基本的 Python 資料清理庫，讓預處理變得高效而直觀。準備好以更智慧而非更困難的方式清理資料了嗎？讓我們深入瞭解緊湊而強大的單行程式碼！

資料清理為何重要？

在深入瞭解資料清理流程之前，瞭解資料清理為何是準確分析和機器學習的關鍵至關重要。原始資料集通常比較混亂，其中的缺失值、重複資料和不一致的格式可能會扭曲結果。正確的資料清理可確保為分析奠定可靠的基礎，提高演算法效能和洞察力。

我們將探討的單行程式碼能以最少的程式碼解決常見的資料問題，使資料預處理更快、更高效。現在，讓我們來看看清理資料集的步驟，輕鬆地將資料集轉換為乾淨、分析就緒的形式。

資料清理的單行解決方案

1. 使用dropna()處理缺失資料

現實世界中的資料集很少是完美無缺的。你會面臨的最常見問題之一就是缺失值，無論是由於資料收集、資料集合並中的錯誤，還是手動輸入造成的。幸運的是，Pandas 提供了一個簡單而強大的方法來處理這個問題：dropna()。

但是 dropna() 可以使用多個引數。讓我們來探索如何充分利用它。

axis

指定刪除行還是列：

axis=0: 丟棄行（預設值）
axis=1: 刪除列

程式碼：

df.dropna(axis=0) # Drops rows

df.dropna(axis=1) # Drops columns

df.dropna(axis=0) # Drops rows df.dropna(axis=1) # Drops columns

df.dropna(axis=0)  # Drops rows
df.dropna(axis=1)  # Drops columns

定義要刪除的條件：

how=’any’: 如果缺少任何值，則刪除（預設）
how=’all’: 僅當所有值都缺失時才丟棄

程式碼：

df.dropna(how='any') # Drop if at least one NaN

df.dropna(how='all') # Drop only if all values are NaN

df.dropna(how='any') # Drop if at least one NaN df.dropna(how='all') # Drop only if all values are NaN

df.dropna(how='any')   # Drop if at least one NaN
df.dropna(how='all')   # Drop only if all values are NaN

thresh

指定保留行/列所需的最小非 NAN 值個數。

程式碼：

df.dropna(thresh=3) # Keep rows with at least 3 non-NaN values

df.dropna(thresh=3)  # Keep rows with at least 3 non-NaN values

注意：不能同時使用 how 和 thresh。

subset

僅將條件應用於特定列（或行，如果軸=1）。

程式碼：

df.dropna(subset=['col1', 'col2']) # Drop rows if NaN in col1 or col2#import csv

df.dropna(subset=['col1', 'col2'])  # Drop rows if NaN in col1 or col2#import csv

2. 使用fillna()處理缺失資料

您可以使用 Pandas 的 fillna() 方法填補缺失資料，而不是丟棄缺失資料。當你想估算值而不是丟失資料時，這個方法尤其有用。

讓我們來探討一下如何使用不同引數的 fillna() 方法。

subset

指定一個標量、字典、數列或計算值（如平均值、中位數或模式）來填補缺失資料。

程式碼：

df.fillna(0) # Fill all NaNs with 0

df.fillna({'col1': 0, 'col2': 99}) # Fill col1 with 0, col2 with 99

# Fill with mean, median, or mode of a column

df['col1'].fillna(df['col1'].mean(), inplace=True)

df['col2'].fillna(df['col2'].median(), inplace=True)

df['col3'].fillna(df['col3'].mode()[0], inplace=True) # Mode returns a Series

df.fillna(0) # Fill all NaNs with 0 df.fillna({'col1': 0, 'col2': 99}) # Fill col1 with 0, col2 with 99 # Fill with mean, median, or mode of a column df['col1'].fillna(df['col1'].mean(), inplace=True) df['col2'].fillna(df['col2'].median(), inplace=True) df['col3'].fillna(df['col3'].mode()[0], inplace=True) # Mode returns a Series

df.fillna(0)  # Fill all NaNs with 0
df.fillna({'col1': 0, 'col2': 99})  # Fill col1 with 0, col2 with 99
# Fill with mean, median, or mode of a column
df['col1'].fillna(df['col1'].mean(), inplace=True)
df['col2'].fillna(df['col2'].median(), inplace=True)
df['col3'].fillna(df['col3'].mode()[0], inplace=True)  # Mode returns a Series

method

用於向前或向後傳播非空值：

‘ffill’ 或 ‘pad’：正向填充
‘bfill’ 或 ‘backfill’：向後填充

程式碼：

df.fillna(method='ffill') # Fill forward

df.fillna(method='bfill') # Fill backward

df.fillna(method='ffill') # Fill forward df.fillna(method='bfill') # Fill backward

df.fillna(method='ffill')  # Fill forward
df.fillna(method='bfill')  # Fill backward

axis

選擇填充方向：

axis=0：向下填充（行方向，預設值）
axis=1：橫向填充（列向填充）

程式碼：

df.fillna(method='ffill', axis=0) # Fill down

df.fillna(method='bfill', axis=1) # Fill across

df.fillna(method='ffill', axis=0) # Fill down df.fillna(method='bfill', axis=1) # Fill across

df.fillna(method='ffill', axis=0)  # Fill down
df.fillna(method='bfill', axis=1)  # Fill across

limit

前向/後向填充中最多可填充的 NaN 個數。

程式碼：

df.fillna(method='ffill', limit=1) # Fill at most 1 NaN in a row/column#import csv

df.fillna(method='ffill', limit=1)  # Fill at most 1 NaN in a row/column#import csv

3. 使用drop_duplicates()刪除重複值

使用 drop_duplicates() 函式輕鬆刪除資料集中的重複行，只需一行程式碼就能確保資料的乾淨和唯一性。

讓我們探討如何使用不同的引數來使用 drop_dupliucates

subset

指定查詢重複資料的特定列。

預設值：檢查所有列
使用單列或列列表

程式碼 ：

df.drop_duplicates(subset='col1') # Check duplicates only in 'col1'

df.drop_duplicates(subset=['col1', 'col2']) # Check based on multiple columns

df.drop_duplicates(subset='col1') # Check duplicates only in 'col1' df.drop_duplicates(subset=['col1', 'col2']) # Check based on multiple columns

df.drop_duplicates(subset='col1')         # Check duplicates only in 'col1'
df.drop_duplicates(subset=['col1', 'col2'])  # Check based on multiple columns

keep

決定保留哪個副本：

‘first’ (default)：保留第一次出現的副本
‘last’：保留最後一次重複
False：刪除所有重複

程式碼：

df.drop_duplicates(keep='first') # Keep first duplicate

df.drop_duplicates(keep='last') # Keep last duplicate

df.drop_duplicates(keep=False) # Drop all duplicates

df.drop_duplicates(keep='first') # Keep first duplicate df.drop_duplicates(keep='last') # Keep last duplicate df.drop_duplicates(keep=False) # Drop all duplicates

df.drop_duplicates(keep='first')  # Keep first duplicate
df.drop_duplicates(keep='last')   # Keep last duplicate
df.drop_duplicates(keep=False)    # Drop all duplicates

4. 使用replace()替換特定值

您可以使用 replace() 替換 DataFrame 或 Series 中的特定值。

程式碼：

# Replace a single value

df.replace(0, np.nan)

# Replace multiple values

df.replace([0, -1], np.nan)

# Replace with dictionary

df.replace({'A': {'old': 'new'}, 'B': {1: 100}})

# Replace in-place

df.replace('missing', np.nan, inplace=True)#import csv

# Replace a single value df.replace(0, np.nan) # Replace multiple values df.replace([0, -1], np.nan) # Replace with dictionary df.replace({'A': {'old': 'new'}, 'B': {1: 100}}) # Replace in-place df.replace('missing', np.nan, inplace=True)#import csv

# Replace a single value
df.replace(0, np.nan)
# Replace multiple values
df.replace([0, -1], np.nan)
# Replace with dictionary
df.replace({'A': {'old': 'new'}, 'B': {1: 100}})
# Replace in-place
df.replace('missing', np.nan, inplace=True)#import csv

5. 使用astype()更改資料型別

更改列的資料型別有助於確保正確操作和記憶體效率。

程式碼：

df['Age'] = df['Age'].astype(int) # Convert to integer

df['Price'] = df['Price'].astype(float) # Convert to float

df['Date'] = pd.to_datetime(df['Date']) # Convert to datetime

df['Age'] = df['Age'].astype(int) # Convert to integer df['Price'] = df['Price'].astype(float) # Convert to float df['Date'] = pd.to_datetime(df['Date']) # Convert to datetime

df['Age'] = df['Age'].astype(int)         # Convert to integer
df['Price'] = df['Price'].astype(float)   # Convert to float
df['Date'] = pd.to_datetime(df['Date'])   # Convert to datetime

6. 使用str.strip()刪除字串中的空格

在資料集中，字串值中不需要的前導空格或尾部空格會導致排序、比較或分組問題。str.strip() 方法可以有效地刪除這些空格。

程式碼 ：

df['col'].str.lstrip() # Removes leading spaces

df['col'].str.rstrip() # Removes trailing spaces

df['col'].str.strip() # Removes both leading & trailing

df['col'].str.lstrip() # Removes leading spaces df['col'].str.rstrip() # Removes trailing spaces df['col'].str.strip() # Removes both leading & trailing

df['col'].str.lstrip()   # Removes leading spaces
df['col'].str.rstrip()   # Removes trailing spaces
df['col'].str.strip()    # Removes both leading & trailing

7.清理和提取列值

透過刪除不需要的字元或使用正規表示式提取特定模式，可以清理列值。

程式碼：

# Remove punctuation

df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True)

# Extract the username part before '@' in an email address

df['email_user'] = df['email'].str.extract(r'(^[^@]+)')

# Extract the 4-digit year from a date string

df['year'] = df['date'].str.extract(r'(\d{4})')

# Extract the first hashtag from a tweet

df['hashtag'] = df['tweet'].str.extract(r'#(\w+)')

# Extract phone numbers in the format 123-456-7890

df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')

# Remove punctuation df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True) # Extract the username part before '@' in an email address df['email_user'] = df['email'].str.extract(r'(^[^@]+)') # Extract the 4-digit year from a date string df['year'] = df['date'].str.extract(r'(\d{4})') # Extract the first hashtag from a tweet df['hashtag'] = df['tweet'].str.extract(r'#(\w+)') # Extract phone numbers in the format 123-456-7890 df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')

 # Remove punctuation
df['col'] = df['col'].str.replace(r'[^\w\s]', '', regex=True) 
# Extract the username part before '@' in an email address
df['email_user'] = df['email'].str.extract(r'(^[^@]+)')
# Extract the 4-digit year from a date string
df['year'] = df['date'].str.extract(r'(\d{4})')
# Extract the first hashtag from a tweet
df['hashtag'] = df['tweet'].str.extract(r'#(\w+)')
# Extract phone numbers in the format 123-456-7890
df['phone'] = df['contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')

8. 對映和替換值

您可以對映或替換列中的特定值，以規範或轉換資料。

程式碼 ：

df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})

df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})

df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'}) df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})

df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
df['Rating'] = df['Rating'].map({1: 'Bad', 2: 'Okay', 3: 'Good'})

9.處理異常值

異常值會扭曲統計分析和模型效能。以下是處理異常值的常用方法：

Z-score 法

程式碼：

# Keep only numeric columns, remove rows where any z-score > 3

df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]

# Keep only numeric columns, remove rows where any z-score > 3 df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]

# Keep only numeric columns, remove rows where any z-score > 3
df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) < 3).all(axis=1)]

剪下異常值（封頂至一定範圍）

程式碼：

df['col'].clip(lower=df['col'].quantile(0.05),upper=df['col'].quantile(0.95))

df['col'].clip(lower=df['col'].quantile(0.05),upper=df['col'].quantile(0.95))

10. 使用Lambda應用函式

Lambda 函式與 apply() 配合使用，可快速轉換或操作列中的資料。lambda 函式起轉換作用，而 apply() 則將其應用於整個列。

程式碼：

df['col'] = df['col'].apply(lambda x: x.strip().lower()) # Removes extra spaces and converts text to lowercase

df['col'] = df['col'].apply(lambda x: x.strip().lower())   # Removes extra spaces and converts text to lowercase

問題陳述

現在，您已經瞭解了這些 Python 單行程式碼，讓我們來看看問題陳述並嘗試解決它。您從一個線上零售平臺獲得了一個客戶資料集。資料存在以下問題

電子郵件、年齡、推特和電話等列中存在缺失值。
重複條目（如相同的姓名和電子郵件）。
格式不一致（如姓名中的空白、字串中的“missing”）。
資料型別問題（例如，Join_Date 值無效）。
Age 和 Purchase_Amount 中的異常值。
需要使用 regex 清理和提取的文字資料（例如，從 Tweet 中提取標籤，從電子郵件中提取使用者名稱）。

您的任務是演示如何清理該資料集。

解決方案

有關完整的解決方案，請參閱此Google Colab notebook。它將引導您完成使用 Python 和 pandas 有效清理資料集所需的每個步驟。

按照以下說明清理資料集

刪除所有值缺失的行

df.dropna(how='all', inplace=True)

df.dropna(how='all', inplace=True)

將 ‘missing’ 或 ‘not available’ 等佔位符文字標準化為 NaN

df.replace(['missing', 'not available', 'NaN'], np.nan, inplace=True)

df.replace(['missing', 'not available', 'NaN'], np.nan, inplace=True)

填補缺失值

df['Age'] = df['Age'].fillna(df['Age'].median())

df['Email'] = df['Email'].fillna('unknown@example.com')

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median())

df['Join_Date'] = df['Join_Date'].fillna(method='ffill')

df['Tweet'] = df['Tweet'].fillna('No tweet')

df['Phone'] = df['Phone'].fillna('000-000-0000')

df['Age'] = df['Age'].fillna(df['Age'].median()) df['Email'] = df['Email'].fillna('unknown@example.com') df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0]) df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median()) df['Join_Date'] = df['Join_Date'].fillna(method='ffill') df['Tweet'] = df['Tweet'].fillna('No tweet') df['Phone'] = df['Phone'].fillna('000-000-0000')

df['Age'] = df['Age'].fillna(df['Age'].median())
df['Email'] = df['Email'].fillna('unknown@example.com')
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Purchase_Amount'] = df['Purchase_Amount'].fillna(df['Purchase_Amount'].median())
df['Join_Date'] = df['Join_Date'].fillna(method='ffill')
df['Tweet'] = df['Tweet'].fillna('No tweet')
df['Phone'] = df['Phone'].fillna('000-000-0000')

刪除重複內容

df.drop_duplicates(inplace=True)

df.drop_duplicates(inplace=True)

刪除空格，規範文字欄位

df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x)

df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)

df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x) df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)

df['Name'] = df['Name'].apply(lambda x: x.strip().lower() if isinstance(x, str) else x)
df['Feedback'] = df['Feedback'].str.replace(r'[^\w\s]', '', regex=True)

轉換資料型別

df['Age'] = df['Age'].astype(int)

df['Purchase_Amount'] = df['Purchase_Amount'].astype(float)

df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')

df['Age'] = df['Age'].astype(int) df['Purchase_Amount'] = df['Purchase_Amount'].astype(float) df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')

df['Age'] = df['Age'].astype(int)
df['Purchase_Amount'] = df['Purchase_Amount'].astype(float)
df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce')

修復無效值

df = df[df['Age'].between(10, 100)] # realistic age

df = df[df['Purchase_Amount'] > 0] # remove negative or zero purchases

df = df[df['Age'].between(10, 100)] # realistic age df = df[df['Purchase_Amount'] > 0] # remove negative or zero purchases

df = df[df['Age'].between(10, 100)]  # realistic age
df = df[df['Purchase_Amount'] > 0]   # remove negative or zero purchases

利用 Z 值去除離群值

numeric_cols = df[['Age', 'Purchase_Amount']]

z_scores = np.abs(stats.zscore(numeric_cols))

df = df[(z_scores < 3).all(axis=1)]

numeric_cols = df[['Age', 'Purchase_Amount']] z_scores = np.abs(stats.zscore(numeric_cols)) df = df[(z_scores < 3).all(axis=1)]

numeric_cols = df[['Age', 'Purchase_Amount']]
z_scores = np.abs(stats.zscore(numeric_cols))
df = df[(z_scores < 3).all(axis=1)]

Regex 提取

df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)')

df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})')

df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')

df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)') df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})') df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')

df['Email_Username'] = df['Email'].str.extract(r'^([^@]+)')
df['Join_Year'] = df['Join_Date'].astype(str).str.extract(r'(\d{4})')
df['Formatted_Phone'] = df['Phone'].str.extract(r'(\d{3}-\d{3}-\d{4})')

‘Name’ 的最後清理

df['Name'] = df['Name'].apply(lambda x: x if isinstance(x, str) else 'unknown')

df['Name'] = df['Name'].apply(lambda x: x if isinstance(x, str) else 'unknown')

清理前的資料集

清理後的資料集

小結

清理資料是任何資料分析或機器學習專案的關鍵步驟。透過掌握這些功能強大的 Python 資料清理單行程式，您可以簡化資料預處理工作流程，確保資料準確、一致，併為分析做好準備。從處理缺失值和重複值到刪除異常值和格式化問題，這些單行程式碼讓您無需編寫冗長的程式碼就能高效地清理資料。利用 Pandas 和正規表示式的強大功能，您可以保持程式碼乾淨、簡潔並易於維護。無論您是初學者還是專家，這些方法都能幫助您更智慧、更快速地清理資料。

Pandas python 正規表示式資料清理

資料清洗的10個Pandas單行程式碼技巧

資料清理為何重要？

資料清理的單行解決方案

1. 使用dropna()處理缺失資料

2. 使用fillna()處理缺失資料

3. 使用drop_duplicates()刪除重複值

4. 使用replace()替換特定值

5. 使用astype()更改資料型別

6. 使用str.strip()刪除字串中的空格

7.清理和提取列值

8. 對映和替換值

9.處理異常值

10. 使用Lambda應用函式

問題陳述

解決方案

清理前的資料集

清理後的資料集

小結

評論留言

取消回覆

文章目录

資料清洗的10個Pandas單行程式碼技巧

資料清理為何重要？

資料清理的單行解決方案

1. 使用dropna()處理缺失資料

2. 使用fillna()處理缺失資料

3. 使用drop_duplicates()刪除重複值

4. 使用replace()替換特定值

5. 使用astype()更改資料型別

6. 使用str.strip()刪除字串中的空格

7.清理和提取列值

8. 對映和替換值

9.處理異常值

10. 使用Lambda應用函式

問題陳述

解決方案

清理前的資料集

清理後的資料集

小結

相關文章

評論留言

取消回覆

文章目录