Data Manipulation
Data Cleaning
Handle missing values, outliers, and data quality issues that plague real-world datasets.
The Dirty Reality of Data
Real-world data is almost always messy. A data scientist typically spends 60-80% of their time on data cleaning, also called data wrangling or data munging.
Common Issues
- Missing values: NaN, null, empty strings
- Outliers: Extreme values that may be errors or genuine
- Duplicates: Same record entered multiple times
- Inconsistent formatting: "New York", "new york", "NY"
- Wrong data types: Numbers stored as strings
- Incorrect values: Age = -5, 200% discount
Strategies for Missing Values
- Remove: If few rows missing (< 5%) and randomly missing
- Mean/Median imputation: Simple, but ignores relationships
- Mode imputation: For categorical variables
- Prediction-based: Use other columns to predict missing values
- Forward/backward fill: For time series data
Example
python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Create messy dataset
df = pd.DataFrame({
'age': [25, None, 35, -1, 120, 28, None, 32],
'salary': [50000, 90000, None, 85000, 75000, None, 60000, 95000],
'city': ['New York', 'new york', 'NYC', 'Los Angeles', 'LA', 'Chicago', None, 'chicago'],
'email': ['alice@test.com', 'bob@test.com', 'bob@test.com', 'carol@test.com',
'', 'dave@test.com', None, 'eve@test.com']
})
# Inspect missing values
print(df.isnull().sum())
print(df.isnull().mean() * 100) # % missing
# Fix impossible values (age < 0 or > 100)
df['age'] = df['age'].where((df['age'] >= 0) & (df['age'] <= 100))
# Standardize categorical text
df['city'] = df['city'].str.lower().str.strip()
city_mapping = {'nyc': 'new york', 'la': 'los angeles'}
df['city'] = df['city'].replace(city_mapping)
# Handle empty strings
df['email'] = df['email'].replace('', None)
# Remove duplicate rows
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
# Impute missing values
# Simple mean imputation for numeric
df['age'] = df['age'].fillna(df['age'].median())
df['salary'] = df['salary'].fillna(df['salary'].mean())
# For categorical
df['city'] = df['city'].fillna(df['city'].mode()[0])
# Detect outliers using IQR
def remove_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return df[(df[column] >= lower) & (df[column] <= upper)]
df_clean = remove_outliers_iqr(df, 'salary')
print(f"Rows removed: {len(df) - len(df_clean)}")Try it yourself — PYTHON