Data Manipulation
Data Analysis with Pandas
Use pandas DataFrames for data loading, cleaning, transformation, and analysis.
Pandas DataFrames
A DataFrame is a 2D labeled data structure — think of it as a spreadsheet or SQL table in Python.
Key Operations
- Loading data: CSV, Excel, JSON, SQL, APIs
- Inspection:
.head(),.info(),.describe(),.dtypes - Selection: Columns, rows, conditions
- Cleaning: Handle null values, fix types, remove duplicates
- Transformation: Apply functions, create new columns
- Aggregation: Group by, pivot tables
- Merging: Join DataFrames like SQL
Series vs DataFrame
- Series: 1D labeled array (a single column)
- DataFrame: 2D labeled array (multiple columns)
Example
python
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [25, 30, 35, 28, 32],
'salary': [75000, 90000, 85000, None, 120000],
'department': ['Engineering', 'Marketing', 'Engineering', 'HR', 'Engineering'],
'years': [2, 5, 10, 3, 8]
})
# Inspect
print(df.head())
print(df.info())
print(df.describe())
# Selection
print(df['name']) # single column (Series)
print(df[['name', 'salary']]) # multiple columns (DataFrame)
print(df[df['age'] > 28]) # filter rows
print(df.loc[0, 'name']) # label-based
print(df.iloc[0, 1]) # position-based
# Handling missing values
df['salary'].fillna(df['salary'].median(), inplace=True)
df.dropna(subset=['name'], inplace=True)
# Create new columns
df['salary_per_year'] = df['salary'] / df['years']
df['is_senior'] = df['years'] >= 5
# Apply functions
df['name_upper'] = df['name'].str.upper()
df['salary_k'] = df['salary'].apply(lambda x: f"{x/1000:.0f}k")
# Group by and aggregate
dept_stats = df.groupby('department').agg(
avg_salary=('salary', 'mean'),
headcount=('name', 'count'),
avg_years=('years', 'mean')
).round(2)
print(dept_stats)
# Sort
top_earners = df.nlargest(3, 'salary')[['name', 'salary', 'department']]
print(top_earners)
# Pivot table
pivot = df.pivot_table(
values='salary',
index='department',
aggfunc=['mean', 'count', 'max']
)Try it yourself — PYTHON