Data Manipulation

Data Analysis with Pandas

Use pandas DataFrames for data loading, cleaning, transformation, and analysis.

Pandas DataFrames

A DataFrame is a 2D labeled data structure — think of it as a spreadsheet or SQL table in Python.

Key Operations

  • Loading data: CSV, Excel, JSON, SQL, APIs
  • Inspection: .head(), .info(), .describe(), .dtypes
  • Selection: Columns, rows, conditions
  • Cleaning: Handle null values, fix types, remove duplicates
  • Transformation: Apply functions, create new columns
  • Aggregation: Group by, pivot tables
  • Merging: Join DataFrames like SQL

Series vs DataFrame

  • Series: 1D labeled array (a single column)
  • DataFrame: 2D labeled array (multiple columns)

Example

python
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [75000, 90000, 85000, None, 120000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR', 'Engineering'],
    'years': [2, 5, 10, 3, 8]
})

# Inspect
print(df.head())
print(df.info())
print(df.describe())

# Selection
print(df['name'])                          # single column (Series)
print(df[['name', 'salary']])              # multiple columns (DataFrame)
print(df[df['age'] > 28])                  # filter rows
print(df.loc[0, 'name'])                   # label-based
print(df.iloc[0, 1])                       # position-based

# Handling missing values
df['salary'].fillna(df['salary'].median(), inplace=True)
df.dropna(subset=['name'], inplace=True)

# Create new columns
df['salary_per_year'] = df['salary'] / df['years']
df['is_senior'] = df['years'] >= 5

# Apply functions
df['name_upper'] = df['name'].str.upper()
df['salary_k'] = df['salary'].apply(lambda x: f"{x/1000:.0f}k")

# Group by and aggregate
dept_stats = df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    headcount=('name', 'count'),
    avg_years=('years', 'mean')
).round(2)
print(dept_stats)

# Sort
top_earners = df.nlargest(3, 'salary')[['name', 'salary', 'department']]
print(top_earners)

# Pivot table
pivot = df.pivot_table(
    values='salary',
    index='department',
    aggfunc=['mean', 'count', 'max']
)
Try it yourself — PYTHON