Getting Started
Data Science Introduction
Explore the data science workflow and the Python tools that power modern data analysis.
What is Data Science?
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
The Data Science Process
- Problem Definition: What question are you trying to answer?
- Data Collection: Gather relevant data sources
- Data Cleaning: Handle missing values, outliers, inconsistencies
- Exploratory Data Analysis (EDA): Visualize and understand the data
- Feature Engineering: Create or transform features for modeling
- Modeling: Apply statistical or ML models
- Evaluation: Measure model performance
- Communication: Present findings clearly
Key Python Libraries
- pandas: Data manipulation and analysis
- NumPy: Numerical computing
- Matplotlib/Seaborn: Data visualization
- scikit-learn: Machine learning
- Jupyter: Interactive notebooks
Example
python
# Data Science toolkit setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# NumPy - numerical computing
arr = np.array([1, 2, 3, 4, 5])
matrix = np.random.randn(5, 3)
print("Mean:", arr.mean())
print("Std:", arr.std())
print("Shape:", matrix.shape)
# Statistical operations
data = np.random.normal(loc=50, scale=15, size=1000)
print(f"Mean: {data.mean():.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"25th percentile: {np.percentile(data, 25):.2f}")
print(f"75th percentile: {np.percentile(data, 75):.2f}")
# Simple visualization
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(data, bins=30, edgecolor='black')
plt.title('Distribution')
plt.xlabel('Value')
plt.subplot(1, 2, 2)
plt.boxplot(data)
plt.title('Box Plot')
plt.tight_layout()
plt.show()Try it yourself — PYTHON