Statistics
Statistical Analysis
Apply hypothesis testing, correlation analysis, and statistical thinking to your data.
Statistics in Data Science
Statistical thinking is fundamental to data science. Key concepts:
- Descriptive statistics: Summarize data (mean, median, std)
- Inferential statistics: Draw conclusions from samples
- Probability distributions: Model data patterns
- Hypothesis testing: Is an observed effect real or random?
- Correlation: Measure relationships between variables
Hypothesis Testing
- Define null hypothesis (H₀) and alternative hypothesis (H₁)
- Choose significance level (α, usually 0.05)
- Compute test statistic and p-value
- If p-value < α, reject H₀
Common tests:
- t-test: Compare means of two groups
- ANOVA: Compare means of multiple groups
- Chi-square: Test independence of categorical variables
- Pearson/Spearman: Measure correlation
Example
python
import pandas as pd
import numpy as np
from scipy import stats
# Generate data for two groups
np.random.seed(42)
group_a = np.random.normal(loc=50, scale=10, size=100)
group_b = np.random.normal(loc=55, scale=10, size=100)
# Descriptive statistics
print("Group A mean:", group_a.mean().round(2))
print("Group B mean:", group_b.mean().round(2))
# T-test: are the means significantly different?
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference! (reject H0)")
else:
print("No significant difference (fail to reject H0)")
# Correlation analysis
x = np.random.rand(100)
y = 0.7 * x + 0.3 * np.random.rand(100)
pearson_r, pearson_p = stats.pearsonr(x, y)
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Pearson r: {pearson_r:.3f} (p={pearson_p:.4f})")
print(f"Spearman r: {spearman_r:.3f} (p={spearman_p:.4f})")
# Chi-square test for categorical independence
contingency = pd.crosstab(
['M', 'F', 'M', 'F', 'M'],
['A', 'B', 'B', 'A', 'A']
)
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi2: {chi2:.3f}, p-value: {p:.4f}")
# Normality test
stat, p_norm = stats.shapiro(group_a)
print(f"Shapiro-Wilk: statistic={stat:.3f}, p={p_norm:.4f}")
if p_norm > 0.05:
print("Data looks normally distributed")
# Confidence interval
confidence = 0.95
n = len(group_a)
se = stats.sem(group_a)
ci = stats.t.interval(confidence, df=n-1, loc=group_a.mean(), scale=se)
print(f"95% CI for Group A: ({ci[0]:.2f}, {ci[1]:.2f})")Try it yourself — PYTHON