Statistics

Statistical Analysis

Apply hypothesis testing, correlation analysis, and statistical thinking to your data.

Statistics in Data Science

Statistical thinking is fundamental to data science. Key concepts:

Descriptive statistics: Summarize data (mean, median, std)
Inferential statistics: Draw conclusions from samples
Probability distributions: Model data patterns
Hypothesis testing: Is an observed effect real or random?
Correlation: Measure relationships between variables

Hypothesis Testing

Define null hypothesis (H₀) and alternative hypothesis (H₁)
Choose significance level (α, usually 0.05)
Compute test statistic and p-value
If p-value < α, reject H₀

Common tests:

t-test: Compare means of two groups
ANOVA: Compare means of multiple groups
Chi-square: Test independence of categorical variables
Pearson/Spearman: Measure correlation

Example

python

import pandas as pd
import numpy as np
from scipy import stats

# Generate data for two groups
np.random.seed(42)
group_a = np.random.normal(loc=50, scale=10, size=100)
group_b = np.random.normal(loc=55, scale=10, size=100)

# Descriptive statistics
print("Group A mean:", group_a.mean().round(2))
print("Group B mean:", group_b.mean().round(2))

# T-test: are the means significantly different?
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference! (reject H0)")
else:
    print("No significant difference (fail to reject H0)")

# Correlation analysis
x = np.random.rand(100)
y = 0.7 * x + 0.3 * np.random.rand(100)

pearson_r, pearson_p = stats.pearsonr(x, y)
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Pearson r: {pearson_r:.3f} (p={pearson_p:.4f})")
print(f"Spearman r: {spearman_r:.3f} (p={spearman_p:.4f})")

# Chi-square test for categorical independence
contingency = pd.crosstab(
    ['M', 'F', 'M', 'F', 'M'],
    ['A', 'B', 'B', 'A', 'A']
)
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi2: {chi2:.3f}, p-value: {p:.4f}")

# Normality test
stat, p_norm = stats.shapiro(group_a)
print(f"Shapiro-Wilk: statistic={stat:.3f}, p={p_norm:.4f}")
if p_norm > 0.05:
    print("Data looks normally distributed")

# Confidence interval
confidence = 0.95
n = len(group_a)
se = stats.sem(group_a)
ci = stats.t.interval(confidence, df=n-1, loc=group_a.mean(), scale=se)
print(f"95% CI for Group A: ({ci[0]:.2f}, {ci[1]:.2f})")

Try it yourself — PYTHON

import pandas as pd
import numpy as np
from scipy import stats

# Generate data for two groups
np.random.seed(42)
group_a = np.random.normal(loc=50, scale=10, size=100)
group_b = np.random.normal(loc=55, scale=10, size=100)

# Descriptive statistics
print("Group A mean:", group_a.mean().round(2))
print("Group B mean:", group_b.mean().round(2))

# T-test: are the means significantly different?
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference! (reject H0)")
else:
    print("No significant difference (fail to reject H0)")

# Correlation analysis
x = np.random.rand(100)
y = 0.7 * x + 0.3 * np.random.rand(100)

pearson_r, pearson_p = stats.pearsonr(x, y)
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Pearson r: {pearson_r:.3f} (p={pearson_p:.4f})")
print(f"Spearman r: {spearman_r:.3f} (p={spearman_p:.4f})")

# Chi-square test for categorical independence
contingency = pd.crosstab(
    ['M', 'F', 'M', 'F', 'M'],
    ['A', 'B', 'B', 'A', 'A']
)
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi2: {chi2:.3f}, p-value: {p:.4f}")

# Normality test
stat, p_norm = stats.shapiro(group_a)
print(f"Shapiro-Wilk: statistic={stat:.3f}, p={p_norm:.4f}")
if p_norm > 0.05:
    print("Data looks normally distributed")

# Confidence interval
confidence = 0.95
n = len(group_a)
se = stats.sem(group_a)
ci = stats.t.interval(confidence, df=n-1, loc=group_a.mean(), scale=se)
print(f"95% CI for Group A: ({ci[0]:.2f}, {ci[1]:.2f})")