ML Examples¶
Practical examples for the ML utilities module.
Overview¶
The ML module provides practical utilities that complement sklearn/pandas, not replace them:
- Random & Reproducibility: Unified seed management, synthetic data generation
- ID Generation: UUID, ULID, hash-based IDs for datasets
- Data Splitting: Train/val/test splits with no leakage (stratified, time-series, group)
- Statistical Utilities: Correlation, hypothesis tests, bootstrap, outlier detection
- Feature Scaling: Standard, min-max, robust scaling with state management
- Categorical Encoding: Label, one-hot, ordinal, frequency encoding with state management
Examples¶
1. ML Pipeline¶
File: examples/ml/01_ml_pipeline.py
Demonstrates a complete end-to-end ML pipeline.
Topics: - Setting up reproducibility with seed management - Generating synthetic datasets - Splitting data (train/val/test, stratified, time-series, group) - Encoding categorical features (label, one-hot) - Scaling numeric features (standard, min-max) - Saving artifacts for production deployment - Loading and using models in production
Run:
Key Patterns:
# Reproducibility
SeedManager.set_global_seed(42)
# Data splitting (no leakage)
X_train, X_val, X_test, y_train, y_val, y_test = DataSplitter.train_val_test_split(
X, y, test_size=0.2, val_size=0.1, stratify=y
)
# Feature scaling
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
scaler.save_to_file("scaler.json")
# Categorical encoding
encoder = Encoder(method="label")
y_encoded = encoder.fit_transform(y_categorical)
encoder.save_to_file("encoder.json")
2. Reproducibility & Statistics¶
File: examples/ml/02_reproducibility_and_stats.py
Shows reproducibility and statistical analysis.
Topics: - Reproducible experiments with seed contexts - Generating synthetic data (classification, regression, time series) - Correlation analysis (Pearson, Spearman, Kendall) - Hypothesis testing (t-test) - Bootstrap confidence intervals - A/B testing with uplift calculation - Outlier detection (IQR, Z-score) - Descriptive statistics
Run:
Key Patterns:
# Reproducibility
with SeedManager.seed_context(123):
data = make_classification_data(n_samples=100, random_state=123)
# Correlation
corr = Stats.correlation(x, y, method="pearson")
# Bootstrap CI
estimate, lower, upper = Stats.bootstrap_ci(
data, n_bootstrap=1000, confidence_level=0.95
)
# A/B testing
result = Stats.ab_test_uplift(group_a, group_b, n_bootstrap=1000)
# Outlier detection
outliers = Stats.detect_outliers(data, method="iqr")
Common Patterns¶
Pattern 1: Reproducible ML Pipeline¶
from dspu.ml import SeedManager, DataSplitter, Scaler, make_classification_data
# Set seed for reproducibility
SeedManager.set_global_seed(42)
# Generate synthetic data
X, y = make_classification_data(n_samples=1000, n_features=10)
# Split data
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, stratify=y
)
# Scale features
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Save for production
scaler.save_to_file("scaler.json")
Pattern 2: Stratified Cross-Validation¶
from dspu.ml import DataSplitter
# Perform stratified 5-fold CV (preserves class distribution)
folds = DataSplitter.stratified_kfold(y, n_splits=5)
for train_idx, val_idx in folds:
X_train = [X[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
# Train model on X_train, validate on X_val
...
Pattern 3: Time Series Validation¶
from dspu.ml import DataSplitter
# Sequential splits (no future leakage)
splits = DataSplitter.time_series_split(X, n_splits=5)
for train_idx, val_idx in splits:
# Training data is always before validation data
assert max(train_idx) < min(val_idx)
X_train = [X[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
...
Pattern 4: Group Split (No Leakage)¶
from dspu.ml import DataSplitter
# Keep all samples from same group together
X_train, X_test, y_train, y_test = DataSplitter.group_split(
X, groups=patient_ids, y=y, test_size=0.25
)
# No patient appears in both train and test
Pattern 5: Production Deployment¶
from dspu.ml import Scaler, Encoder
# Training phase
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
scaler.save_to_file("artifacts/scaler.json")
encoder = Encoder(method="label")
y_encoded = encoder.fit_transform(y_categorical)
encoder.save_to_file("artifacts/encoder.json")
# Production phase
scaler = Scaler.load_from_file("artifacts/scaler.json")
encoder = Encoder.load_from_file("artifacts/encoder.json")
X_new_scaled = scaler.transform(X_new)
predictions_encoded = model.predict(X_new_scaled)
predictions = encoder.inverse_transform(predictions_encoded)
Pattern 6: A/B Testing¶
from dspu.ml import Stats
# Analyze A/B test
result = Stats.ab_test_uplift(
group_a=control_metrics,
group_b=treatment_metrics,
n_bootstrap=1000,
confidence_level=0.95,
)
print(f"Uplift: {result['relative_uplift']*100:.1f}%")
print(f"95% CI: [{result['uplift_ci_lower']:.3f}, {result['uplift_ci_upper']:.3f}]")
if result['uplift_ci_lower'] > 0:
print("Statistically significant!")
Best Practices¶
✅ DO: - Set seed for reproducibility in experiments - Use stratified splits for imbalanced data - Use time-series splits for temporal data - Use group splits to prevent data leakage - Save scalers/encoders for production deployment - Fit on training data only, transform on test data
❌ DON'T: - Don't fit scalers on test data (data leakage!) - Don't use random splits for time series - Don't ignore group structure in data (patients, users, etc.) - Don't forget to handle unknown categories in production - Don't skip reproducibility setup for experiments
Module Reference¶
Random & Reproducibility¶
SeedManager.set_global_seed(seed)- Set seed for all RNGsSeedManager.seed_context(seed)- Temporary seed contextmake_classification_data()- Generate synthetic classification datasetmake_regression_data()- Generate synthetic regression datasetmake_time_series()- Generate synthetic time series
ID Generation¶
IDGenerator.uuid4()- Generate UUID v4IDGenerator.ulid()- Generate sortable ULIDIDGenerator.hash_text()- Hash-based deterministic IDsIDGenerator.add_id_column()- Add IDs to table
Data Splitting¶
DataSplitter.train_test_split()- Train/test splitDataSplitter.train_val_test_split()- Three-way splitDataSplitter.kfold_split()- K-fold cross-validationDataSplitter.stratified_kfold()- Stratified K-foldDataSplitter.time_series_split()- Sequential splits (no future leakage)DataSplitter.group_split()- Split by groups (no leakage)
Statistical Utilities¶
Stats.mean(),Stats.median(),Stats.std()- Basic statsStats.correlation()- Pearson, Spearman, KendallStats.t_test_independent()- Two-sample t-testStats.bootstrap_ci()- Bootstrap confidence intervalsStats.ab_test_uplift()- A/B test analysisStats.detect_outliers()- IQR or Z-score outlier detection
Feature Scaling¶
Scaler(method="standard")- Standard scaling (mean=0, std=1)Scaler(method="minmax")- Min-max scaling to [0, 1]Scaler(method="robust")- Robust scaling (median, IQR).fit_transform(),.transform(),.inverse_transform().save_to_file(),.load_from_file()- Persistence
Categorical Encoding¶
Encoder(method="label")- Label encoding (categories → integers)Encoder(method="onehot")- One-hot encoding (binary vectors)Encoder(method="ordinal")- Ordinal encoding (custom order)Encoder(method="frequency")- Frequency encoding.fit_transform(),.transform(),.inverse_transform().save_to_file(),.load_from_file()- Persistence