Skip to content

ML Examples

Practical examples for the ML utilities module.

Overview

The ML module provides practical utilities that complement sklearn/pandas, not replace them:

  • Random & Reproducibility: Unified seed management, synthetic data generation
  • ID Generation: UUID, ULID, hash-based IDs for datasets
  • Data Splitting: Train/val/test splits with no leakage (stratified, time-series, group)
  • Statistical Utilities: Correlation, hypothesis tests, bootstrap, outlier detection
  • Feature Scaling: Standard, min-max, robust scaling with state management
  • Categorical Encoding: Label, one-hot, ordinal, frequency encoding with state management

Examples

1. ML Pipeline

File: examples/ml/01_ml_pipeline.py

Demonstrates a complete end-to-end ML pipeline.

Topics: - Setting up reproducibility with seed management - Generating synthetic datasets - Splitting data (train/val/test, stratified, time-series, group) - Encoding categorical features (label, one-hot) - Scaling numeric features (standard, min-max) - Saving artifacts for production deployment - Loading and using models in production

Run:

python examples/ml/01_ml_pipeline.py

Key Patterns:

# Reproducibility
SeedManager.set_global_seed(42)

# Data splitting (no leakage)
X_train, X_val, X_test, y_train, y_val, y_test = DataSplitter.train_val_test_split(
    X, y, test_size=0.2, val_size=0.1, stratify=y
)

# Feature scaling
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
scaler.save_to_file("scaler.json")

# Categorical encoding
encoder = Encoder(method="label")
y_encoded = encoder.fit_transform(y_categorical)
encoder.save_to_file("encoder.json")

2. Reproducibility & Statistics

File: examples/ml/02_reproducibility_and_stats.py

Shows reproducibility and statistical analysis.

Topics: - Reproducible experiments with seed contexts - Generating synthetic data (classification, regression, time series) - Correlation analysis (Pearson, Spearman, Kendall) - Hypothesis testing (t-test) - Bootstrap confidence intervals - A/B testing with uplift calculation - Outlier detection (IQR, Z-score) - Descriptive statistics

Run:

python examples/ml/02_reproducibility_and_stats.py

Key Patterns:

# Reproducibility
with SeedManager.seed_context(123):
    data = make_classification_data(n_samples=100, random_state=123)

# Correlation
corr = Stats.correlation(x, y, method="pearson")

# Bootstrap CI
estimate, lower, upper = Stats.bootstrap_ci(
    data, n_bootstrap=1000, confidence_level=0.95
)

# A/B testing
result = Stats.ab_test_uplift(group_a, group_b, n_bootstrap=1000)

# Outlier detection
outliers = Stats.detect_outliers(data, method="iqr")

Common Patterns

Pattern 1: Reproducible ML Pipeline

from dspu.ml import SeedManager, DataSplitter, Scaler, make_classification_data

# Set seed for reproducibility
SeedManager.set_global_seed(42)

# Generate synthetic data
X, y = make_classification_data(n_samples=1000, n_features=10)

# Split data
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
    X, y, test_size=0.2, stratify=y
)

# Scale features
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save for production
scaler.save_to_file("scaler.json")

Pattern 2: Stratified Cross-Validation

from dspu.ml import DataSplitter

# Perform stratified 5-fold CV (preserves class distribution)
folds = DataSplitter.stratified_kfold(y, n_splits=5)

for train_idx, val_idx in folds:
    X_train = [X[i] for i in train_idx]
    X_val = [X[i] for i in val_idx]

    # Train model on X_train, validate on X_val
    ...

Pattern 3: Time Series Validation

from dspu.ml import DataSplitter

# Sequential splits (no future leakage)
splits = DataSplitter.time_series_split(X, n_splits=5)

for train_idx, val_idx in splits:
    # Training data is always before validation data
    assert max(train_idx) < min(val_idx)

    X_train = [X[i] for i in train_idx]
    X_val = [X[i] for i in val_idx]
    ...

Pattern 4: Group Split (No Leakage)

from dspu.ml import DataSplitter

# Keep all samples from same group together
X_train, X_test, y_train, y_test = DataSplitter.group_split(
    X, groups=patient_ids, y=y, test_size=0.25
)

# No patient appears in both train and test

Pattern 5: Production Deployment

from dspu.ml import Scaler, Encoder

# Training phase
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
scaler.save_to_file("artifacts/scaler.json")

encoder = Encoder(method="label")
y_encoded = encoder.fit_transform(y_categorical)
encoder.save_to_file("artifacts/encoder.json")

# Production phase
scaler = Scaler.load_from_file("artifacts/scaler.json")
encoder = Encoder.load_from_file("artifacts/encoder.json")

X_new_scaled = scaler.transform(X_new)
predictions_encoded = model.predict(X_new_scaled)
predictions = encoder.inverse_transform(predictions_encoded)

Pattern 6: A/B Testing

from dspu.ml import Stats

# Analyze A/B test
result = Stats.ab_test_uplift(
    group_a=control_metrics,
    group_b=treatment_metrics,
    n_bootstrap=1000,
    confidence_level=0.95,
)

print(f"Uplift: {result['relative_uplift']*100:.1f}%")
print(f"95% CI: [{result['uplift_ci_lower']:.3f}, {result['uplift_ci_upper']:.3f}]")

if result['uplift_ci_lower'] > 0:
    print("Statistically significant!")

Best Practices

DO: - Set seed for reproducibility in experiments - Use stratified splits for imbalanced data - Use time-series splits for temporal data - Use group splits to prevent data leakage - Save scalers/encoders for production deployment - Fit on training data only, transform on test data

DON'T: - Don't fit scalers on test data (data leakage!) - Don't use random splits for time series - Don't ignore group structure in data (patients, users, etc.) - Don't forget to handle unknown categories in production - Don't skip reproducibility setup for experiments

Module Reference

Random & Reproducibility

  • SeedManager.set_global_seed(seed) - Set seed for all RNGs
  • SeedManager.seed_context(seed) - Temporary seed context
  • make_classification_data() - Generate synthetic classification dataset
  • make_regression_data() - Generate synthetic regression dataset
  • make_time_series() - Generate synthetic time series

ID Generation

  • IDGenerator.uuid4() - Generate UUID v4
  • IDGenerator.ulid() - Generate sortable ULID
  • IDGenerator.hash_text() - Hash-based deterministic IDs
  • IDGenerator.add_id_column() - Add IDs to table

Data Splitting

  • DataSplitter.train_test_split() - Train/test split
  • DataSplitter.train_val_test_split() - Three-way split
  • DataSplitter.kfold_split() - K-fold cross-validation
  • DataSplitter.stratified_kfold() - Stratified K-fold
  • DataSplitter.time_series_split() - Sequential splits (no future leakage)
  • DataSplitter.group_split() - Split by groups (no leakage)

Statistical Utilities

  • Stats.mean(), Stats.median(), Stats.std() - Basic stats
  • Stats.correlation() - Pearson, Spearman, Kendall
  • Stats.t_test_independent() - Two-sample t-test
  • Stats.bootstrap_ci() - Bootstrap confidence intervals
  • Stats.ab_test_uplift() - A/B test analysis
  • Stats.detect_outliers() - IQR or Z-score outlier detection

Feature Scaling

  • Scaler(method="standard") - Standard scaling (mean=0, std=1)
  • Scaler(method="minmax") - Min-max scaling to [0, 1]
  • Scaler(method="robust") - Robust scaling (median, IQR)
  • .fit_transform(), .transform(), .inverse_transform()
  • .save_to_file(), .load_from_file() - Persistence

Categorical Encoding

  • Encoder(method="label") - Label encoding (categories → integers)
  • Encoder(method="onehot") - One-hot encoding (binary vectors)
  • Encoder(method="ordinal") - Ordinal encoding (custom order)
  • Encoder(method="frequency") - Frequency encoding
  • .fit_transform(), .transform(), .inverse_transform()
  • .save_to_file(), .load_from_file() - Persistence

See Also