ML Utilities¶
Practical machine learning utilities for reproducibility, data splitting, scaling, encoding, and statistical analysis.
Overview¶
The ML module provides utilities that complement sklearn/pandas, not replace them:
- Reproducibility: Unified seed management across frameworks
- Data Splitting: Train/val/test splits with no leakage
- Feature Scaling: Stateful transformers with persistence
- Categorical Encoding: Label, one-hot, ordinal encoding
- Statistical Analysis: Correlation, hypothesis tests, bootstrap
- ID Generation: UUID, ULID, hash-based IDs
Reproducibility¶
Why Reproducibility?¶
Reproducible experiments are essential for: - Debugging models - Comparing experiments - Sharing results - Production deployment
Seed Management¶
Set seed for all RNGs at once:
from dspu.ml import SeedManager
# Set global seed
SeedManager.set_global_seed(42)
# Now all random operations use this seed:
# - Python's random
# - NumPy's random
# - PyTorch (if available)
# - TensorFlow (if available)
Seed Context¶
Temporary seed for specific operations:
# Global seed remains unchanged
SeedManager.set_global_seed(42)
# Temporarily use different seed
with SeedManager.seed_context(123):
data = generate_data() # Uses seed 123
# Back to seed 42
more_data = generate_data()
Synthetic Data¶
Generate reproducible synthetic datasets:
from dspu.ml import make_classification_data, make_regression_data
# Classification data
X, y = make_classification_data(
n_samples=1000,
n_features=20,
n_classes=3,
random_state=42
)
# Regression data
X, y = make_regression_data(
n_samples=1000,
n_features=10,
noise=0.1,
random_state=42
)
Data Splitting¶
Why Proper Splitting?¶
Improper splitting causes data leakage: - Information from test set leaks into training - Overly optimistic performance estimates - Models fail in production
Train/Test Split¶
Basic split with optional stratification:
from dspu.ml import DataSplitter
# Simple split
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, random_state=42
)
# Stratified split (preserves class distribution)
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Three-Way Split¶
Train, validation, and test:
X_train, X_val, X_test, y_train, y_val, y_test = DataSplitter.train_val_test_split(
X, y,
test_size=0.2, # 20% for test
val_size=0.1, # 10% for validation
stratify=y, # Preserve class distribution
random_state=42
)
Cross-Validation¶
K-Fold¶
# 5-fold cross-validation
folds = DataSplitter.kfold_split(X, n_splits=5, random_state=42)
for train_idx, val_idx in folds:
X_train = [X[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
# Train and evaluate model
...
Stratified K-Fold¶
Preserves class distribution in each fold:
# Stratified 5-fold CV
folds = DataSplitter.stratified_kfold(y, n_splits=5, random_state=42)
for train_idx, val_idx in folds:
# Each fold has same class distribution as original data
...
Time Series Split¶
For temporal data (no future leakage):
# Training data always comes before validation
splits = DataSplitter.time_series_split(X, n_splits=5)
for train_idx, val_idx in splits:
# train_idx: [0, 1, 2, 3, 4]
# val_idx: [5]
# No shuffling, chronological order preserved
assert max(train_idx) < min(val_idx)
Use Case: Stock prices, sensor data, logs
Group Split¶
Prevent data leakage when samples are grouped:
# Keep all samples from same patient together
X_train, X_test, y_train, y_test = DataSplitter.group_split(
X,
groups=patient_ids, # Patient IDs
y=y,
test_size=0.25,
random_state=42
)
# No patient appears in both train and test
Use Case: Medical data (patients), user behavior (users), documents (authors)
Feature Scaling¶
Why Scaling?¶
Many ML algorithms require scaled features: - Neural networks - SVM - K-nearest neighbors - Gradient descent optimization
Standard Scaling¶
Mean=0, Std=1:
from dspu.ml import Scaler
scaler = Scaler(method="standard")
# Fit on training data only
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data (no refitting!)
X_test_scaled = scaler.transform(X_test)
# Inverse transform if needed
X_original = scaler.inverse_transform(X_train_scaled)
Min-Max Scaling¶
Scale to [0, 1]:
Robust Scaling¶
Uses median and IQR (robust to outliers):
Persistence¶
Save scaler for production:
# Training
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
# Save scaler
scaler.save_to_file("artifacts/scaler.json")
# Production
scaler = Scaler.load_from_file("artifacts/scaler.json")
X_new_scaled = scaler.transform(X_new)
Categorical Encoding¶
Label Encoding¶
Convert categories to integers:
from dspu.ml import Encoder
encoder = Encoder(method="label")
# Fit on training data
categories = ["cat", "dog", "bird"]
encoded = encoder.fit_transform(categories) # [1, 2, 0]
# Transform new data
new_encoded = encoder.transform(["cat", "cat", "dog"]) # [1, 1, 2]
# Inverse transform
decoded = encoder.inverse_transform([1, 2, 0]) # ["cat", "dog", "bird"]
One-Hot Encoding¶
Binary vectors:
encoder = Encoder(method="onehot")
encoded = encoder.fit_transform(["cat", "dog", "bird"])
# Result: [[0, 1, 0], [0, 0, 1], [1, 0, 0]]
# Features: bird, cat, dog
# Get feature names
names = encoder.get_feature_names("animal") # ["animal_bird", "animal_cat", "animal_dog"]
Ordinal Encoding¶
With custom order:
# Define order (low to high)
encoder = Encoder(
method="ordinal",
categories=["low", "medium", "high"]
)
encoded = encoder.fit_transform(["low", "high", "medium"]) # [0, 2, 1]
Frequency Encoding¶
Encode by frequency:
encoder = Encoder(method="frequency")
# Categories with their frequencies
categories = ["cat", "dog", "cat", "bird", "cat"]
encoded = encoder.fit_transform(categories) # [0.6, 0.2, 0.6, 0.2, 0.6]
Unknown Categories¶
Handle unseen categories in production:
encoder = Encoder(
method="label",
handle_unknown="use_default",
unknown_value=-1
)
encoder.fit_transform(["cat", "dog"])
# Unknown category gets default value
encoder.transform(["cat", "bird", "dog"]) # [0, -1, 1]
Persistence¶
Save encoder for production:
# Training
encoder = Encoder(method="label")
y_encoded = encoder.fit_transform(y_train)
# Save encoder
encoder.save_to_file("artifacts/encoder.json")
# Production
encoder = Encoder.load_from_file("artifacts/encoder.json")
predictions_encoded = model.predict(X_new)
predictions = encoder.inverse_transform(predictions_encoded)
Statistical Analysis¶
Correlation¶
from dspu.ml import Stats
# Pearson correlation
corr = Stats.correlation(feature_a, feature_b, method="pearson")
# Spearman (rank-based)
corr = Stats.correlation(feature_a, feature_b, method="spearman")
# Kendall's tau
corr = Stats.correlation(feature_a, feature_b, method="kendall")
Hypothesis Testing¶
# Two-sample t-test
t_stat, p_value = Stats.t_test_independent(group_a, group_b)
if p_value < 0.05:
print("Groups are significantly different")
Bootstrap Confidence Intervals¶
# Bootstrap CI for mean
estimate, lower, upper = Stats.bootstrap_ci(
data,
stat_fn=lambda x: sum(x) / len(x), # Mean
n_bootstrap=1000,
confidence_level=0.95,
random_state=42
)
print(f"Mean: {estimate:.2f} (95% CI: [{lower:.2f}, {upper:.2f}])")
A/B Testing¶
# Test if treatment improves conversion
result = Stats.ab_test_uplift(
group_a=control_conversions,
group_b=treatment_conversions,
n_bootstrap=1000,
confidence_level=0.95,
random_state=42
)
print(f"Relative uplift: {result['relative_uplift']*100:.1f}%")
print(f"95% CI: [{result['uplift_ci_lower']:.3f}, {result['uplift_ci_upper']:.3f}]")
if result['uplift_ci_lower'] > 0:
print("Statistically significant improvement!")
Outlier Detection¶
# IQR method
outlier_indices = Stats.detect_outliers(data, method="iqr")
# Z-score method
outlier_indices = Stats.detect_outliers(data, method="zscore", threshold=3.0)
# Remove outliers
clean_data = [x for i, x in enumerate(data) if i not in outlier_indices]
ID Generation¶
UUID¶
Random universally unique identifiers:
from dspu.ml import IDGenerator
# Generate UUID v4
id = IDGenerator.uuid4() # "550e8400-e29b-41d4-a716-446655440000"
ULID¶
Lexicographically sortable IDs:
# Generate ULID
id = IDGenerator.ulid() # "01ARZ3NDEKTSV4RRFFQ69G5FAV"
# ULIDs are sortable by creation time
ids = [IDGenerator.ulid() for _ in range(10)]
assert ids == sorted(ids) # True!
Hash-based IDs¶
Deterministic IDs from text:
# Generate hash-based ID
id = IDGenerator.hash_text("user_email@example.com", algo="sha256", length=16)
# Same input always gives same ID (deterministic)
id1 = IDGenerator.hash_text("text")
id2 = IDGenerator.hash_text("text")
assert id1 == id2
Add ID Column¶
Add IDs to tabular data:
table = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
]
# Add UUID column
IDGenerator.add_id_column(table, id_col="_id", method="uuid4")
# Result:
# [
# {"_id": "550e8400...", "name": "Alice", "age": 30},
# {"_id": "6ba7b810...", "name": "Bob", "age": 25},
# ]
Common Patterns¶
Pattern 1: Complete ML Pipeline¶
from dspu.ml import SeedManager, DataSplitter, Scaler, Encoder
# 1. Set seed for reproducibility
SeedManager.set_global_seed(42)
# 2. Load and split data
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, stratify=y
)
# 3. Scale features
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 4. Encode labels
encoder = Encoder(method="label")
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)
# 5. Train model
model.fit(X_train_scaled, y_train_encoded)
# 6. Save artifacts
scaler.save_to_file("scaler.json")
encoder.save_to_file("encoder.json")
Pattern 2: Production Deployment¶
# Load artifacts
scaler = Scaler.load_from_file("artifacts/scaler.json")
encoder = Encoder.load_from_file("artifacts/encoder.json")
# Predict on new data
X_new_scaled = scaler.transform(X_new)
predictions_encoded = model.predict(X_new_scaled)
predictions = encoder.inverse_transform(predictions_encoded)
Pattern 3: Cross-Validation¶
from sklearn.metrics import accuracy_score
# Stratified 5-fold CV
scores = []
folds = DataSplitter.stratified_kfold(y, n_splits=5)
for train_idx, val_idx in folds:
X_train = [X[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
y_train = [y[i] for i in train_idx]
y_val = [y[i] for i in val_idx]
# Scale
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
# Train and evaluate
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_val_scaled)
scores.append(accuracy_score(y_val, predictions))
print(f"CV Accuracy: {sum(scores)/len(scores):.3f} ± {std(scores):.3f}")
Best Practices¶
Reproducibility¶
✅ DO: - Set global seed at start of experiments - Use seed contexts for specific operations - Document seeds in experiment logs - Use fixed seeds for debugging
❌ DON'T: - Don't skip seeding in experiments - Don't use random seeds in production - Don't forget framework-specific seeding - Don't mix seeded and unseeded operations
Data Splitting¶
✅ DO: - Use stratified splits for imbalanced data - Use time-series splits for temporal data - Use group splits to prevent leakage - Always have separate test set
❌ DON'T: - Don't fit scalers on test data - Don't use random splits for time series - Don't ignore group structure - Don't skip validation set
Feature Engineering¶
✅ DO: - Fit transformers on training data only - Save transformers for production - Handle unknown categories in encoding - Validate transformed data
❌ DON'T: - Don't leak test information into training - Don't skip saving transformers - Don't ignore data leakage - Don't forget inverse transforms