ML API Reference¶
Practical machine learning utilities that complement sklearn/pandas.
Overview¶
The dspu.ml module provides utilities for:
- Reproducibility: Unified seed management across libraries
- ID Generation: UUID, ULID, hash-based identifiers
- Data Splitting: Train/val/test splits with no leakage
- Statistics: Correlation, hypothesis tests, bootstrap, A/B testing
- Feature Scaling: Standard, min-max, robust scaling with state persistence
- Categorical Encoding: Label, one-hot, ordinal, frequency encoding
Random & Reproducibility¶
SeedManager¶
dspu.ml.random.SeedManager
¶
Unified seed management for reproducibility.
Manages random seeds across multiple libraries (Python random, NumPy, PyTorch, TensorFlow) to ensure reproducible results.
Example
SeedManager.set_global_seed(42)
Now all random operations are seeded¶
with SeedManager.seed_context(123): ... # Temporary seed for this block ... pass
Functions¶
set_global_seed
classmethod
¶
Set seed for all available random number generators.
Seeds the following libraries if available: - Python's random module - NumPy - PyTorch (CPU and CUDA) - TensorFlow
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
Random seed value (non-negative integer) |
required |
Raises:
| Type | Description |
|---|---|
RandomError
|
If seed is negative |
Example
SeedManager.set_global_seed(42) import random random.random() # Will be deterministic
Source code in src/dspu/ml/random.py
seed_context
classmethod
¶
Context manager for temporary seeding.
Sets a seed for the duration of the context, then restores the previous seed state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
Temporary seed value |
required |
Yields:
| Type | Description |
|---|---|
None
|
None |
Example
SeedManager.set_global_seed(42) with SeedManager.seed_context(123): ... # Operations here use seed 123 ... data = random.random()
Back to seed 42¶
Source code in src/dspu/ml/random.py
get_rng
classmethod
¶
Get a random number generator instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int | None
|
Optional seed for the RNG. If None, uses current global seed. |
None
|
backend
|
str
|
RNG backend - "python" or "numpy" |
'python'
|
Returns:
| Type | Description |
|---|---|
Any
|
Random number generator instance |
Raises:
| Type | Description |
|---|---|
RandomError
|
If backend is invalid or not available |
Example
rng = SeedManager.get_rng(seed=42, backend="python") rng.random()
Source code in src/dspu/ml/random.py
get_current_seed
classmethod
¶
Get the current global seed value.
Returns:
| Type | Description |
|---|---|
int | None
|
Current seed or None if not set |
Synthetic Data Generation¶
dspu.ml.random.make_classification_data
¶
make_classification_data(
n_samples: int = 100,
n_features: int = 5,
n_classes: int = 2,
n_informative: int | None = None,
noise: float = 0.1,
random_state: int | None = None,
) -> tuple[list[list[float]], list[int]]
Generate synthetic classification dataset.
Creates a simple classification dataset with optional noise. Features are drawn from normal distribution, with informative features correlating with class labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples to generate |
100
|
n_features
|
int
|
Number of features |
5
|
n_classes
|
int
|
Number of classes |
2
|
n_informative
|
int | None
|
Number of informative features (default: n_features // 2) |
None
|
noise
|
float
|
Standard deviation of Gaussian noise added to features |
0.1
|
random_state
|
int | None
|
Random seed for reproducibility |
None
|
Returns:
| Type | Description |
|---|---|
tuple[list[list[float]], list[int]]
|
Tuple of (X, y) where X is features and y is labels |
Example
X, y = make_classification_data(n_samples=100, n_features=5) len(X), len(y) (100, 100)
dspu.ml.random.make_regression_data
¶
make_regression_data(
n_samples: int = 100,
n_features: int = 5,
n_informative: int | None = None,
noise: float = 0.1,
random_state: int | None = None,
) -> tuple[list[list[float]], list[float]]
Generate synthetic regression dataset.
Creates a simple regression dataset with linear relationship between informative features and target, plus Gaussian noise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples to generate |
100
|
n_features
|
int
|
Number of features |
5
|
n_informative
|
int | None
|
Number of informative features (default: n_features // 2) |
None
|
noise
|
float
|
Standard deviation of Gaussian noise added to target |
0.1
|
random_state
|
int | None
|
Random seed for reproducibility |
None
|
Returns:
| Type | Description |
|---|---|
tuple[list[list[float]], list[float]]
|
Tuple of (X, y) where X is features and y is continuous targets |
Example
X, y = make_regression_data(n_samples=100, n_features=5) len(X), len(y) (100, 100)
dspu.ml.random.make_time_series
¶
make_time_series(
length: int = 100,
pattern: str = "trend+seasonality",
trend_slope: float = 0.1,
seasonality_period: int = 12,
seasonality_amplitude: float = 1.0,
noise: float = 0.1,
random_state: int | None = None,
) -> list[float]
Generate synthetic time series data.
Creates time series with configurable patterns: - trend: Linear trend - seasonality: Sinusoidal seasonality - trend+seasonality: Both components - random: Pure random walk
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int
|
Number of time steps |
100
|
pattern
|
str
|
Pattern type - "trend", "seasonality", "trend+seasonality", "random" |
'trend+seasonality'
|
trend_slope
|
float
|
Slope of linear trend |
0.1
|
seasonality_period
|
int
|
Period of seasonal component |
12
|
seasonality_amplitude
|
float
|
Amplitude of seasonal component |
1.0
|
noise
|
float
|
Standard deviation of Gaussian noise |
0.1
|
random_state
|
int | None
|
Random seed for reproducibility |
None
|
Returns:
| Type | Description |
|---|---|
list[float]
|
List of time series values |
Raises:
| Type | Description |
|---|---|
RandomError
|
If pattern is invalid |
Example
ts = make_time_series(length=100, pattern="trend+seasonality") len(ts) 100
ID Generation¶
IDGenerator¶
dspu.ml.identifiers.IDGenerator
¶
Utilities for generating and working with identifiers.
Supports multiple ID formats: - UUID v4: Random, globally unique - UUID v1: Time-based, globally unique - ULID: Time-based, lexicographically sortable - Hash-based: Deterministic from input data
Example
id1 = IDGenerator.uuid4() id2 = IDGenerator.ulid() id3 = IDGenerator.hash_text("data")
Functions¶
uuid4
staticmethod
¶
Generate random UUID v4.
UUID v4 is randomly generated and has extremely low collision probability.
Returns:
| Type | Description |
|---|---|
str
|
UUID v4 as string (e.g., "550e8400-e29b-41d4-a716-446655440000") |
Example
id = IDGenerator.uuid4() len(id) 36
Source code in src/dspu/ml/identifiers.py
uuid1
staticmethod
¶
Generate time-based UUID v1.
UUID v1 includes timestamp and MAC address. Useful when you need time-ordered IDs but not necessarily sortable.
Returns:
| Type | Description |
|---|---|
str
|
UUID v1 as string |
Example
id = IDGenerator.uuid1() len(id) 36
Source code in src/dspu/ml/identifiers.py
ulid
staticmethod
¶
Generate ULID (Universally Unique Lexicographically Sortable Identifier).
ULIDs are: - 26 characters (vs 36 for UUID) - Lexicographically sortable by creation time - Case-insensitive (base32 encoded) - Timestamp + randomness
Returns:
| Type | Description |
|---|---|
str
|
ULID as string (e.g., "01ARZ3NDEKTSV4RRFFQ69G5FAV") |
Example
id1 = IDGenerator.ulid() time.sleep(0.001) id2 = IDGenerator.ulid() id1 < id2 # Sortable by time True
Source code in src/dspu/ml/identifiers.py
hash_text
staticmethod
¶
hash_text(
text: str,
algo: Literal[
"md5", "sha1", "sha256", "sha512"
] = "sha256",
length: int | None = None,
) -> str
Generate hash-based ID from text.
Creates deterministic ID by hashing the input text. Useful for deduplication or creating stable IDs from content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to hash |
required |
algo
|
Literal['md5', 'sha1', 'sha256', 'sha512']
|
Hash algorithm - "md5", "sha1", "sha256", "sha512" |
'sha256'
|
length
|
int | None
|
Optional truncation length (in characters) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Hex-encoded hash string |
Raises:
| Type | Description |
|---|---|
IDError
|
If algorithm is invalid |
Example
id1 = IDGenerator.hash_text("hello") id2 = IDGenerator.hash_text("hello") id1 == id2 # Same input = same ID True
Source code in src/dspu/ml/identifiers.py
hash_row
staticmethod
¶
hash_row(
row: dict[str, Any],
columns: list[str] | None = None,
algo: Literal[
"md5", "sha1", "sha256", "sha512"
] = "sha256",
length: int | None = None,
) -> str
Generate stable hash-based ID for a row.
Creates deterministic ID by hashing selected column values. Useful for composite keys or content-based deduplication.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row
|
dict[str, Any]
|
Dictionary representing a row |
required |
columns
|
list[str] | None
|
Columns to include in hash (None = all columns, sorted) |
None
|
algo
|
Literal['md5', 'sha1', 'sha256', 'sha512']
|
Hash algorithm |
'sha256'
|
length
|
int | None
|
Optional truncation length |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Hex-encoded hash string |
Example
row = {"name": "Alice", "age": 30} id = IDGenerator.hash_row(row, columns=["name"])
Source code in src/dspu/ml/identifiers.py
add_id_column
staticmethod
¶
add_id_column(
table: list[dict[str, Any]],
id_col: str = "_id",
method: Literal[
"uuid4", "uuid1", "ulid", "hash", "sequential"
] = "uuid4",
hash_columns: list[str] | None = None,
start: int = 1,
) -> list[dict[str, Any]]
Add ID column to a table (list of dicts).
Modifies the table in-place by adding an ID column to each row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
list[dict[str, Any]]
|
List of dictionaries (rows) |
required |
id_col
|
str
|
Name of ID column to add |
'_id'
|
method
|
Literal['uuid4', 'uuid1', 'ulid', 'hash', 'sequential']
|
ID generation method - "uuid4", "uuid1", "ulid", "hash", "sequential" |
'uuid4'
|
hash_columns
|
list[str] | None
|
Columns to use for hash-based IDs (for method="hash") |
None
|
start
|
int
|
Starting value for sequential IDs |
1
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
Modified table (same object, modified in-place) |
Raises:
| Type | Description |
|---|---|
IDError
|
If method is invalid or hash_columns not provided for hash method |
Example
table = [{"name": "Alice"}, {"name": "Bob"}] IDGenerator.add_id_column(table, id_col="id", method="uuid4") "id" in table[0] True
Source code in src/dspu/ml/identifiers.py
Data Splitting¶
DataSplitter¶
dspu.ml.splits.DataSplitter
¶
Data splitting strategies for ML workflows.
Provides various train/test splitting methods with proper stratification and group handling to prevent data leakage.
Example
X = [[1], [2], [3], [4]] y = [0, 1, 0, 1] X_train, X_test, y_train, y_test = DataSplitter.train_test_split( ... X, y, test_size=0.25 ... )
Functions¶
train_test_split
staticmethod
¶
train_test_split(
X: list[Any],
y: list[Any] | None = None,
test_size: float = 0.25,
stratify: list[Any] | None = None,
random_state: int | None = None,
) -> tuple[
list[Any], list[Any], list[Any] | None, list[Any] | None
]
Split data into train and test sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[Any]
|
Input data (list of samples) |
required |
y
|
list[Any] | None
|
Target labels (optional) |
None
|
test_size
|
float
|
Fraction of data for test set (0 to 1) |
0.25
|
stratify
|
list[Any] | None
|
If provided, perform stratified split to maintain class distribution |
None
|
random_state
|
int | None
|
Random seed for reproducibility |
None
|
Returns:
| Type | Description |
|---|---|
list[Any]
|
Tuple of (X_train, X_test, y_train, y_test) |
list[Any]
|
If y is None, returns (X_train, X_test, None, None) |
Raises:
| Type | Description |
|---|---|
SplitError
|
If test_size is invalid or data sizes don't match |
Example
X = [[1], [2], [3], [4]] y = [0, 1, 0, 1] X_train, X_test, y_train, y_test = DataSplitter.train_test_split( ... X, y, test_size=0.25, random_state=42 ... )
Source code in src/dspu/ml/splits.py
train_val_test_split
staticmethod
¶
train_val_test_split(
X: list[Any],
y: list[Any] | None = None,
test_size: float = 0.2,
val_size: float = 0.1,
stratify: list[Any] | None = None,
random_state: int | None = None,
) -> tuple[
list[Any],
list[Any],
list[Any],
list[Any] | None,
list[Any] | None,
list[Any] | None,
]
Split data into train, validation, and test sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[Any]
|
Input data |
required |
y
|
list[Any] | None
|
Target labels (optional) |
None
|
test_size
|
float
|
Fraction for test set |
0.2
|
val_size
|
float
|
Fraction for validation set |
0.1
|
stratify
|
list[Any] | None
|
If provided, perform stratified split |
None
|
random_state
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
tuple[list[Any], list[Any], list[Any], list[Any] | None, list[Any] | None, list[Any] | None]
|
Tuple of (X_train, X_val, X_test, y_train, y_val, y_test) |
Raises:
| Type | Description |
|---|---|
SplitError
|
If sizes are invalid |
Example
X = list(range(100)) y = [i % 2 for i in range(100)] splits = DataSplitter.train_val_test_split(X, y, test_size=0.2, val_size=0.1) X_train, X_val, X_test, y_train, y_val, y_test = splits
Source code in src/dspu/ml/splits.py
kfold_split
staticmethod
¶
kfold_split(
X: list[Any],
n_splits: int = 5,
shuffle: bool = True,
random_state: int | None = None,
) -> list[tuple[list[int], list[int]]]
K-fold cross-validation indices.
Splits data into K consecutive folds, yielding train/validation indices for each fold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[Any]
|
Input data (only length is used) |
required |
n_splits
|
int
|
Number of folds |
5
|
shuffle
|
bool
|
Whether to shuffle data before splitting |
True
|
random_state
|
int | None
|
Random seed (only used if shuffle=True) |
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[list[int], list[int]]]
|
List of (train_indices, val_indices) tuples |
Raises:
| Type | Description |
|---|---|
SplitError
|
If n_splits is invalid |
Example
X = list(range(10)) folds = DataSplitter.kfold_split(X, n_splits=3) for train_idx, val_idx in folds: ... print(len(train_idx), len(val_idx))
Source code in src/dspu/ml/splits.py
stratified_kfold
staticmethod
¶
stratified_kfold(
y: list[Any],
n_splits: int = 5,
random_state: int | None = None,
) -> list[tuple[list[int], list[int]]]
Stratified K-fold cross-validation indices.
Like K-fold but preserves class distribution in each fold. Useful for imbalanced datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
list[Any]
|
Target labels |
required |
n_splits
|
int
|
Number of folds |
5
|
random_state
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[list[int], list[int]]]
|
List of (train_indices, val_indices) tuples |
Raises:
| Type | Description |
|---|---|
SplitError
|
If any class has fewer samples than n_splits |
Example
y = [0, 0, 0, 1, 1, 1] folds = DataSplitter.stratified_kfold(y, n_splits=3)
Source code in src/dspu/ml/splits.py
time_series_split
staticmethod
¶
time_series_split(
X: list[Any],
n_splits: int = 5,
max_train_size: int | None = None,
) -> list[tuple[list[int], list[int]]]
Time-series cross-validation indices.
Sequential splits suitable for temporal data where future data must not leak into training. Each fold uses progressively more historical data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[Any]
|
Input data (only length is used) |
required |
n_splits
|
int
|
Number of splits |
5
|
max_train_size
|
int | None
|
Maximum size of training set (None = unlimited) |
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[list[int], list[int]]]
|
List of (train_indices, val_indices) tuples |
Raises:
| Type | Description |
|---|---|
SplitError
|
If n_splits is too large |
Example
X = list(range(20)) splits = DataSplitter.time_series_split(X, n_splits=4)
Split 1: train [0-4], test [5-9]¶
Split 2: train [0-9], test [10-14]¶
Split 3: train [0-14], test [15-19]¶
Source code in src/dspu/ml/splits.py
group_split
staticmethod
¶
group_split(
X: list[Any],
groups: list[Any],
y: list[Any] | None = None,
test_size: float = 0.25,
random_state: int | None = None,
) -> tuple[
list[Any], list[Any], list[Any] | None, list[Any] | None
]
Split data by groups to prevent leakage.
Ensures that samples from the same group appear only in either train or test set, never both. Critical for scenarios like: - Patient data (keep all visits from same patient together) - Time series (keep all events from same entity together) - Hierarchical data (keep related samples together)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[Any]
|
Input data |
required |
groups
|
list[Any]
|
Group labels for each sample |
required |
y
|
list[Any] | None
|
Target labels (optional) |
None
|
test_size
|
float
|
Approximate fraction for test set |
0.25
|
random_state
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
tuple[list[Any], list[Any], list[Any] | None, list[Any] | None]
|
Tuple of (X_train, X_test, y_train, y_test) |
Raises:
| Type | Description |
|---|---|
SplitError
|
If data sizes don't match |
Example
X = [[1], [2], [3], [4], [5], [6]] y = [0, 0, 1, 1, 0, 1] groups = ["A", "A", "B", "B", "C", "C"] # A, B, C are patients X_train, X_test, y_train, y_test = DataSplitter.group_split( ... X, groups, y, test_size=0.33 ... )
All samples from same patient stay together¶
Source code in src/dspu/ml/splits.py
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 | |
Statistical Utilities¶
Stats¶
dspu.ml.stats.Stats
¶
Statistical utilities for data analysis and ML.
Provides common statistical functions without requiring scipy. For advanced statistical analysis, consider using scipy.stats.
Example
x = [1, 2, 3, 4, 5] y = [2, 4, 5, 4, 5] corr = Stats.correlation(x, y)
Functions¶
mean
staticmethod
¶
Calculate arithmetic mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[float]
|
List of numbers |
required |
Returns:
| Type | Description |
|---|---|
float
|
Mean value |
Raises:
| Type | Description |
|---|---|
StatsError
|
If data is empty |
Source code in src/dspu/ml/stats.py
median
staticmethod
¶
Calculate median.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[float]
|
List of numbers |
required |
Returns:
| Type | Description |
|---|---|
float
|
Median value |
Raises:
| Type | Description |
|---|---|
StatsError
|
If data is empty |
Source code in src/dspu/ml/stats.py
std
staticmethod
¶
Calculate standard deviation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[float]
|
List of numbers |
required |
ddof
|
int
|
Delta degrees of freedom (0 for population, 1 for sample) |
0
|
Returns:
| Type | Description |
|---|---|
float
|
Standard deviation |
Raises:
| Type | Description |
|---|---|
StatsError
|
If data is empty or has insufficient samples |
Source code in src/dspu/ml/stats.py
correlation
staticmethod
¶
correlation(
x: list[float],
y: list[float],
method: Literal[
"pearson", "spearman", "kendall"
] = "pearson",
) -> float
Calculate correlation between two variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
list[float]
|
First variable |
required |
y
|
list[float]
|
Second variable |
required |
method
|
Literal['pearson', 'spearman', 'kendall']
|
Correlation method - "pearson", "spearman", or "kendall" |
'pearson'
|
Returns:
| Type | Description |
|---|---|
float
|
Correlation coefficient (between -1 and 1) |
Raises:
| Type | Description |
|---|---|
StatsError
|
If lengths don't match or method is invalid |
Example
x = [1, 2, 3, 4, 5] y = [2, 4, 5, 4, 5] corr = Stats.correlation(x, y, method="pearson")
Source code in src/dspu/ml/stats.py
t_test_independent
staticmethod
¶
t_test_independent(
sample1: list[float],
sample2: list[float],
equal_var: bool = True,
) -> tuple[float, float]
Independent two-sample t-test.
Tests whether two samples have different means.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample1
|
list[float]
|
First sample |
required |
sample2
|
list[float]
|
Second sample |
required |
equal_var
|
bool
|
Assume equal variances (True) or not (False, Welch's t-test) |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
float
|
Tuple of (t_statistic, p_value_approx) |
|
Note |
float
|
p-value is approximate |
Raises:
| Type | Description |
|---|---|
StatsError
|
If samples are too small |
Example
sample1 = [1, 2, 3, 4, 5] sample2 = [2, 3, 4, 5, 6] t_stat, p_val = Stats.t_test_independent(sample1, sample2)
Source code in src/dspu/ml/stats.py
bootstrap_ci
staticmethod
¶
bootstrap_ci(
data: list[float],
stat_fn: Callable[[list[float]], float] | None = None,
n_bootstrap: int = 1000,
confidence_level: float = 0.95,
random_state: int | None = None,
) -> tuple[float, float, float]
Bootstrap confidence interval for a statistic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[float]
|
Input data |
required |
stat_fn
|
Callable[[list[float]], float] | None
|
Statistic function (default: mean) |
None
|
n_bootstrap
|
int
|
Number of bootstrap samples |
1000
|
confidence_level
|
float
|
Confidence level (e.g., 0.95 for 95% CI) |
0.95
|
random_state
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
tuple[float, float, float]
|
Tuple of (estimate, lower_bound, upper_bound) |
Example
data = [1, 2, 3, 4, 5] estimate, lower, upper = Stats.bootstrap_ci(data, n_bootstrap=1000)
Source code in src/dspu/ml/stats.py
ab_test_uplift
staticmethod
¶
ab_test_uplift(
group_a: list[float],
group_b: list[float],
metric_fn: Callable[[list[float]], float] | None = None,
n_bootstrap: int = 1000,
confidence_level: float = 0.95,
random_state: int | None = None,
) -> dict[str, float]
A/B test uplift calculation with bootstrap confidence interval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_a
|
list[float]
|
Metric values for group A (control) |
required |
group_b
|
list[float]
|
Metric values for group B (treatment) |
required |
metric_fn
|
Callable[[list[float]], float] | None
|
Metric function (default: mean) |
None
|
n_bootstrap
|
int
|
Number of bootstrap samples |
1000
|
confidence_level
|
float
|
Confidence level |
0.95
|
random_state
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary with keys: |
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
Example
group_a = [10, 12, 11, 13, 12] # Control group_b = [15, 16, 14, 17, 16] # Treatment result = Stats.ab_test_uplift(group_a, group_b) print(result["relative_uplift"]) # % improvement
Source code in src/dspu/ml/stats.py
detect_outliers
staticmethod
¶
detect_outliers(
data: list[float],
method: Literal["iqr", "zscore"] = "iqr",
threshold: float | None = None,
) -> list[int]
Detect outliers in data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[float]
|
Input data |
required |
method
|
Literal['iqr', 'zscore']
|
Detection method - "iqr" (interquartile range) or "zscore" |
'iqr'
|
threshold
|
float | None
|
Custom threshold (default: 1.5 for IQR, 3.0 for Z-score) |
None
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of indices of outlier values |
Example
data = [1, 2, 3, 4, 5, 100] outlier_indices = Stats.detect_outliers(data, method="iqr") [data[i] for i in outlier_indices] [100]
Source code in src/dspu/ml/stats.py
Feature Scaling¶
Scaler¶
dspu.ml.scaling.Scaler
¶
Scaler(
method: Literal[
"standard", "minmax", "robust"
] = "standard",
feature_range: tuple[float, float] = (0.0, 1.0),
)
Feature scaling with fit/transform pattern.
Supports multiple scaling methods: - standard: (x - mean) / std - minmax: (x - min) / (max - min) - robust: (x - median) / IQR
Example
scaler = Scaler(method="standard") X = [[1.0, 2.0], [2.0, 4.0], [3.0, 6.0]] X_scaled = scaler.fit_transform(X)
Initialize scaler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
Literal['standard', 'minmax', 'robust']
|
Scaling method - "standard", "minmax", or "robust" |
'standard'
|
feature_range
|
tuple[float, float]
|
Target range for minmax scaling (default: (0, 1)) |
(0.0, 1.0)
|
Raises:
| Type | Description |
|---|---|
ScalingError
|
If method is invalid |
Source code in src/dspu/ml/scaling.py
Functions¶
fit
¶
fit(X: list[list[float]]) -> Scaler
Fit scaler to data by computing statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[list[float]]
|
Training data (list of samples, each sample is list of features) |
required |
Returns:
| Type | Description |
|---|---|
Scaler
|
self (for method chaining) |
Raises:
| Type | Description |
|---|---|
ScalingError
|
If data is empty or invalid |
Example
scaler = Scaler(method="standard") X = [[1.0, 2.0], [2.0, 4.0]] scaler.fit(X)
Source code in src/dspu/ml/scaling.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
transform
¶
Transform data using fitted parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[list[float]]
|
Data to transform |
required |
Returns:
| Type | Description |
|---|---|
list[list[float]]
|
Scaled data |
Raises:
| Type | Description |
|---|---|
ScalingError
|
If scaler not fitted or data shape mismatch |
Example
scaler = Scaler(method="standard") X_train = [[1.0], [2.0], [3.0]] scaler.fit(X_train) X_test = [[1.5]] X_test_scaled = scaler.transform(X_test)
Source code in src/dspu/ml/scaling.py
fit_transform
¶
Fit scaler and transform data in one step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[list[float]]
|
Training data |
required |
Returns:
| Type | Description |
|---|---|
list[list[float]]
|
Scaled data |
Example
scaler = Scaler(method="standard") X = [[1.0], [2.0], [3.0]] X_scaled = scaler.fit_transform(X)
Source code in src/dspu/ml/scaling.py
inverse_transform
¶
Reverse scaling transformation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_scaled
|
list[list[float]]
|
Scaled data |
required |
Returns:
| Type | Description |
|---|---|
list[list[float]]
|
Original-scale data |
Raises:
| Type | Description |
|---|---|
ScalingError
|
If scaler not fitted or inverse not supported |
Example
scaler = Scaler(method="standard") X = [[1.0], [2.0], [3.0]] X_scaled = scaler.fit_transform(X) X_recovered = scaler.inverse_transform(X_scaled)
Source code in src/dspu/ml/scaling.py
save_state
¶
Save scaler state for serialization.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing scaler configuration and statistics |
Raises:
| Type | Description |
|---|---|
ScalingError
|
If scaler not fitted |
Example
scaler = Scaler(method="standard") scaler.fit([[1.0], [2.0]]) state = scaler.save_state()
Save to file: json.dump(state, f)¶
Source code in src/dspu/ml/scaling.py
from_state
classmethod
¶
from_state(state: dict[str, Any]) -> Scaler
Load scaler from saved state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
state
|
dict[str, Any]
|
Dictionary from save_state() |
required |
Returns:
| Type | Description |
|---|---|
Scaler
|
Configured scaler |
Example
state = {"method": "standard", "means": [0.0], ...} scaler = Scaler.from_state(state)
Source code in src/dspu/ml/scaling.py
save_to_file
¶
Save scaler state to JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to output file |
required |
Example
scaler.save_to_file("scaler.json")
Source code in src/dspu/ml/scaling.py
load_from_file
classmethod
¶
load_from_file(filepath: str) -> Scaler
Load scaler from JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to state file |
required |
Returns:
| Type | Description |
|---|---|
Scaler
|
Configured scaler |
Example
scaler = Scaler.load_from_file("scaler.json")
Source code in src/dspu/ml/scaling.py
Categorical Encoding¶
Encoder¶
dspu.ml.encoding.Encoder
¶
Encoder(
method: Literal[
"label", "onehot", "ordinal", "frequency"
] = "label",
categories: list[str] | None = None,
handle_unknown: Literal[
"error", "use_default", "ignore"
] = "error",
unknown_value: int | list[int] | None = None,
)
Categorical encoding with fit/transform pattern.
Supports multiple encoding methods: - label: Categories → integers (0, 1, 2, ...) - onehot: Categories → binary vectors ([1,0,0], [0,1,0], ...) - ordinal: Categories → ordered integers (custom order) - frequency: Categories → occurrence frequency
Example
encoder = Encoder(method="label") categories = ["A", "B", "A", "C"] encoded = encoder.fit_transform(categories)
Initialize encoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
Literal['label', 'onehot', 'ordinal', 'frequency']
|
Encoding method |
'label'
|
categories
|
list[str] | None
|
Pre-defined category order (for ordinal encoding) |
None
|
handle_unknown
|
Literal['error', 'use_default', 'ignore']
|
How to handle unknown categories: - "error": Raise error (default) - "use_default": Use unknown_value - "ignore": Skip/remove unknown values |
'error'
|
unknown_value
|
int | list[int] | None
|
Value for unknown categories (for handle_unknown="use_default") |
None
|
Raises:
| Type | Description |
|---|---|
EncodingError
|
If method is invalid |
Source code in src/dspu/ml/encoding.py
Functions¶
fit
¶
fit(categories: list[str]) -> Encoder
Fit encoder by learning category mappings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
categories
|
list[str]
|
Training categories |
required |
Returns:
| Type | Description |
|---|---|
Encoder
|
self (for method chaining) |
Raises:
| Type | Description |
|---|---|
EncodingError
|
If data is invalid |
Example
encoder = Encoder(method="label") encoder.fit(["A", "B", "A", "C"])
Source code in src/dspu/ml/encoding.py
transform
¶
Transform categories using fitted encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
categories
|
list[str]
|
Categories to encode |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
Encoded values (type depends on encoding method) |
Raises:
| Type | Description |
|---|---|
EncodingError
|
If encoder not fitted or unknown category encountered |
Example
encoder = Encoder(method="label") encoder.fit(["A", "B", "C"]) encoder.transform(["A", "C", "B"]) [0, 2, 1]
Source code in src/dspu/ml/encoding.py
fit_transform
¶
Fit encoder and transform categories in one step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
categories
|
list[str]
|
Training categories |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
Encoded values |
Example
encoder = Encoder(method="label") encoded = encoder.fit_transform(["A", "B", "A"])
Source code in src/dspu/ml/encoding.py
inverse_transform
¶
Reverse encoding transformation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encoded
|
list[int]
|
Encoded integers |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Original categories |
Raises:
| Type | Description |
|---|---|
EncodingError
|
If encoder not fitted or method doesn't support inverse |
Example
encoder = Encoder(method="label") encoder.fit(["A", "B", "C"]) encoded = [0, 2, 1] encoder.inverse_transform(encoded) ['A', 'C', 'B']
Source code in src/dspu/ml/encoding.py
get_feature_names
¶
Get feature names for one-hot encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_feature
|
str
|
Name of input feature |
'x'
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of feature names (one per category) |
Raises:
| Type | Description |
|---|---|
EncodingError
|
If method is not one-hot |
Example
encoder = Encoder(method="onehot") encoder.fit(["A", "B", "C"]) encoder.get_feature_names("category") ['category_A', 'category_B', 'category_C']
Source code in src/dspu/ml/encoding.py
save_state
¶
Save encoder state for serialization.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing encoder configuration and mappings |
Raises:
| Type | Description |
|---|---|
EncodingError
|
If encoder not fitted |
Example
encoder = Encoder(method="label") encoder.fit(["A", "B"]) state = encoder.save_state()
Source code in src/dspu/ml/encoding.py
from_state
classmethod
¶
from_state(state: dict[str, Any]) -> Encoder
Load encoder from saved state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
state
|
dict[str, Any]
|
Dictionary from save_state() |
required |
Returns:
| Type | Description |
|---|---|
Encoder
|
Configured encoder |
Example
state = {"method": "label", "categories": ["A", "B"], ...} encoder = Encoder.from_state(state)
Source code in src/dspu/ml/encoding.py
save_to_file
¶
Save encoder state to JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to output file |
required |
Example
encoder.save_to_file("encoder.json")
Source code in src/dspu/ml/encoding.py
load_from_file
classmethod
¶
load_from_file(filepath: str) -> Encoder
Load encoder from JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to state file |
required |
Returns:
| Type | Description |
|---|---|
Encoder
|
Configured encoder |
Example
encoder = Encoder.load_from_file("encoder.json")
Source code in src/dspu/ml/encoding.py
Exceptions¶
Usage Examples¶
Reproducible ML Pipeline¶
from dspu.ml import SeedManager, DataSplitter, Scaler, make_classification_data
# Set seed
SeedManager.set_global_seed(42)
# Generate data
X, y = make_classification_data(n_samples=1000, n_features=10)
# Split
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, stratify=y
)
# Scale
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Save for production
scaler.save_to_file("scaler.json")
Data Splitting Strategies¶
from dspu.ml import DataSplitter
# Stratified K-fold (preserves class distribution)
folds = DataSplitter.stratified_kfold(y, n_splits=5)
for train_idx, val_idx in folds:
X_train = [X[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
# Train and validate
# Time series split (no future leakage)
splits = DataSplitter.time_series_split(X, n_splits=5)
for train_idx, val_idx in splits:
assert max(train_idx) < min(val_idx) # Chronological order
# Group split (prevent leakage by group)
X_train, X_test, y_train, y_test = DataSplitter.group_split(
X, groups=patient_ids, y=y, test_size=0.25
)
A/B Testing¶
from dspu.ml import Stats
# Analyze A/B test
result = Stats.ab_test_uplift(
group_a=[0.10, 0.11, 0.09, 0.10, 0.12], # Control conversion rates
group_b=[0.15, 0.16, 0.14, 0.15, 0.17], # Treatment conversion rates
n_bootstrap=1000
)
print(f"Uplift: {result['relative_uplift']*100:.1f}%")
print(f"95% CI: [{result['uplift_ci_lower']:.3f}, {result['uplift_ci_upper']:.3f}]")
if result['uplift_ci_lower'] > 0:
print("Statistically significant improvement!")
See Also¶
- ML Examples - Comprehensive examples
- ML User Guide - Detailed guide
- Tutorial - Getting started