Skip to content

DSPU Documentation

Comprehensive data science utilities for Python

DSPU is a modern, production-ready library providing essential utilities for data science and machine learning workflows. Built with type safety, async support, and developer experience in mind.

Features

๐Ÿ”ง Core Utilities

  • Protocols & Type System: Type-safe interfaces for common patterns
  • Plugin Registry: Extensible plugin system for custom integrations
  • Exception Hierarchy: Comprehensive error handling

โš™๏ธ Configuration Management

  • Multi-Source Loading: YAML, TOML, JSON, HOCON, environment variables, HashiCorp Vault
  • File Watching: Auto-reload configuration changes
  • Pydantic Integration: Type-safe configuration with validation

๐Ÿ“ I/O Operations

  • Storage Abstraction: Unified API for local, S3, GCS, Azure storage
  • Multi-Format Support: JSON, YAML, TOML, HOCON, CSV, Markdown, and more
  • Streaming: Efficient handling of large files
  • Path Resolution: Secure path handling with traversal protection

โšก Async Utilities

  • Retry Logic: Exponential backoff, jitter, exception filtering
  • Rate Limiting: Token bucket algorithm with burst capacity
  • Circuit Breaker: Fault tolerance with state management
  • Concurrency Control: gather_with_limit, timeout utilities
  • Sync/Async Bridge: Convert between sync and async code

โœ… Validation

  • Filter System: Composable data filters (strip, lowercase, slugify, etc.)
  • Pydantic Integration: Seamless validation with Pydantic models
  • Validator Decorators: Field-level and function-level validation

๐Ÿ” Security

  • Secret Management: Multiple backends (Vault, AWS Secrets Manager, env, file)
  • Token Rotation: Automatic refresh with expiry tracking
  • Authentication: OAuth2, JWT, static tokens
  • Encryption: Fernet, AES, password hashing, secure comparison

๐Ÿ“Š Observability

  • Structured Logging: Context-based logging with ContextVar
  • Rich Console Output: Colors, tables, trees, progress bars
  • Tracing Decorators: @timed, @traced, @logged_errors
  • Stream Capture: Redirect stdout/stderr to logger

๐Ÿค– ML Utilities

  • Reproducibility: Unified seed management across libraries
  • Data Splitting: Stratified, time-series, group splits (no leakage!)
  • Feature Scaling: Standard, min-max, robust with state persistence
  • Categorical Encoding: Label, one-hot, ordinal, frequency encoding
  • Statistics: Correlation, hypothesis tests, bootstrap, A/B testing
  • ID Generation: UUID, ULID, hash-based IDs

Quick Example

from dspu.ml import SeedManager, DataSplitter, Scaler, Encoder
from dspu.observability import configure_logging, get_logger
from dspu.config import Config

# Set up logging
configure_logging(level="INFO", format="rich")
logger = get_logger(__name__)

# Load configuration
config = Config.from_file("config.yaml")

# Reproducible ML pipeline
SeedManager.set_global_seed(42)

# Split data (no leakage)
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
    X, y, test_size=0.2, stratify=y
)

# Scale features
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save for production
scaler.save_to_file("artifacts/scaler.json")
logger.info("Pipeline complete", samples=len(X_train))

Installation

# Basic installation
pip install dspu

# With all extras
pip install dspu[all]

# Specific extras
pip install dspu[config,ml,observability]

Key Principles

๐ŸŽฏ Practical Over Perfect

DSPU provides utilities that complement existing libraries (sklearn, pandas, etc.) rather than replacing them. We focus on solving common pain points in data science workflows.

๐Ÿ”’ Type-Safe by Default

Comprehensive type hints and protocol-based design ensure type safety throughout your codebase.

โšก Async-First

Built with async/await support from the ground up, with sync/async bridges when needed.

๐Ÿงช Well-Tested

Over 450 comprehensive tests ensure reliability and prevent regressions.

๐Ÿ“š Well-Documented

Extensive documentation with practical examples and patterns for every module.

Module Overview

Module Purpose Key Features
core Foundation Protocols, exceptions, plugin registry
config Configuration Multi-source loading, validation, watching
io I/O Operations Storage abstraction, multi-format, streaming
aio Async Utilities Retry, rate limiting, circuit breaker
validation Data Validation Filters, Pydantic integration, decorators
security Security Secrets, auth, encryption, token rotation
observability Logging & Monitoring Structured logging, rich output, tracing
ml ML Utilities Reproducibility, splits, scaling, encoding

Getting Started

  1. Installation Guide - Install DSPU
  2. Quick Start - Your first DSPU program
  3. Tutorial - Comprehensive walkthrough
  4. User Guide - In-depth module documentation
  5. API Reference - Complete API documentation
  6. Examples - Practical code examples

Community & Support

License

MIT License - see LICENSE for details.