DSPU Documentation¶
Comprehensive data science utilities for Python
DSPU is a modern, production-ready library providing essential utilities for data science and machine learning workflows. Built with type safety, async support, and developer experience in mind.
Features¶
๐ง Core Utilities¶
- Protocols & Type System: Type-safe interfaces for common patterns
- Plugin Registry: Extensible plugin system for custom integrations
- Exception Hierarchy: Comprehensive error handling
โ๏ธ Configuration Management¶
- Multi-Source Loading: YAML, TOML, JSON, HOCON, environment variables, HashiCorp Vault
- File Watching: Auto-reload configuration changes
- Pydantic Integration: Type-safe configuration with validation
๐ I/O Operations¶
- Storage Abstraction: Unified API for local, S3, GCS, Azure storage
- Multi-Format Support: JSON, YAML, TOML, HOCON, CSV, Markdown, and more
- Streaming: Efficient handling of large files
- Path Resolution: Secure path handling with traversal protection
โก Async Utilities¶
- Retry Logic: Exponential backoff, jitter, exception filtering
- Rate Limiting: Token bucket algorithm with burst capacity
- Circuit Breaker: Fault tolerance with state management
- Concurrency Control:
gather_with_limit, timeout utilities - Sync/Async Bridge: Convert between sync and async code
โ Validation¶
- Filter System: Composable data filters (strip, lowercase, slugify, etc.)
- Pydantic Integration: Seamless validation with Pydantic models
- Validator Decorators: Field-level and function-level validation
๐ Security¶
- Secret Management: Multiple backends (Vault, AWS Secrets Manager, env, file)
- Token Rotation: Automatic refresh with expiry tracking
- Authentication: OAuth2, JWT, static tokens
- Encryption: Fernet, AES, password hashing, secure comparison
๐ Observability¶
- Structured Logging: Context-based logging with ContextVar
- Rich Console Output: Colors, tables, trees, progress bars
- Tracing Decorators:
@timed,@traced,@logged_errors - Stream Capture: Redirect stdout/stderr to logger
๐ค ML Utilities¶
- Reproducibility: Unified seed management across libraries
- Data Splitting: Stratified, time-series, group splits (no leakage!)
- Feature Scaling: Standard, min-max, robust with state persistence
- Categorical Encoding: Label, one-hot, ordinal, frequency encoding
- Statistics: Correlation, hypothesis tests, bootstrap, A/B testing
- ID Generation: UUID, ULID, hash-based IDs
Quick Example¶
from dspu.ml import SeedManager, DataSplitter, Scaler, Encoder
from dspu.observability import configure_logging, get_logger
from dspu.config import Config
# Set up logging
configure_logging(level="INFO", format="rich")
logger = get_logger(__name__)
# Load configuration
config = Config.from_file("config.yaml")
# Reproducible ML pipeline
SeedManager.set_global_seed(42)
# Split data (no leakage)
X_train, X_test, y_train, y_test = DataSplitter.train_test_split(
X, y, test_size=0.2, stratify=y
)
# Scale features
scaler = Scaler(method="standard")
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Save for production
scaler.save_to_file("artifacts/scaler.json")
logger.info("Pipeline complete", samples=len(X_train))
Installation¶
# Basic installation
pip install dspu
# With all extras
pip install dspu[all]
# Specific extras
pip install dspu[config,ml,observability]
Key Principles¶
๐ฏ Practical Over Perfect¶
DSPU provides utilities that complement existing libraries (sklearn, pandas, etc.) rather than replacing them. We focus on solving common pain points in data science workflows.
๐ Type-Safe by Default¶
Comprehensive type hints and protocol-based design ensure type safety throughout your codebase.
โก Async-First¶
Built with async/await support from the ground up, with sync/async bridges when needed.
๐งช Well-Tested¶
Over 450 comprehensive tests ensure reliability and prevent regressions.
๐ Well-Documented¶
Extensive documentation with practical examples and patterns for every module.
Module Overview¶
| Module | Purpose | Key Features |
|---|---|---|
| core | Foundation | Protocols, exceptions, plugin registry |
| config | Configuration | Multi-source loading, validation, watching |
| io | I/O Operations | Storage abstraction, multi-format, streaming |
| aio | Async Utilities | Retry, rate limiting, circuit breaker |
| validation | Data Validation | Filters, Pydantic integration, decorators |
| security | Security | Secrets, auth, encryption, token rotation |
| observability | Logging & Monitoring | Structured logging, rich output, tracing |
| ml | ML Utilities | Reproducibility, splits, scaling, encoding |
Getting Started¶
- Installation Guide - Install DSPU
- Quick Start - Your first DSPU program
- Tutorial - Comprehensive walkthrough
- User Guide - In-depth module documentation
- API Reference - Complete API documentation
- Examples - Practical code examples
Community & Support¶
- GitHub: github.com/yourorg/dspu
- Issues: Report bugs or request features
- PyPI: pypi.org/project/dspu
License¶
MIT License - see LICENSE for details.