Data Type Enforcement
mloda supports optional data type declarations on Features, enabling runtime validation that computed data matches declared types.
Declaring Feature Types
Use typed constructors to declare the expected data type:
from mloda.user import Feature
# Typed features - will be validated at runtime
feature_int = Feature.int32_of("user_count")
feature_double = Feature.double_of("price")
feature_str = Feature.str_of("name")
# Untyped feature - no validation
feature_any = Feature.not_typed("legacy_column")
Available typed constructors:
- int32_of(), int64_of() - Integer types
- float_of(), double_of() - Floating point types
- str_of() - String type
- boolean_of() - Boolean type
- date_of(), timestamp_millis_of(), timestamp_micros_of() - Date/time types
- decimal_of(), binary_of() - Other types
Validation Behavior
!!! note "Changed in 0.7.0"
Type validation now runs on all bundled compute frameworks through a uniform
_extract_column_data_type hook. In 0.6.x, type extraction relied on a single
PyArrow-schema path, so typed features running on frameworks that did not expose that
schema were silently left unvalidated. After upgrading, those features may raise a
DataTypeMismatchError for the first time where validation previously did nothing.
Default (Lenient) Mode
By default, validation allows compatible type conversions within categories:
| Declared Type | Compatible Actual Types |
|---|---|
| INT64 | INT32, INT64 |
| DOUBLE | INT32, INT64, FLOAT, DOUBLE |
| TIMESTAMP_MICROS | TIMESTAMP_MILLIS, TIMESTAMP_MICROS |
Cross-category mismatches (e.g., STRING declared but INT64 returned) raise DataTypeMismatchError.
Strict Mode
Enable strict validation per-feature via options:
feature = Feature.int32_of(
"exact_count",
options={"strict_type_enforcement": True}
)
In strict mode, only exact type matches or standard widening conversions are allowed.
Per-Framework Precision Support
Not every backend's native type system can distinguish every precision mloda declares. The table below shows which precisions each bundled framework can extract from data. For the rest, the framework's _extract_column_data_type returns the widest type in the family (still correct under lenient mode), and strict-mode tests for the affected precision are skipped with a clear reason.
| Framework | INT32 / INT64 | FLOAT / DOUBLE | TIMESTAMP_MILLIS / MICROS |
|---|---|---|---|
| Pandas | yes | yes | yes |
| Polars (eager / lazy) | yes | yes | yes |
| PyArrow | yes | yes | yes |
| DuckDB | yes | yes | yes |
| Spark | yes | yes | no (only TimestampType exists) |
| Iceberg | yes | yes | no (only TimestampType exists) |
| SQLite | no (INTEGER affinity) | no (REAL affinity) | no (stored as TEXT) |
| PythonDict | no (type.__name__ is "int") |
no (Python float is 64-bit) | no (datetime.datetime is microsecond) |
Execution Plan Grouping
Features with different explicit data types are separated into different execution groups at plan time. This allows type-specific processing paths.
Untyped features (data_type=None) are "lenient" and can be grouped with any typed features, preserving compatibility with index columns and legacy code.
# These will be in DIFFERENT execution groups
Feature.int32_of("amount")
Feature.int64_of("amount")
# This can join ANY group (lenient)
Feature.not_typed("id")
Database Reader Type Awareness
When reading from databases (e.g., SQLite), declared types are used to build the PyArrow schema:
# Declared type is used for schema, not inferred from data
feature = Feature.int64_of(
"user_id",
options={"sqlite": "/path/to/db.sqlite"}
)
Error Handling
Type mismatches raise DataTypeMismatchError:
```python title="Error handling example" from mloda.user import mloda from mloda.user import Feature from mloda.provider import DataTypeMismatchError
try: result = mloda.run_all([Feature.str_of("numeric_column")]) except DataTypeMismatchError as e: print(f"Feature '{e.feature_name}': declared {e.declared.name}, got {e.actual.name}") ```