Data Type Enforcement

mloda supports optional data type declarations on Features, enabling runtime validation that computed data matches declared types.

Declaring Feature Types

Use typed constructors to declare the expected data type:

from mloda.user import Feature

# Typed features - will be validated at runtime
feature_int = Feature.int32_of("user_count")
feature_double = Feature.double_of("price")
feature_str = Feature.str_of("name")

# Untyped feature - no validation
feature_any = Feature.not_typed("legacy_column")

Available typed constructors: - int32_of(), int64_of() - Integer types - float_of(), double_of() - Floating point types - str_of() - String type - boolean_of() - Boolean type - date_of(), timestamp_millis_of(), timestamp_micros_of() - Date/time types - decimal_of(), binary_of() - Other types

Validation Behavior

!!! note "Changed in 0.7.0" Type validation now runs on all bundled compute frameworks through a uniform _extract_column_data_type hook. In 0.6.x, type extraction relied on a single PyArrow-schema path, so typed features running on frameworks that did not expose that schema were silently left unvalidated. After upgrading, those features may raise a DataTypeMismatchError for the first time where validation previously did nothing.

Default (Lenient) Mode

By default, validation allows compatible type conversions within categories:

Declared Type	Compatible Actual Types
INT64	INT32, INT64
DOUBLE	INT32, INT64, FLOAT, DOUBLE
TIMESTAMP_MICROS	TIMESTAMP_MILLIS, TIMESTAMP_MICROS

Cross-category mismatches (e.g., STRING declared but INT64 returned) raise DataTypeMismatchError.

Strict Mode

Enable strict validation per-feature via options:

feature = Feature.int32_of(
    "exact_count",
    options={"strict_type_enforcement": True}
)

In strict mode, only exact type matches or standard widening conversions are allowed.

Per-Framework Precision Support

Not every backend's native type system can distinguish every precision mloda declares. The table below shows which precisions each bundled framework can extract from data. For the rest, the framework's _extract_column_data_type returns the widest type in the family (still correct under lenient mode), and strict-mode tests for the affected precision are skipped with a clear reason.

Framework	INT32 / INT64	FLOAT / DOUBLE	TIMESTAMP_MILLIS / MICROS
Pandas	yes	yes	yes
Polars (eager / lazy)	yes	yes	yes
PyArrow	yes	yes	yes
DuckDB	yes	yes	yes
Spark	yes	yes	no (only `TimestampType` exists)
Iceberg	yes	yes	no (only `TimestampType` exists)
SQLite	no (INTEGER affinity)	no (REAL affinity)	no (stored as TEXT)
PythonDict	no (`type.__name__` is "int")	no (Python float is 64-bit)	no (`datetime.datetime` is microsecond)

Execution Plan Grouping

Features with different explicit data types are separated into different execution groups at plan time. This allows type-specific processing paths.

Untyped features (data_type=None) are "lenient" and can be grouped with any typed features, preserving compatibility with index columns and legacy code.

# These will be in DIFFERENT execution groups
Feature.int32_of("amount")
Feature.int64_of("amount")

# This can join ANY group (lenient)
Feature.not_typed("id")

Database Reader Type Awareness

When reading from databases (e.g., SQLite), declared types are used to build the PyArrow schema:

# Declared type is used for schema, not inferred from data
feature = Feature.int64_of(
    "user_id",
    options={"sqlite": "/path/to/db.sqlite"}
)

Error Handling

Type mismatches raise DataTypeMismatchError:

```python title="Error handling example" from mloda.user import mloda from mloda.user import Feature from mloda.provider import DataTypeMismatchError

try: result = mloda.run_all([Feature.str_of("numeric_column")]) except DataTypeMismatchError as e: print(f"Feature '{e.feature_name}': declared {e.declared.name}, got {e.actual.name}") ```