Named Data Access Handles
DataAccessCollection (DAC) is a registry of data resources keyed by stable string handles. Every consumer that needs to bind one resource uses a single resolution rule that raises on ambiguity rather than letting iteration order decide.
For background, see issue #443.
The four kinds of resources
A DAC holds resources of four kinds. Each kind has its own keyed dict:
DataAccessCollection(
connections={"warehouse": warehouse_conn, "analytics": analytics_conn},
files={"transactions": "/data/tx.parquet", "users": "/data/users.csv"},
folders={"raw": "/data/raw/"},
credentials={"pg-prod": {"host": "...", "user": "..."}, "snowflake-dev": {...}},
)
Handle names are arbitrary strings you choose. They are globally unique across kinds: you cannot register a connection and a file under the same name. Registration raises ValueError on duplicates.
Mutators mirror the keyed-dict shape:
dac.add_connection("warehouse", warehouse_conn)
dac.add_file("transactions", "/data/tx.parquet")
dac.add_folder("raw", "/data/raw/")
dac.add_credentials("pg-prod", {"host": "..."})
A small runnable example, backed by sqlite3 so it has no external dependencies:
import sqlite3
from mloda.user import DataAccessCollection
primary = sqlite3.connect(":memory:")
secondary = sqlite3.connect(":memory:")
dac = DataAccessCollection(
connections={"primary": primary, "secondary": secondary},
files={"users": "/tmp/users.csv"},
)
assert dac.handles() == {
"primary": "connection",
"secondary": "connection",
"users": "file",
}
Naming is optional
You only need to name resources when there are multiple sources of the same kind that the resolver cannot tell apart. In the simple single-source case, pass bare values:
DataAccessCollection(files={"/data/tx.parquet"}) # set or list
DataAccessCollection(connections={duckdb_conn}) # set
DataAccessCollection(folders={"/data/raw"}) # set
DataAccessCollection(credentials=[{"host": "h"}]) # list (dicts are unhashable)
The same shape applies to the mutators:
dac.add_file("/data/tx.parquet") # auto-named
dac.add_file("tx", "/data/tx.parquet") # named (when you need a handle)
Unnamed entries get internal auto-handles (_auto_file_0, _auto_connection_0, etc.) that you never need to reference. They exist only so the resolver has a unique key per entry. If you later hit ambiguity, the error message will tell you to switch to the named form and pick a data_access_handle.
Naming is only required when:
- Two or more resources of the same kind match the same consumer (then you must set
data_access_handleto pick one), or - You use
column_to_fileand prefer to reference files by name rather than path (both work, see below).
Resolution rule
When a feature group asks the DAC for a resource of a given kind, the resolver applies one rule:
- If
data_access_handleis set on the feature'sOptions: look up that handle. If it does not exist, exists under a different kind, or fails the consumer's type predicate, raiseValueErrorwith the actual situation and the available handles. - Otherwise: filter the kind's registry by the consumer's predicate (for example, "this file matches my suffix and columns").
- Zero matches: return
None(the consumer typically continues looking elsewhere). - One match: bind it.
- More than one match: raise
ValueErrorlisting the candidate handles and telling the user to setdata_access_handle.
- Zero matches: return
The error shape is the same for every kind. This is the same contract ComputeFramework.pick_connection_from_dac shipped in #442, generalized to every consumer.
import sqlite3
import pytest
from mloda.user import DataAccessCollection
dac = DataAccessCollection(
connections={
"primary": sqlite3.connect(":memory:"),
"secondary": sqlite3.connect(":memory:"),
},
)
assert dac.resolve("connection", hint="primary") is not None
with pytest.raises(ValueError) as excinfo:
dac.resolve("connection")
assert "data_access_handle" in str(excinfo.value)
assert "'primary'" in str(excinfo.value) and "'secondary'" in str(excinfo.value)
with pytest.raises(ValueError) as excinfo:
dac.resolve("connection", hint="missing")
assert "missing" in str(excinfo.value)
Per-feature disambiguation: data_access_handle
When more than one resource of the same kind matches a feature's requirements, you disambiguate by setting data_access_handle on the feature's Options:
from mloda.user import DataAccessCollection, Options, Feature, mloda
dac = DataAccessCollection(
files={"transactions": "/data/tx.csv", "users": "/data/users.csv"},
)
features = [
Feature("amount", options=Options(context={"data_access_handle": "transactions"})),
Feature("email", options=Options(context={"data_access_handle": "users"})),
]
mloda.run_all(features, compute_frameworks=["PyArrowTable"], data_access_collection=dac)
The key works across read_file, read_document, and read_db on a per-feature basis, and across every CFW that consumes connections (DuckDB, SQLite, Spark, Iceberg today) on an engine-wide basis.
!!! note "Connections are resolved once per engine, not per feature"
ComputeFramework.pick_connection_from_dac runs at engine setup, not on the per-request path. A single CFW therefore binds a single connection per session. If you need two features in the same session to bind different connections of the same CFW, that is currently out of scope (tracked separately); use a single data_access_handle on the DAC's matching connections, or split the run.
Introspection: handles()
DataAccessCollection.handles() returns a {handle: kind} map of everything registered. Use it for audits, logging, or to surface candidates in your own error messages:
dac.handles()
# {"warehouse": "connection", "analytics": "connection",
# "transactions": "file", "users": "file",
# "raw": "folder", "pg-prod": "credentials"}
Interaction with column_to_file
column_to_file is a file-specific override that takes precedence over the resolver. Its values may be either a file handle (key of the files dict) or a file path (value of the files dict); both are accepted and normalized to handles internally:
DataAccessCollection(
files={"train": "application_train.csv", "bureau": "bureau.csv"},
column_to_file={
"SK_ID_CURR": "train",
"TARGET": "train",
"AMT_CREDIT_SUM": "bureau",
},
)
For columns listed in the map, the resolver short-circuits to the pinned file. For columns not listed, the regular resolver rule applies (single match binds; multiple raises and asks for data_access_handle).
See Disambiguating columns shared across multiple files for the full example.
Why this exists
Before #443, every DAC field was an unkeyed set. When two resources of the same kind were registered, Python's set iteration (PYTHONHASHSEED-dependent) decided which one the engine picked. The same pipeline could read the wrong table, wrong file, or wrong credential across processes, with no error and no signal at planning time.
Named handles collapse that bug class to one invariant: if the registry has more than one entry of the requested kind and the consumer did not pass a handle, raise and list the candidates. Same rule, same error shape, regardless of field.
Common errors
- "Handle 'X' is already registered under kind 'K'": you called
add_*(or constructed the DAC) with a handle already used by another resource. Pick a different name. - "Handle 'X' not found for kind 'K'. Available handles of kind 'K': [...]": you set
data_access_handle='X'but the DAC has no resource of that kind namedX. Check the spelling or register it. - "Handle 'X' is registered under kind 'K1', but kind 'K2' was requested": you set
data_access_handle='X', but the registry hasXunder a different kind. A connection consumer cannot bind to a file handle even if the names match. - "Ambiguous resolve for kind 'K': N candidates [...]; set 'data_access_handle' in Options to disambiguate": more than one entry of the requested kind matched. Set
data_access_handleon the feature'sOptionsto pick one, or remove the extras from the DAC.
Related
- (Feature) data: overview of data access in mloda.
- Data Access Patterns:
BaseInputDatavsMatchData. - Framework Connection Object: stateful connection lifecycle.