mloda demo: How can we make feature engineering shareable?¶
Define dummy data as plugin¶
In [1]:
Copied!
import numpy as np
from mloda.provider import FeatureGroup, DataCreator
class DummyData(FeatureGroup):
@classmethod
def calculate_feature(cls, data, features):
n_samples = features.get_options_key("n_samples") or 100
return {
"age": np.random.randint(18, 80, n_samples),
"weight": np.random.normal(70, 15, n_samples),
"state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
"gender": np.random.choice(["M", "F"], n_samples),
}
@classmethod
def input_data(cls):
return DataCreator({"age", "weight", "state", "gender"})
import numpy as np
from mloda.provider import FeatureGroup, DataCreator
class DummyData(FeatureGroup):
@classmethod
def calculate_feature(cls, data, features):
n_samples = features.get_options_key("n_samples") or 100
return {
"age": np.random.randint(18, 80, n_samples),
"weight": np.random.normal(70, 15, n_samples),
"state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
"gender": np.random.choice(["M", "F"], n_samples),
}
@classmethod
def input_data(cls):
return DataCreator({"age", "weight", "state", "gender"})
Request mlodaAPI to create features¶
In [2]:
Copied!
# We load dependencies.
from mloda.user import mloda
# Load plugins into namespace
# from mloda.user import PluginLoader
# plugin_loader = PluginLoader.all()
result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"])
print(result)
# We load dependencies.
from mloda.user import mloda
# Load plugins into namespace
# from mloda.user import PluginLoader
# plugin_loader = PluginLoader.all()
result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"])
print(result)
[pyarrow.Table age: int64 weight: double gender: string state: string ---- age: [[59,79,51,53,54,...,19,44,49,48,61]] weight: [[82.63466082550435,48.76135309655511,53.42935283513809,69.46679872093235,100.29825438843457,...,76.29799759158638,75.02026515813283,88.5564452526703,75.07448813986902,65.00412677965708]] gender: [["M","M","F","M","M",...,"F","M","F","M","F"]] state: [["TX","CA","TX","NY","FL",...,"CA","TX","NY","CA","CA"]]]
Chain features - automatic dependency resolution¶
In [3]:
Copied!
# Load plugin into namespace again
result = mloda.run_all(
["age__sum_aggr"],
compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
# Load plugin into namespace again
result = mloda.run_all(
["age__sum_aggr"],
compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
[shape: (100, 1) ┌───────────────┐ │ age__sum_aggr │ │ --- │ │ i64 │ ╞═══════════════╡ │ 4968 │ │ 4968 │ │ 4968 │ │ 4968 │ │ 4968 │ │ … │ │ 4968 │ │ 4968 │ │ 4968 │ │ 4968 │ │ 4968 │ └───────────────┘]
As long as the plugins exists, we can run any datatransformation.
What is behind the "age__sum_aggr" syntax?¶
In [4]:
Copied!
from mloda.user import Feature, Options
feature = Feature(
name="CustomConfiguration",
options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}),
)
result = mloda.run_all(
[feature],
compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
from mloda.user import Feature, Options
feature = Feature(
name="CustomConfiguration",
options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}),
)
result = mloda.run_all(
[feature],
compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
[shape: (5, 1) ┌─────────────────────┐ │ CustomConfiguration │ │ --- │ │ i64 │ ╞═════════════════════╡ │ 273 │ │ 273 │ │ 273 │ │ 273 │ │ 273 │ └─────────────────────┘]
How the chaining essentially works¶
class FeatureGroup(ABC):
def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
# In principle, the resolver checks if the feature group depends on another input feature
# -> then adds it to the chain of features which need to be resolved
if feature_name contains "input_feature__sum_aggr":
return input_feature
# How does mloda knows a feature matches a feature group?
# Customizable, but some good guesses
@classmethod
def match_feature_group_criteria(
cls,
feature_name: Union[FeatureName, str],
options: Options,
data_access_collection: Optional[DataAccessCollection] = None,
) -> bool:
Now we have chaining and matching. Why do we do this?¶
class FeatureGroup(ABC):
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
"""
This function should be used to calculate the feature.
"""
# data is the incoming data from other feature dependencies or data via mloda
# features is the configuration
Business knowledge is in the data and in the configuration, but not in the plugin definition.¶
Big idea¶
Separate business logic from transformation logic:
- Plugins = generic transformations (shareable across companies)
- Data + Config = your business knowledge (stays private)
→ Stop rewriting "sum of a column" at every company
→ Build a shared ecosystem of feature engineering plugins