mloda demo: How can we make feature engineering shareable?¶

Define dummy data as plugin¶

In [1]:

Copied!





import numpy as np
from mloda.provider import FeatureGroup, DataCreator


class DummyData(FeatureGroup):
    @classmethod
    def calculate_feature(cls, data, features):
        n_samples = features.get_options_key("n_samples") or 100
        return {
            "age": np.random.randint(18, 80, n_samples),
            "weight": np.random.normal(70, 15, n_samples),
            "state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
            "gender": np.random.choice(["M", "F"], n_samples),
        }

    @classmethod
    def input_data(cls):
        return DataCreator({"age", "weight", "state", "gender"})
import numpy as np
from mloda.provider import FeatureGroup, DataCreator


class DummyData(FeatureGroup):
    @classmethod
    def calculate_feature(cls, data, features):
        n_samples = features.get_options_key("n_samples") or 100
        return {
            "age": np.random.randint(18, 80, n_samples),
            "weight": np.random.normal(70, 15, n_samples),
            "state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
            "gender": np.random.choice(["M", "F"], n_samples),
        }

    @classmethod
    def input_data(cls):
        return DataCreator({"age", "weight", "state", "gender"})

Request mlodaAPI to create features¶

In [2]:

Copied!

# We load dependencies.
from mloda.user import mloda

# Load plugins into namespace

# from mloda.user import PluginLoader
# plugin_loader = PluginLoader.all()

result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"])
print(result)
# We load dependencies.
from mloda.user import mloda

# Load plugins into namespace

# from mloda.user import PluginLoader
# plugin_loader = PluginLoader.all()

result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"])
print(result)

[pyarrow.Table
age: int64
weight: double
gender: string
state: string
----
age: [[59,79,51,53,54,...,19,44,49,48,61]]
weight: [[82.63466082550435,48.76135309655511,53.42935283513809,69.46679872093235,100.29825438843457,...,76.29799759158638,75.02026515813283,88.5564452526703,75.07448813986902,65.00412677965708]]
gender: [["M","M","F","M","M",...,"F","M","F","M","F"]]
state: [["TX","CA","TX","NY","FL",...,"CA","TX","NY","CA","CA"]]]

Alternative options to consume data¶

Apidata
Files
DBs
Streams
...

This is not the heart of mloda.

Chain features - automatic dependency resolution¶

In [3]:

Copied!





# Load plugin into namespace again


result = mloda.run_all(
    ["age__sum_aggr"],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
# Load plugin into namespace again


result = mloda.run_all(
    ["age__sum_aggr"],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)

[shape: (100, 1)
┌───────────────┐
│ age__sum_aggr │
│ ---           │
│ i64           │
╞═══════════════╡
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ …             │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
└───────────────┘]

As long as the plugins exists, we can run any datatransformation.

What is behind the "age__sum_aggr" syntax?¶

In [4]:

Copied!





from mloda.user import Feature, Options

feature = Feature(
    name="CustomConfiguration",
    options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}),
)

result = mloda.run_all(
    [feature],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
from mloda.user import Feature, Options

feature = Feature(
    name="CustomConfiguration",
    options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}),
)

result = mloda.run_all(
    [feature],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)

[shape: (5, 1)
┌─────────────────────┐
│ CustomConfiguration │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 273                 │
│ 273                 │
│ 273                 │
│ 273                 │
│ 273                 │
└─────────────────────┘]

How the chaining essentially works¶

class FeatureGroup(ABC):

    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        
        # In principle, the resolver checks if the feature group depends on another input feature
        # -> then adds it to the chain of features which need to be resolved
        if feature_name contains "input_feature__sum_aggr":
            return input_feature

    # How does mloda knows a feature matches a feature group?
    # Customizable, but some good guesses
    @classmethod
    def match_feature_group_criteria(
        cls,
        feature_name: Union[FeatureName, str],
        options: Options,
        data_access_collection: Optional[DataAccessCollection] = None,
    ) -> bool:

Now we have chaining and matching. Why do we do this?¶

class FeatureGroup(ABC):

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        """
        This function should be used to calculate the feature.
        """
        
        # data is the incoming data from other feature dependencies or data via mloda

        # features is the configuration

Business knowledge is in the data and in the configuration, but not in the plugin definition.¶

Big idea¶

Separate business logic from transformation logic:

Plugins = generic transformations (shareable across companies)
Data + Config = your business knowledge (stays private)

→ Stop rewriting "sum of a column" at every company

→ Build a shared ecosystem of feature engineering plugins