mloda docs

Home

  • mloda

mloda Concepts

  • Table of Content
  • Intro to the core interfaces of mloda
  • What makes mloda unique?
  • Data, Feature, FeatureSets and FeatureGroups in mloda
  • Provider, User, Steward in mloda

Getting Started

  • Installation
  • API Request
  • Feature Groups
  • Compute Frameworks
  • Extender
  • mloda demo: How can we make feature engineering shareable?
    • Define dummy data as plugin
    • Request mlodaAPI to create features
    • Alternative options to consume data
    • Chain features - automatic dependency resolution
    • What is behind the "age__sum_aggr" syntax?
    • How the chaining essentially works
    • Now we have chaining and matching. Why do we do this?
    • Business knowledge is in the data and in the configuration, but not in the plugin definition.
    • Big idea
  • mloda + scikit-learn Integration: Basic Example

In Depth - Basics

  • mloda API
  • Streaming
  • (Feature) data
  • Join data
  • Filter data
  • Artifacts

In Depth - Advanced

  • Data quality
  • Domain concept
  • Data Access Patterns
  • Compute Frameworks
    • Framework Transformers
    • Compute Framework Integration
    • Framework Connection Object
  • Feature Groups
    • Feature Configuration
    • Feature Chain Parser
    • Feature Group Matching
    • PROPERTY_MAPPING
    • Feature Group Testing
    • Feature Group Versioning
    • Multiple Result Columns
  • Plugin System
    • Discover Plugins
    • Plugin Loader
  • Data Type Enforcement
  • Troubleshooting
    • Feature Group Resolution Errors

Development

  • Contributors
  • License - Apache 2.0

Need Help?

  • FAQ
mloda docs
  • Getting Started
  • mloda demo: How can we make feature engineering shareable?

mloda demo: How can we make feature engineering shareable?¶

Define dummy data as plugin¶

In [1]:
Copied!
import numpy as np
from mloda.provider import FeatureGroup, DataCreator


class DummyData(FeatureGroup):
    @classmethod
    def calculate_feature(cls, data, features):
        n_samples = features.get_options_key("n_samples") or 100
        return {
            "age": np.random.randint(18, 80, n_samples),
            "weight": np.random.normal(70, 15, n_samples),
            "state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples),
            "gender": np.random.choice(["M", "F"], n_samples),
        }

    @classmethod
    def input_data(cls):
        return DataCreator({"age", "weight", "state", "gender"})
import numpy as np from mloda.provider import FeatureGroup, DataCreator class DummyData(FeatureGroup): @classmethod def calculate_feature(cls, data, features): n_samples = features.get_options_key("n_samples") or 100 return { "age": np.random.randint(18, 80, n_samples), "weight": np.random.normal(70, 15, n_samples), "state": np.random.choice(["CA", "NY", "TX", "FL"], n_samples), "gender": np.random.choice(["M", "F"], n_samples), } @classmethod def input_data(cls): return DataCreator({"age", "weight", "state", "gender"})

Request mlodaAPI to create features¶

In [2]:
Copied!
# We load dependencies.
from mloda.user import mloda

# Load plugins into namespace

# from mloda.user import PluginLoader
# plugin_loader = PluginLoader.all()

result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"])
print(result)
# We load dependencies. from mloda.user import mloda # Load plugins into namespace # from mloda.user import PluginLoader # plugin_loader = PluginLoader.all() result = mloda.run_all(["age", "weight", "state", "gender"], compute_frameworks=["PyArrowTable", "PandasDataFrame"]) print(result)
[pyarrow.Table
age: int64
weight: double
gender: string
state: string
----
age: [[59,79,51,53,54,...,19,44,49,48,61]]
weight: [[82.63466082550435,48.76135309655511,53.42935283513809,69.46679872093235,100.29825438843457,...,76.29799759158638,75.02026515813283,88.5564452526703,75.07448813986902,65.00412677965708]]
gender: [["M","M","F","M","M",...,"F","M","F","M","F"]]
state: [["TX","CA","TX","NY","FL",...,"CA","TX","NY","CA","CA"]]]

Alternative options to consume data¶

  • Apidata
  • Files
  • DBs
  • Streams
  • ...

This is not the heart of mloda.

Chain features - automatic dependency resolution¶

In [3]:
Copied!
# Load plugin into namespace again


result = mloda.run_all(
    ["age__sum_aggr"],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
# Load plugin into namespace again result = mloda.run_all( ["age__sum_aggr"], compute_frameworks=["PolarsLazyDataFrame"], ) print(result)
[shape: (100, 1)
┌───────────────┐
│ age__sum_aggr │
│ ---           │
│ i64           │
╞═══════════════╡
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ …             │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
│ 4968          │
└───────────────┘]

As long as the plugins exists, we can run any datatransformation.

What is behind the "age__sum_aggr" syntax?¶

In [4]:
Copied!
from mloda.user import Feature, Options

feature = Feature(
    name="CustomConfiguration",
    options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}),
)

result = mloda.run_all(
    [feature],
    compute_frameworks=["PolarsLazyDataFrame"],
)
print(result)
from mloda.user import Feature, Options feature = Feature( name="CustomConfiguration", options=Options(context={"aggregation_type": "sum", "in_features": Feature("age", options={"n_samples": 5})}), ) result = mloda.run_all( [feature], compute_frameworks=["PolarsLazyDataFrame"], ) print(result)
[shape: (5, 1)
┌─────────────────────┐
│ CustomConfiguration │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 273                 │
│ 273                 │
│ 273                 │
│ 273                 │
│ 273                 │
└─────────────────────┘]

How the chaining essentially works¶

class FeatureGroup(ABC):

    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        
        # In principle, the resolver checks if the feature group depends on another input feature
        # -> then adds it to the chain of features which need to be resolved
        if feature_name contains "input_feature__sum_aggr":
            return input_feature

    # How does mloda knows a feature matches a feature group?
    # Customizable, but some good guesses
    @classmethod
    def match_feature_group_criteria(
        cls,
        feature_name: Union[FeatureName, str],
        options: Options,
        data_access_collection: Optional[DataAccessCollection] = None,
    ) -> bool:

Now we have chaining and matching. Why do we do this?¶

class FeatureGroup(ABC):

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        """
        This function should be used to calculate the feature.
        """
        
        # data is the incoming data from other feature dependencies or data via mloda

        # features is the configuration

Business knowledge is in the data and in the configuration, but not in the plugin definition.¶

Big idea¶

Separate business logic from transformation logic:

  • Plugins = generic transformations (shareable across companies)
  • Data + Config = your business knowledge (stays private)

→ Stop rewriting "sum of a column" at every company

→ Build a shared ecosystem of feature engineering plugins

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »