Multiple Result Columns

Overview

Feature groups in mloda can return multiple related columns using a naming convention pattern. This allows for more flexible and powerful feature engineering, especially when a single feature computation produces multiple related outputs.

Naming Convention

The naming convention for multiple result columns follows this pattern:

feature_name~column_suffix

Where: - feature_name is the base name of the feature - ~ is the separator character - column_suffix is a unique identifier for each related column

Example

A feature group that computes statistical properties might return multiple columns:

temperature~mean
temperature~min
temperature~max
temperature~std

Implementation

Returning Multiple Columns

When implementing a feature group that returns multiple columns:

class MultiColumnFeatureGroup(FeatureGroup):
    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        feature_name = features.get_name_of_one_feature().name

        # Return multiple columns with the naming convention
        return {
            f"{feature_name}~mean": [1, 2, 3],
            f"{feature_name}~max": [4, 5, 6],
            f"{feature_name}~min": [0, 1, 2]
        }

Consuming Multiple Columns

Use the resolve_multi_column_feature() utility to automatically discover all columns matching the pattern:

class MultiColumnConsumer(FeatureGroup):
    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        return {Feature.not_typed("MultiColumnFeature")}

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        # Automatically discover all columns matching the pattern
        columns = cls.resolve_multi_column_feature(
            "MultiColumnFeature",
            set(data.columns)
        )
        # Returns: ["MultiColumnFeature~mean", "MultiColumnFeature~max", "MultiColumnFeature~min"]

        # Process all discovered columns
        result = sum(data[col] for col in columns)

        feature_name = features.get_name_of_one_feature().name
        return {feature_name: result}

Benefits: - No need to manually enumerate column names - Automatically adapts if number of columns changes - Cleaner, more maintainable code

Manual Column Access (Legacy)

For backwards compatibility, you can still access columns manually:

class MultiColumnConsumer(FeatureGroup):
    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        return {Feature.not_typed("MultiColumnFeature")}

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        # Manual access to specific columns
        mean_values = data["MultiColumnFeature~mean"]
        max_values = data["MultiColumnFeature~max"]

        # Perform calculations using these columns
        result = mean_values + max_values

        feature_name = features.get_name_of_one_feature().name
        return {feature_name: result}

Specific Sub-Column Dependency

You can declare a dependency on a specific sub-column directly, without needing all columns:

class SpecificSubColumnConsumer(FeatureGroup):
    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        # Depend on ONLY base_feature~1, not all columns
        return {Feature("base_feature~1")}

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        # Only base_feature~1 is available
        values = data["base_feature~1"]

        feature_name = features.get_name_of_one_feature().name
        return {feature_name: values * 10}

Benefits:

  • No need for pass-through FeatureGroups
  • Clearer dependency declarations
  • Only pulls the specific column needed

Behavior Summary:

  • Feature("base_feature") → Returns ALL columns: base_feature~0, base_feature~1, etc.
  • Feature("base_feature~1") → Returns ONLY base_feature~1 column

How It Works

The mloda framework automatically handles the selection of columns that follow this naming convention:

  1. When a feature group requests a feature by name, the framework identifies all columns that either:
  2. Match the feature name exactly, or
  3. Follow the pattern feature_name~suffix

  4. This is implemented in the identify_naming_convention method in the ComputeFramework class:

def identify_naming_convention(self, selected_feature_names: Set[FeatureName], column_names: Set[str]) -> Set[str]:
    feature_name_strings = {f.name for f in selected_feature_names}
    _selected_feature_names: Set[str] = set()

    for col in column_names:
        for feature_name in feature_name_strings:
            if col == feature_name:
                _selected_feature_names.add(col)
                continue

            if col.startswith(f"{feature_name}~"):
                _selected_feature_names.add(col)

    if not _selected_feature_names:
        raise ValueError(
            f"No columns found that match feature names {feature_name_strings} or follow the naming convention 'feature_name~column_name'"
        )

    return _selected_feature_names

Best Practices

  1. Use Automatic Discovery: Prefer resolve_multi_column_feature() over manual column enumeration when you need all columns
  2. Use Specific Sub-Column Dependencies: When you only need one sub-column, use Feature("base_feature~N") for clearer intent and efficiency
  3. Consistent Naming: Use consistent suffixes across related feature groups
  4. Use apply_naming_convention(): When producing multi-column outputs, use the utility for consistency
  5. Documentation: Document the meaning of each suffix in your feature group's docstring
  6. Validation: Consider validating that all expected columns are present when consuming multiple columns
  7. Error Handling: Handle cases where expected columns might be missing

Available Utilities

mloda provides several utilities for working with multi-column features:

Method Purpose Use Case
apply_naming_convention(result, feature_name) Create multi-column outputs Producer: Generate ~N suffixed columns from arrays
resolve_multi_column_feature(feature_name, columns) Discover multi-column inputs Consumer: Auto-find all ~N columns
expand_feature_columns(feature_name, num_columns) Generate column name list Producer: Pre-generate expected column names
get_column_base_feature(column_name) Strip suffix from column Both: Extract base feature from feature~N

Note: When declaring dependencies with Feature("base_feature~N"), the framework automatically resolves to the parent FeatureGroup that produces base_feature and extracts only the specified sub-column.