What is a Feature Store?
Ian Hellström | 27 January 2021 | 4 min read
What is all the fuss about feature stores in machine learning?
A feature store is a data management layer that saves features specifically designed for machine learning use cases. What is a feature? A feature is a measurable property of an entity, which is a model or representation of a domain object (e.g. customer, order, transaction, product, trip, sensor, assembly line, facility, vehicle, flight).
Whereas data typically resides in tabular format inside databases or as Avro/Parquet files in data lakes and object stores, that data needs to be cast into features that machine learning models can understand.
A feature store comprises:
- Transformations to support feature engineering
- A storage layer for both online and offline features
- A serving layer for online (e.g. REST API) and offline features (e.g. SDK)
- A feature registry that both documents features, enables discovery of said features, and tracks lineage
- Monitoring (and alerting) of features over time to check for data drift or detect anomalies
Why is feature engineering so important to machine learning?
Applied machine learning is basically feature engineering. Andrew Ng
The benefits of feature stores are varied:
- Feature stores can accelerate model development and increase productivity:
- Features are documented and can be reused.
- Features are pre-computed and shared across an organization.
- Data engineering (DataOps) is decoupled from data science: the feature store is the single source of truth for machine learning use cases.
- Data science is decoupled from machine learning engineering (MLOps), since the need to rewrite ingestion code eliminated.
- Feature stores eliminate training/serving skew by applying the same transformations everywhere.
- Feature stores can monitor data drift to ensure the predictive qualities of the model are not degraded by slow or sudden shifts in data distributions.
- Feature stores often have built-in time travel capabilities thanks to data immutability and feature versioning, which means point-in-time backfills are possible.
Why should anyone care about time travel in a feature store?
Compliance and/or liability.
If a customer is denied a loan by an ML-powered application, and that person sues the bank for discrimination, it is in the bank’s interest to be able to recreate the exact data that went into the model as well as the model version that produced the result. The model code itself is (probably) versioned, so that is easy enough. But point-in-time recovery of data or even features may be extremely cumbersome. Without time travel capabilities, it may be nearly impossible to tell why the model denied the customer the loan, even with explainable machine learning algorithms.
In any kind of autonomous decision-making system, it is vital that the model, data, and code (with all dependencies) can be rolled back to the state at the time of the decision in question.
Training/serving skew: “apples to oranges”.
Training/serving skew happens when transformations on online and offline features are not applied in an identical manner, which can easily happen with separate code bases for online and offline transformations, as in the case of a lambda architecture. In that case, the model is trained with one metric (apples), but at serving time it receives another (oranges).
Thanks to time travel, it is possible to reconstruct features at training and serving time, in case a model misbehaves and must be debugged. Without it, it may be impossible to know why model drift occurred at some point in time.
Data leakage: “apples to future apples (i.e. apple trees)”.
Data leakage occurs when outside information makes it into the model; the model is trained with data it cannot possibly have at serving. Point-in-time backfills allow data leakage issues to be fixed retroactively, if need be.
Let’s say you want to predict stock movements or next week’s temperatures with a machine learning model. If you shuffle the data before you split it for training and evaluation, data from the future may bleed into the training data set, which means the model learns to cheat. After all, it’s a easy to predict the future if you already know it. Similarly, aggregate joins can leak future data, particularly when aggregates are pre-computed with no way of filtering out data that should not be included in the aggregate.
Essentially, you’re not comparing apples to oranges, but apples to apples from the future. These are more commonly known today as apple trees, as they can produce apples at a future date.
Whether 2021 will be the year of the feature store remains to be seen. It’s early days. There do appear to be more options now (in alphabetical order):
- Feast (open source)
- Hopsworks (open source)
- Butterfree (open source)
- Splice Machine
Note that Featuretools does not offer a storage layer and is therefore not a feature store.
Wait a Minute…
Feature stores are not silver bullets. There are a few situations where they are overkill:
- No dedicated data engineers
- Few data scientists (e.g. small team or silos)
- Few models in production (e.g. research division)
- No need for online serving (i.e. only offline batch predictions)
- Model retraining is rare or not scheduled automatically
- Few structured data sources (e.g. only images)