7 costly surprises of machine learning: part seven

Messy real-world input data can quickly wreck havoc on a machine learning (ML) system that isn’t adequately prepared. Last week’s post explored a trio of problems and solutions: data testing methods, covariance shift, and prediction shift.

In this post, I’m going to continue to explore the challenges that real-world input data poses to ML systems and propose some effective solutions.

This is the seventh in a series of seven posts dissecting a 2015 research paper, Hidden Technical Debt in Machine Learning Systems, and its implications when using ML to solve real-world problems. Any block quotes throughout this piece come from this paper.

How do you calibrate an ML model?

Classification ML models typically output a number from 0 to 1. For example, a marketing classifier can output a 0 for unlikely to buy your product and a 1 for very likely to buy your product. When you ultimately need to make a decision, e.g., of whether to show an ad or not, you need to set a threshold to make that decision. For example, if the output is greater than .7, you show the ad. If it’s less than .7, you do not. This is called the decision boundary.

A real-life decision boundary

In practice, the decision boundary is different for every single classifier. A decision boundary of .7 for one classifier may mean that it’s 80% certain of its prediction, whereas for another similar classifier a decision boundary of .7 could mean it’s only 20% certain. As models are retrained, the semantic meaning of their output score changes.

Ultimately, if you want to make a consistent decision across classifier updates, you need to re-tune the decision boundary every single time, a time-consuming and error-prone process.

Solution

Create properly calibrated classifiers automatically. Take a piece of your training data and set it aside. This is called calibration data. You then use the calibration dataset to calibrate your classifier so that an output score of .7 means according to the classifier there’s a 70% probability that if you show the ad the user will click it (or a 30% chance of not clicking it), or if the output score is .2, then the classifier predicts a 20% chance that the user will click the ad.

When you have a properly calibrated classifier you no longer need to manually tune decision boundaries every time you retrain a classifier. This also makes it easier to compare the behavior of classifiers, because the meaning of the output scores are the same.

What are bounded predictions?

ML models have no natural limits to the values they predict. For example, regression models can output any real number. Sometimes ML models without constraints can predict values that appear nonsensical or don’t fit within the natural constraints of the situation. For example, there are theoretical limits (e.g., traveling a negative distance), practical limits (e.g. limit your trading algorithm to not spend more than $2M / day), and strategic limits (e.g., limiting spam-filtering models). This is particularly important for ML systems that take automated actions in the world in high-stakes regimes, such as in bidding, high-frequency trading, and medical diagnoses.

Solution

In these scenarios — and in any scenario where natural limits exist — bound the output of your classifier as a post-processing step.

The dashed lines represent the prediction limits. Only the values between the prediction limits (shaded pink) are accepted.

Producer monitoring and consumer alerting

There are many upstream systems that prepare data for ML models. These systems collect and prepare data so that it’s suitable for ML model ingestion. Such systems can start producing errors at any point in time. This will have unexpected and usually detrimental effects on the ML models in the system.

Solution

The power of ML models often makes it difficult to predict how a change of inputs into a model will affect the outputs. It becomes paramount to monitor and test upstream systems so that they remain consistent with the needs of the ML model.

You need to create a central hub that injects upstream testing and monitoring.

[Furthermore,] any up-stream alerts must be propagated to the control plane of an ML system to ensure its accuracy. Similarly, any failure of the ML system to meet established service level objectives be also propagated down-stream to all consumers, and directly to their control planes if at all possible.

Example of a central monitoring/alerting system (pink), that monitors upstream producers (yellow), and alerts the ML system (black) and downstream consumers (blue) of aberrations.

Conclusion

Because external changes occur in real-time, response must also occur in real-time as well.

Real-time responses are essential for the following:

  • Violated data tests
  • Covariance or prediction shift
  • Violation of bounded predictions
  • Changes in upstream producers

Relying on manual human intervention is often too slow and unsustainable. Instead, it is usually a worthwhile investment to create a system that can handle real-time responses to critical events without the need for direct human intervention.

In the final post in this series, I’m going to explore two remaining forms of technical debt — configuration and R&D debt — and offer some final thoughts on the future of machine learning in the world of business. Stay tuned!

Keep in touch


 

Sign up for our newsletter to get the latest AI and ML news!

SUBSCRIBE