rickya 4 hours ago

Machine Learning and Data Science: Academia vs. Industry

Medium

8 min

Machine Learning (ML) technologies are becoming increasingly popular and have various applications, ranging from smartphones and computers to large-scale enterprise infrastructure that serves billions of requests per day. Building ML tools, however, remains difficult today because there are no industry-wide standardised approaches to development. Many engineering students studying ML and Data Science must re-learn once they begin their careers. In this article, I've compiled a list of the top five problems that every ML specialist faces only on the job, highlighting the gap between university curriculum and real-world practice.

So the key challenges for new ML specialists are:

The need to collect datasets;
Data leakage;
Constant data changes;
Multi-version models;
ML bugs.

Data vs. Model

In academia, students who take Data Science, Deep Learning, and Machine Learning courses learn how to work with and preprocess data. The curriculum frequently includes the testing of various models and algorithms for making data-driven predictions. Homework assignments tend to focus on achieving the highest possible score on a holdout test set, with possible limitations on overall processing time.

When new ML specialists enter the industry, they often discover the need to collect data from unstructured sources. In contrast to academic settings where datasets are provided, ML engineers in production must collect, clean, preprocess, and validate data from various sources. This is important because data quality has a significant impact on model performance.

Starting with a "fixed" model and iterating over data usually results in faster experiment iterations. In practice, high-quality data provides a solid foundation for improving the product. While there are no universal rules governing the model adjustment process, a focus on data allows for faster and larger performance gains. Constant data quality improvements can consistently improve model performance, whereas model manipulation may result in diminishing returns.

Data Leakage

Data leakage is a common ML deployment problem that occurs when a model in production is unable to access the data it was trained with offline. It usually leads to models that perform exceptionally well on training data experiencing significant performance drops in production.

Overfitting

This problem arises when a model is trained with knowledge that it would not have in production. Labels (the results or outcomes) may be present in training data that are not available from the start in a real-world scenario. In this case, a model may learn to fit too closely to the training data, focusing on random details and noise while ignoring previously unseen data during the prediction process.

For example, a classification model can be trained on CT scans to detect an illness, but because the training data was collected from a small number of hospitals, each with slightly different equipment, the model may "overfit" to the data. The model might be able to guess which hospital the data came from by looking at specific quirks in how the data is compressed or the unique types of noise from the equipment used. It could then incorrectly assume that hospitals treating more serious cases of an illness have a higher incidence of that illness, skewing the results.

Data imbalance

Assuming that the data used in offline training will be the same as that available in production is a common mistake. Models trained this way often fail to adapt, resulting in decreased effectiveness.

To prevent data leakage, data hygiene, realistic production conditions during training, and constant monitoring are needed.

There are ways to prevent data imbalances, which can be roughly divided into two categories:

Data Methods
Algorithm methods

Data methods balance the dataset by removing the majority samples (Tomek links, cluster centroids, random removes) or generating new synthetic minority samples (SMOTE, random copies). Scaling those methods to high-dimensional data poses challenges, however. Undersampling causes data loss, which can impair model performance, whereas oversampling can result in overfitting to the synthetic distribution.

Tomek links:

SMOTE:

Source: Resampling strategies for imbalanced datasets

These negative characteristics could be mitigated by using more advanced techniques. For example, you could train your model on undersampled data and then fine-tune it on original data to avoid undersampling-related performance issues.

Another way to avoid the drawbacks is to combine the approaches. You can start training your model on original data, but depending on how it performs on evaluation, you oversample low-performing classes and undersample high-performance classes dynamically.

Both of the methods described above necessitate additional computation during training or additional finetuning stages, making training a longer process.

Algorithm methods aim to increase the loss function for low performance classes. It can be accomplished by weighting classes in the loss function based on static parameters (number of samples, FP FN cost values, or number of clusters after class sampling clustering). As with data methods, we can use dynamic loss function weights to direct the model's attention to classes that are currently more difficult to classify.

Both data and algorithm techniques can be used together, but due to the increased complexity of dynamic techniques and the even greater complexity of combining the two production pipelines, it may be preferable to avoid complicated code in favor of simplicity and maintainability.

Data Drifts

Data in production environment reflects real-world trends, including political developments, natural disasters, armed conflicts and other events. The COVID-19 pandemic, for example, impacted most countries and global data flows.

This is why ML models need the most up-to-date and relevant data possible to maintain optimal performance. This can be challenging, particularly for large language models that require more time and resources to retrain.

The core problem here is detecting data drifts, because some of them are not obvious. For example, unnatural data drifts, which are caused by errors in the input data, can be detected and corrected quickly. However, natural data drifts that reflect evolution of data patterns over time are more difficult to detect in a timely manner.

There are numerous statistical methods available, each with advantages and disadvantages. Alibi-detect is a Python library that implements many of them and also supports the PyTorch and TensorFlow frameworks. However, because data drift can mean a variety of things, it is difficult to determine the best way to detect it. What are you responsible for monitoring? Model outputs? Input features? Which specific features, or groups of features? How do I choose a reference window for comparing statistics?

Incorrect window selection for drift detection leads to false alarms. Source: SEASONAL, OR PERIODIC, TIME SERIES

If your detection method triggers too frequently, it can require frequent retraining (and a low return on investment) or generate a high volume of alarms for engineers and data scientists, causing alarm fatigue. This problem is notoriously difficult to solve in data and ML-driven products, so many teams opt for scheduled retraining as a more straightforward and predictable solution.

Retraining the models on a regular basis is the answer to this problem. To find the most efficient model retraining schedule, we first need to measure how regular retraining impacts model accuracy and other business metrics and what monetary value it brings. After that, we can compare the retraining value with its cost and find the optimal model retraining frequency to maximise the return on investment.

Production and Experimentation Models

ML engineers deal with both experimentation and production models in their practice. Both types need to be thoroughly tracked, but what would be prioritised: weights, code, or data? The more components are tracked, the more expensive it is. So to find the right balance for maintaining different kinds of models, you need to clearly understand their peculiarities. While experimentation models help test new hypotheses and algorithms, production ones make real-time predictions that impact business performance.

This means production models have to be reliable, quick and effective. That’s why they need a version control system which can provide a working backup model in case of a model failure. Such an approach ensures that business can run without interruptions. Another critical issue is data drift monitoring for both the model inputs and outputs. It can help detect and fix production bugs before they affect product users.

Production data pipelines must also be thoroughly verified, as ML services rely on them to preprocess and deliver data. As with experimental models, experimental environments could be deployed to test model performance in a separate environment without interfering with normal production. These environments could include data pipelines and other services that communicate with ML infrastructure during end-to-end testing. That means that data pipelines differ between testing and production environments. There is a trade-off between experimental iteration speed and consistency in the test/production environment, so we must exercise caution when deploying new models with data pipeline changes. For example, an engineer may forget to update the production pipeline after making changes to the experimental pipeline, resulting in errors or silent bugs when the model is released.

Again, there is a tradeoff between cost (resources and R&D hours) and system efficiency and feature richness. The more experimental features there are, the harder it is to maintain and ensure the system's reliability. Often, the choice is made in favor of simplicity, allowing production systems to be more predictable, while new experiments with features or models not yet supported by the platform must be handled separately and more manually until there is sufficient evidence that the return on investment of incorporating new functionality into production pipelines is sufficient.

ML Bugs

Constant ML experiments, high iteration speeds, and many changes that need to be reviewed quickly can all lead to the constant emergence of model bugs. ML bugs can be hard, soft, or drift ones.

Hard bugs

These are the errors that can crash the whole system: on the one hand, they are extremely dangerous, but, on the other, they can be easier to detect and fix because they affect the service immediately in an obvious way. For example, a new model with a corrupted configuration was deployed, resulting in high memory and CPU usage; the service is no longer available.

Soft bugs

These bugs don’t break the service right away, but rather degrade it gradually. The danger of this type of bugs is that you can miss them easily before they start to impact the system. (For example, a bug or feature timeout in the production data pipeline causes the model to lose prediction quality while service availability metrics remain unaffected).

Drift bugs

These are caused by inevitable data drifts and affect the system only after some time (For example, there is a data drift but a model isn’t retrained).

Debugging in ML differs from traditional software in that it is difficult to effectively classify ML bugs. To categorise and prioritise bugs, it is critical to understand when models are underperforming. To detect performance drops, all pipeline bugs must be identified and understood, as well as their potential impact.

Breaking this cycle requires high debugging velocity: the errors should be detected and addressed promptly. To achieve this, the model evaluation process should be systematic. It must include continuous monitoring, routine performance evaluations against diverse datasets, and mechanisms for implementing improvements based on insights from ongoing assessments. Moreover, data pipelines must be monitored and tested, as many ML-related bugs can originate in data delivery systems. Both model and data verification are crucial to service performance, and updating data pipelines is just as important as updating model weights.

Tags:

Hubs:

Machine learning