Challenges and Opportunities in Implementing MLOps

AI and ML:

Norbert Fijałek, Wiktor Hawrylik

Challenges and Opportunities in Implementing MLOps

Machine learning models offer cost savings, increased efficiencies, and impressive innovations, provided that your enterprise can overcome the obstacles of building them. Machine learning operations (MLOps) can help address them, though it, in turn, poses its own challenges.

Imagine playing a game of chess with a friend. Given the rules of the game, each unique piece can move across the board in a particular manner. You and your opponent try to think many steps ahead in anticipation of the other’s moves, and you both adjust strategies in response to changing board states.

Within the game’s restrictions, the permutations for how a round of chess can go are endless. While the brains of grandmasters can process way more permutations than your average human player, the former still pale in comparison to the capacity of a machine learning (ML) model that’s equipped with algorithms that are especially made for chess. This is why an AI player can analyze a human player’s moves, predict their intent with a high degree of accuracy, and counterattack effectively.

Now, imagine abstracting your business into its own set of rules, its own unique set of moving pieces, and its own dynamics with competitors, customers, and regulators. Just like how an ML model for chess can optimize every chess move, ML models that are especially built for specific business use cases can also process countless permutations to help you make the right business decisions. For enterprises, this ability is critical to remaining competitive. However, building an ML model is fraught with challenges.

What makes it difficult for organizations to build and implement their own ML models?

There are many things that make the steps in the machine learning project life cycle ineffective or inefficient:

Data management: Let’s say that you own an apple orchard, and you want a more efficient way of sorting out the damaged and worm-infested apples from the good apples. You can train a machine learning model to recognize visual and weight characteristics of bad apples and have conveyor belt robots chuck them to the discard pile. This idea is great but pulling it off successfully is another matter due to difficulties in managing ML data:
- Data collection: Machine learning models usually require large amounts of data to train. The model must have many samples of both good apples and bad for it to be accurate.
- Data exploration and documentation: Exploring what type of data represents the problem being solved by the model is time-consuming (e.g., skin discoloration, punctures, dents, scratches).
- Data validation: Validating the chosen representative data also consumes a lot of time. The model should be able to pull out as much of the bad fruit from the bunch while keeping as much of the good fruit away from the discard pile. To maximize accuracy, the robustness and quality of the data must be high.
- Data transformation: Each data transformation might significantly affect modeling. Let's take data format, for example. In our hypothetical orchard, the model must transform visual data into textual signifiers for fruit damage and/or quantitative data that numerically expresses the quality rating of an apple. To ensure that the data transformation is accurate (i.e., that a digital image of a particular blemish on the fruit’s skin represents rot or other damage to the fruit), data engineers must invest a lot of time to explore and validate the data in the ML use case.

Model selection and tuning: There are many different machine learning algorithms available, and each algorithm has its own strengths and weaknesses. It can be difficult to choose the right algorithm for a particular problem, and it can also be difficult to tune the hyperparameters of an algorithm to achieve optimal performance.
Model evaluation: It is important to assess the performance of machine learning models before they are deployed into production. This evaluation should be done using a holdout dataset that was not used to train the model.

When a model is poorly evaluated, various factors, such as biased datasets, erroneous training methods, and poor testing, can make ML projects fail in real-life applications. For instance, Amazon pulled the plug on its AI recruitment system for being biased against women. IBM’s ML system Watson for Oncology recommended treatments that were not useful for treating patients because the system was not trained on actual patient data. AI tools meant to diagnose COVID in patients or triage them faster were deemed unfit for clinical use because AI researchers committed errors in the way they trained and tested their tools.
Model deployment: Once an ML model is trained, it needs to be deployed to production so that it can be used to make predictions. Otherwise, it is not creating tangible value for the enterprise. Model deployment can be a complex and time-consuming process since the model must be deployed in a secure and reliable way. A 2022 IDC report, for instance, noted that it takes an average of 290 days to fully deploy a model into production from end to end.
Model monitoring and retraining maintenance: Once an ML model is deployed and is found to work as intended in the real world, it is important to monitor its output and performance. This is because data from production is fed into the model so that the model can adapt to changing data. However, in a phenomenon known as “data drift,” the addition of new data over time makes the initial training data lose its statistical properties and thereby makes the model generate less accurate predictions or fail to perform as intended.

Also, the degree to which input data can be trusted to be accurate, complete, consistent, and timely — known as “fidelity” — can wane and therefore lead to false insights, inaccurate predictions, and lost business opportunities. Furthermore, other factors that would necessitate model adjustments may include internal ones like changes in the business model or external ones like the cloud provider’s system updates.

There are also the ethical and legal aspects for monitoring and maintaining ML models. AI and ML tools must have human monitors and limiters. Otherwise, they might label individuals as “primates” and chatbots that are geared to sound more emotional and engaging might encourage people to commit worrying or controversial.
Model maintenance costs: After accounting for ML model development costs, the business cost savings obtained by an AI or ML project can be easily gobbled up by its maintenance costs.

Given these and other roadblocks, such as the shortage of AI and ML skills needed to deliver projects, the success rate of these projects being deployed to, adopted by, and scaled to produce benefits for many parts of an organization is low. Many enterprises, for instance, are finding it difficult to hire and retain software, data, and machine learning engineers as well as AI data scientists.

To illustrate, a Gartner survey revealed that while 80% of executives see that AI-powered automation can be applied to any business decision, nearly half of enterprises still can’t move their AI projects into production. Moreover, in a survey by Deloitte, 22% of organizations that are fully deploying AI applications are not achieving meaningful outcomes, while 28% are still in the nascent stage of deploying AI projects and thus haven’t realized their benefits.

The low success rate can be attributed to the approach enterprises commonly take to build ML solutions, which is to create predictive models without factoring in their operational contexts. A new approach was created, where business operations are integral to the machine learning project life cycle. This is how machine learning operations (MLOps) came to be.

What is MLOps?

MLOps is a set of practices that combines software engineering, data science, and operations to manage the entire machine learning project life cycle. It aims to streamline the development and deployment of machine learning models in an efficient and consistent manner. By implementing MLOps, businesses can achieve better collaboration between teams, faster model deployment, and improved model performance. Additionally, MLOps helps organizations ensure compliance and governance while maintaining the integrity and accuracy of their machine learning models.

Here are the ways that MLOps optimizes the machine learning lifecycle:

Data management: MLOps can help organizations to collect, explore, label, validate, and transform data more efficiently and effectively. This can be done by automating the data collection process and by providing tools for data cleaning and preprocessing.

Beyond helping data engineers turn structured, semistructured, and unstructured data into “features” (i.e., variables expressed in numeric values that an ML model can comprehend, such as apple weight and the apple quality rating in our earlier orchard example), MLOps also centralizes the storage, processing, and access to features in a data system called a “feature store.” This is good, because being able to access such variables from one place makes them reusable for other ML projects. Moreover, a feature store increases collaboration across all ML teams and speeds up ML-specific extraction of insights, formulation of data hypothesis, and documentation.
Model selection and tuning: MLOps can help data scientists by automating the model selection process and tuning machine learning models more effectively. This can be done by providing automated machine learning (AutoML) tools for model selection and hyperparameter tuning.
Model evaluation: MLOps introduces model registries that help track the results of model evaluations on data. These are crucial to performance tracking, benchmarking, modeling and analysis, research, and reporting.
Model deployment: MLOps can help organizations deploy machine learning models more efficiently and effectively. This can be done by automating the process of model deployment and implementing standard DevOps practices like testing, versioning, and continuous delivery.
Model monitoring: MLOps can help organizations monitor machine learning models more effectively. This can be done by providing tools for tracking metrics, parameters, code versions, and output files as well as analyzing them and visualizing the results. Moreover, model monitoring from the vantage point of operations makes underperforming models easier to identify, which leads to the models being retrained sooner.

Lastly, MLOps includes ways for human monitors to address data drift, concept drift, and other phenomena that contribute to model performance decay, such as the loss of prediction accuracy. As soon as the model’s performance decay threshold is breached, alarms are raised so that the model is retrained and brought back on track.
Model maintenance: This is a crucial aspect in AI solutions that ensure their long-term effectiveness and value. Unfortunately, model maintenance is expensive. Thankfully, an effective MLOps strategy can control costs by automating tasks that enable the AI models to continue performing optimally, remain up-to-date, and align with changing business requirements.

By solving these challenges, MLOps can help organizations to get more value from their machine learning investments.

Further reading:

Analytics and Technology Trends and Predictions for 2023: Toward Resilience

One of the key tech predictions for 2023 is that organizations will increasingly harness AI and ML to innovate more quickly, respond more proactively, and act more meaningfully. MLOps will prove key to fulfilling these business goals.

Read more

Enabling Enterprises to Operationalize AI and Machine Learning

Lingaro presented a framework for addressing challenges in implementing AI and ML projects and went back to basics to structure a company’s MLOps practice.

Read more

What are the challenges that organizations face when implementing MLOps?

While MLOps helps address a lot of problems in building ML models, implementing it introduces new ones of its own:

Lack of talent: There is a shortage of skilled MLOps professionals, which can make it difficult to find the right people to build and maintain MLOps pipelines.
Need for a continuous feedback loop: By default, ML models aren’t monitored for the business value they bring. The amount of value they create could decrease over time and the enterprise wouldn’t be aware that they must act upon it.

With that said, MLOps-monitoring capabilities can be applied to make sure that this value is continuously monitored and maintained via a continuous feedback loop. This is done by first establishing the business requirements for the model, then mapping those requirements to measurable KPIs. The KPIs are monitored, reported, and sometimes acted upon automatically. To illustrate, if a model’s performance dips below a predefined threshold, model retraining can be automatically initiated.
Poor data quality: This is not exactly a challenge that MLOps introduces, but rather one that MLOps helps to identify. The quality of the data used to train ML models is critical to the success of those models. However, data can be dirty, incomplete, or biased, which can lead to inaccurate or unreliable models. In fact, data quality or precision remains the top obstacle among surveyed AI, ML, and data leaders and practitioners who are struggling to productionalize their ML operations.
Culture shock: MLOps requires a culture of collaboration and cooperation between different teams, such as data scientists, engineers, and operations staff. This can be difficult to achieve, especially in organizations that are not used to working in this way. This is not new. In a 2023 NewVantage Partners survey, 79.8% of respondents said that cultural impediments collectively form the biggest obstacle to organizations becoming data-driven.
High costs: MLOps can be a costly investment, both in terms of time and money. Organizations need to be prepared to invest in the right tools and resources to make MLOps a success. To illustrate, building an ML platform can take a handful of ML engineers a couple of months to a dozen engineers up to two years.

Despite these challenges, MLOps is a valuable framework that can help organizations get more value from their data. By overcoming these challenges, organizations can reap the benefits of AI and ML, such as improved decision-making, increased productivity, and reduced costs.

In our next post, we’ll explore how Azure ML addresses these concerns.

Lingaro Group’s AI and machine learning practice helps enterprises determine their most critical business challenges, frame these as ML use cases, and develop, operationalize, and scale models accordingly. Lingaro provides end-to-end MLOps as a service — from strategic consulting, rapid productization of ML solutions, and model monitoring to implementation of responsible AI.

Lingaro maintains a dedicated center of excellence that brings together top talent, knowledge, and resources to holistically manage and scale AI projects and data life cycles. Our expertise and capabilities are complemented by industry-recognized advanced analytics practices built through XOps-inspired initiatives as well as DevOps- and Agile-based principles.

Meet Our Experts

Norbert Fijałek AI Engineering Team Leader

Norbert is passionate about AI and Machine Learning Operations. He has spent more than 10 years exploring and implementing new technologies, designing and developing innovative products, and leading teams to deliver successful AI and product management solutions across multiple industries.

Norbert enjoys sharing his knowledge with others and has been a featured speaker at various conferences on AI and related topics. He also participates in successful accelerators and competitions.

Overall, he is a lifelong learner who loves being on the cutting edge of the latest technology trends and exploring new ways to apply them to real-world problems.

Wiktor Hawrylik Principal Consultant

Wiktor not only excels at complex Data Science and DevOps, but he also applies detailed algorithmic approaches to solving problems, handling performance-intensive and time-critical tasks and transforming regulatory requirements into functioning code. He matches his expertise with an unwavering dedication to solving tasks, which gave him a reputation for always leading teams to successful machine learning solutions deliveries.