Skip to main content

Assembling MLOps practice - part 2

[PLACEHOLDER]

 Part I of this series, published in May, discussed the definition of MLOps and outlined the requirements for implementing this practice within an organisation. It also addressed some of the roles necessary within the team to support MLOps.

lego-alike-data-assemblyLine-green
Lego Alike data assembly - Generated with Gemini

 

This time, we move forward by exploring part of the technical stack that could be an option for implementing MLOps. 

Before proceeding, below is a CTA to the first part of the article for reference.


Lego-Alike-data-assembly

Assembling an MLOps Practice - Part 1

ML components are key parts of the ecosystem, supporting the solutions provided to clients. As a result, DevOps and MLOps have become part of the "secret sauce" for success...


Components of your MLOps stack.

The MLOps stack optimises the machine learning life-cycle by fostering collaboration across teams, delivering continuous integration and deployment (CI/CD), modular code structures, and automated workflows. This approach accelerates time-to-market while enhancing model reliability. 

Components.

  • Version control.
    • Allowing collaboration as the teams contribute to the ML code.
    • Essential to the CI/CD pipeline and the automation. 
    • E.g. Github, bitbucket, others.
  • Model pipeline.
    • For model environments and libraries, you can leverage tools such as Tensorflow, Azure Machine Learning Studio, Pytorch, Cohere, others.
    • For CI/CD and automation you leverage Jenkins, Github Actions, Apache Airflow, Luigi, Prefect
    • For deployments and monitoring your model performance on the cloud you have options like Azure ML, AWS Sagemaker, Google Vertex, Databricks, MLFlow.
  • Containerisation for your models. The usual suspects here: Docker, K8s (Kubernetes) with AKS (Azure) or EKS (AWS) or GKE (Google cloud).

Approach for monitoring ML pipeline.


You want to watch for certain things within the lifetime of your ML pipeline:

  • Model drift. When the model is not performing because of changes in the data, leading to inaccuracy predictions. 
  • Data drift. A change to the input data to the model over time. If so, then you can start to notice the declining performance of the model. 
  • Performance
  • Fatigue due to overwhelming alerting/notifications. 

To tackle the challenges above then leverage an ML framework and tools that allow for:

  • Proactive monitoring.
  • Automated remediation.
  • Continuous improvement. 

Here is a list of tools:

  • Pytorch.
  • Tensorflow.
  • MLflow.
  • AWS Sagemaker.
  • Azure ML.  

MLFlow and demo.

We are going to focus on MLflow because (at least at the time of writing this article) is one of the best options for MLOps activities, due to its versatility and ease of learning.  It includes robust features for:

  1. Experiment Tracking.
  2. Model Management.
  3. Deployment.
  4. It is open-source.
  5. A great community behind it, with regular updates.

An additional advantage of this tool is its flexibility, which enables seamless integration with various other tools that users may prefer. Combined, these tools create a robust MLOps toolkit and provide an effective framework for project implementation. Below you will find a couple of suggestions:

  •  MLflow and AWS Sagemaker. MLflow can be focused on tracking and deployment. SageMaker deals with scalable infrastructure for model training.
  • MLflow and Prometheus and Grafana. This is one of the most popular combinations that you will find on the internet these days. MLflow focuses on tracking, workflows and deployments. Prometheus can capture the metrics and perform real-time monitoring around the ML models, which can be leveraged for diagnostics (performance, health, etc.). Grafana complements all these with its visualization of the data provided via Prometheus.
endless-library-robot-librarian-with-people

Recommended article

Democratising AI. Democratising AI is all about empowering others to use it, by making it available to them...


 Learnings from setting MLflow on AWS.

Tech used:

  • MLflow,
  • Docker,
  •  Microsoft VSCode as IDE,
  • Github Copilot with Gemini 2.5 Pro model,
  • Terraform,
  • PostgreSQL as the DB Engine,
  • Amazon AWS:
    • ECS,
    • ECR,
    • Secret Manager,
    • IAM (for user, role and policy),
    • EC2 (load balancer),
    • Aurora and RDS,
    • CloudWatch.

Prerequisites.

Install the AWS CLI.  

  • Instructions: aws cli userguide - getting started 
  • Once you install the CLI then get your Access-key-ID and Secret-access-key.
  • Use this command to configure the CLI: aws configure
  • Check your work by making sure you are running the right version and the configure is correct. Handy commands:
    • Aws --version
    • Aws configure list

Install Terraform. 

At the end of our internal lab practice we ended up with a functional MLflow instance on AWS. See image and code repo link below. 

 Note: Important to call out this was a lab exercise for the purposes of this article. In order to have this production ready, additional steps are require, which will include security checks and multiple test rounds. 

 

MLflow running on Beolle - AWS instance

Key takeaways.

  1. Whether you are new to Terraform or quite experienced, when using GitHub Copilot as a coding assistant, don’t simply accept its code without question. It’s important to evaluate the quality, security, and ensure the logic remains consistent. This final point is crucial because the assistant offers a range of possible code flows, which might diverge and affect the overall design of your code and the services you’ve chosen to implement.
  2.  LLMs are good code assistants. It was helpful to get this done. However, do not fall under the trap of letting it do everything. Take time to learn and understand what you are producing. Also keep in mind that for production readiness you need to follow your quality and security controls. 
  3. One important lesson was that all the necessary privileges were needed for Terraform to run and set up the required AWS services. Keep this in mind and enjoy the process; have some fun!

Public Github repo.

 Note before you go to the repo: Like we mentioned in the takeaways section, we had to tweak the AWS role and policy created for this so Terraform could run smoothly and the IaC would work right.
 
 If you check out the repo, especially the main.tf file, you'll see our comment pointing out that the role and policy are pretty basic and missing some recent updates needed for everything to run without a hitch. We didn’t want to hold up publishing this article any longer, so we decided to go ahead with it.  

We plan to update the repo soon. Meanwhile, feel free to ask a friend who knows AWS and Terraform for help, or use your favourite LLM to assist you with that part. Good luck!

beol-mlops-mlflow lab

 

Trending posts

SLA-SLO-SLI and DevOps metrics

Companies are in need of the metrics that will allow them to stay in business by making sure they meet the expectations of their customers. The name of the game is higher customer satisfaction by winning their trust and loyalty. To do so, you want to provide good products and services. Therefore you need to find ways to monitor performance, drive continuous improvements and deliver the quality expected by the consumer in this highly competitive market. Photos from AlphaTradeZone via Pexel and Spacejoy via Unsplash SLAs, SLOs and SLIs are a good way to achieve the above. They allow clients and vendors to be on the same page when it comes to expected system performance. If we go one level deeper, vendors/providers work on NFRs (Non-Functional Requirements) when working on their solutions. NFRs define the quality attributes of a system. I bring them up because the relationship between them and the SLAs is that they provide, in a way, foundational aspects for the SLA-SLO-SL...

AI Agents is the new thing to talk about

Tech is evolving faster than ever in this AI era, that it feels every week there is something new to talk about, and what you learn weeks back is no longer relevant, or “that AI tools” already has gone through changes that you need to catch up with in order to stay relevant.  Fear not, embrace the challenges and learnings, and find applications for it that are good and ethical for this present, and the hereafter.  The new “craze” is AI agents, and for good reason!  Image generated with NightCafe In contrast with AI chatbots, an AI agent can execute tasks on your behalf. If you are thinking “ that this could be agents that we leave running independently for many days for a group of deliveries ”… Well then you are correct! Are there risks? Should we talk about trust and accountability? The answer for both is yes. I already hinted at it a couple of paragraphs above, when I wrote “ good and ethical ”. AI (Artificial Intelligence) agents are software that work autonomously,...

SRE, DevOps and ITOps

 If you are wondering what the differences between the SRE and DevOps are, as well as how these roles work with ITOps within an organisation then you are not alone; and best of all you are on the right blog post. Often enough business units in a company get confused, assigning the ServiceNow or Jira tickets or any other ticketing system of your preference, to the wrong group, and even having the incorrect expectations when doing resourcing. Let us go through definitions, insights and scenarios that will help you understand the difference. DevOps software development operations - AI Generated When it comes to DevOps and SRE, then you might be wondering which practice came first. While SRE may have originated a bit earlier, internally at Google, DevOps came first publicly as a practice and started to be used by companies. A few years later was when Google decided to open SRE to the world after the publication of the "Site Reliability Engineering" book. Therefore, technically sp...

Assembling MLOps practice - part 1

In one of our previous articles it was highlighted how DevOps manages the End-to-End application cycle, leveraging agility and automation. CI/CD pipelines, collaboration and transparency, monitoring and automation are part of the list on how DevOps leverages and facilitates agility. What if then we bring those to support ML? That is how MLOps comes to the table and starts making sense! Lego Alike data assembly - Generated with Gemini A big tech corporation, or a startup, nowadays will see how it is becoming a requirement to incorporate AI and Machine learning (ML) in their operations. ML components are key parts of the ecosystem, supporting the solutions provided to clients. As a result, DevOps and MLOps have become part of the "secret sauce" for success.  What is MLOps Just to bring the definition of what you probably know (or put together based on the above) MLOps focuses on the life-cycle management of machine learning models. It combines machine learning with traditional ...

This blog uses cookies to improve your browsing experience. Simple analytics might be in place for pageviews purposes. They are harmless and never personally identify you.

Agreed