How to Become a MLOps Engineer as a DevOps Engineer

Thursday, Oct 8, 2020

MLOps Loop

View original image here. By ML-Ops.org, released under Creative Commons Attribution 4.0 International Public License.

Getting into Machine-Learning-Ops

This article is for people who are already DevOps Engineers – who works day and night with infrastructure automation, continuous integration and continuous delivery pipeline setup and maintenance, set up monitoring and alerting for applications and systems.

I assume you know the concepts and implementations of the "DevOps Roadmap", and now you are looking at a team with 5 PhDs doing something called "Machine Learning". You may be wondering: Are they using automation? The answer is mostly likely No. Looking at Google Trends MLOps v.s. DevOps as of early October 2020, the search term MLOps is almost non-existent, but if you search for machine learning you can see how much the growth has been.

What does this tell you? There is may be a gap between the development of machine learning and the level of maturity needed to productionize machine learning projects' artifacts.

What is the difference

The first thing you need to understand is the key difference between software engineering projects and machine learning projects.

Software engineering projects is built to produce a deterministic result given an input based on the business requirements, while machine learning projects gives you a probability or prediction of a result given the input data. If you still do not get it, let us make it simple: software engineering projects are like multiple choices, but machine learning projects are like writing essays – there is mostly a known answer for a software engineering problem, and mostly no known answer for a machine learning problem.

Practically, what does this mean?

It means software engineering projects have black and white answers, and they mostly like have an "end state" where the software is behaving as expected. You can have written tests with great confidence that, as long as your infrastructure is up and running, the software should behave correctly as expected.

However, that is not the case with machine learning projects. They do not really have an "end" because the result is a probability, and if a problem can be solved with 100% probability, then it is likely that you do not need machine learning. This means if you want you can always improve something in a machine learning project – algorithms, parameters, training data …

So, what should I learn

Personally, I think here are the things you need to learn:

At the very beginning you need to know the machine learning workflow.
On the way of the earlier point you need to learn the terminologies.
You then explore what are the tools people use to do machine learning.
Understand how to do CI/CD for machine learning projects.
Understand how to do monitoring & alerting for projects with machine learning components.

Now let us dig a bit deeper into each section!

Machine Learning Workflow

Typically, in a software engineering project, you follow the below process:

Define the business requirement.
Translate the business requirement to technical requirements.
Design the technical solution.
Implement the technical solution.
Test the technical solution.
Deliver the technical solution.
Then you repeat the cycle, from step 1.

However, this is not what a typical machine learning project looks like. It is more like this:

Define the business requirement.
Gather the data that could be relevant to solve the business problem. This may include data cleaning, transformation, etc.
Perform experiment with the data. This may include performing feature engineering, pick and choose the correct algorithms and parameters, etc.
Build machine learning models. This may include selecting the features and algorithms, parameters and build models.
Deliver the machine learning model, look at the performance.
Improve the model by repat the cycle, from step 3.
Then when you are happy with the model, you repeat the outer cycle, from step 1.

As you may have noticed, the way machine learning projects work is not a simple build-and-deliver model. It has more experimental cycles, before moving on to the next business requirement. This means MLOps engineers will work more closely with data engineers and data scientists, providing solutions to facilitate the need of continuous experimentation and engineering.

Terminologies

There can be a lot of new terminologies for you, which I consider it is the elephant in the room. End of the day we are talking about data science, not software engineering. As scientists always make things sound more complicated, it is necessary that as MLOps engineers we understand the terminologies they are using.

There are a lot of places where you can find great references about these terminologies:

Do not stress too much if you do not understand what something is. Personally, my suggestion is that a lot of times you want to pay more attention to the verbs than the nouns. For example, you may want to know what "Dimension Reduction" is, but probably do not need to know "Markov Decision Process".

Tooling & CI/CD

This will be similar to what you need to learn when you start your DevOps journey, however, focused on the set of tools you need for machine learning. Putting aside your usual tools like scripting languages and CI/CD platforms, you may want to learn some new tools that are designed to solve machine learning problems.

Some of the things you familiar with may need to be changed, and you may need to find new ways of doing the things you know. For example, when you think about testing, what type of testing you would like to do, with what tools, in what stages? You may find yourself not able to do unit testing in a magnitude of seconds or minutes, or you may not even be able to do unit testing at all in some stages like model training. Another example could be the artifacts you will be storing and serving. What are the artifacts in each stage, how to store them, how to version them, and how to distribute them? There are some great resources out there, like https://neptune.ai/blog/best-mlops-tools.

The main complication here is that as of today the machine learning ecosystem is not as mature as software engineering. Therefore, you may find a lot of new tools that have not been fully tested and/or still in Beta. You may also need to talk to your data engineers and data scientists to convince them to use some tools over using no tools at all, or even more excitingly (or worse, depending on if you like it or not), you would need to skill up your team and help them to follow the so called "engineering fundamentals" – because a considerable amount of data scientists, despite super talented and smart, have not been exposed to writing production quality software.

Monitoring & Alerting

Monitoring and alerting in machine learning projects could mean the same or totally different things compare to software engineering projects. Of course, you want your server clusters to stay up, you do not want your data store to reach storage or throughput limit, and you set up alerts on your budget on the cloud so you do not get a heart attack when seeing the bill. But on top of that, there are things specifically needed for machine learning projects.

For example, you may want to monitor the machine learning models to make sure that the model does not change unexpectedly. If your model is providing recommendations to an e-commerce website, you may want to see if it is still providing relevant recommendations or is providing any recommendations at all. Thus, your approach may need to be changed compare to your traditional software engineering monitoring – it is no longer if the request returns 200 then the dashboard is green.

There could be more subtle things to monitor and be more difficult to achieve and design. If you have a machine learning model that self-improves based on user behaviors/interactions with your application, you may want to monitor the model to make sure that it does not do unwanted things. When you deploy your next (Tay Bot) you may want to make sure that it's not going to the old route again.

What is next

Congratulations! You now opened a new area of possibilities of your career to help more people! With the knowledge you gained during the learning of MLOps, you can then think about picking up DataOps – a similar but slightly different area of Ops, which will help you to be better at providing your team with a framework for data administration, collaboration and automation!