MLOps Continuous Delivery For ML On AWS
MLOps Continuous Delivery For ML On AWS
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contents
Introduction ..........................................................................................................................1
Continuous delivery for machine learning .......................................................................2
The different process steps of CD4ML ............................................................................3
The technical components of CD4ML .............................................................................6
AWS Solutions .....................................................................................................................8
Alteryx ..................................................................................................................................8
Data governance and curation.........................................................................................9
Machine learning experimentation .................................................................................10
Productionized ML pipelines ..........................................................................................13
Model serving and deployment ......................................................................................15
Model testing and quality ...............................................................................................17
Continuous improvement ...............................................................................................17
Alteryx: Your journey ahead...........................................................................................18
Dataiku DSS ......................................................................................................................18
Access, understand, and clean data .............................................................................19
Build machine learning ...................................................................................................22
Deploy machine learning ...............................................................................................24
Dataiku DSS: Your journey ahead .................................................................................26
Domino Data Lab...............................................................................................................27
Introduction .....................................................................................................................27
Enterprise data science workflows ................................................................................28
Domino: Your journey ahead .........................................................................................37
KNIME ................................................................................................................................38
KNIME Software: creating and productionizing data science .......................................38
KNIME: Your journey ahead ..........................................................................................46
AWS reference architecture ..............................................................................................46
Model building ................................................................................................................46
Productionize the model ................................................................................................51
Testing and quality .........................................................................................................54
Deployment ....................................................................................................................56
Monitoring and observability and closing the feedback loop ........................................58
High-level AI services.....................................................................................................59
AWS: Your journey ahead .............................................................................................61
Conclusion .........................................................................................................................61
Contributors .......................................................................................................................62
Resources ..........................................................................................................................62
Document Revisions..........................................................................................................64
Abstract
Artificial intelligence (AI) is expanding into standard business processes, resulting in
increased revenue and reduced costs. As AI adoption grows, it becomes increasingly
important for AI and machine learning (ML) practices to focus on production quality
controls. Productionizing ML models introduces challenges that span organizations and
processes, involving the integration of new and incumbent technologies.
This whitepaper outlines the challenge of productionizing ML, explains some best
practices, and presents solutions. ThoughtWorks, a global software consultancy,
introduces the idea of MLOps as continuous delivery for machine learning. The rest of
the whitepaper details solutions from AWS, Alteryx, Dataiku, Domino Data Lab, and
KNIME.
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Introduction
by Christoph Windheuser, Global Head of Artificial Intelligence, ThoughtWorks
Danilo Sato, Head of Data & AI Services UK, ThoughtWorks
After machine learning (ML) techniques showed that they can provide significant value,
organizations started to get serious about using these new technologies and tried to get
them deployed to production. However, people soon realized that training and running a
machine learning model on a laptop is completely different than running it in a
production IT environment. A common problem is having models that only work in a lab
environment and never leave the proof-of-concept phase. Nucleus Research published
a 2019 report, where they analyzed 316 AI projects in companies ranging from 20-
person startups to Fortune 100 global enterprises. They found that only 38% of AI
projects made it to production. Further, projects that made it to production did so in a
manual ad hoc way, often then becoming stale and hard to update.
There are also organizational challenges. Different teams might own different parts of
the process and have their own ways of working. Data engineers might be building
pipelines to make data accessible, while data scientists can be researching and
exploring better models. Machine learning engineers or developers then have to worry
about how to integrate that model and release it to production. When these groups work
in separate silos, there is a high risk of creating friction in the process and delivering
suboptimal results.
1
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
MLOps extends DevOps into the machine learning space. It refers to the culture where
people, regardless of their title or background, work together to imagine, develop,
deploy, operate, and improve a machine learning system. In order to tackle the
described challenges in bringing ML to production, ThoughtWorks has developed
continuous delivery for machine learning (CD4ML), an approach to realize MLOps. In
one of their first ML projects, ThoughtWorks built a price recommendation engine with
CD4ML on AWS for AutoScout24, the largest online car marketplace in Europe. Today,
CD4ML is standard at ThoughtWorks for ML projects.
2
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Continuous delivery applies to changes of all types, not just software code. With that in
mind, we can extend its definition to incorporate the new elements and challenges that
exist in real-world machine learning systems, an approach we are calling Continuous
Delivery for Machine Learning (CD4ML).
This definition includes the core principles to strive for. It highlights the importance of
cross-functional teams with skill sets across different areas of specialization, such as:
data engineering, data science, or operations. It incorporates other sources of change
beyond code and configuration, such as datasets, models, and parameters. It calls for
an incremental and reliable process to make small changes frequently, in a safe way,
which reduces the risk of big releases. Finally, it requires a feedback loop: The real-
world data is continuously changing and the models in productions are continuously
monitored, leading to adaptations and improvements by re-training of the models and
the re-iteration of the whole process.
3
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Model building
Once the need for a machine learning system is found, data scientists research and
experiment to develop the best model, by trying different combinations of algorithms,
and tuning their parameters and hyperparameters. This produces models that can be
evaluated to assess the quality of its predictions. The formal implementation of this
model training process becomes the machine learning pipeline.
Having an automated and reproducible machine learning pipeline allows other data
scientists to collaborate on the same code base, but also allows it to be executed in
different environments, against different datasets. This provides great flexibility to scale
out and track the execution of multiple experiments and ensures their reproducibility.
4
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Regardless of which pattern you choose, there will always be an implicit contract (API)
between the model and how it is consumed. If the contract changes, it will cause an
integration bug.
While you cannot write a deterministic test to assert the model score from a given
training run, the CD4ML process can automate the collection of such metrics and track
their trend over time. This allows you to introduce quality gates that fail when they cross
a configurable threshold and to ensure that models don’t degrade against known
performance baselines.
Deployment
Once a good candidate model is found, it must be deployed to production. There are
different approaches to do that with minimal disruption. You can have multiple models
performing the same task for different partitions of the problem. You can have a shadow
model deployed side by side with the current one to monitor its performance before
promoting it. You can have competing models being actively used by different segments
of the user base. Or you can have online learning models that are continuously
improving with the arrival of new data.
Elastic cloud infrastructure is a key enabler for implementing these different deployment
scenarios while minimizing any potential downtime, allowing you to scale the
infrastructure up and down on-demand, as they are rolled out.
5
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Let's look at the different components of the CD4ML infrastructure in more detail.
6
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
• The ability to use specialized hardware (such as GPUs) for training machine
learning models more efficiently
7
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
AWS Solutions
CD4ML is a process for bringing automation, quality, and discipline to the practice of
releasing ML software into production frequently and safely. It is not bound to a specific
infrastructure or toolset.
This whitepaper showcases MLOps solutions from AWS and the following AWS Partner
Network (APN) companies that can deliver on the previously mentioned requirements:
• Alteryx
• Dataiku
• KNIME
These solutions offer a broad spectrum of experiences that cater to builders and those
who desire no-to-low-code experiences. AWS and the APN provide you with choices
and the ability to tailor solutions that best fit your organization.
Alteryx
by Alex Sadovsky, Senior Director of Product Management, Alteryx
David Cooperberg, Senior Product Manager, Alteryx
8
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
• Alteryx Server – an analytical hub that allows users to scale their analytic
capabilities in the cloud or on premises on enterprise hardware
The remainder of this section discusses how you can use Alteryx in conjunction with
AWS to achieve a comprehensive CD4ML solution.
9
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Connect makes it easy for users to discover and understand relevant data assets. Once
a data source is represented in Connect, users collaborate using social validation tools
like voting, commenting, and sharing to highlight the usefulness and freshness of the
data. Connect installs in a Windows Server environment running in Amazon EC2. Once
Connect is installed, one or more of the 25+ existing database metadata loaders are
used to add data sources. This includes loaders for Amazon Redshift and Amazon S3,
and loaders for Postgres and MySQL that can load metadata from Amazon Aurora. If a
data source is missing a metadata loader, Alteryx offers intuitive SDKs that make writing
new loaders easy for developers in multiple languages and via REST APIs. Connect
offers a cross-platform experience, allowing desktop Designer users and Server users
to explore and utilize data assets based upon shared metadata.
Alteryx also offers the ability to augment user data with datasets from industry data
providers. Alteryx Datasets can provide valuable location and business insights when
combined with proprietary data. In the modeling realm, this data is most typically paired
with proprietary data, to offer demographic and geographic features into models.
10
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 6 – Alteryx Designer offers several options for modeling and experimentation based on a user's
level of experience
Once a data architecture is implemented and the appropriate data assets are identified,
analytics can begin. Designer, a code-free and code-friendly development environment,
enables analysts of all skill levels to create analytic workflows, including those that
require machine learning. Designer can be installed on a local Windows machine.
The first step in any analytic process is to import data. Alteryx is agnostic to where and
how data is stored, providing connectors to over 80 different data sources, including an
AWS Starter Kit that includes connectors for Amazon Athena, Amazon Aurora, Amazon
S3, and Amazon Redshift. Because Alteryx provides a common ground to data
processing from multiple sources, for high performant workloads, it is often a best
practice to co-localize the data by preprocessing workflows. For example, in order to
reduce future processing latency, you could move on-premises data to an AWS source.
This can all be done with drag and drop, code free data connector building blocks,
11
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
avoiding the need to know any CLI/SQL intricacies of the underlying infrastructure,
although the latter is possible as well.
Designer includes over 260 automation building blocks that enable the code-free
processing of data. This includes building blocks for data prep, cleansing, blending,
mapping, visualization, and modeling. Data cleansing, blending, and prep building
blocks are often used before machine learning experimentation to prepare training, test,
and validation datasets.
Much of the data preprocessing that occurs before modeling can also be accomplished
using Alteryx’s In-Database functionality. This functionality pushes down data
processing tasks to the database and delays the data import until after that processing
has been completed and a local machine in-memory action needs to be executed.
Alteryx Designer provides users with several choices for machine learning:
• The Alteryx Predictive Suite offers code-free functionality for many descriptive,
predictive, and prescriptive analytics tasks. Users can also customize the
underlying R code that powers these building blocks to address their specific use
cases.
• The Alteryx Intelligence Suite offers code-free functionality for building machine
learning pipelines and additional functionality for text analytics. The Intelligence
Suite also offers Assisted Modeling, an automated modeling product designed to
help business analysts learn machine learning while building validated models
that solve their specific business problems. Assisted Modeling is built on open-
source libraries and provides users with the option to export their drag-and-drop
or wizard-created models as Python scripts.
12
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
• Code-friendly building blocks that support R and Python allow users to write
machine learning code that is embedded in an otherwise code-free workflow.
Users can use these building blocks to work with their preferred frameworks and
libraries, and the built-in Jupyter notebook integration enables interactive data
experimentation.
Productionized ML pipelines
Figure 9 illustrates how Alteryx Server can be leveraged to operationalize workflows,
including those that are used for data governance. Server offers a componentized
installation experience that works natively in AWS.
13
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 9 – Alteryx Server can be installed easily in AWS to productionize machine learning and data
governance workflows
As experimentation begins to yield promising results, it’s often time to scale modeling to
support larger training data, hyperparameter tuning, and productionization.
Alteryx Server is a scalable environment that is used to manage and deploy analytic
assets. It’s easy to append CPU-optimized machines to a Server cluster that can be
specified for use by machine learning training pipelines. Executing long-running training
jobs in Server offers users the flexibility to continue designing analytic workflows in
Designer while the training job executes.
Server enables the scheduling and sequencing of analytic workflows. Each of these
features can be used as part of CI/CD pipelines that ensure the quality of models that
have been deployed to production. Using REST APIs, workflows can be
programmatically triggered and monitored for status to integrate into established
DevOps and CI/CD setups.
14
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Alteryx Server can be installed in an on-premises data center or in the AWS Cloud, and
supports single and multi-node configurations. It’s offered as an Amazon Machine
Image (AMI) in the AWS Marketplace for easy one-click deployments. Customized
instances can also be deployed in a private subnet using Amazon Virtual Private Cloud.
Server offers many options for customization, one of which is the option to store Server
metadata in a user-managed MongoDB instance, for which AWS offers a Quick Start.
For detailed guidance, see Best Practices for Deploying Alteryx Server on AWS.
Alteryx Server offers built-in governance and version control of analytic assets, which
can be used in place of or in addition to other source control solutions.
Figure 10 – Alteryx Promote offers a MLOps solution providing model management and highly-available,
low-latency model serving
15
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
The Alteryx platform offers several options for model deployment. Promote is used
primarily for real-time deployments, common for models that interact with web
applications. Promote enables the rapid deployment of pre-trained machine learning
models through easy-to-use Python and R client libraries or in a code-free manner
using Alteryx Designer.
Models that have been deployed to a Promote cluster server environment are packaged
as Docker containers, replicated across nodes, and made accessible as highly available
REST APIs that host in-memory inference methods. The number of replications of each
model is configurable, as is the number of nodes available in the Promote cluster. An
internal load balancer spreads requests across the available replications.
Like Server and Connect, Promote can be installed in an AWS Cloud environment or in
an on-premises data center. The recommended setup also includes an external load
balancer, such as Elastic Load Balancing, in order to distribute prediction requests
across every Promote node. Promote is ideal for inference cases in which throughput is
already known or is acceptable to be changed on demand. While automatic scaling is
technically possible, it’s beyond the intended use of the product.
Alteryx Server is the recommended solution for models that require batch inference on
known existing hardware. Batch models can be packaged for prediction in workflow or
analytic apps and can be scheduled to run in Server on compute-optimized nodes.
Server’s workflow management functionality can also be leveraged to ensure that
predictions are made only after up-to-date features have been generated through data
preprocessing.
16
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Additionally, users often find they need a hybrid of Alteryx and AWS solutions to deploy
complex models at scale. One usage pattern we have observed is using our Assisted
Modeling tool on the desktop to prototype a model on sample data. Using Designer and
Server, clients prep/blend data from local sources and push the resulting data to S3.
Then, the model code from Assisted Modeling can be pushed to SageMaker, where the
model can be trained on the entire dataset resident in Amazon S3, and deployed as an
API in the SageMaker ecosystem to take advantage of containerization, scaling, and
serverless capabilities. As Alteryx focuses on friendly model building, this is often the
best path for organizations who are light in data science, but who have heavy DevOps
or engineering resources.
During model deployment, it’s easy to add test data to a Promote deployment script.
The testing step can be used to conditionally allow or disallow the deployment of that
model version. New Promote model versions are initially hosted in logical development
and staging environments, allowing users to run a new model in parallel with the
previously running production model. Testers can set up their systems to make
predictions on both the production and staging model versions before deciding to
replace the production model, which is accomplishable using an API. Promote also
records all request and response data, making it possible for users to develop custom
workflows that leverage that data to test for bias, fairness, and concept drift.
Continuous improvement
In addition to recording all incoming requests and their responses, Promote tracks
aggregated metrics in Amazon Elasticsearch Service so administrators can observe the
performance of the models they have deployed. Metrics for requests, errors, and
latency over the previous month inform whether the model needs to be replicated
further. Additional system utilization reporting helps administrators determine if
additional nodes must be added to the Promote cluster.
17
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Finally, users can export the historical request data to be analyzed for concept or data
drift. These analyses can be performed in Alteryx Designer, scheduled to run in Server,
and can kick off the CD pipeline if drift is detected.
Get started with our Intelligence Suite Starter Kit or an interactive demo of Alteryx
Designer. Ready to scale? Learn about Best Practices for Deploying Alteryx Server on
AWS and deploy Alteryx Server from the AWS Marketplace.
Dataiku DSS
by Greg Willis, Director of Solutions Architecture, Dataiku
Dataiku is one of the world’s leading AI and machine learning platforms, supporting
agility in organizations’ data efforts via collaborative, elastic, and responsible AI, all at
enterprise scale. At its core, Dataiku believes that in order to stay relevant in today’s
changing world, companies need to harness Enterprise AI as a widespread
organizational asset instead of siloing it into a specific team or role.
To make this vision of Enterprise AI a reality, Dataiku provides a unified user interface
(UI) to orchestrate the entire machine learning lifecycle, from data connectivity,
preparation, and exploration to machine learning model building, deployment,
monitoring, and everything in between.
Dataiku was built from the ground up to support different user profiles collaborating in
every step of the process, from data scientist to cloud architect to analyst. Point-and-
click features allow those on the business side and other non-coders to explore data
and apply automated machine learning (AutoML) using a visual interface. At the same
time, robust coding features (including interactive Python, R, Spark, and SQL
notebooks) ensure that data scientists and other coders are first-class citizens as well.
18
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 12 – Machine learning flow within Dataiku DSS, enabling users to visually orchestrate the
ML lifecycle in collaboration with multiple AWS technologies
19
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 13 – Access, understand, and clean data, using Dataiku DSS in collaboration with AWS
services.
20
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
You can combine visual recipes and code recipes within the same flow to support real-
time collaboration between different users working on the same project. For example,
you might have a data engineer using an SQL recipe, a data scientist using an R recipe,
and a business analyst using a visual recipe, all simultaneously contributing to different
parts of the data pipeline. Integration with external code repositories and IDEs, such as
PyCharm, Sublime, and R Studio, as well as built-in Jupyter notebooks, enables coders
and developers to use the processes and tools they’re familiar with.
Figure 15 – Python code recipe in conjunction with a visual split recipe using Amazon S3 data
Data preparation for unstructured data, such as images, can be particularly challenging
with regards to supervised learning, and often requires manually labeling the data
before it can be used. Dataiku DSS includes an ML-assisted labeling tool that can help
optimize labeling of images, sound, and tabular data through active learning.
21
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 16 – Build machine learning, using Dataiku DSS in collaboration with AWS services
Business analysts and other non-technical users can use Visual and AutoML features to
enhance project collaboration. Data scientists can also use these features to rapidly
prototype simple models. Machine learning and deep learning models can be trained
using various open source libraries, including Python (scikit-learn and XGBoost), Spark
(MLLib), and Keras and TensorFlow. As a clear-box solution based on open-source
algorithms, Dataiku DSS allows users to easily tune parameters and hyperparameters
in order to optimize models trained using the Visual and AutoML features.
Experiment tracking
Because model training is an iterative process, Dataiku DSS keeps track of all previous
experiments, including model versions, feature handling options, the algorithms
22
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
selected, and the parameters and hyperparameters associated with them. Through the
use of either the built-in Git repository or integration with an external repository, Dataiku
DSS also manages code and project versions. This means that you can revert to a
previous version of an experiment, model, or whole project, at the click of a button.
Validation
Before deploying a new model into production, or updating an existing one, it’s
important to validate that both the data and model are consistent, performant, and
working as expected. Dataiku DSS provides built-in metrics and checks to validate data,
such as column statistics, row counts, file sizes, and the ability to create custom
metrics. An Evaluate Recipe produces metrics (such as Precision, Recall, and F1) that
can be used to test the performance of a trained model on unseen data.
Scalable infrastructure
By providing access to a self-service, scalable, and performant infrastructure, cloud
services can afford many advantages, when used as part of the machine learning
lifecycle. Dataiku DSS can create and manage dynamic clusters for both Amazon
Elastic Kubernetes Service and Elastic MapReduce. For example, the model training
process can be transparently offloaded to automatic scaling Amazon Elastic Kubernetes
Service (Amazon EKS) clusters (CPU-based or GPU-based) or SageMaker.
Amazon EKS can be automatically used in conjunction with Python, R and, Spark jobs
for model training and data preprocessing tasks. Dataiku DSS creates the required
container images based on your specific project requirements and pushes those to your
container repository, to be used by the EKS cluster. For more information, see
Reference architecture: managed compute on EKS with Glue and Athena.
23
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
• API Deployer and API Nodes, for deploying individual models as REST API
services
Whichever deployment method is used, Dataiku DSS keeps track of the artifact versions
to allow rollback.
The API Deployer manages the underlying infrastructure used for the API Nodes,
whether they are physical servers, virtual machines (for example, EC2 instances), or
containers managed by Kubernetes (for example, Amazon EKS). Dataiku DSS also
supports exporting models to portable formats, such as PMML, ONNX, and Jar, so they
can be deployed using alternative infrastructure options.
24
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 18 – Deploy machine learning using Dataiku DSS in collaboration with AWS services
Orchestration
Using a combination of the Automation Node, Scenarios, and the Public API, alongside
native Git integration (GitHub, Bitbucket, etc.), and common CI/CD & configuration
management techniques and tools, Dataiku DSS makes it simple to orchestrate the
deployment of machine learning applications and the associated infrastructure and data
preprocessing pipelines. For example, Dataiku DSS can dynamically create and
manage an Amazon EKS cluster onto which it deploys your real-time machine learning
REST API services, to automatically retrain and redeploy a model based on key
performance metrics, or to rebuild an entire machine learning application.
Model monitoring
Advanced model monitoring capabilities such as model drift detection, metrics and
checks, and the Evaluate Recipe can be used to assess your model’s key performance
indicators against real data. Using visual recipes in Dataiku DSS, you can easily create
validation feedback loops to compute the true performance of a saved model against a
new validation dataset with the option to automatically retrain and redeploy models.
The feedback mechanism in Dataiku DSS delivers the ability to determine when existing
models can merely need to be retrained, or whether a challenger model, with better
performance, should be used to replace the existing production-deployed model
25
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
instead. Alternatively, you can run A/B tests on your machine learning models to arrive
at the best performing model.
Further, dedicated dashboards for monitoring global data pipelines and model
performance in Dataiku DSS allow for greater transparency and easier collaboration on
MLOps.
26
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Introduction
Domino Data Lab was founded in 2013 to accelerate the work of code-first data
scientists and help organizations run core components of their business on the models
they develop. It pioneered the Data Science Platforms category, and today powers data
science research at over 20% of Fortune 100 companies.
● Openness – open source and proprietary IDEs and analytical software, all
containerized under one platform and deployed on premises or in the cloud
Domino does not provide any proprietary machine learning algorithms or IDEs, and
does not push any single type of distributed compute framework. Instead, Domino is
focused on providing an open platform for data science, where large data science
teams have access to the tools, languages, and packages that they want to use, plus
scalable access to compute, in one central place. Domino’s code-driven and notebook-
centric environment, combined with AWS’s expertise in large-scale enterprise model
operations makes for a natural technical pairing when optimizing the productionization
of ML.
27
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
• Customer hosted and managed, in which you bring your own Kubernetes cluster
• Customer hosted and Domino managed, in which you give Domino a dedicated
VPC in your AWS account for a Kubernetes deployment
When running Domino on Amazon Elastic Kubernetes Service (EKS), the architecture
uses AWS resources to fulfill the Domino cluster requirements as follows:
28
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Core services for reproducibility, application services, and enterprise authentication run
in persistent pods. User execution and production workloads are brokered to the
compute grid in ephemeral pods. Execution resources scale automatically using the
Kubernetes Cluster Autoscaler and Amazon EC2 Auto Scaling groups.
2. Environment management
Domino allows admins to manage Docker images and control their visibility across the
organization, but image modification is not limited to platform administrators. All Domino
users can self-serve environment management to create and modify Docker
environments for their own use. Once created, these pre-configured environments are
available to data scientists throughout the enterprise, based on configurable permission
levels.
Figure 21 – Users can easily duplicate and modify an environment for their own use
3. Project management
Key, and often overlooked, aspects of ML productionization occur during the initial
stages of a project. Leveraging prior work on similar projects is one such aspect.
Because Domino automatically tracks all practitioner work, it offers a repository of past
29
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
projects, code, experiments, and models from research scattered across the
organization. Data scientists search this repository for prior art as they kick off
collaborative project work.
Figure 22 – Projects Portfolio details all research and the status toward established goals
4. Data connections
With environment management, reproducibility, and project best practices set,
collaborative data science research and productionization can proceed at a rapid pace.
This begins with connecting to, iterating on, and managing data.
Data sources
Domino users can bring the data and files stored outside of Domino to their work inside
of Domino. Domino is easily configured to connect to a wide variety of data sources,
including Amazon S3, Amazon Redshift, and an Amazon EMR cluster. This involves
30
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
loading the required client software and drivers for the external service into a Domino
environment (many come pre-loaded), and loading any credentials or connection details
into Domino environment variables.
You can optionally configure AWS credential propagation, which allows for Domino to
automatically assume temporary credentials for AWS roles that are based on roles
assigned to users in the upstream identity provider. Once a user has logged into the
system, their short-lived access token is associated with any workloads executed during
their session.
31
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
5. Computing frameworks
When it’s time to process data, practitioners have the freedom to select the compute
and environment that best meets their needs. Users easily choose from Domino-
managed Docker images preconfigured to work on AWS GPU and CPU tiers. Included
in this are the latest in NVIDIA configurations specific to data science workloads.
With compute and environments defined, data scientists have the freedom to do their
coding in open or proprietary IDEs, such as Jupyter Lab, RStudio, SAS, MATLAB,
Superset, VS Code, and more.
These IDEs are containerized on AWS to create a powerful workflow for organizations
with heterogeneous coding teams.
32
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
6. Model building
Data scientists write code to build models in Domino using their favorite statistical
computing software and IDE as outlined in the previous section. Code is written in an
interactive workspace and then executed in that same workspace, in a batch job, or in a
scheduled job. Typically, model training (and all the preparation that leads up to it) is
done via open-source packages in R and Python. SAS and MATLAB are also popular
choices for model building in Domino. Domino’s API allows for the integration of batch
jobs into production pipelines.
Both traditional ML and deep learning workflows are supported via Domino’s
preconfigured environments for GPUs and deep learning frameworks, such as
TensorFlow and PyTorch. Given the ease of building your own environment in Domino,
new frameworks can be quickly integrated.
33
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
As would be expected with any Docker-based interactive workspace, Domino users can
also leverage Amazon SageMaker tools to build models. Users simply make calls to
tools such as Amazon SageMaker Autopilot or training jobs from a Domino workspace.
This can be done from any IDE that supports a language with extensions for
SageMaker. A common workflow is to kick off SageMaker calls via cells in a Jupyter
notebook in a Domino interactive workspace. Training then happens in SageMaker
exactly as it would when working directly with SageMaker.
Domino’s central and open platform allows data scientists to seamlessly switch between
tools while Domino keeps a constant track of the flow of research, down to the smallest
details.
Figure 24 – The Experiment Tracker is one of Domino’s tools for central tracking, collaboration,
and reproducibility of research
7. Productionizing models
Deploying in Domino
Data scientists can use Domino’s simplified deployment process to deploy and host
models in Domino on AWS infrastructure, which provides an easy, self-service method
for API endpoint deployment. When deploying in Domino, Domino manages the
containers hosted on AWS. Endpoints are not automatically registered with AWS. For
more details, see Video introduction to model publishing.
34
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Deploying models with Domino’s deployment process has several advantages. First, the
publishing process is simple and straightforward, and designed for data scientists to
easily manage the process themselves. Second, Domino provides full model lineage
down to the exact version of all software used to create the function that calls the
model. Third, Domino can provide an overview of all assets and link those to individuals,
teams, and projects.
Domino allows you to export model images built in Domino to an external container
registry. These images include all the information needed to run the model, such as
model code, artifacts, environment, and project files. Domino exposes REST APIs to
programmatically build and export the model image, which can be called by your CI/CD
pipelines or integrated with workflow schedulers, such as Apache Airflow. By default,
the images are built in Model API format. These images can be easily deployed in an
environment that can run Docker containers.
You can also leverage Domino’s SageMaker Export feature. This API-based feature
builds a Docker image for a given version of a Model API in an AWS SageMaker-
compliant format and then exports it to Amazon ECR or any third-party container
registry outside Domino. As part of the API request, users need to provide credentials
for their registry to push an image to it. These credentials are not saved inside Domino
and can have a Time to live (TTL) attached.
35
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
8. Model monitoring
After models are deployed, it’s critical that they are monitored to ensure that the data
that’s presented for scoring is aligned with the data that was used to train the model.
Proactive model monitoring allows you to detect economic changes, shifts in customer
preferences, broken data pipelines, and any other factor that could cause your model to
degrade.
Domino Model Monitor (DMM) automatically tracks the health of all deployed models,
no matter if they were deployed in Domino, SageMaker, or another platform. DMM
provides a single pane of glass for anomaly detection against input data drift, prediction
drift, and ground truth-based accuracy drift. If thresholds are exceeded, retraining and
redeployment of models can be initiated via integration with the Domino API.
36
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 26 – Domino Model Monitor tracks the health of all models in production
Download a free trial today so you can see how Domino works with AWS to provide an
open data science platform in the cloud. Full details on how you can purchase Domino,
including pricing information and user reviews, can be found on the Domino page in the
AWS Marketplace.
37
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
KNIME
by Jim Falgout, VP of Operations, KNIME
This section explains how one software environment enables individuals and
organizations to tackle their data challenges.
KNIME Analytics Platform is where data engineers and data scientists get started with
KNIME. Usually run on a desktop or laptop, KNIME provides a great way to get started
with a no-code environment (although if users do want to code, that is possible). It’s a
visual development environment used to build workflows that perform various functions
such as data discovery, data engineering, data preparation, feature engineering and
parameter optimization, model learning, as well as model testing and validation. As an
open platform, KNIME supports other open source technologies such as H2O, Keras,
TensorFlow, R and Python. KNIME's open approach ensures that the latest, trending
technologies are quickly usable on the platform.
38
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
With these two software platforms, KNIME Software covers the entire data science
cycle as seen in Fig. 27. No additional add-ons or plugins needed.
Figure 27 – End-to-End data science integrating the creation and operations processes
KNIME supports sourcing data from Amazon S3, Amazon Redshift, Amazon Relational
Database Service, and Amazon Athena. These services are all integrated into KNIME
as native nodes. With KNIME it’s easy to mix and match data from AWS services with
other data sources as needed. KNIME supports a broad range of services within AWS
39
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
The workflow in Figure 28 demonstrates how easy it is to use Amazon Personalize with
KNIME. It shows the entire cycle of importing data into Amazon Personalize, creating a
model, deploying a campaign, and getting real-time recommendations. It also takes
advantage of KNIME’s ability to prepare data for AWS Machine Learning services and
post-process service results. Adding in KNIME Server to productionize integrated
workflows completes the lifecycle.
40
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Experimenting during the early phases of a project is a strength of KNIME. The visual
development interface enables quick prototyping of different approaches and supports
collaboration with others. Combining ML algorithms from KNIME, Python, R, H2O, and
deep learning technology such as Keras and TensorFlow is easy. Users can mix and
match the technologies that make the most sense for the problem at hand. Or they can
use the technologies that they have experience working with.
Workflows created using a visual development environment are well suited for
collaboration. Understanding the logic behind a visual workflow is more easily
accomplished than browsing a large set of code, which makes KNIME workflows well
suited for a data science task that requires reproducible results.
KNIME Server supports working within a team. The workflow repository within KNIME
Server supports deploying workflows from the KNIME Analytics Platform. Deployed
workflows are versioned, and changes can be tracked over time. Changes to a workflow
across versions can be visually inspected. Visual inspection of changes allows users to
quickly see what changes were made to a workflow by team members over time.
41
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 29 – Using the KNIME Analytics Platform to train a random forest model
KNIME also supports more complex modeling techniques such as feature engineering,
automated parameter optimization, model validation, and model interpretability.
Deploying models
Models are deployed within KNIME Server as workflows. The concept is that a model is
not standalone but requires additional functionality, such as data pre-processing and
post-processing. Wrapping a model in a KNIME workflow supports all those needs.
Models wrapped in a KNIME workflow can be deployed to the KNIME Server as an API
endpoint, a scheduled job, or available for batch processing.
The process starts with creating a workflow to generate an optimal model. The
Integrated Deployment nodes allow data scientists to capture the parts of the workflow
needed for running in a production environment, plus data creation and preparation and
the model itself. These captured subsets are saved automatically as workflows with all
the relevant settings and transformations. There is no limitation in this identification
process. It can be as simple or as complex as required.
The workflow in Figure 30 shows the model training workflow used in Figure 29 with the
Integrated Deployment nodes added. The data preparation and model application parts
of the workflow are captured, stitched together, and persisted as a new workflow. In this
case, the generated workflow with the embedded model is automatically deployed to a
KNIME Server instance.
Integrated Deployment allows for combining model creation, including pre- and post-
processing, into a single workflow that is deployment ready. As design of the model
preparation and training changes, those changes are automatically captured in the
workflow and regenerated and redeployed to your chosen target. Closing the gap
42
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
between model readiness and production readiness as described here can increase the
productivity of data science teams drastically.
Generated workflows can be persisted to the KNIME Server, an Amazon S3 bucket, the
local KNIME environment, or anywhere KNIME can write data. In the preceding
example, we show deploying the generated workflow directly to a KNIME Server.
KNIME provides integration with external software that can be used to integrate with
CI/CD pipeline components, such as Jenkins and GitLab. For example, Figure 32
outlines the process of writing a generated workflow to Amazon S3 and using a Lambda
trigger to automatically deploy the workflow to KNIME Server. The Lambda function can
also be used to kick off a set of workflow tests within KNIME Server to validate the new
workflow as part of a CI/CD flow. Likewise, workflows can be used to integrate with
external systems.
All of these nodes are available as native nodes, ready to use in KNIME Analytics
Platform.
43
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
KNIME also supports continuous delivery of models and model monitoring using a
concept we call the Model Process Factory. This provides a flexible and extensible
means to monitor model performance and automatically build new, challenger models
as needed. We’ve done all the work in putting it together, ready for you to use directly
from the KNIME Hub.
44
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
All of these nodes are available as native nodes, ready to use in KNIME Analytics
Platform.
45
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 34 – A workflow using the SHAP nodes to help interpret predictions of a model
In this section, we present builders with the option to tailor your own AWS solution. To
serve as guidance, we prescribe an architecture designed from production successes,
such as this implementation by WordStream and Slalom. This solution leans on AWS
Managed Services so that your team can minimize operational overhead and focus on
developing differentiated experiences.
We’ll explain how the solution comes together by walking through the CD4ML process
from ML experimentation to production. We’ll present architecture diagrams organized
across three disciplines: data science, data engineering, and DevOps, so that you get a
sense for organizational ownership and collaboration.
Model building
As prescribed by CD4ML, the process starts with model building and the need for data
curation.
Many AWS services and partner technologies are designed to operate on this
architecture. Thus, it provides a future-proof design for an evolving ML toolchain.
46
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Figure 35 – An AWS data lake for facilitating discoverable and accessible data
Your organization can get a data lake up and running in quickly by using AWS Lake
Formation. Lake Formation brings together core data services and centralizes data
access controls to simplify data management.
Amazon S3 provides storage, AWS Glue provides data processing and a catalog, and
Amazon Athena enables analysis on your data in Amazon S3. These fully-managed
components provide core data governance and curation capabilities without the need to
manage infrastructure.
Once in place, the data lake supports a wide range of data ingestion methods to
centralize your data sources. Your team can also augment your datasets through the
AWS Data Exchange. The exchange provides a diverse catalog of industry data
sources—some of which can be used for ML—ranging from financial services,
economics, retail, climate and beyond.
At some point, your data science teams will need support for data labeling activities.
Amazon SageMaker Ground Truth enables your team to scale and integrate labeling
activities into your data lake. It provides integrated workflows with labeling services, like
Amazon Mechanical Turk, to augment your private labeling workforce, automated
labeling using active learning to reduce cost, and annotation consolidation to maintain
label quality.
47
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
As your ML workloads increase and become more sophisticated, your organization will
need a better way to share and serve features—the data used for model training and
inference. You can augment your data lake with the Amazon SageMaker Feature Store,
which allows your organization to ingest, store, and share standardized features from
batch and streaming data sources.
Your team can use built-in data transformations from Amazon Data Wrangler to quickly
build feature engineering workflows that feed into the Feature Store. These features can
be automatically cataloged and stored for offline use cases like training and batch
inference. Online use cases that require low latency and high throughput also exist.
Imagine your team has to deliver a service to predict credit fraud on real-time
transactions. Your model might use transaction, location, and user history to make
predictions. Your client app doesn’t have access to user history, so you serve these
features from the online Feature Store.
Model experimentation
With your data lake in place, you’ve established a foundation for enabling ML initiatives.
Your data science teams can work with their business partners to identify opportunities
for ML.
You’ll need to provide your data scientists with a platform for experimentation. Figure 36
presents the components in this solution that facilitate experimentation to prove the art-
of-the-possible. SageMaker provides ML practitioners with many capabilities in this
area.
48
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Amazon SageMaker Studio provides you with a fully-managed IDE and a unified user
experience across SageMaker. Your Studio environment is secured through AWS
Identity and Access Management (IAM) or AWS SSO. Thus, you can authenticate as an
AWS user or a corporate identity from a system like Active Directory.
Remember to apply security best practices at all times and across all environments.
Your team should become familiar with the security in SageMaker, as outlined in the
documentation. You should use IAM controls to govern minimal permissions and apply
data protection best practices. You should deploy notebooks in an Amazon Virtual
Private Cloud (Amazon VPC), and use Amazon VPC interface endpoints and a hybrid
network architecture to enable private network security controls. Internet-free model
training and serving options can also be used to support security sensitive use cases.
49
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Within Studio, your ML practitioners launch elastic Studio notebooks. These notebooks
are fully-managed and include pre-built environments that you can customize. You can
select from a variety of popular ML libraries and frameworks, such as TensorFlow,
PyTorch, MXNet, Spark ML, and Python libraries like scikit-learn, among others. ML
problems come in many forms, as do the skill sets of your engineering resources.
Having an arsenal of tools provides your ML practice with flexibility.
Also, experimentation workloads tend to evolve and they come in many forms. An
environment that scales efficiently and seamlessly is important for rapid prototyping and
cost management. The notebook’s compute resources can be adjusted without
interrupting your work. For instance, for lightweight experimentation, you can run on
burstable CPU and use SageMaker remote training services to minimize cost. Later,
you might have a need to experiment with Deep Learning (DL) algorithms, and you
could seamlessly add GPU for rapid prototyping.
Throughout ML experimentation, you can share and version your notebooks and scripts
through SageMaker’s Git integration and collaboration features. This solution prescribes
AWS CodeCommit, which provides a fully-managed, secured Git-based repository.
For ML problems on tabular datasets, you can opt to use Amazon SageMaker Autopilot
to automate experimentation. Autopilot applies AutoML on data tables and delivers an
optimized ML model for regression or classification problems. It does this by automating
data analysis and generating candidate ML pipelines that explore a variety of
50
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
All training jobs are tracked by Amazon SageMaker Experiments. This includes tracking
datasets, scripts, configurations, model artifacts and metrics associated with your
training runs.
Now that you have an audit trail of all your organizations’ experiments, your data
scientists can reproduce results.
To scale and automate training jobs, your ML practitioners will use SageMaker’s zero-
setup training services to automatically build training clusters, perform distributed
training, and parallelize data ingestion on large datasets.
These training services provide options that range from code-free to full-control.
Typically, a training process is launched by providing a script or configuring a built-in
algorithm. If your data scientists desire full-control, there’s the option to “bring-your-own-
model” or algorithm by managing your own compatible container.
Your teams also need the ability to scale the volume of training jobs in addition to
scaling individual jobs. This is needed for hyperparameter optimization, a process that
involves exploring algorithm configurations (hyperparameters) to improve a model’s
effectiveness. The process requires training many models with varying hyperparameter
values. Your data scientists can choose to use Automatic Model Tuning to automate
and accelerate this process.
51
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
From within Studio, your data scientists can craft and experiment with different ML
workflows using Amazon SageMaker Pipelines as shown in Figure 38.
Figure 39 illustrates the top-level pattern that your ML workflows typically resemble. In
each of these high-level stages, your data scientists—possibly in collaboration with data
and DevOps engineers—craft workflow steps as part of a SageMaker Pipeline.
During the Prepare Data stage, you’ll typically create steps to validate and apply feature
engineering transforms to your data. The data prep workflows that your team creates
52
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
with the Data Wrangler can be exported for use in your production pipelines. Data prep
steps could make use of services like Athena and capabilities like SageMaker
Processing. Data processing steps can run on a remote Spark cluster and is a natural
point for data scientists and data engineers to collaborate.
After data prep, your workflow introduces one or more training steps to train models
using the aforementioned training services. Consider adding hooks into your training job
by using Amazon SageMaker Debugger. For instance, you can use built-in rules to stop
a training job if it detects a vanishing gradient problem to avoid resource waste. To learn
how Autodesk used the Debugger to optimize their training resources, see the blog
post, Autodesk optimizes visual similarity search model in Fusion 360 with Amazon
SageMaker Debugger.
Figure 40 – Real-time Profiling of Training Resource Utilization using the SageMaker Profiler
Consider adding branches in your workflow to recompile your model for different target
platforms using Amazon SageMaker Neo. Neo can reduce the size of your models and
inference latency.
These steps are followed by model evaluation, testing and deployment stages that we’ll
discuss later.
Figure 41 summarizes how the key systems at this stage come together. Your DevOps
team likely has a Continuous Integration/Continuous Delivery (CI/CD) pipeline in place
to support existing software development practices. If not, you have the option to roll out
the solution highlighted in Figure 41. SageMaker provides pipeline templates that can
be deployed through AWS Service Catalog. This includes an integrated CI/CD pipeline
built on AWS DevOps services.
53
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Your team commits your ML scripts, which include a SageMaker Pipeline. These scripts
run through your standard CI process. Once your scripts pass your automated tests and
approval process, your new SageMaker Pipeline is activated. The workflow then
executes either on demand, on a scheduled basis, or on event-based triggers.
What if your pipeline requirements extend beyond the scope of SageMaker Pipelines?
For instance, you may need to manage pipelines for other AWS AI services. To
minimize time-to-value, evaluate these other options: AWS Step Functions and the Data
Science SDK, Apache Airflow, and Kubeflow Pipelines. These workflow managers have
integrations with SageMaker.
Following training in Figure 39 are steps for model evaluation and testing. During this
stage, you should create rules to enforce quality standards. For example, you could
create a rule that ensures candidate models exceed performance baselines on metrics
like accuracy. The probabilistic nature of ML models can introduce edge cases. Your
54
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
team should brainstorm risky scenarios and create tests accordingly. For example, for
some computer vision use cases, you should ensure your models act appropriately
when exposed to explicit images.
Your team will also need to update your CI tests. Load tests should be in place to
measure the performance of your model server endpoints. Changes to your ML model
can affect latency and throughput. SageMaker provides Amazon CloudWatch metrics to
assist you in this area, and AWS CodePipeline provides a variety of QA automation
tools through partner integrations.
Figure 42 illustrates where you can add bias detection into a SageMaker Pipeline.
These bias evaluation steps can be incorporated into pre-training, post-training and
online stages of your workflow.
You can create workflow steps to measure and mitigate bias in your data. For instance,
the Class Imbalance metric can help you discover an underrepresentation of a
disadvantaged demographic in your dataset that will cause bias in your model. In turn,
you can attempt to mitigate the issue with techniques like resampling.
You can also create steps to identify prediction bias on your trained models. For
instance, Accuracy Difference (AD) measures prediction accuracy differences between
different classes in your models. A high AD is problematic in use cases such as facial
recognition because it indicates that a group experiences mistaken identity
disproportionately and this could have an adverse impact.
55
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
More generally, SHAP values can be included in granular error analysis reports to
attribute the impact of features on specific results. This helps your team understand the
cause of incorrect predictions and helps validate that correct predictions aren’t due to
flaws like leakage.
Deployment
Models that pass your quality controls should be added to your SageMaker catalog of
trusted models. Next, your team should create workflows to deploy these models into
production to support real-time and batch workloads.
Figure 43 illustrates how your system extends to support these scenarios. This phase in
the lifecycle will require collaboration across teams.
56
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
First, this is a good time to re-evaluate your SageMaker Pipelines and ensure your
setup is optimized for both performance and cost. Evaluate Amazon Elastic Inference
and AWS Inferentia to cost-optimize GPU inference workloads. Leverage multi-model
endpoints and automatic scaling to improve resource utilization, and use the provided
CloudWatch metrics to monitor the situation. You’re likely to have to re-train your model
throughout the lifespan of your application, so use managed spot training when it’s
suitable.
• For high availability, configure your endpoint to run on two or more instances.
Next, you can choose to attach your model endpoints to Amazon API Gateway for API
management. Among the added benefits are: edge caching, resource throttling,
WebSocket support, and additional layers of security.
Alternatively, you can integrate your model endpoints with Amazon Aurora and Amazon
Athena. Doing so allows a data-centric application to blend in predictions using SQL,
and reduce data pipeline complexity.
Your team can also build a workflow for batch inference. For instance, you might create
a scoring pipeline that pre-calculates prediction probabilities for loan-default risk or
customer churn likelihood. Following that, these predictions are loaded into a database
to support predictive analytics.
You can create a batch inference step that uses batch transform. Batch transform is
designed to scale out and support batch inference on large datasets. It’s optimized for
throughput and you’re only billed on transient resource usage.
57
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Model monitoring
As depicted in Figure 44, your SageMaker Pipelines should deploy a Model Monitor
alongside your model endpoints to provide drift detection. Your team chooses between
the default monitor or a custom monitoring job. When drift is detected, your monitors
should be set up to trigger workflows to enable an automated training loop or alert your
data scientists about the issue.
58
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
As a best practice, your team should keep a snapshot of the data used to train each of
your production model variants and create an Amazon S3 prefix that encodes version
tracking metadata. The lineage tracking system currently uses the Amazon S3 location
to track data artifacts. Thus, by following this best practice, your team has traceability
between model and dataset versions.
High-level AI services
In addition to SageMaker, AWS provides other AI services designed for building AI-
enabled apps without requiring ML expertise.
59
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Nonetheless, custom models are often needed to achieve business objectives. Thus,
many AWS AI services have both turnkey and AutoML capabilities. Some services like
Amazon Fraud Detector and Amazon Forecast are designed around AutoML.
When you’re ready to deploy your AutoML solutions into production, your team should
deploy them as automated workflows. These workflows have a common pattern, as
shown in Figure 46.
Step Functions is a good choice for orchestrating these workflows to create serverless
solutions. For human review workflows, A2I has turnkey support for Amazon
Rekognition and Amazon Textract. The other AI services require customization.
For drift detection, you’ll need to implement a custom solution. You can look to APN
Partners to help fill the gap. For instance, Digital Customer Experience partners like
Amplitude and Segment integrate with Amazon Personalize to provide marketing and
60
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
As your ML initiatives mature, your organization will adopt more ML tools and your team
will face the challenge of managing many pipelining tools. AWS provides an MLOps
Framework that provides the foundation for unifying workflow technologies like
SageMaker Pipelines and Step Functions under a common management framework.
The framework is designed to accelerate outcomes by serving workflow blueprints
developed by AWS and our open source community.
Among these partners are specialists in MLOps. Many of these consulting partners
have achieved key competencies in DevOps, data and analytics, and machine learning.
Information about these partners has been consolidated in the Resources section. Use
the provided links to learn more about the partners and how to contact them.
Conclusion
The APN Machine Learning community provides you with a choice of outstanding
solutions for ML Ops. The following links can help you facilitate the next steps.
• AWS – Connect with vetted MLOps consultants and try Amazon SageMaker for
free.
• Alteryx – Get started with the Alteryx Intelligence Suite Starter Kit.
• Domino Data Lab – Launch a free-trial of Domino Data Lab’s managed offering.
• KNIME – Launch a free-trial of the KNIME Server from the AWS Marketplace.
61
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Contributors
• Christoph Windheuser, Global Head of Artificial Intelligence, ThoughtWorks
Resources
The following table lists AWS Machine Learning Competency Partners with MLOps
expertise.
Inawisdom UK, Middle East, Learn about Inawisdom’s DevOps practice for ML.
Benelux
62
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Keepler Spain, Portugal, Learn about Keepler’s ML expertise and offerings on AWS.
Austria, Germany,
Switzerland
Max Kelsen Australia, New Learn about MLOps Best Practices with Amazon SageMaker &
Zealand, Asian Kubeflow and Max Kelsen’s expertise.
Pacific
1Strategy North America Learn about 1Strategy’s MLOps expertise and offerings.
(TEKsystems)
Peak.ai UK, US Learn about Peak’s MLOps expertise and offerings on AWS.
Provectus North America Learn about Provectus’s MLOps expertise and offerings.
63
Amazon Web Services MLOps: Continuous Delivery for Machine Learning on AWS
Document Revisions
Date Description
December 2020 First publication
Notes
1 Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment
Automation
2 Jez Humble’s Continuous Delivery website: continuousdelivery.com
3 Continuous Delivery for Machine Learning: Automating the end-to-end lifecycle of
Machine Learning applications
64