An enterprise design pattern for MLOps & DataOps on unified data and AI platforms

Christian Bernecker
IBM Data Science in Practice
6 min readNov 1, 2022

--

This article describes a reusable enterprise design pattern for MLOps / AIOps and DataOps tasks that helps to simplify and automate how teams collect, organize, analyze and infuse data with a unified data and AI platform. This pattern is for the IBM Cloud platform CP4D but can be reused and adapted for Azure, AWS, GCP and other Hyperscalers.

Who should read this? Solutions Architects, Enterprise Architects, Platform Architects, Data Engineers, Data Scientist, Machine Learning Engineers.

What is covered in that article? This article translates the MLOps and DataOps patterns into a ready to use cloud patterns. It goes into each step of these well-known operations and is a guide for any architect who needs to do the same.

Photo by "My Life Through A Lens" on Unsplash

First we looking at the whole pattern and going into the details in later sections. You will find the following 3 pattern in that article:

  • Enterprise Design Pattern
  • MLOps Design Pattern
  • DataOps Design Pattern

Enterprise Design Pattern

A fully operable pattern that covers all steps of the machine learning cycle (MLOPS): collecting, analyzing, training, deploying and monitoring AI Models. In addition it covers the following DataOps steps: collecting, government, organizing and analyzing data.

Challenges: Organizing Assets, Data Governance, Reusable data assets, standardized MLOPS, Training Models, Monitoring Models, Continuous Training, Documentation

Enterprise Design Pattern

The pattern starts with building ETL’s for the data processing part. It continuous with collecting the data and ends with the deployment of the trained AI model. All these steps have to be covered and organized within a reusable pattern. The next section explains how projects can be organized step by step on IBM’s Cloud platform CP4D:

  1. Use a dedicated ETL project with limited access. Only admins and data-engineers should have access to the project. The ETL project contains all DataStage flows to produce data assets that are needed by the business.
  2. The ETL project publish all data assets to the WATSON Knowledge Catalog (WKC). The WKC can be accessed by any person in the organization and it is the entry point for any data exploration.
  3. The data assets can be shared with any CP4D project, external GIT repositories or to any other Cloud like Azure, GCP or AWS via the Watson Data API.
  4. Within a project, data exploration and modeling is done via code (Jupyter Notebook or a Federated Learning) or any other IBM service that offers machine learning or AI capabilities.
  5. The deployment is made with the IBM Machine Learning Service. It provides a deployment space where the models are stored, versioned, deployed and monitored. It offers a scalable REST API and batch processing application.

MLOps / Data Science Design Pattern

A data scientist (DS) faces different challenges. In some cases a DS gets a well prepared dataset from previous used models or projects. In other cases they have to explore and investigate unknown data. Means, accessing data can become a challenge, not only for DS. Exactly for that reason it makes sense to uncoupled the DS part from the data engineering part. The data engineers ensure that the data scientist have access to the data they need.

Challenges: Data Exploration, Fast Access, Simple Use of different AI Frameworks, Easy Setup, GPU’s, Easy Deployment, Monitoring

CP4D — Data Science Pattern — Created by Author
  1. For each application a new project and a new deployment space (IBM Machine Learning service) is created. This allows a clear separation of the projects.
  2. Within the project different tasks are made like: creating reports, analyze data and train models.
  3. The trained models are stored and documented in Watson Knowledge Catalog (WKC). Each model contains a fact-sheet that explains the model in detail (Purpose, Data, Results etc). This fact-sheet can be viewed in the WKC and is accessible and searchable by any member of the organization. In addition models can be reused and shared via WKC.
  4. The deployment space is part of the IBM Machine Learning Service. It offers the capability to deploy the models as a REST API or as a Batch. A deployment space can have multiple models and different versions of a model. It supports common AI Framework like Tensorflow, Pytorch, Scikit and others. In addition it is possible to define an own software specification.

If you want to learn more about deployments with IBM Machine Learning. I highly recommend the following article for you:

Finally: I highly recommend to standardize the work process for the data science teams. This makes them more flexible and increases the productivity.

DataOps / Data Engineer Design Pattern

This pattern is used to decouple the data science work from the data engineering work. The data engineers ensure that the data they providing are government and compliant to the rules of the organization. In addition database connections are stored in a single place and it makes it easy to re-roll credentials.

Challenges: Data Exploration, Fast Data Access, Repeatable Patterns, Easy and Fast setup of new data streams. Governance, Data Inventory

CP4D — Data Engineer Pattern — Created by Author
  1. Use a dedicated ETL project to uncouple the data engineering part from the data science part.
  2. Within the ETL project defined schedules running the data flows to ensure a continuous data collection. DataStage is our ETL tool of choice. It supports extract, transform and load (ETL) and extract, load and transform (ELT) patterns.
  3. All data assets are published to the WATSON Knowledge Catalog (WKC). The WKC is the data inventory and can be accessed by every person in the organization.
  4. Within the WKC the data assets can be added to any CP4D project, external GIT projects or to any other Cloud like Azure, GCP or AWS. That allows a high flexibility and manifests IBM as a real hybrid cloud platform.
  5. WKC allows to curate, governance and create access polices for the produced data.

Finally: CP4D delivers multiple options of creating ETL pipelines. This pattern focus on DataStage but can be replaced with any other method or technology. In our pattern, DataStage is the tool of choice because we need robust, production-ready dataflows. But for example you could as well use Apache Spark or any other technology you prefer.

If you are interested in the other option I recommend to read the following article:

How it comes together.

The single patterns can be connected and used interoperable. They allow a decoupling of the data engineering and data science work and allow a clear defined enterprise project structure. The following illustration shows an example how an enterprise design pattern can look like.

CP4D Project Pattern — Created by Author

Conclusion and Advices:

  1. Decouple DataOps from MLOps.
  2. Define a work pattern: “ Who is responsible for what?”.
  3. Define ETL process patterns.
  4. Define a clear project structure.
  5. Create a service architecture pattern.

What you learned

You have learned designed patterns for MLOps and DataOps within IBM’s CP4D. You learned how to decouple data engineering and data science work and how to integrate these into an overall enterprise design pattern.

Congratulation — that’s it!

Leave a comment if you have any question, recommendation or something is not clear and I’ll try to answer soon as possible.

Disclaimer: Opinions are my own and not the views of my employer.

--

--

Christian Bernecker
IBM Data Science in Practice

IT Architect | Data Scientist | Software Developer | Data Driven Investor