ETL Pipelines & Data Preparation for any skill level with Cloud Pak for Data

Published in

IBM Data Science in Practice

6 min readAug 3, 2022

The Extract, Transform and Load pattern (ETL) is a classic in Data Engineering, but it’s still the most common and useful. An ETL can be helpful for all sorts of users: business analyst, developers, data scientists, and data engineers. All these different roles have different requirements and different skills sets, but all of them need data and tools to get data from different data sources to do their daily work.

IBM Cloud Pak For Data (CP4D) allows you to create customized ETL pipelines specific to your role or purpose, anywhere from highly-complex and powerful coded transformers, or quick and fast PoC’s. You can choose to code the solution, use a Canvas UI or even a no-code drag-and-drop interface.

The challenge: Data is the new oil of our industry and nearly every job needs a way to consume data from different sources. Based on that requirement, we need different options for different skills sets. This article provides a solution for any skill set: from no-code, to code expert, to professional Data Engineer.

Who should read this? Data Engineers, Data Scientist, Machine Learning Engineer, Business Analysts, Developers and all people who need a way to extract data from different sources.

What is covered in that article?

I divided the different task in into three categories:

low code
coding
data engineering.

First, I will elaborate on the following picture that shows the different path you can follow. Later, I will start with a brief introduction of the services that are used in that article.

Create ETL with CP4D

In general, there are 3 options to build an ETL with CP4D. Low Code for less advances technical people or people who need a quick PoC. Coding for highly technical persons who want to use code with 3rd Party Libraries or IBM’s Apache Spark implementation to deal with big data. And last, Data Engineers who need a bullet proofed and reliable tool to aggregate big data from multiple sources

Prerequisites: IBM Cloud Account

IBM Cloud Pak for Data Access (Try it for free)
IBM Machine Learning Instance (Try it for free)

Low Code: This describes an alternative that needs no coding knowledge, which is designed for less technical people. It allows you to create a connection to a specific data source of your choice, prepare and preprocess the data with the IBM Refinery Service, and finally, analyze the data with an IBM Cognos Dashboard.

The first step is to create a connection to a data source — anything ranging from relational databases like DB2 or Apache MariaDB, or even flat-files via FTP — nearly anything that holds your data of interest. You only need to know your credentials and URL of the source. In the next step, the connections is used in the Refinery service for data preprocessing or basic analytics.

Data Visualization with Data Refinery CP4D

Tutorial Data Refinery: https://developer.ibm.com/tutorials/data-visualization-with-data-refinery/

The data within the refinery service is stored as assets in the Cloud Object Storage (COS) and are accessible via the data section in CP4D Project . These assets can be used for data analysis, to train models and to create dashboards. For low code, I highly recommend Cognos Dashboards to create dashboards, Refinery for data analysis and AUTO AI for training models.

Tutorial Cognos Dashboard: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/analytics-dashboard.html?audience=wdp

Summary: If you preferer Low Code or No Code use the following services to build ETL pipelines (Connections, Refinery), analyze data (Refinery), train models(Auto AI) and building dashboards (Cognos Dashboard).

Additional materials for creating an ETL with Cloud Pak for Data:

Coding: To connect your code with the data source of your choice. The predefined IBM connections can be used via one click in any Notebook.

Use predefined connections in a Jupyter Notebook.

A full list of connections can be found here

Each Notebook can be used with a pre-configured Apache Spark environment that allows you to handle big data. Apache Spark provides additional features for connections, pre-processing and training. If you are familiar with Apache Spark, then this is your playground.

Here is a good Tutorial to start with: “Use Spark and batch deployment to predict customer churn”

The point is that the Jupyter Notebooks are the Swiss Army knife — your first tool for any purpose. It’s important to note that each connection or preprocessing step can be substituted by a Notebook. This means, if a connection is not available, or the Refinery doesn’t have the right capabilities for pre-processing, you can slot-in a Notebook to allow you to code something customized for a perfect fit. Of course, within the Notebook you can always use your favorite frameworks too, such as Pandas or Numpy..

If you are new to Notebooks and IBM. Here is a demo of creating a Notebook and using an IBM Connection:

add connection
Load data from data source connections via Notebooks

Data Engineering: describes a method for all users to transform and integrate data in projects. No coding knowledge is required, but a deep understanding of data architecture and data interaction is critical. DataStage is the tool of choice if you need robust, production-ready dataflows.

IBM DataStage delivers 3 basic components. Connectors, Stages and Data Quality.

Connectors represent a node for data connectivity and metadata integration of external data sources. It supports various sources from messaging software, over relations databases to public cloud storages. Here is a full list of all available connections. There is no need to setup any database drivers or installing 3rd party libraries — you can use them out of the box.

Flow consists of stages that are linked together, which describe the flow of data from a data source to a data target. A stage describes a data source, a processing step, or a target system. The stage also defines the processing logic that moves the data from the input links to the output links. Gere is a full list of all available stages.

Use the QualityStage stages in DataStage® to investigate, cleanse, and manage your data.

Congratulation — that’s it!

You have learned how do create an ETL and preprocess data with any level of expertise in IBM’s Cloud Pak for Data offering.

Disclaimer: Opinions are my own and not the views of my employer.

ETL Pipelines & Data Preparation for any skill level with Cloud Pak for Data

Create ETL with CP4D

Written by Christian Bernecker