Guide to SageMaker Pipelines

Published in

Level Up Coding

15 min readMay 5, 2024

In the previous article of the series, we learned how to deploy our models as a SageMaker Endpoint and expose the endpoint as a REST API. In that article, I mentioned that in the next article, we would explore SageMaker Pipelines. Why do we need SageMaker Pipelines?

Introduction

Most machine learning models don’t make it to production. If your model reaches the deployment stage, congratulations. Is our job finished? Not yet. There are chances that the statistical properties or distribution of the data that our model serves in production might differ significantly from the distribution of the data our model was initially trained on. This is termed as data drift. Hence, it is important to continuously monitor and train our models after deployment. If the retraining happens yearly, twice, or thrice, it is not an issue and we can do this manually. But if we are required to retrain our models very frequently, we can’t do this manually each time. We want to automate this process of model building, training and deployment. This is the reason for using SageMaker Pipelines.

This article continues upon the use case discussed in the previous article. I will demonstrate the steps to use SageMaker Pipelines with some necessary code snippets and screenshots. Please refer to GitHub for the complete code repositories.

Before we start, let me give you a flowchart that displays the overall flow of what we are going to implement. You might not understand initially. But after reading the article, if you look at this again, you will understand it clearly.

Let’s Start

The first step is to open SageMaker Studio and create a new project. Let me elaborate on the steps:

Go to the AWS SageMaker homepage.
Navigate to Domains on the left and create a new domain.
You will be asked to create a user profile, and once created, launch SageMaker Studio.
From the homepage, navigate to Projects under the Deployments section on the left.

Create a Project

Click the Create Project button in the Projects tab. SageMaker offers some default templates, each with a specific objective, which simplifies the task of creating a project. Since our objective is to build a model, deploy it, and retrain it automatically, select the project template Model building, training, and deployment. Give it a name and click Create Project. It might take some time to create the project.

Once it’s completed, two repositories will appear. Click on the clone repo corresponding to the two repositories. (In the below image, I have created a sample project for demonstration purposes. The actual project I created has a proper name.) Once you create the project, AWS creates the repositories with all the necessary files in AWS CodeCommit. If you go to AWS CodeCommit, there you will find your repositories. Within SageMaker Studio, you have just cloned those repositories. AWS CodeCommit is like the GitHub of AWS. Hence, CodeCommit helps us manage our code and handles code versioning and other related tasks. I will also talk about CI/CD deployment later.

Once you clone the repositories, click on the File menu on the left side of the page. A folder with your project name will appear. Inside this folder, there will be two folders, one ending with the name model-build and the other ending with the name model-deploy. The former contains the files that will help us with the model building and training process, followed by saving and registering the model. The latter contains files that will help us deploy the model. Let’s dive deep into each of the repositories.

Build your Pipeline

We will start with the Model Build stage. Click on the model-build folder and you will have several files and folders. You can explore them one by one, but I will focus on the main files that will help us with our tasks. Click on the Pipelines folder, followed by the abalone folder, and you will find the following files:

pipeline.py: Contains the code to build the pipeline for model building, training, and registering.
preprocess.py: Contains the code to preprocess the dataset before building the model.
evaluate.py: Contains the code to evaluate our model.

SageMaker provides the complete code to solve the problem of predicting the age of abalone using physical measurements. Our task is to alter the files to match our use case of predicting CTR. The pipeline.py file is mandatory, while preprocess.py and evaluate.py can be used based on your needs. Let me explain the pipeline.py file briefly.

pipeline.py

This is where we define our pipeline. The pipeline is built using a series of steps or components, and we use four such components:

step_process: The processing job built using FrameworkProcessor or any other processor helps us preprocess our dataset. Here we specify the instance type and the script (preprocess.py) to run the processing job. At the time of execution of the pipeline, SageMaker creates a processing job using the specified instance type.
step_train: This step helps us train our XGBoost model. As mentioned in the previous article, we can use built-in algorithms as well as frameworks. In this case, I will be using the built-in algorithm. I will be using the SageMaker Estimator class where I will specify the framework to use (XGBoost), along with the instance type, number of instances, and more importantly, the file path of the training and validation datasets. At the time of running the pipeline, SageMaker spins up a training job which will train our XGBoost model.
step_eval: In this step, we will be using a processing job with a custom script which will fetch the model created in the previous step, make predictions on the test dataset, evaluate the predictions using any metric (in this case, F1 Score), and store the metric in a JSON file.
step_cond and step_register: This step is for registering our models. In this step, we will check the value of the evaluation metric from the JSON file and if the metric is above a certain threshold, we will register or save the model in AWS Model Registry. This will assist us in model versioning, and if we want an older version of the model at any point in the future, we can easily retrieve it.

Note: The above steps involve the creation of a few files such as model artifacts and the training, validation, and testing CSV files which will be stored in the container’s output directory. i.e., at the time of running every job, AWS spins up an instance for running our jobs and the files would be stored there. This might be a bit confusing. The above interpretation is based on my understanding. If I am wrong, please feel free to correct me. But as long as we know the file paths of the files that are created, we don’t need to worry about where AWS stores them.

preprocess.py

This file is used for preprocessing the dataset. We perform the same set of operations as shown in the previous article. Here, the datasets are downloaded from the S3 bucket, processed, and the training, testing, and validation files are stored in the output directory.

evaluate.py

This file, as the name suggests, contains the script used for evaluating the XGBoost model. Here, we fetch the model artifact created as part of the training step, evaluate it on the testing dataset (created in the preprocess step), and store the resulting evaluation metric in a JSON file, which would be used in the later condition step in the pipeline.

The input and output file paths can be specified as arguments in Processor objects, and the outputs of one step can be accessed in another step if needed. For example, we will use the training and validation dataset generated at this step in the subsequent training step (step_train). The testing dataset created in this step will be used in the evaluation step (step_eval).

Execute your Pipeline

We’ve altered the model build files to fit our use case. You can find the complete code on GitHub. Next, let’s initiate our pipeline execution. Navigate back to the model-build folder, where you’ll find a Jupyter notebook named sagemaker-pipelines-project.ipynb. From this notebook, we’ll manually start our pipeline execution. Later, I’ll explain how to automatically trigger pipeline execution. Run through the cells in the notebook, making sure to modify the pipeline_name and model_package_group_name accordingly.

Note: During the deployment of the pipeline in the build stage, you might encounter an error. If you check the logs, the error might resemble the following: No approved ModelPackage found for ModelPackageGroup: If you face this error, simply modify the model_package_group_name in your Jupyter notebook to this package name and rerun the pipeline. You shouldn’t encounter any errors afterward.

Once you run the pipeline.start() step, the execution begins. Click on the left tab in SageMaker Studio and select the Pipelines tab. Here, you’ll find the list of all pipelines created.

Click on your pipeline name, and you will find the list of all executions, along with information on how long they have run and what the status of the execution is. The latest execution will be at the top of the list.

Click on it, and you will find a beautiful flowchart depicting all stages of the pipeline execution. The names of the components will be the same as the names you had specified in the Pipeline steps. Click on a component, and you will find more information about the component, including the ability to check the log files. If any error has occurred at a particular stage, that component will be highlighted in red, and you can examine the log files for more details. The log files are stored in AWS CloudWatch Logs. Please ensure you have the necessary IAM permissions to access the logs.

Also, every time you make changes in the pipeline files and before you start the pipeline execution using the Jupyter notebook, please restart the kernel. I’m not sure why, but without restarting the kernel, the changes made were not getting reflected. Starting execution from Jupyter notebooks is only for experimentation purposes. Once coding is completed, we can directly trigger the pipeline execution, or once we push the code to GitHub, the pipeline starts executing.

If the execution was successful, all components would be highlighted in green color. If the conditional check was true, meaning the performance of our model was above a certain threshold, then we will register the model in AWS Model Registry.

Why we need Model Registry?

AWS Model Registry keeps track of the different versions of the model created as a result of different executions. At any time, we can re-deploy an older version of the model.

Click on the left tab in SageMaker Studio, and under the Models section, you would find the Model Registry. Here, you would find different model groups. Click on the model group you had created, and you will find different versions of the model created as a result of multiple pipeline executions.

There would be a column named Status indicating the current status of the model. It would be marked as Pending Manual Approval. You can click any version of the model and you will be navigated to a tab with more features about the model. On the top right there will be Action button and under that button, if you click on Update Status button you will have the option to Approve or Reject the model. Once you approve the model, this will trigger the Model Deploy pipeline. We can add some comments while approving or rejecting the model. There is also a column indicating who modified the status of the model. When working in a collaborative environment this would help your team members to know who made the decision to approve the model and why it was approved.

Note: If you want to automatically start model deployment after the model is registered without the manual approval step, in the pipeline.py file, there would be a parameter named model_approval_status with the default value as PendingManualApproval. You can change this to Approved.

Deploy your models

The code repository for the deployment stage of the pipeline can be found in the folder ending with the name model-deploy, which was created along with the model-build folder. Here, we also have different files. Though I’m not a DevOps person, I’ll give you an overview of what’s happening in some of the files.

Once we approve the model, this triggers the deploy pipeline, which picks the latest approved model from the model package group you had created and deploys it to the endpoint. The buildspec.yml file contains the command to start the execution, and the build.py file contains the code to get the latest approved model.

There are also two more files named prod-config.json and staging-config.json. These files contain the parameters that describe the instance where we want to deploy our model. For example, what instance type we want to use, what should be the number of instances, etc. In the JSON files, by default, the instance type would be ml.m5.large. We can change this to ml.m5.xlarge or whatever instance type we want. But the catch is we can’t edit these two files in the SageMaker Studio environment (at least, I didn’t know how to edit them). We have to edit them in AWS CodeCommit.

Go to the AWS CodeCommit service, and you will find your model-build and model-deploy repositories.
Click the model-deploy repository and open the file, staging-config.json. Click the edit button, modify the instance type to ml.m5.xlarge, add details regarding who made the change and a commit message similar to what we do in GitHub, and click the commit button.
Your changes will be committed. Perform the same for the prod-config.json file.

Also, in the README.md file, there would be a description of the various files being used in this repository.

Once you commit and return to the SageMaker Studio environment, click on the Git button on the left tab. You will find a small yellow dot at the top indicating you to pull the latest changes. Pull the changes, and they will be reflected in both those files within Studio environment. Also, please make following changes to the pipeline.py file to reflect the instance type modification. In the model_register step arguments, change the inference instance type accordingly. Once you commit the changes, this will trigger the model-build pipeline, and you can view this in AWS CodePipeline. Once this execution gets completed, a new model will get registered in the AWS Model Registry.

We haven’t approved any of our models yet. Let’s approve it, and once approved, I will show the deploy pipeline in action. Once approved, go to SageMaker Pipelines and open your model-deploy pipeline.

If all stages have executed successfully, a new endpoint should be created, which you can view in the SageMaker Endpoints. Previously, I have attached the screenshots for approving the model in the Model Registry section.

The last stage of the execution (Approve Deployment) would be waiting for Approval. This approval is for creating a production endpoint. In a real-time scenario, we would experiment with our staging endpoint, and once everything is set, we will deploy the model in production. Similarly, once we approve this step, the production endpoint will get created (This endpoint is just a duplicate of the staging endpoint with the word ‘production’ in the name of the endpoint).

Let’s summarize the entire flow:

Data Preparation:

The dataset is stored in an S3 bucket.

Model Build Pipeline Execution:

Preprocessing Component: Fetches the CSV file, preprocesses it, and splits the dataset into training, validation, and testing datasets.
Training Component: Builds an XGBoost model, trains, and validates it using the training and validation datasets. The trained model is available as a pickle file.
Evaluation Component: Makes predictions on the testing dataset using the XGBoost model pickle file. Evaluates the model using any chosen metric (e.g., F1 score).
Conditional Check Component: Checks if the evaluation result is above a certain threshold. If true, the model gets stored in the Model Registry.

3. Model Approval:

Once we approve the model in the Model Registry, the Model Deploy pipeline gets triggered.

4. Model Deploy:

Deploys the model as an endpoint, typically a staging endpoint. Optionally, we can create a production endpoint if needed.

This flow ensures a systematic approach to model building, evaluation, approval, and deployment, ensuring that only high-quality models are deployed for inference.

That’s it. We have almost reached the end.

Schedule the Pipelines

Now we are manually starting the pipeline execution. But we want this to happen periodically. There are two ways:

AWS EventBridge Scheduler
AWS Lambda

AWS EventBridge Scheduler

In the words of AWS, “Amazon EventBridge Scheduler is a serverless scheduler that allows you to create, run, and manage tasks from one central, managed service”. We will use this service to schedule our pipelines to run periodically. For example, we can use this service to run our pipelines every Friday at 7:00 pm or any other time as per your requirements. In my case I am creating a schedule to start the pipeline every Tuesday at 21:00 PM. I am attaching the screenshots below. This process is simple. Go to AWS EventBridge Scheduler service and click Create Schedule and you will be navigated through a series of simple steps. I have attached the screenshots below.

We are creating a recurring schedule that starts at 9:00 PM every Tuesday.

Let’s return to our Pipeline executions tab in SageMaker Studio and check whether the execution has started or not. In the image below, pay attention to the time, and you’ll notice that the pipeline has started automatically according to our schedule.

To stop the recurring occurrences, we can either disable or completely delete the EventBridge rule.

AWS Lambda

I won’t elaborate on this method as much as the previous one. Here’s how it works: we can add triggers for our Lambda function to start execution automatically. By setting up a trigger, such as uploading a file (e.g., a CSV file) to a specific S3 bucket, the Lambda function is triggered. The Lambda function contains the code to trigger the SageMaker Pipeline.

The overall flow is this: once the dataset is uploaded to the S3 bucket, it triggers the Lambda function, which in turn triggers the SageMaker Pipeline. I didn’t elaborate on this methodology because it’s quite common, and there are many resources available that demonstrate how to set up triggers in Lambda functions. The additional step required is writing the code within the Lambda function to start the SageMaker Pipeline.

Conclusion

In the last article, we focused on deploying our models, and this article would have demonstrated clearly how to automate model training and deployment. I might not have been clear at some places, so I apologize if I missed out on anything. If you encounter any difficulties while deploying, please leave a comment, and I will do my best to resolve them.

Thanks for sticking till the end. This article might look lengthy, but most of the space was occupied by images. Nonetheless, I hope this article would have taught you something new.

Your feedback and questions are highly welcomed.

There is one more article in the series where we will explore Feature Stores.

Useful Resources

For the complete code refer here — https://github.com/Bhujith10/PredictingCTR-Sagemaker

https://youtu.be/Hvz2GGU3Z8g?si=imDnMfifXfCiX4eT