Photo by NEOM on Unsplash

MLOps-Mastering MLflow: Unlocking Efficient Model Management and Experiment Tracking

Experiment Tracking, Model Registry, and Versioning

Introduction:

In the world of machine learning, managing experiments, and tracking progress can be pretty challenging. That’s where MLflow comes into play. It’s a powerful tool designed to help anyone working with machine learning — from beginners to experts. MLflow makes it easier to keep track of your experiments, manage your models, and streamline the entire machine-learning process. Whether you’re just starting out or you’re a seasoned professional, MLflow is here to make your journey in machine learning more organized and efficient. In this article, we’ll dive into how MLflow can transform your machine-learning projects, making them more effective and manageable.

Image Credit MLFlow documentation

Challenges Faced by Data Scientists

Experiment Tracking and Model Management

  1. Inefficient Experiment Tracking:
  • Scattered and disorganized experiment logs.
  • Difficulty in comparing and reproducing results.
  • Lack of centralized tracking system.

2. Model Versioning and Lifecycle Management:

  • Challenges in managing multiple model versions.
  • Difficulty in transitioning models through lifecycle stages (development, staging, production).
  • Need for adequate documentation and annotation of model changes.

3. Collaboration and Governance:

  • Issues in sharing and collaborating on model development.
  • Lack of access control and oversight over model modifications.
  • Difficulty in maintaining standards and compliance.

4. Model Deployment and Scalability:

  • Complexities in deploying models to production.
  • Challenges in scaling models and managing resources.
  • Inconsistencies in deployment across different environments.

MLflow’s Solutions

1. Centralized Experiment Tracking with MLflow:

  • Unified interface for logging experiments, parameters, and results.
  • Easy comparison and reproduction of experiments.
  • Enhanced visibility and organization of experiment data.

2. Model Registry and Versioning in MLflow:

  • Systematic version control for models.
  • Staging and lifecycle management for smooth transitions to production.
  • Detailed annotations and descriptions for each model version.

3. Collaboration and Governance Features:

  • Facilitates team collaboration with shared model access.
  • Robust access control and governance capabilities.
  • Compliance and standardization in model development and deployment.

4. Scalable Model Deployment with MLflow:

  • Streamlined deployment processes to various environments.
  • Support for scalable model serving and resource management.
  • Integration with popular serving tools and cloud platforms.

Introduction to MLflow

  1. What is MLflow?
  • MLflow is an open-source platform created by Databricks.
  • It’s designed to manage the complete machine-learning lifecycle.
  • Includes tools for tracking experiments, packaging code, and sharing models.

2. Purpose of MLflow:

  • Use MLflow to simplify and streamline complex machine learning projects.
  • It helps in managing the workflow from data preparation to model deployment.
  • Aims to increase the efficiency and reproducibility of machine learning projects.

3. Components of MLflow:

MLflow encompasses four primary components:

  • MLflow Tracking: For logging parameters, metrics, and artifacts from ML experiments.
  • MLflow Projects: A packaging format for reproducible runs and sharing code.
  • MLflow Models: A standard format for packaging ML models.
  • MLflow Model Registry: A central repository for managing the lifecycle of ML models.
Image by the Author

MLflow Tracking:

MLflow Tracking serves as a specialized API designed for logging in machine learning workflows. It is uniquely versatile, seamlessly integrating with various libraries and environments used in training. The core organizational structure of MLflow Tracking revolves around “runs”, which are essentially individual executions of data science or machine learning code. These runs are systematically grouped into “experiments”, allowing multiple runs to be part of a specific experiment.

An MLflow server has the capacity to manage numerous experiments, each of which can document diverse types of information:

  • Parameters: Inputs to the model, like the number of estimators in an ensemble model, represented as key-value pairs.
  • Metrics: Quantitative measures to evaluate model performance, such as Root Mean Squared Error (RMSE) or the Area Under the Receiver Operating Characteristics (ROC) Curve.
  • Artifacts: These are various outputs in different formats, ranging from images and serialized models to datasets.
  • Source Code: The actual code that executed the experiment.

MLflow’s flexibility allows for tracking experiments using a variety of programming languages, including Python, R, and Java. Additionally, it supports command-line interface (CLI) operations and REST API calls for tracking. This comprehensive approach makes MLflow a robust tool for managing and monitoring machine learning experiments.

Install MLFlow:

# Importing necessary library
import mlflow

# Installing MLflow (execute this line in your terminal or command prompt)
# !pip install mlflow

# Starting the MLflow Tracking Server
# This will launch the server and UI will be accessible at http://localhost:5000
# Run this command in your terminal or command prompt
# mlflow ui
+--------------+             +------------------+
| Notebooks | | UI |
+------+-------+ +---------+--------+
| |
v v
+------+----------------+ +--------+-------+
| | | |
| MLflow |<---+ |
| Tracking Server | | |
| | +--------+-------+
+------+----------------+ ^
| |
v v
+------+-------+ +---------+--------+
| Local Apps | | API |
+--------------+ +------------------+

+--------------+
| Cloud Jobs |
+--------------+
Image by the Author

Basic Experiment:

  1. Starting an Experiment: Begin your machine learning experiment by initiating a run with MLflow. This is done through the mlflow.start_run() function, where you provide a unique name for your run. This marks the beginning of your model's experiment tracking.
  2. Model Training: Proceed to train your machine learning model using your preferred methodology and algorithms.
  3. Logging the Model: After training, use mlflow.sklearn.log_model() to log your model. This function is part of the MLflow's sklearn module, designed to work seamlessly with scikit-learn models.
  4. Logging Metrics: Evaluate your model’s performance and log these metrics for comparison and analysis. For instance, you can use mlflow.log_metric() to log the model's error rate or any other relevant performance metric.
  5. Run ID Retrieval: Finally, access and print the run ID with run.info.run_id. This ID is a unique identifier for your experiment's run, allowing you to easily reference and retrieve it later in the MLflow tracking system.
Image by the Author
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Set the experiment name
mlflow.set_experiment("Random Forest")

with mlflow.start_run(run_name="My RF Run") as run:
# Create model, train it, and create predictions
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Log model
mlflow.sklearn.log_model(rf, "random_forest_model")

# Log metrics
mse = mean_squared_error(y_test, predictions)
mlflow.log_metric("mse", mse)

run_id = run.info.run_id
experiment_id = run.info.experiment_id

print(f"Inside MLflow Run with run_id `{run_id}` and experiment_id `{experiment_id}`")

Log More: Parameters, Artifacts,etc:

The code and the MLflow UI screenshots are below. Please check all the parameters, metrics, CSV file and png file are stored.

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn

# Load California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = california.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MLflow logging function
def log_rf(run_name, params, X_train, X_test, y_train, y_test):
with mlflow.start_run(run_name=run_name) as run:
# Create and train model
rf = RandomForestRegressor(**params)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Log model
mlflow.sklearn.log_model(rf, "random_forest_model")

# Log parameters
mlflow.log_params(params)

# Log metrics
mlflow.log_metrics({
"mse": mean_squared_error(y_test, predictions),
"mae": mean_absolute_error(y_test, predictions),
"r2": r2_score(y_test, predictions)
})

# Log feature importance
importance = pd.DataFrame(list(zip(X_train.columns, rf.feature_importances_)),
columns=["Feature", "Importance"]).sort_values("Importance", ascending=False)
importance_path = "importance.csv"
importance.to_csv(importance_path, index=False)
mlflow.log_artifact(importance_path, "feature-importance")

# Log plot
fig, ax = plt.subplots()
importance.plot.bar(ax=ax)
plt.title("Feature Importances")
mlflow.log_figure(fig, "feature_importances.png")

return run.info.run_id

# Example parameters
params = {
"n_estimators": 100,
"max_depth": 5,
"random_state": 42
}

# Log the experiment
run_id = log_rf("California Housing RF Model", params, X_train, X_test, y_train, y_test)
print(f"Run ID: {run_id}")

Hyperparameter Training:

Image by the Author
import mlflow
import mlflow.sklearn
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor

# Set the experiment name
experiment_name = "Hyperparameter Tuning Experiment - California Housing"
mlflow.set_experiment(experiment_name)

# Load data
california_housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(california_housing.data, california_housing.target, test_size=0.3, random_state=42)

# Hyperparameter grid
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 10, 20]
}

# Grid search with cross-validation
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Logging results for each parameter combination
for i, params in enumerate(grid_search.cv_results_['params']):
with mlflow.start_run(run_name=f"Run_{i}"):
mlflow.log_params(params)
mlflow.log_metric("mean_test_score", grid_search.cv_results_['mean_test_score'][i])

# Log the best model in a separate run
with mlflow.start_run(run_name="Best Model"):
mlflow.log_params(grid_search.best_params_)
mlflow.sklearn.log_model(grid_search.best_estimator_, "model")
  1. Set Up MLflow Experiment:
  • Define and set an experiment name in MLflow.

2. Load Dataset:

  • Fetch the California housing dataset from sklearn.
  • Split the dataset into training and testing sets.

3. Define Hyperparameter Grid:

  • Create a grid of hyperparameters to search, including different values for n_estimators and max_depth.

4.Perform Grid Search with Cross-Validation:

  • Use GridSearchCV with a RandomForestRegressor to search the hyperparameter space.
  • Use cross-validation and scoring based on negative mean squared error.

5.Logging Each Parameter Combination:

  • Iterate through each set of parameters tested in the grid search.
  • For each combination, start an MLflow run, log the parameters, and log the mean test score.

6. Log the Best Model:

  • After completing the grid search, start a separate run in MLflow for the best model.
  • Log the best parameters and the best estimator as an MLflow model.

Running Nested Runs:

import mlflow
import mlflow.sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Set the experiment name
mlflow.set_experiment("MLFlow_Nested_Runs")

# Load data
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Parent run
with mlflow.start_run(run_name="Parent Run"):
mlflow.log_param("dataset", "California Housing")

# Nested run 1
with mlflow.start_run(run_name="Child Run 1", nested=True):
model1 = RandomForestRegressor(n_estimators=10, random_state=42)
model1.fit(X_train, y_train)
mlflow.sklearn.log_model(model1, "model1")

# Nested run 2
with mlflow.start_run(run_name="Child Run 2", nested=True):
model2 = RandomForestRegressor(n_estimators=100, random_state=42)
model2.fit(X_train, y_train)
mlflow.sklearn.log_model(model2, "model2")
  • Set Up Experiment: The code initiates an MLflow experiment titled “MLFlow_Nested_Runs”.
  • Load Data: It loads the California housing dataset and splits it into training and testing sets.
  • Parent Run: Starts a main run (“Parent Run”), within which two nested runs will be executed.
  • Nested Run 1:
  • Trains a RandomForestRegressor model with 10 trees.
  • Logs the model to MLflow under the name “model1”.
  • Nested Run 2:
  • Trains a RandomForestRegressor model with 100 trees.
  • Logs this model as “model2” in MLflow.
  • Logging: In each nested run, the model and its parameters are logged to the MLflow tracking server for comparison and later analysis.
Image by the Author

Model Comparison Experiment:

import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set the experiment name
experiment_name = "Model Comparison Experiment"
mlflow.set_experiment(experiment_name)

# Load California housing dataset
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

# Define models
models = {
"Random Forest": RandomForestRegressor(),
"Gradient Boosting": GradientBoostingRegressor(),
"Linear Regression": LinearRegression()
}

# Iterate through models and log to MLflow
for model_name, model in models.items():
with mlflow.start_run(run_name=model_name):
# Train the model
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

# Log model and metrics
mlflow.sklearn.log_model(model, model_name)
mlflow.log_metrics({"mse": mse, "r2": r2})

This code will create three separate runs within the same experiment, each corresponding to a different model. It logs the model and its evaluation metrics (MSE and R2 score) for comparison.

Model Signature Experiement:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from mlflow.models.signature import infer_signature
import pandas as pd

# Fetch California housing data
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the experiment name
experiment_name = "Model Signature Experiment"
mlflow.set_experiment(experiment_name)

# Start MLflow run
with mlflow.start_run(run_name="Signature and Input Example"):
# Train model
rf = RandomForestRegressor(random_state=42)
rf_model = rf.fit(X_train, y_train)
mse = mean_squared_error(y_test, rf_model.predict(X_test))

# Log metrics
mlflow.log_metric("mse", mse)

# Log the model with signature and input example
signature = infer_signature(X_train, rf_model.predict(X_train))
input_example = X_train.head(3)
mlflow.sklearn.log_model(rf_model, "rf_model", signature=signature, input_example=input_example)

This code does the following:

  1. Loads and preprocesses the California housing dataset.
  2. Splits the dataset into training and testing sets.
  3. Sets a new experiment in MLflow named “Model Signature Experiment”.
  4. Trains a RandomForestRegressor model.
  5. Logs the mean squared error as a metric to MLflow.
  6. Uses infer_signature to automatically determine the schema (signature) of inputs and outputs of the model.
  7. Selects a few examples from the training data (input_example).
  8. Logs the trained model to MLflow, including the inferred signature and input example for better future schema checks and integration with deployment tools.

Autologging Experiment:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Fetch California housing data
data = fetch_california_housing()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the experiment name
experiment_name = "Autologging Experiment"
mlflow.set_experiment(experiment_name)

# Enable autologging
mlflow.autolog()

# Training model with autologging
rf = RandomForestRegressor(random_state=42)
rf_model = rf.fit(X_train, y_train)
  1. We enable MLflow autologging which automatically logs parameters, metrics, and models.
  2. A RandomForestRegressor model is trained on the California housing dataset.
  3. The experiment is set to “Autologging Experiment”.
  4. Since autologging is enabled, there’s no need for explicit logging statements. Metrics, parameters, and the model will be logged automatically once training is complete.
  5. There’s no need to encapsulate the training code within mlflow.start_run(), as autologging handles this implicitly.

SHAP and Feature Importance Experiment:

import matplotlib.pyplot as plt
import pandas as pd
import mlflow
import shap
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Fetch and prepare the data
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
rf = RandomForestRegressor(random_state=42)
rf_model = rf.fit(X_train, y_train)

# Set the experiment name
experiment_name = "SHAP and Feature Importance Experiment"
mlflow.set_experiment(experiment_name)

# Start MLflow run
with mlflow.start_run(run_name="SHAP and Feature Importance Analysis"):
# Generate and log SHAP values
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_train[:5])
shap.summary_plot(shap_values, X_train[:5], show=False)
plt.savefig("shap_summary.png")
mlflow.log_artifact("shap_summary.png", "shap_plots")

# Generate and log feature importance plot
feature_importances = pd.Series(rf_model.feature_importances_, index=X_train.columns)
fig, ax = plt.subplots()
feature_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
mlflow.log_figure(fig, "feature_importance_rf.png")
  1. We start by fetching and preparing the California housing dataset.
  2. A RandomForestRegressor model is trained using the dataset.
  3. We define and set a new experiment named “Feature Importance Experiment”.
  4. A feature importance plot is generated and logged into MLflow within a run named “Feature Importance Scores”.
  5. This script will log the feature importance plot to the MLflow UI under the specified experiment.
  6. We add SHAP analysis using shap.TreeExplainer and shap_values for the first five training samples.
  7. The SHAP summary plot is saved to a file and then logged as an artifact in MLflow.
Image by the Author

All the essential methods:

import mlflow
import os
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Start a new MLflow run
with mlflow.start_run():
# Log parameters and metrics
mlflow.log_param("batch_size", 32)
mlflow.log_metrics({"accuracy": 0.95, "loss": 0.05})

# Log a local file as an artifact
with open("output.txt", "w") as f:
f.write("Hello world!")
mlflow.log_artifact("output.txt")

# Log all files in a local directory as artifacts
if not os.path.exists("outputs"):
os.makedirs("outputs")
with open("outputs/output.txt", "w") as f:
f.write("Hello world!")
mlflow.log_artifacts("outputs")

# Log a dictionary as an artifact
data = {"key": "value"}
mlflow.log_dict(data, "data.json")

# Log a matplotlib figure as an artifact
fig, ax = plt.subplots()
ax.plot([0, 1], [2, 3])
mlflow.log_figure(fig, "figure.png")

# Log an image file as an artifact
image = np.random.rand(100, 100, 3) * 255
image = Image.fromarray(image.astype('uint8')).convert('RGBA')
mlflow.log_image(image, "image.png")

# Set tags for the current active run
mlflow.set_tags({"key1": "value1", "key2": "value2"})

# Set the experiment name
mlflow.set_experiment("My Experiment")

# Set a tag on the current experiment
mlflow.set_experiment_tag("key", "value")

# Fetch and print the current experiment details
experiment = mlflow.get_experiment_by_name("Default")
print(f"Experiment_id: {experiment.experiment_id}")

# Fetch and print the artifact URI
with mlflow.start_run():
artifact_uri = mlflow.get_artifact_uri()
print(f"Artifact uri: {artifact_uri}")

# Set the tracking server URI
mlflow.set_tracking_uri("http://my-tracking-server:5000")
Image by the Author

MLFlow Projects:

Why Packaging Matters in Machine Learning Projects

  1. Diverse Library Dependencies: Machine learning projects often rely on various libraries, making standardization crucial for consistency across different environments.
  2. Environment Replicability: MLflow facilitates the replication of the development environment, either through conda environments or docker containers, ensuring that projects are easily transferable and deployable across different setups.
  3. Ease of Sharing: With MLflow’s packaging capabilities, teams can effortlessly share and disseminate their machine learning solutions, enhancing collaboration and usability.
  4. Handling Project Complexity: As machine learning projects evolve, they encompass intricate processes like ETL operations, preprocessing through auxiliary models, and the core model training. Managing these multifaceted components becomes critical.
  5. Traceability and Debugging: Comprehensive lineage tracing of each element in a machine learning pipeline is essential. It provides clarity and aids in troubleshooting, especially when identifying the source of failures or inconsistencies.
  • MLflow Project: A set of conventions for organizing machine learning code.
  • MLproject File: A YAML file that acts as a blueprint for the project, detailing its structure and components.
name: My Project

python_env: python_env.yaml
# or
# conda_env: my_env.yaml
# or
# docker_env:
# image: mlflow-docker-example

entry_points:
main:
parameters:
data_file: path
regularization: {type: float, default: 0.1}
command: "python train.py -r {regularization} {data_file}"
validate:
parameters:
data_file: path
command: "python validate.py {data_file}"
  • Project Name: The project is titled “My Project”.
  • Environment Specification: The project uses a Python environment, which is specified in a file named python_env.yaml. Alternatively, a Conda environment or a Docker container environment can be specified with my_env.yaml or by referencing a Docker image, respectively.
  • Entry Points: Two main actions (entry points) are defined for this project:
  • Main: The main action involves training a model. It takes a data file path and a regularization parameter with a default value of 0.1. The corresponding command to run this action is a Python command that executes the train.py script with the regularization parameter and data file.
  • Validate: The second action is for validation. It requires a data file path and runs a Python command that executes the validate.py script with the data file.

This MLproject file acts as a guide for MLflow to understand how to run and manage the project's machine learning workflows.

  • Encapsulation: Each machine learning process (like data preprocessing, training, etc.) can be wrapped into a project, allowing complex workflows.
  • Collaboration: The standard structure makes it easier for teams to work together on machine learning experiments.
Image by the Author

Lets walk through a sample MLFlow project setup:

  1. Prepare your data: Download the California housing dataset. You can use Scikit-learn to load the dataset.
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
data = housing.data
target = housing.target
data.to_csv('california_housing.csv')

2. Create your MLflow Project files: In a new directory, create the following files:

  • MLproject: This file defines the structure of your project.
  • conda.yaml: Defines the conda environment needed to run the project.
  • train.py: Contains the training script for your RandomForestRegressor.

ML Project:

name: California-Housing-Prediction

conda_env: conda.yaml

entry_points:
main:
parameters:
data_path: {type: str, default: "california_housing.csv"}
n_estimators: {type: int, default: 100}
max_depth: {type: int, default: 20}
max_features: {type: str, default: "auto"}
command: "python train.py --data_path {data_path} --n_estimators {n_estimators} --max_depth {max_depth} --max_features {max_features}"

conda.yaml:

name: california-housing-env
channels:
- conda-forge
dependencies:
- python=3.8
- scikit-learn
- pandas
- numpy
- mlflow
- pip

train.py

import click
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

@click.command()
@click.option("--data_path", default="california_housing.csv", type=str)
@click.option("--n_estimators", default=100, type=int)
@click.option("--max_depth", default=20, type=int)
@click.option("--max_features", default="auto", type=str)
def mlflow_rf(data_path, n_estimators, max_depth, max_features):

with mlflow.start_run():
# Import the data
df = pd.read_csv(data_path)
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create model, train it, and create predictions
rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Log model
mlflow.sklearn.log_model(rf, "random-forest-model")

# Log params
mlflow.log_params({
"n_estimators": n_estimators,
"max_depth": max_depth,
"max_features": max_features
})

# Log metrics
mlflow.log_metrics({
"mse": mean_squared_error(y_test, predictions)
})

if __name__ == "__main__":
mlflow_rf()

3. Run your MLflow Project: Execute the project using the MLflow CLI from the same directory where your MLproject file is located:

mlflow run .

The train.py file should be in the same directory as your MLproject and conda.yaml files.

For more information on MLFlow Projects:

Image by the Author

MLFlow Model Management & Model registry:

  • Centralized Repository: MLflow provides a central hub to store, version, and manage ML models.
  • Multiple Frameworks: It supports models from many ML frameworks, such as scikit-learn, TensorFlow, PyTorch, etc.
  • Model Versioning: You can track different versions of models, similar to code versioning with Git.
  • Stage Management: Models can be transitioned through different stages like Staging, Production, and Archived.
  • Model Packaging: MLflow packages models in a standard format, which can be used in a variety of environments without compatibility issues.
  • Model Serving: Packaged models can be deployed for inference in different environments, including cloud services or local servers.
  • Custom Models: You can define custom models with pre-processing and post-processing steps for predictions.
  • Consistent API: MLflow provides a consistent API to interact with different types of models, offering a unified way to make predictions.
Image by the Author

MLflow Model Registry:

  • Central Hub for Models: It’s a central place to store and manage MLflow models, with unique names and details for each model.
  • Track Different Versions: Automatically keeps track of different versions of a model as updates are made.
  • Lifecycle Stages: Assign stages like “Staging” or “Production” to models, showing where they are in their lifecycle.
  • Record Changes: Keeps a log of updates, registrations, and who made changes, with added notes and tags.
  • Integration with Workflows: Fits into CI/CD pipelines, allowing for recording, reviewing, and approving changes as part of model development and deployment processes.

Let's walk through the steps:

  1. Set the training code:
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Fetch and split data
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the experiment name
experiment_name = "Model_Registry_Experiment"
mlflow.set_experiment(experiment_name)

# Train a RandomForest model
with mlflow.start_run(run_name="My_Model_Run") as run:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
mse = mean_squared_error(y_test, rf.predict(X_test))

# Log model
mlflow.sklearn.log_model(rf, "random_forest_model")
run_id = run.info.run_id
Image by the Author

2. Register the Model:

import uuid
from mlflow.tracking.client import MlflowClient

client = MlflowClient()
model_name = f"california_rf_model_{uuid.uuid4().hex[:10]}"
model_uri = f"runs:/{run_id}/random_forest_model"
model_details = client.create_registered_model(model_uri=model_uri, name=model_name)

3. Check and Update Model Status:

model_version_details = client.get_model_version(name=model_name, version=1)
print(model_version_details.status)

# Add descriptions
client.update_registered_model(name=model_details.name, description="Random Forest model for California housing dataset.")
client.update_model_version(name=model_details.name, version=1, description="Initial version with default parameters.")

4.Transition the Model to Staging:

import time
time.sleep(10) # Wait for registration to complete
client.transition_model_version_stage(name=model_details.name, version=1, stage="Staging")

5. Transition the Model to Production:

import time
time.sleep(10) # Wait for registration to complete
client.transition_model_version_stage(name=model_details.name, version=1, stage="Production")

6.Deploy the Model:

model_version_uri = f"models:/{model_name}/1"
model_version_1 = mlflow.pyfunc.load_model(model_version_uri)
predictions = model_version_1.predict(X_test)

7.Create and Deploy a New Model Version:

Repeat steps 1 and 2 with modified parameters or data preprocessing to create a new version. Then use client.transition_model_version_stage to move this new version through stages (Staging, Production).

8. Archiving and Deleting Old Versions:

client.transition_model_version_stage(name=model_name, version=1, stage="Archived")
client.delete_model_version(name=model_name, version=1)
Image by the Author

Model Management:

Lets dive into model management.Lets check an example:

  1. Prepare the California housing dataset:
  • You can use sklearn.datasets.fetch_california_housing to get the dataset.
  • Split the dataset into training and testing sets.

2. Train a Random Forest model:

  • Use RandomForestRegressor from sklearn.ensemble.
  • Train the model on the training set.
  • Log the model and metrics (like MSE) using MLflow.

3. Train a Neural Network model:

  • Define a simple NN model using tensorflow.keras.
  • Train the model on the training set.
  • Enable auto-logging for TensorFlow with mlflow.tensorflow.autolog().
  • Log the NN model with MLflow.

4. Load and use the models:

  • Use mlflow.pyfunc.load_model() to load both the sklearn and Keras models as pyfunc models.
  • Use the predict method of the pyfunc models on the test set.

5. Create and log a custom Python model:

  • Define a class that extends mlflow.pyfunc.PythonModel.
  • Implement the predict function.
  • Log the model with MLflow.
  • Load the model using mlflow.pyfunc.load_model() and test its predict function.
  1. Prepare the California housing dataset:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

2. Train a Random Forest model:

import mlflow
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Set the experiment name
experiment_name = "Model_Management"
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
rf = RandomForestRegressor(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf.predict(X_test))
mlflow.sklearn.log_model(rf, "Model_01")
mlflow.log_metric("mse", rf_mse)

3.Train a Neural Network model:

import mlflow.tensorflow
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

experiment_name = "Model_Management"
mlflow.set_experiment(experiment_name)

mlflow.tensorflow.autolog()

nn = Sequential([
Dense(20, activation='relu', input_shape=(X_train.shape[1],)),
Dense(10, activation='relu'),
Dense(1)
])

nn.compile(optimizer='adam', loss='mean_squared_error')

with mlflow.start_run():
nn.fit(X_train, y_train, validation_split=0.2, epochs=10)

4. Load and use the models:

rf_run_id = "53600fdcb3f548359c4aa9ba741af1c4"  # Your RF model run ID
rf_model_uri = f"runs:/{rf_run_id}/Model_01"
nn_run_id = "c7d1b11f95c248328ecec7e140386fe5" # Your NN model run ID
nn_model_uri = f"runs:/{nn_run_id}/model"


rf_pyfunc_model = mlflow.pyfunc.load_model(rf_model_uri)
nn_pyfunc_model = mlflow.pyfunc.load_model(nn_model_uri)

rf_predictions = rf_pyfunc_model.predict(X_test)
nn_predictions = nn_pyfunc_model.predict(X_test)
Image credit MLFlow documentation

5. Create and log a custom Python model:

import mlflow.pyfunc

class AddN(mlflow.pyfunc.PythonModel):
def __init__(self, n):
self.n = n
def predict(self, context, model_input):
return model_input.apply(lambda column: column + self.n)

model_path = "add_n_model"
add5_model = AddN(n=5)

with mlflow.start_run():
mlflow.pyfunc.log_model(artifact_path=model_path, python_model=add5_model)
import pandas as pd

loaded_model = mlflow.pyfunc.load_model(model_path)
model_input = pd.DataFrame([range(10)])
model_output = loaded_model.predict(model_input)
Image by the Author
Image by the Author

MLFlow Cheat Sheet:

# Active Run Management
mlflow.active_run() # Retrieve the current active run
mlflow.start_run() # Start a new MLflow run
mlflow.end_run() # End the current active run

# Autologging
mlflow.autolog() # Enable automatic logging of parameters, metrics, and models

# Experiments Management
mlflow.create_experiment() # Create a new experiment
mlflow.delete_experiment() # Delete an existing experiment

# Runs Management
mlflow.delete_run() # Delete a specified run
mlflow.get_run() # Retrieve details of a specified run
mlflow.search_runs() # Search for runs in a specified experiment

# Model Management
mlflow.evaluate() # Evaluate a model against specified data and targets
mlflow.get_artifact_uri() # Retrieve the URI of a specified artifact
mlflow.log_artifact() # Log a local file or directory as an artifact
mlflow.log_artifacts() # Log all files in a directory as artifacts
mlflow.log_dict() # Log a dictionary as an artifact in JSON/YAML format
mlflow.log_figure() # Log a matplotlib figure as an artifact
mlflow.log_image() # Log an image file as an artifact
mlflow.log_input() # Log input data (experimental)
mlflow.log_metric() # Log a single metric
mlflow.log_metrics() # Log multiple metrics
mlflow.log_param() # Log a single parameter
mlflow.log_params() # Log multiple parameters
mlflow.log_table() # Log a table as an artifact
mlflow.log_text() # Log text as an artifact
mlflow.log_vector() # Log a vector as an artifact
mlflow.model.log() # Log a PyFunc model
mlflow.model.save() # Save a PyFunc model to a directory
mlflow.register_model() # Register a model with MLflow

# Experiment Tag Management
mlflow.set_experiment() # Set the given experiment as the active one
mlflow.set_experiment_tag() # Set a tag on the current experiment
mlflow.set_experiment_tags() # Set multiple tags for the current experiment

# Other Useful Functions
mlflow.set_registry_uri() # Set the model registry server URI
mlflow.set_system_metrics_samples_before_logging() # Set the number of system metrics samples before logging (experimental)
mlflow.set_system_metrics_sampling_interval() # Set the system metrics sampling interval (experimental)
mlflow.set_tag() # Set a tag for the current active run
mlflow.set_tags() # Set multiple tags for the current active run
mlflow.set_tracking_uri() # Set the tracking server URI
mlflow.get_experiment() # Retrieve an experiment by its ID
mlflow.get_experiment_by_name() # Retrieve an experiment by its name
mlflow.get_parent_run() # Get the parent run for a given run ID
mlflow.get_registry_uri() # Get the current registry URI
mlflow.get_tracking_uri() # Get the current tracking URI
mlflow.is_tracking_uri_set() # Check if the tracking URI has been set
mlflow.last_active_run() # Get the most recent active run
mlflow.load_table() # Load a table from MLflow Tracking as a pandas.DataFrame

MLFlow Code Snippets:

import mlflow
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

# Starting an MLflow run
mlflow.start_run()
run = mlflow.active_run()
print(f"Active run_id: {run.info.run_id}")
mlflow.end_run()

# Autologging
mlflow.autolog()

# Creating and deleting an experiment
experiment_id = mlflow.create_experiment("New Experiment")
print(f"Experiment ID: {experiment_id}")
mlflow.delete_experiment(experiment_id)

# Deleting a run
with mlflow.start_run() as run:
mlflow.log_param("p", 0)
run_id = run.info.run_id
mlflow.delete_run(run_id)

# Deleting a tag
with mlflow.start_run() as run:
mlflow.set_tag("info", "this run will have no info tag soon")
mlflow.delete_tag("info")

# Ending a run
mlflow.start_run()
mlflow.log_param("my", "param")
mlflow.end_run()

# Evaluating a model
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
mlflow.evaluate(model=model, data=X_test, targets=y_test, model_type="regressor")

# Getting artifact URI
with mlflow.start_run():
artifact_uri = mlflow.get_artifact_uri()
print(f"Artifact uri: {artifact_uri}")

# Getting experiment by ID and name
experiment = mlflow.get_experiment("0")
print(f"Name: {experiment.name}")
experiment = mlflow.get_experiment_by_name("Default")
print(f"Experiment_id: {experiment.experiment_id}")

# Getting parent run
with mlflow.start_run():
with mlflow.start_run(nested=True) as child_run:
child_run_id = child_run.info.run_id
parent_run = mlflow.get_parent_run(child_run_id)
print(f"Parent_run_id: {parent_run.info.run_id}")

# Setting and getting Registry URI
registry_uri = mlflow.get_registry_uri()
print(f"Registry URI: {registry_uri}")
mlflow.set_registry_uri("http://my-registry-server:5000")

# Getting a run
with mlflow.start_run() as run:
mlflow.log_param("p", 0)
run_id = run.info.run_id
run = mlflow.get_run(run_id)
print(f"Run ID: {run.info.run_id}")

# Setting and getting Tracking URI
tracking_uri = mlflow.get_tracking_uri()
print(f"Tracking URI: {tracking_uri}")
mlflow.set_tracking_uri("http://my-tracking-server:5000")

# Starting a new run
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.98)

# Logging artifacts
with mlflow.start_run():
# Log a single file
with open("output.txt", "w") as f:
f.write("Hello world!")
mlflow.log_artifact("output.txt")

# Log all files in a directory
if not os.path.exists("outputs"):
os.makedirs("outputs")
with open("outputs/output.txt", "w") as f:
f.write("Hello world!")
mlflow.log_artifacts("outputs")

# Logging a dictionary, figure, and image
with mlflow.start_run():
data = {"key": "value"}
mlflow.log_dict(data, "data.json")

# Log a matplotlib figure
fig, ax = plt.subplots()
ax.plot([0, 1], [2, 3])
mlflow.log_figure(fig, "figure.png")

# Log an image
image = np.random.rand(100, 100, 3) * 255
image = Image.fromarray(image.astype('uint8')).convert('RGBA')
mlflow.log_image(image, "image.png")

MLFlow Wrapper Class:

You can also create a wrapper class like below-this is a basic wrapper.

import mlflow
import os

class MLflowWrapper:
def __init__(self, tracking_uri, experiment_name, auto_log=False):
self.tracking_uri = tracking_uri
self.experiment_name = experiment_name
self.auto_log = auto_log
self._setup_mlflow()

def _setup_mlflow(self):
"""Sets up the MLflow tracking URI and experiment."""
mlflow.set_tracking_uri(self.tracking_uri)
if not mlflow.get_experiment_by_name(self.experiment_name):
mlflow.create_experiment(self.experiment_name)
mlflow.set_experiment(self.experiment_name)

if self.auto_log:
mlflow.autolog()

def start_run(self, run_name=None):
"""Starts an MLflow run."""
return mlflow.start_run(run_name=run_name)

def log_param(self, key, value):
"""Logs a parameter."""
mlflow.log_param(key, value)

def log_metric(self, key, value):
"""Logs a metric."""
mlflow.log_metric(key, value)

def end_run(self):
"""Ends an MLflow run."""
mlflow.end_run()

# Example usage
tracking_uri = "http://localhost:5000"
experiment_name = "My_Experiment"

ml_wrapper = MLflowWrapper(tracking_uri, experiment_name, auto_log=True)

with ml_wrapper.start_run(run_name="test_run"):
# Your ML code goes here
ml_wrapper.log_param("param1", 5)
ml_wrapper.log_metric("metric1", 0.87)

How This Works:

  1. Initialization: The MLflowWrapper class initializes with the MLflow tracking URI, experiment name, and an option to enable auto-logging.
  2. MLflow Setup: The _setup_mlflow method sets up the MLflow tracking URI and experiment. It also enables auto-logging if specified.
  3. Run Management: Methods like start_run, log_param, log_metric, and end_run are provided for managing MLflow runs, making it easier for data scientists to use these functionalities without dealing with the underlying MLflow setup details.

Usage:

  • Data scientists can use this wrapper class to interact with MLflow easily. They just need to provide the tracking URI and the experiment name.
  • The class handles the setup and provides simple methods for starting runs, logging parameters/metrics, and ending runs.
  • This approach abstracts away the MLflow setup complexity, allowing data scientists to focus more on the machine learning aspects of their projects.MLFlow End To End ML Pipeline using MLFlow:

MLflow 2.8: Enhancing Automated Evaluation with LLMs

MLflow 2.8 introduces a significant advancement in automated evaluation by leveraging the capabilities of Large Language Models (LLMs) such as GPT, MPT, and Llama2. This update brings about a revolutionary shift in model evaluation efficiency and cost-effectiveness.

Key Highlights:

  1. LLM-as-a-Judge Metrics: MLflow 2.8 supports automated evaluation using state-of-the-art LLMs. It allows for rapid, low-cost, and effective evaluation of unstructured outputs like chat-bot responses, approximating human-judged metrics with significant accuracy.
  2. Cost and Time Efficiency: In a case study with Databricks Documentation AI Assistant, LLM-as-a-judge metrics showed remarkable time reduction (from 2 weeks to 30 minutes) and cost savings (from $20 per task to $0.20 per task) while maintaining over 80% consistency with human scores.
  3. Custom GenAI Metrics: MLflow 2.8 allows the creation of custom GenAI metrics. For example, a Professionalism metric can be defined with a grading scale and evaluation examples, using GPT-4 as the default judge.
  4. Automated Evaluation for RAG Applications: The new version extends its utility to RAG applications, providing a framework for evaluating and tuning performance with the addition of data cleaning techniques.
  5. Data Cleaning for Enhanced Performance: Investigations revealed that data cleaning significantly improves the correctness of LLM-generated answers and reduces the number of tokens required, lowering costs and improving speed.

MLflow 2.8 UI & Analysis Tools:

  • The MLflow UI now includes an Evaluation tab for a visual comparison of GenAI metrics.
  • Evaluation results can be viewed in detail or analyzed further using tools like Pandas DataFrames.
Image credit-MLFlow Documentation

Check for more info on LLM Evaluation

MLFlow Authentication:

MLflow 2.7 introduced an experimental feature for HTTP basic authentication, enhancing access control over experiments and registered models. Here’s a summary of the key points from the document:

Overview:

  • Purpose: Adds security to the MLflow Tracking Server by requiring users to log in.
  • Scope: Mainly for remote Tracking Servers accessed via REST APIs.
  • Implementation: Users and permissions are stored in a SQL database.

How It Works

  • Resources: Supports access control for Experiments and Registered Models.
  • User Management: Admin users can create and manage other users, including assigning admin status.
  • Permissions: Defines access levels (READ, EDIT, MANAGE, NO_PERMISSIONS) for experiments and models. The default is READ.

Authentication Process

  • MLflow UI: Users log in through a prompt in the browser.
  • Environment Variables: Can use MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD for authentication.
  • Credentials File: Option to store credentials in ~/.mlflow/credentials.
  • REST API: Supports HTTP Authorization header for authentication.

User Creation and Management

  • Admin Required: Only admin users can create new users.
  • Interfaces: Users can be created via MLflow UI, REST API, or AuthServiceClient.

Configuration

  • File: mlflow/server/auth/basic_auth.ini.
  • Settings: Include default permissions, database URI, admin credentials, and custom authentication function.

Custom Authentication

  • Extensibility: Supports third-party plugins or custom plugins for advanced authentication methods.
  • Plugin Structure: Should be an installable Python package with an app factory and, optionally, a client to manage permissions.

Server Command

  • Enabling Authentication: Use mlflow server --app-name basic-auth.

Considerations

  • Experimental Feature: As an experimental feature, it may change in future releases.
  • Admin User: A built-in admin user is created when the feature is enabled for the first time. It’s recommended to change the default admin password immediately.
  • Permissions Database: Stored in basic_auth.db and persists users and permissions.

MLFlow System Requirements:

Image by the Author

Database Options for MLflow

  1. Supported Databases
  • SQLite: Ideal for lightweight and single-user scenarios, not suitable for high concurrency or large-scale deployments.
  • MySQL: A robust choice for medium to large-scale deployments. Offers good performance and scalability.
  • PostgreSQL: Known for its advanced features, reliability, and strong compliance with SQL standards. Suitable for large-scale deployments.
  • Microsoft SQL Server: A good option for enterprises already invested in Microsoft technologies. Offers high performance and integration with other Microsoft services.

2. Choosing a Database

  • Scalability: Consider if the database can handle increased workload and data growth.
  • Accessibility: Database should be easily accessible by the MLflow server and the team members.
  • Compatibility: Check compatibility with existing infrastructure and tools.
  • Security and Compliance: Ensure the database meets the organization’s security and compliance requirements.
  • Cost and Management: Evaluate the cost of running and managing the database, including licensing fees if applicable.

Artifact Storage in MLflow

  1. Storage Options
  • Local Filesystem: Simple and easy to set up, but not scalable or reliable for large-scale deployments.
  • Amazon S3: Highly scalable, reliable, and secure storage service. Good for high availability and large-scale storage needs.
  • Azure Blob Storage: Offers scalability, high availability, and security. Well-integrated with other Azure services.
  • Google Cloud Storage: Globally unified, scalable, and secure object storage for any size of data.
  • SFTP Server, NFS, and HDFS: Good for specific use cases like existing infrastructure integration or large data sets handling.

2. Access and Security

  • Credential Management: Proper management of access credentials, using secrets management tools or environment variables.
  • Access Control: Configure access policies to control who can read/write artifacts.
  • Encryption: Ensure that data is encrypted in transit and at rest.

Tracking Server Setup

  1. Components
  • REST API Server: Handles HTTP requests from MLflow clients.
  • Backend Store: Database for storing metadata about experiments, runs, parameters, and metrics.
  • Artifact Store: Storage for artifacts like models, plots, and data files.

2. Deployment Options

  • Local Machine: Suitable for individual use or small-scale testing.
  • Cloud Provider: Scalable and more secure. Options include AWS, Azure, GCP, and others.
  • Managed Services: Services like Databricks provide a managed MLflow with additional features and ease of use.

3. High Availability and Scalability

  • Load Balancing: Implement load balancers to distribute traffic and reduce the load on a single server.
  • Replication and Failover: Set up database replication and failover mechanisms for high availability.
  • Scalable Infrastructure: Use cloud services that can automatically scale resources based on demand.
  • Monitoring and Alerts: Implement monitoring tools to track system performance and set up alerts for potential issues.

Managed MLFlow:

What is Managed MLflow?(From Databricks documentation)

Managed MLflow extends the functionality of MLflow, an open source platform developed by Databricks for machine learning lifecycle management, focusing on enterprise reliability, security and scalability. The latest update to MLflow introduces innovative LLMOps features that enhance its capability to manage and deploy large language models (LLMs). This expanded LLM support is achieved through new integrations with industry-standard LLM tools Hugging Face Transformers and OpenAI functions — as well as the MLflow AI Gateway. Additionally, MLflow’s integration with LangChain and Prompt Engineering UI enables simplified model development for creating generative AI applications for a variety of use cases, including chatbots, document summarization, text classification, sentiment analysis and beyond.

Image by the Author
Image by the Author

Conclusion:

In summary, MLflow stands out as an invaluable tool for managing the entire lifecycle of machine learning projects. Its ability to streamline the process of tracking experiments, managing models, and simplifying deployment makes it an essential asset for data scientists and machine learning engineers. With MLflow, teams can efficiently collaborate, maintaining a clear record of their work and ensuring reproducibility. Its flexibility to integrate with various tools and platforms further enhances its utility in diverse environments. Whether you are working on a small project or a large-scale enterprise solution, MLflow offers a robust, user-friendly framework to manage, track, and deploy your machine learning models with ease. In the ever-evolving field of machine learning, MLflow stands as a beacon, guiding teams toward more efficient and effective project management.

References:

  1. MLflow Official Documentation: This is the primary source of information for MLflow, offering comprehensive guidance on all aspects of MLflow, including setup, components, and usage. MLflow Documentation​​.

2. MLflow GitHub Repository: The official GitHub repository of MLflow is a valuable resource. It not only provides access to the source code but also includes examples, issue tracking, and contributions from the community.MLflow GitHub Repository​​.

--

--

ML/DS - Certified GCP Professional Machine Learning Engineer, Certified AWS Professional Machine learning Speciality,Certified GCP Professional Data Engineer .