Programmable Data Infrastructure is Finally Within Reach

Everything as Code for Data Infrastructure

Published in

Level Up Coding

6 min readJul 5, 2023

Everything as Code (EaC) is a development approach aiming to express not only software but also its infrastructure and configuration in code. Changes to resources are managed programmatically using a Git workflow and a code review process rather than deployed manually. This post examines how to apply the same development philosophy to data infrastructure.

Why Everything as Code for Data Infrastructure?

Configuring data pipelines and related resources from a UI might be convenient. However, manual deployments have several drawbacks and risks. EaC can help you avoid these downsides using proven engineering methods.

EaC makes it easier to reproduce environments and keep them in sync.
Making changes to resources in EaC is as simple as editing text (making changes to the code) rather than manually deleting each resource and reconfiguring them from scratch.
You can store your resource configuration in a Version Control System and maintain a history for auditability and rollback. If you stumble upon any issues, you can troubleshoot by reading the commit log and resolve these issues simply by reverting the change.
Maintaining all resources in code allows collaboration via a pull request to ensure a proper review and approval process in your team.
Code can be easily formatted and validated, helping to detect issues early on. Simply run terraform format and terraform validate to ensure your resources are formatted and configured properly.
The best documentation is one that you don’t need to write or read because the code already explains the state of your resources.
Code can be reused. Instead of clicking through UI components hundreds of times, you can declare your resource once and reuse its configuration in code for other similar resources.
Finally, defining resources via code can automate many manual processes and save you time.

Airbyte Terraform provider

Airbyte is an open-source data integration platform that simplifies and standardizes the process of replicating data from source systems to desired destinations, such as a data warehouse or a data lake. It provides a large number of pre-built connectors for various source systems, such as databases, APIs, and files, as well as a framework for creating new custom connectors.

Airbyte can be self-hosted or used as a managed service. This post will focus on the latter — Airbyte Cloud. One feature of this managed service is the recently launched Terraform provider, making it easy to define your data sources, destinations, and connections in code.

Airbyte ingestion, dbt & Python transformation, Kestra orchestration

This post will dive into the following:

Using Airbyte’s Terraform provider to manage data ingestion
Orchestrating multiple Airbyte syncs in parallel using Kestra
Adding data transformations with dbt and Python
Scheduling the workflow using Kestra’s triggers
Managing changes and deployments of all resources using Terraform.

The code for the demo is available in the examples repository.

Let’s get started.

Setup

Airbyte Cloud

First, you need to sign up for an Airbyte Cloud account. Once you have an account, save your workspace ID. It’s a UUID provided within the main URL:

Then, navigate to Airbyte’s developer portal and generate your API key. Store both the Workspace ID and API key in a terraform.tfvars file.

Kestra setup

Download Kestra’s docker-compose file, for example, using curl:

Then, run docker compose up -d and navigate to the UI. You can start building your first flows using the integrated code editor in the UI. In this demo, we’ll do that using Terraform.

Terraform setup

Make sure that Terraform is installed in your environment. If not, check out the installation docs or the following video demo that guides you through the process step by step.

Clone the GitHub repository with the example code: kestra-io/examples and navigate to the airbyte directory. Then, run:

terraform init

This will download the Airbyte and Kestra Terraform provider plugins.

Add the previously created terraform.tfvars file (containing your API key and Workspace ID) into the same directory.

Finally, you can run the following command to validate your existing configuration.

terraform validate

Deploy all resources with Terraform

Now, you can watch the magic of the EaC approach. Run:

terraform apply -auto-approve

Airbyte sources, destinations, connections, and a scheduled workflow will get automatically provisioned. In the end, you’ll see a URL to a Kestra flow in the console. Following that URL, you can trigger a workflow that will orchestrate Airbyte syncs, along with dbt and Python transformations.

How does it all work?

Having seen the process in action, you can now dive deeper into the code to understand how this works behind the scenes. The repository includes the following files:

sources.tf - includes configuration of three example source systems: PokeAPI, Faker (sample data), and DockerHub

Terraform code with Airbyte sources— image by the author

destinations.tf - configures the destination for your synced data (for reproducibility, we use a Null destination that will not load data anywhere)

Terraform code with Airbyte destinations — image by the author

variables.tf - sets dynamic variables such as the Airbyte API key and workspace ID set in terraform.tfvars

Terraform code with Variables — image by the author

outputs.tf - after you run terraform apply, the output defined here returns the URL to the flow page in Kestra UI (from which you can run the workflow)

Terraform code with Outputs — image by the author

main.tf - the main Terraform configuration file specifying required providers, Airbyte connections, along with the end-to-end scheduled workflow. Note how Terraform helps reference the resources to avoid redundancy. This way, you can ensure that your orchestrator (here, Kestra) always uses the right connection ID and that your data ingestion jobs, data pipelines, IAM roles, database schemas, and cloud resources stay in sync.

The main Terraform code with Airbyte connections and Kestra flow — image by the author

End-to-end orchestration

While you could schedule data ingestion jobs directly from Airbyte, integrating those within an end-to-end orchestration pipeline gives you several advantages:

An easy way of parallelizing your data ingestion tasks just by wrapping them in a Parallel task,
Highly customizable scheduling features, such as adding multiple schedules to the same flow, temporarily disabling schedules (without redeployments), and adding custom conditions or event triggers — you can schedule your flows to run even every minute,
Integrating data ingestion with subsequent data transformations, quality checks, and other processes — the DAG view shown below demonstrates a possible end-to-end workflow.

DAG view in the Kestra UI — image by the author

Next steps

This post covered the benefits of applying the Everything as Code approach to Data Infrastructure. We used Terraform with the Airbyte and Kestra provider plugins to manage data ingestion, transformation, orchestration, and scheduling — all that managed via code. By embracing the EaC philosophy, you can adopt software engineering and DevOps best practices to make your data operations more resilient. If you encounter anything unexpected while reproducing this demo, you can open a GitHub issue or ask via Community Slack.