Photo by Rubén Bagüés on Unsplash

GitHub Actions, self-hosted runners on Amazon EKS & spot instances

How to spin up ephemeral runners in Kubernetes.

Jakub Jewloszewicz
Level Up Coding
Published in
10 min readNov 13, 2023

--

Imagine you have a medium-sized team of developers, and your project has gained traction, resulting in frequent code updates and deployments.

  • You have a Github Free org with two main repositories for your web application: one for the frontend and another for the backend.
  • Each repository has a comprehensive CI/CD pipeline declared as a Github Actions worklow covering tasks such as linting, unit testing, integration testing, and building Docker images.
  • Developers are actively pushing code changes multiple times a day.
  • Your application integrates with external APIs for payment processing and authentication, triggering additional workflows upon code changes.

How does it translate to GitHub minutes consumption?

Given this scenario, we can assume an average of 10 code pushes per day across per repository, triggering workflows. Each workflow run takes approximately 15 minutes on average due to the parallel jobs and comprehensive testing. Two repositories with a single active workflow each.

Based on these estimates, the daily usage would be:

10 code pushes * 15 minutes per run * 2 repositories = 300 minutes per day

Over a month (assuming 22 workdays), the total usage would be:

300 minutes per day * 22 days = 6,600 minutes per month
GitHub Free plan offers 2,000 free minutes per month, making it cost-effective for smaller projects and those with light CI/CD needs.

In our scenario, though, the team (using GitHub free org) would exceed the 2,000-minute limit within the first 10 days of the month.

Or right after the first half of the month with the GitHub Team plan.

This is just a simplified hypothetical scenario, and actual usage can vary based on specific workflows, code change frequency, and other factors.

Watch your GitHub Actions usage, especially in bigger teams where several workflows can run at once. Toss in deploying artifacts to different environments, wait times for approval in CD workflow steps, the rollback processes, and your free minutes disappear fast.

You’ve got a couple of options: either upgrade to a higher-tier GitHub plan (Team with 3,000 CI/CD minutes/month or Enterprise with 50,000 CI/CD minutes/month), or explore setting up self-hosted runners to meet your active development and deployment needs in a custom environment.

I’m planning to experiment with the latter.

Self-hosted GitHub Actions runners: a smart move or an overkill?

Opting for self-hosted runners introduces more control over underlying hardware specifications, networking configurations, preferred operating systems, and software packages. Whether physical, virtual, containerized, on-premises, or in the cloud, self-hosted runners afford flexibility in tailoring your development environment.

I need a streamlined, robust, and cost-effective platform to manage random number of runners so we’ll try out a container-based approach.

In the upcoming sections, I will walk you through my experimental setup of a GitHub Actions pipeline for a Python repo. This pipeline will be executed within the Amazon EKS cluster, with runners deployed on EC2 spot instances to lower total costs.

Keep in mind that this guide is intentionally simplified in a few places, as a production-grade setup would require more time and consideration.

Part 1: The workflow

First, let’s set up a basic GitHub Actions workflow for a Python project.

Imagine a straightforward integration pipeline that serves the purpose of catching errors early, maintaining code quality, and simplifying the delivery of software updates.

Assuming your team has agreed upon preferred developer tools for linting and testing a Python repository, we’ll keep things simple by using flake8 for code quality checks and pytest as the testing tool, both listed in the requirements.txt.

Place the following workflow file in the .github/workflows directory of your repository:

Workflow details

Once the code is checked out, we activate the actions/setup-python action, downloading the specified Python version into the runner environment.

Using virtualenv and pip, we ensure that project dependencies defined in requirements.txt are installed.

The subsequent step involves linting our codebase, where flake8 enforces style guidelines for the project.

flake8 --exclude=venv* - statistics

Another build step focuses on testing and test coverage. We limit the coverage report to include only our project’s code by using the --cov option with a specific directory or module path. This ensures the analysis and reporting of coverage for the specified files and directories.

pytest -v --cov=my_project

What happens when you trigger the workflow?

A successful build on a standard Github runner.

When you now run the workflow, the standard GitHub runner environment will be used and your Github Actions free minutes consumed.

This list will shortly be populated by self-hosted ephemeral runners.

Part 2: The cluster

We will bring an Amazon EKS cluster to life using eksctl, a CLI tool that will automate a lot of steps involved in creating EKS clusters.

We will not deep dive into eksctl in this article. You can will find all the configuration options detailed in their usage guide:

Let’s detail some parts of the cluster configuration.

Networking

We will let eksctl provision the VPC with CIDR 10.10.0.0/16 and subnets within 2 AZs. For the sake of this example, we provision a single NAT Gateway creation.

I’ve kept the public access endpoint to run kubectl commands without a VPN. In real-life scenario, we’d probably want to restrict access to the cluster from the internet.

Also, our VPC doesn’t need any AWS endpoints linked to the subnets. eksctl can skip creating them by providing the option skipEndpointCreation.

Addons

Each newly created EKS cluster includes 3 add-ons by default (out of many available Amazon EKS add-ons):

  • vpc-cni implements the Kubernetes network model. It provides native VPC networking for the cluster.
  • coredns serves as the Kubernetes cluster DNS. CoreDNS pods provide name resolution for all Pods in the cluster.
  • kube-proxy maintains network rules on each Amazon EC2 node and enables network communication to your Pods.

We request latest versions of add-ons that include the latest security patches, and are validated by AWS to work with Amazon EKS.

Also, we need an IAM OpenID Connect (OIDC) provider to enable the Amazon VPC CNI plugin.

Managed node groups

These cluster addons are mandatory for the cluster to work properly.

For instance, two replicas of the CoreDNS image are deployed by default, regardless of the number of nodes deployed in your cluster. CoreDNS pods provide name resolution for all pods in the cluster. They may experience disruptions if CoreDNS pods are missing. This could lead to timeouts or errors until CoreDNS is fully operational again. To avoid it we need to run multiple replicas of CoreDNS across different nodes that can provide redundancy and reduce the impact of a single node restart.

To be cost efficient, however, we might decide to create a dedicated node group (mng-1-corestack) for core components, which would be backed by cheap instance types, and a separate node group (mng-2-gha-runners) for other workloads (runners).

We will use a single on-demand t3.small instance for core components and an auto-scaling group of spot instances (t3.small and t3a.small instance types) that will run GitHub workflows.

Note that we use labels to take advantage of node affinity, ensuring that runner pods are scheduled onto a specific node group.

Cluster Autoscaler will automatically adjust the number of nodes in the cluster when pods fail or are rescheduled onto other nodes. When inactive, the node group with runners will scale back to zero nodes.

Provisioning a cluster using a config file instead of flags

To sum it up, our cluster-config.yaml file should have the following contents:

We will provision an Amazon EKS cluster and corresponding resources in one go:

eksctl create cluster -f cluster-config.yaml

eksctl does the heavy lifting, ensuring our EKS cluster is provisioned along with other AWS resources with ease.

The command will run for the next 15 min.

Alright, now let’s get the Cluster Autoscaler up and running. We’ll use Helm and just toss in these settings into invalues.yaml file:

cloudProvider: aws
awsRegion: eu-central-1
fullnameOverride: cluster-autoscaler
nodeSelector:
role: corestack # cluster-autoscaler pod should be placed on the corestack node
autoDiscovery:
clusterName: gha-runners
rbac:
serviceAccount:
create: false # already created by eksctl
name: cluster-autoscaler
podDisruptionBudget:
service:
create: false
extraArgs:
scale-down-unneeded-time: 2m

The only action left is to install the chart:

helm install cluster-autoscaler \
--namespace kube-system \
autoscaler/cluster-autoscaler \
-f values.yaml

In effect, we have the following pods in our EKS cluster, and they are all located on the node from the mng-1-corestack node group:

Part 3: The runners

Managing a fixed pool of individual containers can be a hassle. What if we could have a runner’s pool that dynamically scales in and out based on the load?

Actions Runner Controller, a Kubernetes operator, does exactly that — it orchestrates and scales self-hosted runners for GitHub Actions. Head over to the Quickstart for Actions Runner Controller in the GitHub Docs to get started.

Alright, let’s install the Helm chart for Actions Runner Controller. The only customization we’re making is setting the nodeSelector to ensure pods are scheduled to the node in the mng-1-corestack node group:

helm install arc \
--create-namespace \
--namespace actions-runner-system \
--set nodeSelector.role=corestack \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller

Moving on to the next step: installing a runner scale set.

Here, we want runner pods to run specifically on the mng-2-gha-runners node group. To achieve this, we use a nodeSelector expression again to override the default pod template. The provided nodeSelector ensures node affinity. Pretty neat, right?

GITHUB_CONFIG_URL="<your_github_repository>"
GITHUB_PAT="<your_github_access_token>"

helm install eks-runners \
--create-namespace \
--namespace actions-runners \
--set template.spec.nodeSelector.role=worker \
--set githubConfigUrl="${GITHUB_CONFIG_URL}" \
--set githubConfigSecret.github_token="${GITHUB_PAT}" \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

Even before kicking off any workflows, let’s now take a peek at the current state of pods:

All the pods so far run on the EC2 instance belonging to the mng-1-corestack node group.

What happens now when you trigger the workflow?

To observe our runners in action, we need to update the workflow to run on our managed EKS infrastructure. A runner environment is specified by pointing runs-on keyword to eks-runners:

jobs:
build:
runs-on: eks-runners # change to Helm release name to use a self hosted runner

Upon triggering the GitHub Actions workflow, we can reassess the pod list:

A new (spot) EC2 instance i-02b5d44b343c26cc9 has been spawned, and the eks-runners-gbknq-runner-tvjws pod has been scheduled to run on it.

The runner pod has been registered in the list of available self-hosted runners

Feel free to initiate it multiple times concurrently. ARC will provision multiple ephemeral runner pods, one for each run.

Estimating costs of spot instances for self-hosted runners

Let’s revisit the calculations.

Why would one use spot instances as GitHub runner nodes inside an EKS cluster executing GitHub actions workflows instead of incurring additional costs when your GitHub Actions usage outgrows the free tier?

A pricing estimates provided AWS Pricing Calculator for a t3.small instance demonstrate that spot instances are significantly cheaper than on-demand instances.

According to Spot Instance advisor, that instance type has consistently provided savings of over 70% over the last 30 days.

Now, let’s keep things simple. Assume spot instances are hustling at 100% utilization, zero autoscaling delays, no idle instances, and no waiting times — a direct correlation between EC2 running time and total job execution time (6,600 minutes).

At a generous 72% off the on-demand price (0.024 USD per hour), you’re looking at a mere 0.672 USD per month for spot capacity!

However, we still need to account for the supporting infrastructure that runs 24/7:

  • the EKS cluster (the control plane): 73 USD
  • at least a single node in the mng-1-corestack node group: 17.52 USD

Sure, there are other variables to consider — storage volumes, data transfer, networking services, logs, and the list goes on. For limited GitHub Actions usage per month, self-hosted runners might not be the wisest financial move.

Once you breach the initial 2,000 minutes (GitHub Free plan), each subsequent minute of standard GitHub-hosted runner on the smallest machine with 2 vCPUs costs $0.008 USD (0.48 USD per hour).

If you double your usual usage, the monthly cost could get pretty close to maintaining an EKS cluster and its node groups.

But here’s a thought: if you’re already running an EKS cluster for other tasks, why not hitch GitHub Actions runners to it? Leverage existing resources and configurations, and spot instances become a potential money-saver.

Conclusion: When to use self-hosted runners?

A hybrid approach using GitHub standard runners for lighter tasks and self-hosted runners for heavy-duty jobs makes a lot of sense.

Spot EC2 instances are a much cheaper alternative to GitHub-hosted standard runners, resulting in significant cost savings for long-running or resource-intensive CI/CD jobs.

If your project has light CI/CD needs and fits within the available free minutes, you’re good to go!

For simplicity, prioritize ease of use — runners readily available on GitHub’s infrastructure save you from dealing with the setup and maintenance hassle.

--

--