Massively Parallel Serverless Computing in AWS

Published in

Level Up Coding

12 min readJan 26, 2021

Summary

This article is about the design and challenges of implementing a massively parallel scientific data processing system running about 5K containers concurrently to compute a hurricane wind footprint model of the US east coast in FORTRAN. In total, about 150K containers were run in a fault-tolerant manner over four days to compute the US Wind Footprint (USWFPT) model of the east coast and achieving a speed-up of 10x.

Most scientific computations that require such massive parallel processing are usually run in supercomputers (OpenMP + FORTRAN), which not many people have access to. There are multiple solutions to this problem (scaling FORTRAN code); this is just one that has successfully worked at this scale. The main idea is the design principles you follow when you architect a massively parallel solution in the cloud for this problem.

Project Goals

Use existing code base written in FORTRAN.
Minimal modifications to the code base
Reduce the computation time significantly.
Zero infrastructure maintenance
Enable CI/CD (Continuous Integration and Continuous Delivery)

Why is this important?

US hurricanes are one of the most expensive perils for the insurance industry globally. According to the Insurance Information Institute, the hurricanes make up close to 40% of the US insured catastrophe losses for 1917–2016 based on the Property Claim Services data (https://www.iii.org/fact-statistic/facts-statistics-us-catastrophes ). Computing the risk posed by the landfalling hurricanes is therefore very important.

A bit about Hurricanes

Hurricanes (also called typhoons or tropical cyclones) are a class of storms that originate in the tropical regions worldwide and have an approximately axisymmetric, very powerful cyclonic circulation around a low-pressure, calm “eye” in the center.

Depending on a storm, the wind impact can extend anywhere between about ten miles from the storm center (tropical storm Marco 2008 — smallest on record), about 200 miles (Hurricane Irma 2017), to over 500 miles from the storm center (hurricane Sandy 2012 — the largest to affect the US).

This GOES-16 satellite image taken Sunday, Sept. 1, 2019, at 17:00 UTC and provided by National Oceanic and Atmospheric Administration (NOAA), shows Hurricane Dorian, right, churning over the Atlantic Ocean. Hurricane Dorian struck the northern Bahamas on Sunday as a catastrophic Category 5 storm. Its 185 mph winds ripped off roofs and tearing down power lines as hundreds hunkered in schools, churches, and other shelters. (NOAA via AP)

Catastrophe models

Catastrophe models may be used to quantify the risk posed by landfalling hurricanes. These models calculate losses from a vast number of tropical cyclones (10K or more events). Damage resulting from a hurricane event is caused in three significant ways (sub-perils):

Wind (structural damage to buildings)
Storm surge (coastal flooding)
Rainfall associated with the storm (inland flooding)

From the modeling point of view, each of the sub-perils is modeled separately. This example focuses only on wind hazards. The wind-related damage estimation is based on the maximum wind gust (3s sustained gust at the elevation of 10m) that any location experiences during the passage of a hurricane. A complete set of all the footprints for the event set's events makes up the catastrophe model's wind hazard portion.

The USWFPT model is currently written in FORTRAN and uses a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data to calculate an event footprint.

Event Footprint

A collection of all the locations with significant wind gusts is called an event footprint.

To create an event footprint, one needs to know several physical parameters for the storm, such as:

The maximum wind around the storm center
The distance from the center at which the maximum wind occurs
The size of a storm
The storm track is a series of latitudes and longitudes that correspond to the storm location during its life span.
Besides, one also needs the coastal and inland locations for which the maximum wind gust will be calculated.

The Problem

It is not particularly computationally involved to compute one event and a handful of locations — although it should not be thought of as trivial; there are many physical processes to be considered.

The challenge manifests itself when the calculation needs to be repeated for 150K events and a grid of location covering the entire US eastern seaboard — on the order of 5 million grid points. The FORTRAN computation becomes too tedious and time-consuming as it takes over two months to compute 150K events.

It all started in a single container.

A Journey of a thousand miles begins with a single step — Lao Tzu

In our case, it starts with a single container. Containers are a method of operating system virtualization that is attractive to test and experiment software applications.

As mentioned in one of the project goals above, rewriting the code in another language other than FORTRAN was not a desirable option for various reasons but not limited to a deadline, developer productivity, proprietary scientific data formats, and libraries. For those of you who have worked with FORTRAN before, you must be aware that FORTRAN is notoriously tricky to work with cross-version compatibility issues.

As a general rule, code that is compiled by a given version of the FORTRAN compiler (version 8 or above) can be linked with code compiled by a later version of the same compiler as long as the newer version’s language libraries and tools (such as the linker) are used. The converse is not true — object code compiled by a more recent version is not supported for linking with an older version’s libraries and tools.

After overcoming cross-version compatibility issues in the docker container, the computation to run a single event footprint was successful. Now, all we need to do is run is 150K of these containers, each containing a different event footprint.

Before we figure out how to run and where to run thousands of these events concurrently, it would significantly help us if we can make the computations of these events embarrassingly parallel; by that, I mean the computation of an event footprint by a container that houses the FORTRAN code has little or no dependency with other event footprint computations.

System Architecture

Why Cloud?

It is quite evident at this stage we are dealing with a massively parallel computing problem, and we require a cluster. Traditional, on-premises clusters force a one-size-fits-all approach to the cluster infrastructure. However, the cloud offers a wide range of possibilities and allows for optimization of performance and cost.

You can define your entire workload as code (Infrastructure-as-code) and update it in the cloud. This enables you to automate repetitive processes or procedures. You benefit from being able to reproduce infrastructure and implement operational procedures consistently. This includes automating the job submission process and responses to events, such as job start, completion, or failure.

AWS, and as do any other cloud provider, offers the capability to design the cluster for the application. A one-size-fits-all model is no longer necessary with individual clusters for each application. When running various applications on AWS, various architectures can be used to meet each application’s demands. This allows for the best performance while minimizing cost.

Cloud Architecture

What is the cloud architecture that is most suited to run this type of workload? — this massively parallel workload of running 150K containers, each computing an event footprint.

Let us delve a bit deeper into this massively parallel workload that we have by asking a series of questions and, in the end, ascertain an appropriate cloud architecture.

Are our workloads embarrassingly parallel? Yes, As mentioned before, our workload is embarrassingly parallel — computations have little or no dependency on other computations. The entire workload is not iterative.
Do our workloads vary in storage requirements? No, our workload does not differ in storage requirements but is driven by desired performance and reliability for transferring, reading, and writing the data.
Do our workloads vary in compute requirements? Yes, Our workloads do vary in compute requirements. Some event footprints take a few seconds, and others take a few hours, hence the need to choose an appropriate memory-to-compute ratio for the containers. You could optimize and find the sweet spot ratio (2GB RAM:1 vCPU) for the entire workload or customize the ratio ([1GB RAM:1 vCPU], [2GB RAM:1 vCPU]) for every workload.
Do our workloads require high network bandwidth/latency? No, Because our workloads do not typically interact with each other, the workloads' feasibility or performance is not sensitive to the network's bandwidth and latency capabilities between containers. Therefore, clustered placement groups are not necessary for our case because they weaken the resiliency without providing a performance gain.

After analyzing the answers for the above-posed questions, the architecture that lends itself to such a design is a loosely coupled cloud architecture. Another point to note is that workloads' scalability is conspicuously absent in the above list of questions. Scalability is a no-brainer since we are dealing with a massively parallel system.

Loosely coupled applications are found in many areas, including Monte Carlo simulations, image processing, genomics analysis, and Electronic Design Automation (EDA). The loss of one node or job in a loosely coupled workload usually doesn’t delay the entire calculation. The lost work can be picked up later or omitted altogether. The nodes involved in the calculation can vary in specification and power.

Serverless

The loosely coupled cloud journey often leads to an entirely serverless environment, meaning that you can concentrate on your applications and leave the server provisioning responsibility to managed services. You can run code without the need to provision or manage servers. You pay only for the compute time you consume — there is no charge when your code is not running. You upload your code or your container, and the system takes care of everything required to run and scale your code.

Scalability is another advantage of the serverless approach. Although each task may be modest in size — for example, a compute core with some memory — the architecture can spawn thousands of concurrent nodes, thus reaching a large compute throughput capacity.

AWS Fargate

Serverless in AWS is synonymous with Lambda functions, Lambdas are not the only serverless compute engines in AWS, and Lambdas do have their limitations. One of the main limitations is the amount of computing time (15 min) that is available. In our project, the compute time for each task is different and varies between a few seconds to hours, and hence the choice of Fargate, which is a serverless compute engine for containers. The other limitation is the choice of programming language available in Lambda functions. Lambda functions do not have support for FORTRAN code, we needed to scale FORTRAN code, and one of the best ways to do that is to scale it using containers.

In the above architecture, the user triggers these massively parallel workloads through an API call which is hosted by the Amazon API Gateway and is secured by Amazon Cognito and before initiating the run through the API, the user must also ensure the input data is present in the appropriate S3 bucket (uswind/input).

The API gateway triggers a Lambda function that processes and validates the request and kicks off the “slicer” service.

As the name suggests, the “slicer” service, a fargate task, is responsible for processing the input file containing approximately 150K events. These events are fed into an inbound SQS queue.

When these events start streaming into the inbound SQS queue, the “Q-processor” lambda function gets triggered. It processes these individual events, which are then fed into another outbound SQS queue.

The “fargate-launcher” lambda function triggered by the outbound SQS queue processes these requests and launches these fargate tasks in the fargate cluster in AWS.

We now have a stateless system that works for 10, 100, and even 1000 events. What happens if you intend to launch 10K events? The system crashes because AWS has a set limit of running only 5K parallel Fargate tasks at any point in time.

You might also wonder why one has to go through two SQS queues and a few Lambda functions to run tasks in a Fargate cluster. Well, the answer to that question lies in the fact that AWS rate limits fargate tasks to 1 task per second. They throttle the RunTask API calls on per account per region basis. We plan to launch 150K events in total and ~5K events concurrently. Hence, we need to implement a retry mechanism with exponential backoffs to not inundate the system with excessive requests that overwhelm the system.

This hard limit makes the system's design undergo a paradigm shift from a stateless system to a stateful system. It would have been far easier If AWS did not have the above hard limit of running 5K parallel tasks. We would have just launched 150K tasks/containers in a controlled manner and be done, but now we have to keep track of the system's state. We need to know exactly how many tasks are currently being processed and the status (success or failure) of each task in the Fargate cluster. Hence the introduction of DynamoDB into the architecture.

All lambda functions and services (slicer) now update DynamoDB to record the system's current state. By enabling Cloudwatch to keep track of tasks in the Fargate cluster and triggering an “update-status” lambda function, which processes that state change by updating DynamoDB, the metamorphosis of the system from a stateless system to a stateful system is now complete.

Errors that are encountered in the fargate tasks are captured by Cloudwatch and are collected in the DLQ (Dead Letter Queue) bucket. If a code change needs to be made, all a researcher/developer needs to do is push the FORTRAN code to the DevOps system (Azure DevOps), build the docker image, and publish it ECR. The latest image gets referenced by the “fargate-launcher” lambda function, which runs the container as tasks in the Fargate cluster.

Optimizing Fortran Code

In this section, you have two types of optimization, one improves the algorithm in the code itself, and the other (the “O3” flag) refers to compiler optimization.

The gprof(1) command provides a detailed postmortem analysis of program timing at the subprogram level, including how many times a subprogram was called, who called it, whom it called, and how much time was spent in the routine and by the routines, it called.

Baseline test results for one of the event was 12m 16s of actual run time. The initial gprof report suggested that about 4.5m out of 5m 4s in the code was being spent in the “apply_fric_full” subroutine.

`The table above shows the gprof report and actual run time after each optimization is applied starting with the original profiling report.`

With optimizations applied, the event's total run time was reduced from 12m 16s down to 3m 49s, an ~70% reduction. One of the effective optimization techniques used was using the “-O3” flag, which helped reduce runtime by approximately 50%.

Cost Analysis

Here we analyze the cost for the most expensive part of the architecture, AWS Fargate, as the other resources in the above architecture are relatively negligible in cost.

Pricing is based on requested vCPU and memory resources for the task. The two dimensions are independently configurable.

With Fargate, there are no upfront payments, and you only pay for the resources you use. You pay for the amount of vCPU and memory resources consumed by your containerized applications, rounded up to the nearest second.

Conclusion

The above massive parallelism that was achieved can be attributed to the below three factors:

Containers: Using containers, we could run computations in any language and for any reasonable amount of time and control the containers' memory-to-compute ratio. The choice of a programming language is significant. FORTRAN has now been in use for several decades. There is a vast body of FORTRAN software in daily use throughout the scientific and engineering communities.
Cloud: In the cloud, the network, storage type, compute type, and even deployment method can be strategically chosen to optimize performance, cost, and usability for a particular workload, and our architecture can be dynamic: growing and shrinking to match the demands of the project.
Architecture: Running massively parallel workloads more or less fall under two categories: loosely coupled and tightly coupled. It is prudent to know which of those categories suits you better and design the architecture accordingly. The above project is a loosely coupled massively parallel workload, as evidenced by analyzing the workload in the “Cloud Architecture” section above.

Workloads that took over two months on-prem now take approximately four days to complete 150K events. Thus, the need arose to engineer a scalable, reliable, and maintainable system that significantly reduces computation time by a factor ~10x. Now, we can confidently say we are indeed on “cloud nine.”