Performance monitoring for AWS Lambda
Monitoring Lambda function performance might seem trivial, but once the dataset gets larger, it becomes increasingly harder to understand how your users experience the system.
As a developer, you usually care about the latency and cost of your system. The features of a good observability tool should be aligned with that while also enabling you to ask arbitrary questions about your system to figure out the scope and causes of problems.
Let’s detail how you should approach performance monitoring and figure out the root causes of Lambda function performance problems.
Performance monitoring for Lambda functions
Let’s start with what you should monitor in Lambda functions.
In general, there are two areas — user experience and the cost of the system. User experience usually comes down to availability, latency, and feature set of a service, while the cost of operating a service is important to ensure the profitability of the business.
In distributed architectures, the surface area of what to monitor becomes larger, and changes in performance and cost can often slip through unnoticed.
One of the contributing factors that make serverless applications harder to monitor is the setup overhead of analytics services. In most cases with serverless, there are a lot more units to monitor, the lifecycles are short, and monitoring agents directly contribute to latency and cost.
The good thing about such services is that, by default, they make themselves observable.
Observability does not mean that you have visibility; it means that the systems emit data that makes it possible to understand what is happening from the outside. This is the core principle we built Dashbird on.
Observing the cost of Lambda functions
Depending on the metric, it might make sense to observe it across all functions or individually per resource. For example, for the system’s total cost, keep an eye on it at the account level, and only if that metric experiences a significant change does it make sense to drill down to the function level.
Monitoring latency of functions
Large datasets can skew the latency results, making it hard to notice when an important user-facing function has started to take longer to execute. A good way to keep an eye on latencies is to construct a custom dashboard of all mission-critical functions and observe for outliers. Once you detect a function that is taking longer than expected, you can drill down to detailed metrics.
Detailed statistics
In large data sets, average metrics usually hide outlying data points, making it impossible to detect some users experiencing significantly longer response times. For a developer, it’s not uncommon to be faced with SLAs, which require that 99% of all requests finish in under one second. A requirement like that is good because it’s actionable and easily measurable — this where detailed metrics come into play.
Debugging Performance Issues
When you’ve detected a problem with your application, its cause might not be obvious.
Are the slow executions caused by cold starts? Does the function call a slow responding service? Could you speed up the execution with a memory increase, or would it merely cost more money while having little impact? Let’s take it one question at a time.
Waiting for Cold Starts
You can graph out the cold start in time and compare the latency of cold starts against warm invocations. In case cold starts are the problem, they can be dealt with in different ways.
Trying to cut down on the deployment package size can speed up cold-starts because the Lambda service has to download these to before it starts your function from a cold state.
If you know your access patterns from logging your users’ behavior, you can buy provisioned concurrency for Lambda functions with prolonged cold starts.
And finally, dialing up the memory of your function will also increase its CPU allocation, which, in turn, can also reduce cold start delays.
Waiting for Slow Services
The minority of Lambda functions do all the work on their own; most of the time, your function calls one or more other services and, in turn, has to wait for them. Is some service call drawn-out?
To break it down, enable X-Ray tracing for functions, and Dashbird will connect requests with X-Ray traces, showing you exactly where the time is spent for each request. Logging out events before and after a particular service call includes timestamps, meaning you can later measure the time between calls. Dashbird takes about 5 minutes to set up, after which you will get full visibility into your serverless applications and can start troubleshooting, searching logs and receive pre-configured alarms instantly.
When you figured out which service calls are slow, you can investigate further.
Is this just an expensive call, and you can’t change it? Stop the Lambda after the call and start a new one later when the service finished its work. For service calls that will always take longer to complete, it’s a bad practice to wait in a Lambda invocation because you’ll pay for the service call and the waiting Lambda. Caching responses to slow service calls could also be a solution here.
More often than not, it can be the case that your service call just does more than it needs to, and you can cut down on it. Filter out the data you don’t need or try to batch multiple calls into one.
Demanding Functions in General
If you figure that your function is generally very performance-hungry, you can increase the memory speed to up the execution. Memory allocations in AWS Lambda also impact CPU allocation, so even if your function is CPU-bound, this configuration change could improve performance.
This is mostly a trial and error-based improvement flow, and there can be a sweet spot when the speed no longer increases when adding more memory. Use the Lambda Power Tuning tool to figure out what configuration works best for your function.
Conclusions
Even though serverless introduced new challenges in monitoring and visibility, the right tooling and development practices can easily help you overcome operational and management issues. The necessity of agents is increasingly deteriorating because of the amount of info available just by data emitted by services themselves.