Hunting memory leaks in a server side rendered React application

Gaspar Nagy
Level Up Coding
Published in
12 min readSep 17, 2019

--

If you are using server side rendering (SSR) for your React application, you surely faced some issues at times. Even if you choose tools like Next.js which is backed by a big company, there is always a specific situation when you run into an issue.

tl;dr;

  1. Do your compiling just with Babel targeting NodeJS environment
  2. Use metrics, e.g. Prometheus with Grafana, to determine if any memory leak is happening
  3. If some problem appears, start inspecting your running NodeJS process by using kill -USR1 PID
  4. Right after start using your Chromes DevTools for profiling
  5. Our memory leak was caused by reselect and with the bad usage of styled-components, both problems were found by using Chrome DevTools

Backstory

We are working with an environment where we should handle 5 renders per 1 second on one pod of our NodeJS server side rendering application (call it Renderer). With 26 pods of our Renderer, it makes 7800 requests per 1 minute.

In my case, the memory leak appeared when I finished the refactoring of a big monolithic application (call it APP), which included also our Renderer, and tested it with production like request load.

The refactoring included building process update, upgrade of outdated libraries, as well as separation of some packages from the APP.

When I use library I am talking about npm package/library and when I use package I am talking about our package/library from monorepo.

Why the refactoring?

The APP used a yarn workspaces setup, which tended to be a monorepo setup, but the problem was that the APP’s packages had such a deep coupling that actually it was a monolith. As a cherry on the cake, there were a lot of outdated libraries so it was more than enough to start the refactoring.

Our old Renderer

Our NodeJS application — Renderer is written in Typescript. That's why there was a Webpack + Babel setup to generate bundles targeting NodeJS and browser environment.

This means that after the bundles get created, we end up with two code bases. One including the code for NodeJS server and the code for the server side rendered client application. The second one including only the code for client application which gets loaded in the browser.

The client React application is part of a bigger application written in ASP.NET (call it Wrapper App) so the server side rendered code is processed by the Wrapper App. Also, our Renderer is used for rendering only specific React components — not the whole app. For this reason we were using renderToString() from react-dom/server and getDataFromTree() from react-apollo. I also mention that for the code splitting, which supports the server side rendering, we are using react-loadable.

Monitoring tools

Except for the following tools, we didn’t have any tooling for memory leak detections.

In our Renderer, we are using Prometheus to collect monitoring data and Grafana for to visualise them. Also, there is logging to Elasticsearch and visualisation by Kibana.

We noticed the memory leak with the help of Grafana where we visualise the following metrics:

  • garbage collector pause (nodejs_gc_pause_seconds_total)
  • garbage collector runs per second (nodejs_gc_runs_total)
  • garbage collector reclaimed bytes (nodejs_gc_reclaimed_bytes_total)
  • heap size (nodejs_heap_size_total_bytes)
  • old and new heap space size — (nodejs_heap_space_size_total_bytes)
  • large object heap space size — (nodejs_heap_space_size_total_bytes)
  • external memory usage (nodejs_external_memory_bytes)
  • process resident memory usage (process_resident_memory_bytes)
  • cpu usage (process_cpu_seconds_total)
  • total active requests (nodejs_active_requests_total)
  • total active handlers (nodejs_active_handles_total)
  • event loop lag duration (nodejs_eventloop_lag_seconds)

What can cause memory leaks in NodeJS application?

Global Variables

They stay in memory until the application stops. That’s why it is dangerous to set a big amount of data in them.

Multiple References

This is also dangerous because you can end up with a lot of references to the same object, which could not be garbage collected and it will increase your heap size.

Closures

It has a similar problem, like multiple references. You are using closure to keep references on some objects to use them later.

The realisation

Green: New heap space size; Yellow: Old heap space size;
95 percentil of renderToString() duration

Everything went just fine until the refactored application was tested with production-like request load.

As you can see on graphs above after a while the large object heap got filled, so the application restarted. The memory limit for NodeJS app is 1,4 GB. Except that the renderToString() method duration from react-dom/server increased from tens/hundreds of ms to seconds. Also, the GC runs/sec increased, normally it was under 5 runs/sec for Scavange.

After this realisation, we started to think about possible solutions. In the past, we faced similar issues after library upgrades, so this was a starting point. Unfortunately, at that time there was no space to solve these issues, so we just downgraded those libraries.

In addition to library upgrades, we also added some new ones and started using React Context API for some scenarios.

Here is the list of important packages which got the upgrade:

  • graphql
  • react-apollo
  • apollo-client
  • apollo-cache-inmemory
  • styled-components
  • react
  • nodejs
  • webpack
  • babel
  • typescript

Also, we added reselect for memoizing the graphql() HOC. Reselect is a generic memoization library so it seemed to be a good choice because every time some component higher in the components tree gets re-rendered, the grapqhl() HOC returns a new reference to props, so even with React.PureComponent we couldn’t eliminate unnecessary re-rendering.

By the way, this question to use reselect with react-apollo also appeared between reselect’s github issues https://github.com/reduxjs/reselect/issues/334.

Now the fun part :)

How to find the memory leak?

1. Simulating the request load

As a start, I collected some network traffic with tShark (https://www.wireshark.org/docs/man-pages/tshark.html) from production servers and saved it in CSV friendly format (comma-separated values).

tshark -T fields -E separator=,

TShark is a network protocol analyzer. It lets you capture packet data from a live network.

After a little editing, I ended up with the following format in tshark.csv:

http.request.uri,http.accept_language
/our/specific/url,en-GB
/our/other/specific/url,it-IT

Finally, I had all the data for simulating the production-like request load on our test environment. For this purpose, I used superBenchmarker (https://github.com/aliostad/SuperBenchmarker).

The setup for superBenchmarker was with sbtemplate.txt file:

Accept-Language: {{{http.accept_language}}}

and the final command:

sb -u "https://our-test-url-adress.tst{{{http.request.uri}}}" -c 50 -t sbtemplate.txt -f tshark1.csv -U -n 10000000 -y 10

By this I started 50 concurrent requests 10000000 times with a 10ms delay on address https://our-test-url-adress.tst{{{http.request.uri}}} where the {{{http.request.uri}}} represented the first column from our tshark.csv and the second column was used for sending Accept-Language header.

2. Remove Webpack

What is Webpack doing in production mode? Bundling modules and minifying the code. If I like to have useful data, this is not an option. That's why I have to do the compilation only with Babel.

There were three issues what I faced during this:

Yarn workspaces — symlinks in monorepo

After the different packages get compiled, the module imports to our symlinked packages stays there.

These paths point to our not compiled codebase (symlinks in node_modules), because webpack loaders loaded these files and created a bundle from them, it was not needed to do any further setup.

I wanted to solve this in the ‘easiest’ way, so I created a simple babel plugin, which changes the path of the imported module to an absolute path according to input data.

Importing images

While Webpack has loaders for importing images or creating base64 format from them, unfortunately our default babel setup had nothing. There are several babel plugins that helps you with this but I was in creating plugins already so I created another one. This changes image imports to constants with a value of empty string according to input data, in this case extensions:

Importing .graphql files

The last issue I faced was with importing graphql files. I thought that I already created enough babel plugins so I simply used the graphql-import-node (https://www.npmjs.com/package/graphql-import-node) library.

Finalisation

For compiling I created some scripts in package.json just by the usage of @babel/cli (https://www.npmjs.com/package/@babel/cli).

rimraf babel-build/myPackage && babel ./packages/myPackage --out-dir ./babel-build/myPackage --config-file ./babel.js --copy-files

The --copy-files argument is used to copy all files which are not compiled (e.g. .json etc.) https://babeljs.io/docs/en/babel-cli#copy-files.

Then I was able to start the server:

node babel-build/myPackage

3. Google chrome dev tools

Our environment is on Kubernetes so I prepared some handy aliases for command line (bash).

kube ()
{
if [ "$1" == "cc" ]
then
kubectl config current-context "${@:2}"
elif [ "$1" == "gc" ]
then
kubectl config get-contexts "${@:2}"
elif [ "$1" == "uc" ]
then
kubectl config use-context "${@:2}"
elif [ "$1" == "sn" ]
then
kubectl config set-context --current --namespace="${@:2}"
elif [ "$1" == "pf" ]
then
kubectl port-forward "$2" 9229 9229
elif [ "$1" == "bash" ]
then
kubectl exec -i -t "$2" -- ash
elif [ "$1" == "gp" ]
then
kubectl get pods -o wide | grep "${@:2}"
else
kubectl "${@:1}"
fi
}

You can notice that in bash alias we are using ash, its because the main Docker image is built on top of Alpine Linux.

The kube bash function works as an alias for kubectl. If there is no match in any if clause it will execute kubectl with the given arguments.

The usage is then:

kube cc|gc|uc|sn|pf|bash|gp [...args]

How to inspect a running NodeJS process

Because I don’t want to change the Docker image nor restart any pods in Kubernetes, what I can do is to connect to a specific pod, find the PID of my running node process, and send a signal to it. In our case the PID is 1.

List my pods using:

kube gp myPodsName

Now I have my specific pod name, let's say mySpecificPodName, so I can connect to it:

kube bash mySpecificPodName

Finally I can do the magic:

kill -s SIGUSR1 1

What I have done?

From the docs https://nodejs.org/en/docs/guides/debugging-getting-started/

Node.js will also start listening for debugging messages if it receives a SIGUSR1 signal. (SIGUSR1 is not available on Windows.) In Node.js 7 and earlier, this activates the legacy Debugger API. In Node.js 8 and later, it will activate the Inspector API.

So for Unix like environments, I can do the following:

kill -s SIGUSR1 PID
# or
kill -USR1 PID

For Windows users, you need to start another process in a new command line with the following:

node -e "process._debugProcess(PID)"

In both environments, this will signal to the NodeJS process to start the inspecting.

How to connect the dev tools to a running NodeJS process

Currently, I have a running NodeJS process which is in inspect mode. What next?

From the NodeJS docs, I know that in inspect mode the NodeJS process listens for a debugging client on a port 9229. So I need to forward this port to my local machine by using:

kube pf mySpecificPodName

Now I can open my Chrome on local machine and type chrome://inspect:

On this page I need to click to Open dedicated DecTools for Node:

By this, my DevTools for NodeJS should open.

Now I have everything which I need for detecting the memory leak.

What was the cause of our memory leak?

Reselect was keeping references in a global variable so I removed the whole library. This leak was found by Allocation instrumentation on timeline under the Memory tab. Unfortunately, I didn’t save the Profile.

Anyway, the biggest problem we had was with the styled-components v5 (beta8 at the time writing this). At that time, I didn’t know that it was caused by bad usage of the styled-components, not by the package itself.

The large object heap increased a lot so I used the Allocation instrumentation on the timeline. With this instrumentation I got the following:

After expanding the StyleSheet I definitely knew that it is connected to styled-components.

You can see that the problem is caused by allocating too much memory by one ArrayBuffer used by Uint32Array. So I dug into the styled-components source code and find the code for groupSizes. It is located in styled-components/src/sheet/GroupedTag.js, specifically in the insertRules method of DefaultGroupedTag class.

The important part is the following:

Here we can see the usage of the Uint32Array. If you are setting the new Uint32Array from the old Uint32Array, the internal ArrayBuffer is shared between them. You can find this info on MDN web docs.

I thought that it causes the problem because the insertRule() method is called many times during the SSR, so under a high request load, it allocates all the memory really fast.

So I proposed the following solution:

The magic I am doing here is just that I copy the elements from the groupSizes to a new empty array, so the internal ArrayBuffer from the original typed array is not shared anymore. Great, I can move on and test my solution with the high request load.

Unfortunately it wasn’t enough. The problem with high allocation in large object space disappeared but the process RSS (Resident set size) and External memory usage started to increase. To reveal the cause I used the CPU profiler:

First I recorded the CPU profile when the app started. As we can see the toString method called inside the ServerStyleSheet._emitSheetCSS takes 3.4ms to finish.

After a while (maybe 15 mins) under the production like request load, I noticed that the same method took 1.97 seconds. It went up to circa 6 seconds before the pod died. What the hell?!

So I again dug into the source code. In the file styled-components/src/sheet/GroupIDAllocator.js I find out that they are creating global variables groupIDRegister and reverseRegister:

They also have in source a reset function for these variables:

The problem is that this function is used only in test utils. This means that in the final bundle it gets ejected, so we are not able to use it.

You may think that this can be easily fixed by re-exporting this function in the public API, or use it in ServerStyleSheet class to reset these global variables, but this step would just cause another issue.

Imagine that you have a global variable and after every SSR you will reset this variable. This would not be a problem until you have concurrent requests.

With concurrent requests, you may reset those groupID global variables while another request uses them. In that case, you just cause a partial style generation so you end up in the browser without styles.

The possible solution can be to ensure a new instance of those variables for every request. You can achieve that by wrapping those variables with class and resets them after SSR is done.

After the theoretical solution I created two Github issues to really find out what is going on and ask for help:

The answer I got pointed out that the problem may be in our code, especially with the dynamic creation of styled components. In the next iteration, I and my teammates dug into our code to detect these use cases.

During the migration from version 3.2.3 to version 5 beta 8, our team members added the css component to almost every callback used inside styled components.

Like the following:

It was because of the following sentence in styled-components documentation:

If you are composing your style rule as a partial, make sure to use the css helper.

Unfortunately, we missed the point that this instruction was connected to the new keyframes API. Oops :).

We also find out that in some places we are dynamically creating styled components in the React component’s render as follows:

Yes, so after we reworked the usage of styled components and downgraded back to stable version (v4) all the issues disappeared.

https://imgflip.com/i/3bjm5c

Thanks a lot for reading!

--

--

Leading my own software development company https://techmates.io. We focus on long term partnerships with our clients where we overtake part of the development.