100x Faster Data Processing in Javascript

Cory Grinstead
Level Up Coding
Published in
3 min readJan 4, 2022

--

Node.js has a pretty extensive ecosystem that has a tool for just about everything. However, the data processing tools & frameworks have always been a bit lackluster in comparison to other languages. Frameworks like Apache Spark and Pandas provide excellent interfaces for handling complex data transformations. For one reason or another, these kind of frameworks have always been missing from the JS ecosystem.

In Javascript & Node, there has never been a mature library with similar functionality.

Until now…

To fill this gap in the Node ecosystem, Polars has extended it’s officially supported languages to include Node.js.

NPM now has a Node.js library aptly named nodejs-polars

Polars is a blazingly fast DataFrame library implemented in Rust using Apache Arrow Columnar Format as memory model.

  • Lazy | eager execution
  • Multi-threaded
  • SIMD
  • Query optimization
  • Powerful expression API

Polars consists of two main components.

Series

A Series is similar to an array, except it can only contain one data type.

DataFrame

A DataFrame contains multiple series.

What is a DataFrame anyways?

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

What does any of this have to do with performance?

Polars DataFrame & Series offer unparalleled performance

To demonstrate how Polars can increase the performance of your code by 100x or more. A series of benchmarks were performed comparing Polars to other commonly used libraries from NPM.

Series Benchmark

Libraries tested nodejs-polars,ramda,lodash
Dataset 1 million records of randomly generated data

Series Benchmarks

At its worst, Series outperformed the runner up by ~2x

At its best, Series outperformed by over 80x.

Overall, Series outperformed the others for every operation tested

CSV Benchmark

Libraries tested nodejs-polars,csv,fast-csv
Dataset: 10k-1M records of Ethereum transaction data**

CSV benchmarks

Reading a CSV file into memory with 16 columns, and 500k rows in Polars shows 114x faster execution time compared to native fs module, and over 170x faster than the slowest performer

The 10k Dataset read operations were ~40x faster. Filtering was ~15x faster than the native fs module.

fs filter for 1M records was the only one to run out of memory. As a baseline I used the fs.readFileSync method for fetching the data. This was likely the cause of the Out Of Memory Error, as it was the only test that did not use streaming or dataframes.

Polars delivers this amazing performance by leveraging a multi-threaded Rust backend. If you want to learn more about the algorithms behind the library, give this article a read!

Using Polars

Along with incredible performance, Dataframe & Series offer expressive apis that allow you to operate on your data without having to write boilerplate for common operations.

Polars has a wide range of build in operations. Ranging from statistical operations such as cumulative sum, rolling average,standard deviation, to string based operations such as base64 decode, regex replace, string contains

… and many more

Some Examples.

Filter records where input !== '0x' and return results as sorted list

The old way via streaming csv libraries

The polars way

The Polars Way

Extracting unique values from a single field and writing to JSON

A Streaming approach

The Old Way

The Polars Way

The Polars Way

Joining on a key

A procedural approach

The Polars Way

Wrapping Up.

Polars provides great performance, wrapped up in an easy to use package.

  • Polars can turbo boost simple operations like array sorting.
  • It can simultaneously simplify, and speed up complicated streaming pipelines.
  • Polars can work as a high performance replacement to libraries like Ramda, Underscore, and Lodash

=========================

Benchmarking Hardware Specs

=========================

  • Processor: AMD Ryzen 7 Microsoft Surface (R) Edition 2.00 GHz
  • Installed RAM: 16.0 GB
  • System type: 64-bit operating system, x64-based processor
  • Linux version: Ubuntu 18.04
  • Polars version: 0.0.8

--

--