100x Faster Data Processing in Javascript
Node.js has a pretty extensive ecosystem that has a tool for just about everything. However, the data processing tools & frameworks have always been a bit lackluster in comparison to other languages. Frameworks like Apache Spark and Pandas provide excellent interfaces for handling complex data transformations. For one reason or another, these kind of frameworks have always been missing from the JS ecosystem.
In Javascript & Node, there has never been a mature library with similar functionality.
Until now…
To fill this gap in the Node ecosystem, Polars has extended it’s officially supported languages to include Node.js.
NPM now has a Node.js library aptly named
nodejs-polars
Polars is a blazingly fast DataFrame library implemented in Rust using Apache Arrow Columnar Format as memory model.
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
Polars consists of two main components.
Series
A Series is similar to an array, except it can only contain one data type.
DataFrame
A DataFrame contains multiple series.
What is a DataFrame anyways?
A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.
What does any of this have to do with performance?
Polars DataFrame & Series offer unparalleled performance
To demonstrate how Polars can increase the performance of your code by 100x or more. A series of benchmarks were performed comparing Polars to other commonly used libraries from NPM.
Series Benchmark
Libraries tested nodejs-polars,ramda,lodash
Dataset 1 million records of randomly generated data
At its worst, Series outperformed the runner up by ~2x
At its best, Series outperformed by over 80x.
Overall, Series outperformed the others for every operation tested
CSV Benchmark
Libraries tested nodejs-polars,csv,fast-csv
Dataset: 10k-1M records of Ethereum transaction data**
Reading a CSV file into memory with 16 columns, and 500k rows in Polars shows 114x faster execution time compared to native
fs
module, and over 170x faster than the slowest performerThe 10k Dataset read operations were ~40x faster. Filtering was ~15x faster than the native
fs
module.fs filter for 1M records was the only one to run out of memory. As a baseline I used the
fs.readFileSync
method for fetching the data. This was likely the cause of the Out Of Memory Error, as it was the only test that did not use streaming or dataframes.Polars delivers this amazing performance by leveraging a multi-threaded Rust backend. If you want to learn more about the algorithms behind the library, give this article a read!
Using Polars
Along with incredible performance, Dataframe & Series offer expressive apis that allow you to operate on your data without having to write boilerplate for common operations.
Polars has a wide range of build in operations. Ranging from statistical operations such as cumulative sum, rolling average,standard deviation, to string based operations such as base64 decode, regex replace, string contains
… and many more
Some Examples.
Filter records where input !== '0x'
and return results as sorted list
The old way via streaming csv libraries
The polars way
Extracting unique values from a single field and writing to JSON
A Streaming approach
The Polars Way
Joining on a key
A procedural approach
The Polars Way
Wrapping Up.
Polars provides great performance, wrapped up in an easy to use package.
- Polars can turbo boost simple operations like array sorting.
- It can simultaneously simplify, and speed up complicated streaming pipelines.
- Polars can work as a high performance replacement to libraries like Ramda, Underscore, and Lodash
=========================
Benchmarking Hardware Specs
=========================
- Processor: AMD Ryzen 7 Microsoft Surface (R) Edition 2.00 GHz
- Installed RAM: 16.0 GB
- System type: 64-bit operating system, x64-based processor
- Linux version: Ubuntu 18.04
- Polars version: 0.0.8