Near-realtime Data Ingestion in Snowflake

Building near-realtime ingestion pipelines in Snowflake

Joao Marques @ Data Beyond Ltd

Published in

Level Up Coding

8 min readJan 19, 2021

Executive Summary

External tables are great, but won’t ever come close to Snowflake internal tables
Snowpipe is extremely resilient and fast, just make sure you monitor for errors
Snowpipe is quite a recent technology, so some annoyances/missing features require some workarounds
How to easily implement a Hot/Cold/Frozen data storage approach to your staging tables

Introduction

When building a data warehouse or data lake, the first foundational block you will need to tackle is replication, which represents 2 out of the 3 ELT steps (E+L). This article focuses on the Load part, and more specifically on how to ingest billions of rows a day in a fast, reliable, and cost efficient way using Snowflake.

Requirements

Before digging into the details, let’s lay out the attributes one would look for a staging area to have in an ideal scenario:

Stable, always available and as up-to-date as possible
Without data limitations (have all the data ever produced available)
High performance
Seamlessly available to the Business (using the same toolset and language)
Cost efficient

Most of these are not that hard to get with most MPP platforms on the market, but there are 2 that are especially important and that are usually the trickiest ones.

Making all data available

When your ingest billions of rows of data daily, scale starts to creep in. Typically you would keep this data in the data lake only, and just ingest a small part into the database to avoid a big storage bill. But your business may require more than 1 year of data to be available “sometimes”…