Near-realtime Data Ingestion in Snowflake

Building near-realtime ingestion pipelines in Snowflake

Joao Marques @ Data Beyond Ltd
Level Up Coding
Published in
8 min readJan 19, 2021

--

Photo by Paul Teysen on Unsplash

Executive Summary

  • External tables are great, but won’t ever come close to Snowflake internal tables
  • Snowpipe is extremely resilient and fast, just make sure you monitor for errors
  • Snowpipe is quite a recent technology, so some annoyances/missing features require some workarounds
  • How to easily implement a Hot/Cold/Frozen data storage approach to your staging tables

Introduction

ELT Pipeline

When building a data warehouse or data lake, the first foundational block you will need to tackle is replication, which represents 2 out of the 3 ELT steps (E+L). This article focuses on the Load part, and more specifically on how to ingest billions of rows a day in a fast, reliable, and cost efficient way using Snowflake.

Requirements

Before digging into the details, let’s lay out the attributes one would look for a staging area to have in an ideal scenario:

  • Stable, always available and as up-to-date as possible
  • Without data limitations (have all the data ever produced available)
  • High performance
  • Seamlessly available to the Business (using the same toolset and language)
  • Cost efficient

Most of these are not that hard to get with most MPP platforms on the market, but there are 2 that are especially important and that are usually the trickiest ones.

Making all data available

When your ingest billions of rows of data daily, scale starts to creep in. Typically you would keep this data in the data lake only, and just ingest a small part into the database to avoid a big storage bill. But your business may require more than 1 year of data to be available “sometimes”…

Keeping it always up-to-date

--

--

Just a technical Data Architect having too much fun building new data platforms! Want to support my writing? Why not BuyMeACoffee.com/joao.marques.db