Analytics vs Transactional Processing
OLTP, OLAP, stars, and snowflakes
In the early days of business data processing, write operations to the database usually correspond to commercial transactions that occur: selling, placing orders with suppliers, paying employees’ salaries, etc. Transaction does not involve currency exchange, but the term transaction is stagnant, referring to a set of reads and writes that constitute a logical unit.
Even if the database started to be used for many different types of data annotations such as blog posts, actions in games, contacts in the address book, etc., the basic access pattern is still similar to processing business transactions. Applications usually use indexes to find a small number of records by certain keys. Insert or update records based on user input. Since these applications are interactive, the access mode is called online transaction processing (OLTP).
However, databases also started to be used more and more for data analysis, which has very different access patterns. Generally, analytical queries need to scan a large number of records, read only a few columns of each record, and then calculate summary statistics (such as count, sum, or average) instead of returning the raw data to the user. For example, if your data is a sales transaction table, the analytical query might be:
- What was the total revenue of each of our stores in February?
- How many more apples than usual did we sell during our latest promotion?
- Which brand of food is most often purchased together with mustard?
These queries are usually written by business analysts and fed into reports that help company management make better decisions (business intelligence). In order to distinguish this mode of using a database from transaction processing, it is called online analytical processing (OLAP)
At first, the same databases were used for both transaction processing and analytic queries. Facts have proved that SQL is very flexible in this regard: it is applicable to both OLTP-type queries and OLAP-type queries. However, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP system for analysis and instead run analysis on a separate database. This separate database is called a data warehouse.
Data Warehousing
Companies may have dozens of different transaction processing systems: systems that power customer-facing websites, point-of-sale (checkout) systems in physical stores, inventory tracking in warehouses, route planning of vehicles, management of suppliers, employees, etc. Each of these systems is complex and needs a team of people to maintain it, so the systems end up operating mostly autonomously from each other.
These OLTP systems are generally expected to be highly available and able to process transactions with low latency, as they are often critical to business operations. Therefore, database administrators closely protect their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytical queries on OLTP databases because these queries are usually expensive and scan most of the data set, which may harm the performance of concurrently executing transactions.
In contrast, a data warehouse is a separate database, and analysts can query its inner content without affecting OLTP operations. The data warehouse contains read-only copies of data in all the company’s various OLTP systems. Extract data from the OLTP database (using periodic data dumps or continuous update streams), transform it into an easy-to-analyze mode, clean it, and then load it into the data warehouse. The process of putting data into the warehouse is called extract-transform-load (ETL).
Now, almost all large enterprises have data warehouses, but in small enterprises, they are almost unheard of. This may be because most small companies do not have many different OLTP systems, and most small companies have a small amount of data. Small enough to be queried in a regular SQL database or even analyzed in a spreadsheet. In a large company, it takes a lot of heavy work to do something that is simple in a small company.
A big advantage of using a separate data warehouse instead of directly querying the OLTP system for analytics is that the data warehouse can be optimized for analytics access patterns.
The difference between OLTP database and data warehouse
The data model of a data warehouse is usually relational because SQL is usually very suitable for analytics queries. There are many graphical data analytics tools that can generate SQL queries, visualize the results and allow analysts to browse the data (through operations such as drill down, slice, and dice).
On the surface, data warehouses and relational OLTP databases look similar because they both have SQL query interfaces. However, the internals of the system may look completely different because they are optimized for very different query patterns. Many database vendors now focus on supporting transaction processing or analytical workloads, but they cannot support both at the same time.
Certain databases (such as Microsoft SQL Server and SAP HANA) support transaction processing and data warehousing in the same product. However, they are increasingly becoming two independent storage and query engines, which happen to be accessible through a public SQL interface.
Data warehouse vendors (such as Teradata, Vertica, SAP HANA, and ParAccel) usually sell their systems under expensive commercial licenses. Amazon Redshift is the hosted version of ParAccel. Recently, there have been many open-source SQL-based Hadoop projects. They are young but aim to compete with commercial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill. Some of them are based on the ideas of Google Dremel.
Stars and Snowflakes Schemas
Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling).
The example schema below was modeled using the start schema pattern. At the center of the schema is a so-called fact table. Each row of the fact table represents an event that occurred at a particular time.
Usually, the facts are captured as a single event, because this allows the greatest flexibility for later analysis. However, this means that the fact table can become very large. Large companies like Apple, Walmart, or eBay may have dozens of petabytes of transaction history in their data warehouse, most of which are actually in fact tables.
Some of the columns in the fact table are attributes, such as the units sold of the product and the revenue. The other columns in the fact table are foreign key references to other tables (called dimension tables). Since each row in the fact table represents an event, the dimensions represent the person, event, location, time, method, and reason of the event.
Even dates and times are usually represented by dimension tables because this can encode other information about dates (such as public holidays) so that the query can distinguish sales during holidays and non-holidays.
The name “star schema” comes from the fact that when visualizing table relationships, the fact table is in the middle and surrounded by its dimension tables; the connection with these tables is like the light of stars.
A variation of this template is called the snowflake pattern, where the size is further subdivided into multiple sub-dimensions. For example, there may be separate tables for brands and product categories, and each row in the dimension product table can reference the brand and category as foreign keys instead of storing them as strings in the table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with.
In a typical data warehouse, tables are usually very wide: fact tables usually have more than 100 columns, sometimes hundreds. The dimension tables may also be very wide because they contain all metadata that may be relevant to the analysis.
Final thoughts
OLTP systems are usually user-oriented, which means they may see a large number of requests. To handle the load, the application usually touches only a few records in each query. The application uses a certain key to request records, and the storage engine uses an index to find the data of the requested key. Disk seek time is usually the bottleneck here.
Data warehouses and similar analytics systems are little known because they are mainly used by business analysts rather than end-users. Compared with OLTP systems, they handle much fewer queries, but the requirements for each query are usually high, requiring millions of records to be scanned in a short time. Disk bandwidth (rather than seek time) is usually the bottleneck here.