Server logs stream into BigQuery: yes, of course

Monitor your server traffic, detect bots behavior and create charts with ease

Simeon Emanuilov
Level Up Coding

--

Photo by Depositphotos

Abstract

In this tutorial, we will go through the process of making all the needed settings to send streams with log data from a server to Google BigQuery.
There are some other tutorials on this topic on the Internet, but I found them not clear enough or full of outdated information. There are various changes needed for everything to work as I expected and will cover them in this article. I have tested this approach in production in a platform for speech recognition — Voxpow (running on Ubuntu 18.04 and Nginx) and results for some weeks & huge amounts of data are good enough.

But why we need this?

One question is coming in place — why we need this? Why we are not just downloading the logs and perform an analysis with Pandas or some other tools like GoAccess (awesome open source web log analyzer)?

I wanted to be able to see data in real-time and have the ability to perform some sort of analysis of the traffic. In our voice platform — we have also an API and I need more information on hits of each user.

That’s the way I search for the best way of implementing log streaming into Google BigQuery, where we are able to check every request, IP, and visited page.

It is like Google Analytics for all server traffic, not only JavaScript detected visitors. Well, not much, but we are close …

With this technique, we are achieving:

  • ability to monitor in almost real-time hits on our server;
  • possibility to see how search engines bots are crawling our website;
  • check if we have too many visits from specific IP addresses.

And every data is stored in BigQuery, where we can perform the lightning-fast requests, using SQL syntax. On top of that, we can integrate our data in Google Data Studio or custom data visualization solutions.

What is Google BigQuery

It is a serverless, highly scalable, and cost-effective cloud data warehouse designed for business agility. It allows us to perform blazing-fast queries and get the data we need with simple SQL — in a fraction of the time.
Also, BigQuery’s high-speed streaming insertion API provides a powerful foundation for real-time analytics, making our data immediately available for analysis.
That is perfect for our need as we want normally to handle big amount of information, as you can imagine crawling rates for the regular bots.

And what is Fluentd

In order to send all this streaming information, we will need some type of connection from our server to BigQuery. Using custom methods of sending information from logs can be hard to achieve, as we need to implement the logic for specific behavior in case of log rotating, server shutdowns, and recovery from the same place, and other. Fluentd is the tool I found so far as the best for these needs. It is an open-source data collector for a unified logging layer. It works both with Apache and Nginx, but for this tutorial we will be focusing more on Nginx.

Installation on Ubuntu + nginx

Let’s start and go throw the process of setting everything. Settings are performed on Ubuntu 18.04 + Nginx 1.14.

  1. Install fluentd

2. Create a BigQuery dataset

First, you need to have a registration in Google Cloud and go throw the process of project creation and setting service account credentials. You can refer to the documentation if you face any issues.

We then have to choose from using the CLI or the visual interface. If you prefer command-line, you need to check for bq. It is a python-based, command-line tool for BigQuery.

From the visual console, you can create dataset like this:

Creating dataset in Google BigQuery
Create Dataset in BigQuery

After creating the dataset, you need to make a table and schema. For the naming, we use voxpow:logs:ubuntu (dataset:table:schema)

3. Create a JSON schema to handle the web logs + server hostname + vhost name

Important step is to define the schema with all the fields, which should match the data.

JSON schema for BigQuery

If you use Apache server, this schema need to be adapted as the log formats are different.

4. Install the Fluentd Google BigQuery plugins

Install Fluent additional plugins for communication with BiqQuery

5. Configure fluentd to read the nginx access log for this vhost and upload to BigQuery

Location of the configuration file: /etc/td-agent/td-agent.conf

Configuration of td agent for Nginx

Here is maybe the hardest part, as you need to adapt this configuration to your needs. We are using vhosts also, which can add additional complexity, but will allow us to extend this functionality to more services.

If you want to read a different format, you can change the <service> tag.

All sercrets for the connection to BigQuery are placed in a JSON file on secure location, which is not accessible via HTTP. It is the part:

json_key /home/voxpow/voxpow-aSdx3sf0spf.json

If something is not working as expecting or just for confirmation, you need to check the logs of fluentd:

6. After making sure that the user fluentd runs as (td-agent by default) has read access to your Nginx access logs, start (or restart) fluentd.

You can control and check the Fluend daemon like this:

$ sudo systemctl status td-agent

Also you can start and stop the service:

$ sudo systemctl start td-agent
$ sudo systemctl stop td-agent

If you face a permission issue, which can be identify from logs, it is possible to use this simple bash script for giving permissions to log directory:

And setting a cron job in specific amount of time, adapted to your website load. Please keep in mind that logrotating service will change the access file, according the permissions.

7. Congratulations, you have just setup your web access logs to inject to a BigQuery.

If there is any errors — you will find them in log: /var/log/td-agent/td-agent.log

Server logs in BigQuery

If everything is fine with settings, you should be able to see hits in the BigQuery console.

Big Query with populated fields
BigQuery statistics

Example SQL queries

You can adapt and use some those example queries, but keep in mind that if you have a lot of rows — you can be charged for the “big query” and the golden rule is to ask only the information you really need.

Example SQL Queries to use on log data

Charts

You can make various charts, exporting data directly to Data Studio.

Export rows to Data Studio
Explore with Data Studio
Simple Data Studio Visualitation

Resource usage

In the first few weeks, we have some high CPU peaks, but it was due to misconfiguration in the td-agent.conf. This article helps us to find and resolve the problems.

Summary

In this article, we discussed an example configuration and log streaming in BigQuery. There is a lot of servers analyzing tools, but I found the following key points very hard to get with other solution:

  • make an analysis of Google bot crawling;
  • real-time statistic for the API usage, available in different command-line interfaces;
  • possibility to quickly have IP information and frequency of request;
  • data warehouse and storing this information for performing further analysis;

Using this approach, it is easy to perform a detailed analysis of search engine crawlers and get the insights — how they crawl your website, what kind of factors have an impact on the frequency of visits, etc.

Thanks for reading; if you liked my content and want to support me, the best way is to —

  • Connect with me on LinkedIn and GitHub, where I keep sharing such free content to become more productive at building ML systems.
  • Follow me on X (Twitter) and Medium to get instant notifications for everything new.
  • Join my YouTube channel for upcoming insightful content.

--

--

Senior Backend Engineer in Machine Learning and Big Data space | Sharing knowledge for Python & Go programming, Software architecture, Machine Learning & AI