Extract Tables and Text from PDF Files in R
Introducing Tabulizer: The R library for processing PDF files.

Motivation
Suppose that you have an idea to conduct data analysis. But when you search for the data, you realize the data is not in formats like CSV or even spreadsheet. The data is in PDF format.
If the data is in that format, what are you going to do? Probably you will enter the data from scratch on the spreadsheet application manually, or you will copy the data entirely and then copy them into a spreadsheet to clean them.
Those processes will take so much time, and what makes it worse that there’s a possibility you will get a human error. For example, you possibly could enter the wrong data.
Thankfully, there’s an R library called tabulizer that can help you to extract tables or texts from PDF files automatically and in a short amount of time.
In this article, I will introduce you to the tabulizer library using the R library. Also, I want to show you what other features this library can do. Without further, let’s get started!
Implementation
Install and load the library
The first thing that we will do is to install the package. We can use the install.packages function to install the library. After that, we can load the library using the library function. Here is the code looks like,
install.packages('tabulizer')
library(tabulizer)Please Note: Before we can use the library, make sure that you’ve already install java on your computer. It is because the library is a binding to a java library called tabula java. You can download java from here.
The data
For this article, I will use a PDF file from Badan Pusat Statistik (BPS) called “Luas Panen dan Produksi Padi di Provinsi Riau 2019”.
The PDF file contains data about the number of paddy rice production in 2019 at Riau Province. This data also categorized by regions and months. You can access the PDF file here.
In the next section, I will refer to the file as “file.pdf”.
Extract the table
Now let’s play with the PDF file with the tabulizer library. The first thing that we can do is to extract the table from the PDF file. As an example, we will extract the table from page 60.
To extract the table, we can use the extract_tables function. The function will return a list that consists of one or more tables. The code looks like this,
# Extract the table
tabel <- extract_tables('file.pdf', pages = 60)
# Extract the first element of the variable
View(tabel[[1]])Here is the preview of the result,
The next thing that we can also is that we can extract the table interactively. To do that, we can use the extract_areas function for extracting the table interactively. Here is the command look like,
extract_areas('file.pdf', pages=60)Here is the process and the result look like,

Extract the text
Besides we can extract the table, we can also extract texts from the PDF file. We can use the extract_text function to gather the text data. Here is the command look like,
cat(extract_text('file.pdf', pages=6), sep="\n")Here is the preview from the PDF file on the left and the code result on the right,
Other Features
There are other features that this library has. The first feature is to extract the numbers of pages. You can use the get_n_pages function for extracting it. Here is the command look like,
get_n_pages("file.pdf")Here is the preview of the result,
The second feature is to extract the size dimensions of the PDF file. You can use the get_page_dims function to retrieve the dimensions. Here is the command look like,
get_page_dims("file.pdf", pages=1)Here is the preview of the result,
The last feature that you can do with this library is to convert a page of a PDF file to an image. You can use the make_thumbnails function to change the format. Here is the command look like,
make_thumbnails("file.pdf", "<FILE_PATH>", pages=60)OUTPUT:
>> "<FILE_PATH>/file60.png"
Here is the preview looks like,
Final Remarks
Congratulations! You have learned how to use tabulizer library for extracting tables and texts from PDF file. I hope that useful to you, and you can use it for other cases.
If you are interested in my posts, you can follow me on Medium, or if you want to have a conversation with me regarding data science or something like that, you can connect with me on LinkedIn.
Thank you for reading my article!






