How to parse forms using Google Cloud Document AI

A step-by-step guide to extracting structured data from paper forms

Lak Lakshmanan
Level Up Coding

--

Many business processes, especially ones that involve interacting with customers and partners, involve paper forms. As consumers, we are used to filling out forms to apply for insurance, make insurance claims, specify healthcare preferences, apply for employment, tax withholdings, etc. Businesses on the other side of these transactions get a form that they need to parse, extract specific pieces of data from, and populate a database with.

The input form

Google Cloud Document AI comes with form-parsing capability. Let’s use it to parse a campaign disclosure form. These are forms that every US political campaign needs to file with the Federal Election Commission. An example of such a form:

Upload to Cloud Storage

The form(s) need to be loaded to Cloud Storage for Document AI to be able to access it. The full code can be found in this notebook on GitHub.

gsutil cp scott_walker.pdf \
gs://{BUCKET}/formparsing/scott_walker.pdf

Create a request

To invoke Document AI, we need to create a request JSON structure that looks like this:

{
"inputConfig":{
"gcsSource":{
"uri":"${PDF}"
},
"mimeType":"application/pdf"
},
"documentType":"general",
"formExtractionParams":{
"enabled":true
}
}

We then send it to the service:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://${REGION}-documentai.googleapis.com/v1beta2/projects/${PROJECT}/locations/us/documents:process

Parse the response

The response is a JSON document that we can parse to pull out the data that we want. The response has a text field that contains all the extracted text. You can get that by:

allText = response['text']

For example, the text includes the words “Cash on Hand” because it appears in the form:

Let’s say that we want to find the actual amount ($75,931.36) that the campaign currently has on hand. This information is available on the second page of the form. Because of this, we can look for page=1 (it starts at page=0) and look at either the text blocks or at the extract form fields.

The text blocks is more low-level; the form fields is a higher-level abstraction. Let’s look at both.

Method 1: From blocks of text

This, for example is how to get block=5 on page=1:

response['pages'][1]['blocks'][5]

The block itself is a JSON struct:

{'layout': {'textAnchor': {'textSegments': [{'startIndex': '1716',
'endIndex': '1827'}]},
'confidence': 1,
'boundingPoly': {'normalizedVertices': [{'x': 0.068627454, 'y': 0.24873738},
{'x': 0.6764706, 'y': 0.24873738},
{'x': 0.6764706, 'y': 0.25757575},
{'x': 0.068627454, 'y': 0.25757575}]},
'orientation': 'PAGE_UP'}}

We can parse it in turn to get the textSegment’s start and end index:

startIndex = int(response['pages'][1]['blocks'][5]['layout']['textAnchor']['textSegments'][0]['startIndex'])
endIndex = int(response['pages'][1]['blocks'][5]['layout']['textAnchor']['textSegments'][0]['endIndex'])

Using the start and end index into allText:

allText[startIndex:endIndex]

gives us:

'6. CASH ON HAND AT BEGINNING OF REPORTING PERIOD .............................................................\n'

Well, that was block=5. What is block=6?

Yup, $75,931.36.

Method 2: Form Fields

Document AI understands that this form consists of name-value pairs. So, we can parse the JSON response at that level. Let’s write a helper function first:

def extractText(allText, elem):
startIndex = int(elem['textAnchor']['textSegments'][0]['startIndex'])
endIndex = int(elem['textAnchor']['textSegments'][0]['endIndex'])
return allText[startIndex:endIndex].strip()

To get the third form field on the 2nd page, we’d do:

response['pages'][1]['formFields'][2]

This gives us the following structure:

{'fieldName': {'textAnchor': {'textSegments': [{'startIndex': '1719',
'endIndex': '1765'}]},
'confidence': 0.9962783,
'boundingPoly': {'normalizedVertices': [{'x': 0.0922335, 'y': 0.24873738},
{'x': 0.4584429, 'y': 0.24873738},
{'x': 0.4584429, 'y': 0.2587827},
{'x': 0.0922335, 'y': 0.2587827}]},
'orientation': 'PAGE_UP'},
'fieldValue': {'textAnchor': {'textSegments': [{'startIndex': '1716',
'endIndex': '1842'}]},
'confidence': 0.9962783,
'boundingPoly': {'normalizedVertices': [{'x': 0.068627454, 'y': 0.24873738},
{'x': 0.90849674, 'y': 0.24873738},
{'x': 0.90849674, 'y': 0.26767677},
{'x': 0.068627454, 'y': 0.26767677}]},
'orientation': 'PAGE_UP'}}

So, we can extract the field name and field value using:

fieldName = extractText(allText, response['pages'][1]['formFields'][2]['fieldName'])
fieldValue = extractText(allText, response['pages'][1]['formFields'][2]['fieldValue'])

Enjoy!

Next steps:

  1. Try it out: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/form_parser/formparsing.ipynb
  2. Read the docs: https://cloud.google.com/document-ai/docs/process-forms
  3. Read the reference about the returned structure: https://cloud.google.com/document-ai/docs/reference/rpc/google.cloud.documentai.v1beta2

--

--