Serverless text-to-speech application with Amazon Polly, Step functions and WebSocket Api

Zied Ben Tahar
Level Up Coding
Published in
8 min readJul 9, 2022

--

Photo by Volodymyr Hryshchenko on Unsplash

In this article we will build “Easy Reader”, a text to speech application fully hosted on AWS and fully serverless. This application will read online articles to you just by providing a URL

On the backend side we will use Amazon Polly, WebSocket Api (Api Gateway), Step function orchestrating lambda functions (written in typescript and using node 16 runtime)

On the frontend side we will build a small static react application hosted on s3 and served with a CloudFront distribution.

We will build and deploy this application with Github Actions and CloudFormation

TL;DR

You will find the complete repo here 👉https://github.com/ziedbentahar/aws-easy-reader

Wait, why a WebSocket Api ?

In order to process a web article, we will perform two steps:

  • First, we extract a text-only, readable/clutter free version of a Web article and we detect its language
  • Second, we generate the audio using Amazon Polly

Depending on the length of the article to process, text extraction and audio synthesis tasks might be long. One important quota to take into account with WebSocket Api (and AWS ApiGateway in general) is the 29 seconds of maximum integration timeout.

This could be solved by two strategies:

  • Poll based: Let the client repeatedly send requests to the Api Gateway at a regular intervals to check if the article finished processing. This strategy is not resource friendly as it leads to many requests between the server and the client.
  • Push based with WebSockets: A two way communication between the client and the server, the browser is notified when an article finished processing. That would allow an efficient usage of resources and a reduced latency.

AWS API Gateway supports WebSockets and provides routes integrations with AWS lambda, HTTP endpoints or other AWS Services. This is what we are going to use on this speech synthesis application

Reference architecture for “Easy Reader” application

Reference architecture for “Easy Reader” application
  • The WebSocket Api exposes a route integrated with “Start Task Lambda”. This route handles requests from the client (a payload containing an article URL), it also provides to the integrated lambda (Start Task) a connection Id relative to the connected client. This connection Id will be used to make callbacks to the client in the case of success or failure
  • Start Task Lambda triggers a step function and passes an article URL to be processed along with the connection Id.
  • Extract Text has two responsibilities: First it tries to extract a readable version of the text, second it detects the language of the content. The extracted text data may vary, passing it to the step function might fail if the payload size exceeds 265 Kb. We will instead save the content on a s3 bucket “Content Bucket”. This content will then be read by the next lambda on the state machine
  • Generate Audio reads the content from the content bucket and then invokes Polly in order to generate the audio from the article content
  • Notify Success sends a success payload to the connected client containing a publicly accessible presigned URL of audio that was generated
  • Handle Error sends an error notification to the connected client in the case when an error occurs on the state machine
  • The Step function orchestrates and connect Lamba functions together
  • FrontEnd bucket and CloudFront distribution are responsible for serving the react application to the client

Defining the state machine

Step function provides a great way to orchestrate the execution of multiple Lambda functions. The diagram below shows the logical flow of tasks involved on “Easy Reader” step function.

Here is the template definition of the state machine with ASL (Amazon State Language)

Follow this link for the complete CloudFormation definition of this step function

Building The Lambda functions, the relevant parts

1- Text Extractor Lambda function

This lambda receives the article URL and the connection Id as input and tries to extract the content from the website

We will use “Readability” to extract the content of a web article. This library is used by Firefox reader, so it provides pretty good results with main content extraction. We also try detecting the language by using “languagedetect” lib. This is what extractContentFromArticleUrl does:

Once the content extracted, this lambda function notifies the connected client by publishing a notification via the WebSocket connection.

await postNotificationToConnection(connectionId, { type: “contentExtracted”, articleUrl: articleUrl, contentUrl: articleContentUrl, });

Under the hood this function calls the API Gateway Management Api to post a message to a connection Id

This lambda function must have allow execute-api:ManageConnections policy to post data to a connection with the WebSocket Api

You will find the full CloudFormation template of this lambda function here

2- Generate audio Lambda function

Pretty straightforward process. First we get the extracted content from Content bucket, we then generate audio by calling the synthesize function and last we save the generated audio on the Content bucket

synthesize function uses Amazon Polly SDK to generate audio. Polly text synthesis api accepts plain text or SSML. In this example we use SSML so we can control pause duration to the speech specially for paragraphs not ending with proper punctuation marks.

Note that we generate audio by chunks of 3000 characters. This is the maximum size of the input of SynthesizeSpeech

getLangConfigurationOrDefault provides the engine (neural or standard) and voice Id relative to the language. This configuration is defined here

This lambda function must have allow polly:SynthesizeSpeech policy.

The full CloudFormation template of this lambda here

3- Send Success/Failure Notification

Success notification function requests a presigned URL of the audio file from the Content bucket and then notifies the client with a payload containing a success state and the presigned audio URL

On the other hand, Handle error lambda will only send a message to the connected client when an issue occurs during the whole process

4- Building and bundling lambda functions

In this application, we will build and bundle the Lambda functions with Webpack. Bundling brings some advantages:

  • Reduced package size and tree shaking: As we don’t want to copy the entire content of node_modules for each lambda, Webpack bundles only includes the dependencies that were imported on each lambda function handlers
  • Each Lambda packaged individually
  • Improved cold start time with smaller deployments

On the webpack.config.js file, you will notice that we are ignoring the canvas module. This module is, in fact, required by jsdom (which is a dependency of Readability) and is not compatible with Lambda runtime.

Configuring the WebSocket API

Defining a WebSocket Api with CloudFormation is not much different from a REST or HTTP Api.

Here we define on the ApiGateway resource the ProtocolType to WEBSOCKET and we create a route processUrl that is targeting the integration with StartTaskLambda. You will find the full CF template of this WebSocket Api here

Testing the backend deployment

Once all of these backend components deployed, we will be able to perform Websocket Api tests. Postman makes it really easy.

Creating a new WebSocket request in Postman
Testing A WebSocket api with postman

On the previous section we focused on building backend side of “Easy Reader”, let’s handle the frontend part now.

Reminder

You can find the complete repo with Github workflow here 👉https://github.com/ziedbentahar/aws-easy-reader

Building the frontend

In this section we will mostly focus on querying “Easy Reader” WebSocket Api and handling the asynchronous responses.

You can find the complete React app ⚛️ here

Easy Reader Hook

useEasyReader accepts a URL, creates a WebSocket connection and sends a processArticle command with the URL as a payload. It then waits for two events:

  • contentExtracted event, with a payload containing the presigned URL of the S3 object relative to the readable article content.
  • audioGenerated event providing the URL of the audio on the payload

This hook returns an object providing the processing progress state, article content and audioUrl that will be used at the component level.

On its turn, the component reacts to the progress events and handles the different event types progressively.

Defining the infrastructure for the frontend

As presented on the architecture diagram above, in order to deploy the frontend part on AWS, we will create an S3 bucket that will store and serve the bundled react app. We will then attach it as an origin to a CloudFront distribution.

The S3 bucket only allows access to the Origin Access Identity (OAI) that is a special CloudFront user. This way we only allow CloudFront access to the bucket and deny everything else.

A word on the build/deployment pipeline

Easy reader main Github Actions pipeline

In this article I don’t dive into the build and deployment pipeline with Github Actions. You can find the complete pipeline workflow following this link. You can also find in this previous article a step by step guide on setting up a Github Actions pipeline.

🎉 Easy Reader in Action

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Placing developers like you at top startups and tech companies

--

--