Amazon Sumerian Hosts: How to reduce your Amazon Polly cost

Rani Lian
Level Up Coding
Published in
7 min readOct 19, 2020

--

Photo by Cris Fudge on ArtStation

The intended audience for this article is for users who are already familiar with Amazon Sumerian Hosts and are looking for a way to reduce their AWS bill (specifically for Amazon Polly) while maintaining their interactive & immersive experience.

Introduction

Amazon Sumerian Hosts is an open-source GitHub repository published by AWS that allows developers to integrate 3D virtual characters into their Web-based (ThreeJS or BabylonJS) interactive experiences. The virtual characters are referred to as Amazon Sumerian Hosts.

Technology

Amazon Sumerian Hosts are mostly powered by Amazon Polly (Amazon’s text-to-speech service). With Amazon Polly, you can make the Hosts speak 29 different languages (at the time of writing) with a wide variety of control through SSML allowing the developer to configure exactly how they want a sentence to be spoken. To learn more about Amazon Sumerian Hosts please visit the GitHub Repository.

Since most interactive web applications are client-based meaning that the browser is rendering the experience the only cost associated with Amazon Sumerian Hosts, besides the cost of hosting and delivery, is the API calls made to Amazon Polly.

Amazon Polly is a server-less pay-as-you-go service meaning that you are billed monthly for the number of characters of text that you processed.

Amazon Polly’s Standard voices are priced at $4.00 per 1 million characters for speech or Speech Marks requests (when outside the free tier). Amazon Polly’s Neural voices are priced at $16.00 per 1 million characters for speech or Speech Marks requested (when outside the free tier). Referenced from Amazon Polly’s Pricing page.

An Amazon Sumerian Host perform 2 API calls every time the developer requests it to speak a message.

  1. The first API call is to synthesize the text (transform the text to an audio file) and retrieve the MP3 Audio file to play.
  2. The second API call is to retrieve the Speech Marks for the text.

Speech marks are metadata that describe the speech that you synthesize, such as where a sentence or word starts and ends in the audio stream. When you request speech marks for your text, Amazon Polly returns this metadata instead of synthesized speech. By using speech marks in conjunction with the synthesized speech audio stream, you can provide your applications with an enhanced visual experience. Referenced from Amazon Polly’s Documentation.

Amazon Sumerian Hosts utilize the Speech Marks to provide Lip-Sync capabilities to the virtual characters.

So when estimating your AWS costs (specifically around Amazon Polly) you have to take into consideration the fact that every spoken message is going to cost you twice.

Cost Formula

The following would be the formula to use when estimating Amazon Polly costs when using Amazon Sumerian Hosts:

Amazon Polly Monthly Cost = # of users X # of spoken characters X price per character X num of API calls

The above formula is calculated as follows:

  1. The total number of estimated users of your application, multiplied by
  2. The total number of spoken characters (keep in mind that if you support multiple languages and allow your users to switch between languages you may want to calculate a worst case estimate by adding up all of the characters of all languages but only count up the characters of the unique messages because Amazon Sumerian Hosts cache the response in memory in case the host speaks the same message multiple times), multiplied by
  3. Price per character (0.000004 per character if you are using the standard voice or 0.000016 per character if you are using the neural voice), multiplied by
  4. Num of API calls is 2 (since each message needs to make two API calls to Amazon Polly, one for the Audio File and another for the Speech Marks)

Scenario 1

  • Number of Users: 10
  • Number of Characters: 10,000

Standard Monthly Total: 10 x 10,000 x 0.000004 x 2 = $0.8

Neural Monthly Total: 10 x 10,000 x 0.000016 x 2 = $3.2

Scenario 2

  • Number of Users: 100
  • Number of Characters: 25,000

Standard Monthly Total: 100 x 25,000 x 0.000004 x 2 = $20

Neural Monthly Total: 100 x 25,000 x 0.000016 x 2 = $80

Scenario 3

  • Number of Users: 1,000
  • Number of Characters: 25,000

Standard Monthly Total: 1,000 x 25,000 x 0.000004 x 2 = $200

Neural Monthly Total: 1,000 x 25,000 x 0.000016 x 2 = $800

As you can clearly see from the above scenarios the cost of Amazon Polly can grow quite large the more users and characters we have. If we preprocess and cache the result our costs will significantly decrease by a factor of the number of users and as such would be much more manageable as you can see below.

Scenario 3 (with preprocessing)

  • Number of Users: 1
  • Number of Characters: 25,000

Standard Monthly Total: 1 x 25,000 x 0.000004 x 2 = $0.2

Neural Monthly Total: 1 x 25,000 x 0.000016 x 2 = $0.8

Note: The above monthly total doesn’t include the minuscule S3/CloudFront storage and transfer fee.

Introducing Amazon Sumerian Polly Optimized

I’ve forked the existing Amazon Sumerian Hosts package and modified it a bit to accomplish the above by allowing the developer to provide two optional parameters to the TextToSpeechFeature.play function in order to directly use the provided Audio file and the Speech Marks.

The GitHub package can be found here.

How do I install it?

npm install amazon-sumerian-hosts-polly-optimized

How does it work?

Pass in the two optional flags into the function call as such:

/**
* ... initialize host here ...
*/

/**
* Make sure to replace text with a unique string per audioURL and speechMarksJSON, you can keep this as the text you used to play but note that it actually won't be played as the preprocessed audio and speechMarks will be played instead.
* Make sure to replace speechMarksJSON with the preprocessed SpeechMarks JSON Array.
* Make sure to replace audioURL with a Blob URL of the preprocessed Audio file
*/
host.TextToSpeechFeature.play(text, {
SpeechMarksJSON: speechMarksJSON,
AudioURL: audioURL
});

As an example, to have the host speak the following SSML:

<speak>Hello, I am a Sumerian Host powered using a preprocessed MP3 file and SpeechMarks.</speak>

I have already preprocessed the necessary Audio file and SpeechMarks and placed them under the examples/assets/preprocessed/ folder in the GitHub repository so you should be able to use them in your code.

// Specify local paths
// Make sure to update them to where you copy them into under your public/root folder
const speechMarksPath = './examples/assets/preprocessed/exampleSpeechMark.json';
const audioPath = "./examples/assets/preprocessed/exampleAudio.mp3";

// Fetch resources
const speechMarksJSON = await(await fetch(speechMarksPath)).json();
const audioBlob = await(await fetch(audioPath)).blob();

// Create Audio Blob URL
const audioURL = URL.createObjectURL(audioBlob);

// Play speech with local assets
host.TextToSpeechFeature.play('Hello, I am a Sumerian Host powered using a preprocessed MP3 file and SpeechMarks', {
SpeechMarksJSON: speechMarksJSON,
AudioURL: audioURL
});

If the two properties, SpeechMarksJSON and AudioURL, aren't specified it will work as it used to work (interact with Polly) but if you are using static text often I highly recommend preprocessing the Speech Marks and Audio files and cache them on S3/CloudFront for major cost optimization.

How do I preprocess the necessary files?

I highly recommend preprocessing the files and storing them on Amazon S3 (and serve them through CloudFront) or serve them as part of the web application.

In order to do so, there are two main files that need to preprocessed.

  1. Audio file: The audio file will be played by the host and can be either preprocessed using the AWS SDK or by downloading the MP3 file directly from the Amazon Polly AWS console.
  2. SpeechMarks: The JSON array that defines the viseme SpeechMarks that shape the host’s mouth to match the sounds it is making. You can use the code that’s already in the Amazon Sumerian Hosts repo to preprocess them by running it on an AWS Lambda NodeJS function or run a NodeJS function locally. The code itself is pasted below:
const synthesizeSpeechmarks = (text, voiceId) => {
console.log(`Synthesizing speechmarks for ${text} with voice: ${voiceId}`);
const params = {
OutputFormat: "json",
SpeechMarkTypes: ["sentence", "ssml", "viseme", "word"],
SampleRate: "22050",
Text: text,
TextType: "ssml",
VoiceId: voiceId,
};
return polly
.synthesizeSpeech(params)
.promise()
.then((result) => {
// Convert charcodes to string
const jsonString = JSON.stringify(result.AudioStream);
const json = JSON.parse(jsonString);
const dataStr = json.data.map((c) => String.fromCharCode(c)).join("");

const markTypes = {
sentence: [],
word: [],
viseme: [],
ssml: [],
};
const endMarkTypes = {
sentence: null,
word: null,
viseme: null,
ssml: null,
};

// Split by enclosing {} to create speechmark objects
const speechMarks = [...dataStr.matchAll(/\{.*?\}(?=\n|$)/gm)].map(
(match) => {
const mark = JSON.parse(match[0]);

// Set the duration of the last speechmark stored matching this one's type
const numMarks = markTypes[mark.type].length;
if (numMarks > 0) {
const lastMark = markTypes[mark.type][numMarks - 1];
lastMark.duration = mark.time - lastMark.time;
}

markTypes[mark.type].push(mark);
endMarkTypes[mark.type] = mark;
return mark;
}
);

// Find the time of the latest speechmark
const endTimes = [];
if (endMarkTypes.sentence) {
endTimes.push(endMarkTypes.sentence.time);
}
if (endMarkTypes.word) {
endTimes.push(endMarkTypes.word.time);
}
if (endMarkTypes.viseme) {
endTimes.push(endMarkTypes.viseme.time);
}
if (endMarkTypes.ssml) {
endTimes.push(endMarkTypes.ssml.time);
}
const endTime = Math.max(...endTimes);

// Calculate duration for the ending speechMarks of each type
if (endMarkTypes.sentence) {
endMarkTypes.sentence.duration = Math.max(
0.05,
endTime - endMarkTypes.sentence.time
);
}
if (endMarkTypes.word) {
endMarkTypes.word.duration = Math.max(
0.05,
endTime - endMarkTypes.word.time
);
}
if (endMarkTypes.viseme) {
endMarkTypes.viseme.duration = Math.max(
0.05,
endTime - endMarkTypes.viseme.time
);
}
if (endMarkTypes.ssml) {
endMarkTypes.ssml.duration = Math.max(
0.05,
endTime - endMarkTypes.ssml.time
);
}

return speechMarks;
});
};

Conclusion

If you have large amounts of static text that is being processed by Amazon Polly or large amounts of users interacting with your AWS-powered application try to look for ways to cache as much data as possible to save time and money!

Thanks!

--

--

Full-Stack Engineer / Senior Technical Consultant. Passionate about reverse-engineering, design thinking, and building things that improve everyday tasks.