If Web Apps Could Talk — Intro to the Web Speech API
Build a sassy pronunciation checker that uses simple JavaScript to judge your language skills
What if it were easy to add speech-to-text and text-to-speech functionality to your web app? Have no fear, There’s an API for That™. HTML5’s native Web Speech API surfaces two interfaces for handling voice data with ease: SpeechSynthesis for turning text into sound, and SpeechRecognition for the opposite.
Dipping your toes into the Web Speech API you will find plenty of tutorials on using SpeechSynthesis
or SpeechRecognition
, but this tutorial will have you killing two birds with one app using no more than HTML/CSS/JavaScript. Your end goal is this:

Try it out here. The user selects a language, enters the phrase to attempt to say and presses ‘Hear it’ to hear it spoken aloud. Next they press ‘Say it’ to activate the mic and record their best attempt. Finally, the app displays to the user what it heard and lets them know whether their utterance matches their goal.
Disclaimer!
Before you start scheming your million dollar app that runs entirely on voice commands à la Star Trek, note the caveats.
- The Web Speech API is experimental. While native to HTML5, it is not yet canon and specs could change.
- Browser compatibility is limited. To sum up the facts, SpeechSynthesis is available to 90% of users and SpeechRecognition to a mere 68%. For simplicity’s sake we will design this app for Chrome browser. (Note: for mobile, SpeechRecognition is only compatible with Chrome for Android and Android browser.)
This one’s for fun!
Now that that’s out of the way, let’s dive in.
Getting started
Make an empty project directory called am-i-saying-it-right
. Inside, add a file named index.html
and paste the following code for your app structure.
Points of interest:
- Animate.CSS is imported in the <head> to animate our mic button later
- The
value
attribute in our <option> tags for languages holds the language codes recognized by the Web Speech API (see an unofficial list here). We will use them later to tell the engine to “listen for Chinese” or “speak French”, for example. - The empty
#result-message
<div> will be used to tell the user whether their pronunciation is good or sucks. main.js
is imported withtype="module"
in the <script> tag so we can organize our scripts by concern and import them as modules inmain.js
Next, bootstrap your style by creating a file style.css
and pasting in this code.
Your file tree should now look like this:

Loading index.html in Chrome browser should give you this:

You’ll notice that it does nothing yet. Let’s start making the app speak.
Let there be speech!
The basic syntax for SpeechSynthesis is quite simple:
The speechSynthesis
interface only “speaks” SpeechSynthesisUtterances. To get the browser to speak, first create a SpeechSynthesisUtterance
and assign the string you want spoken to its text
attribute. The language spoken will default to the app’s <html lang="..." >
value unless a language code is specified with the utterance’s .lang
attribute.
Based on this let’s make a function called speakText()
that speaks a given string aloud in a given language. In the root of the app, make a module file called textToSpeech.js
and insert this code:
Lines 3–5 check whether the user’s browser supports speechSynthesis
and nudges them to use Chrome if not.
The rest is familiar already — we create a new SpeechSynthesisUtterance
and configure it with the given text
(string) and lang
(also a string) arguments. Finally we export the module to use it in a new file called main.js
.
Create the new file main.js
. Maybe you’ve caught on that all of our files in the project will be in the root:

This is the code to put in main.js
:
At the top, you import the speakText()
function you’ve just made and exported. Then grab the ‘Hear it’ button, the text input field and language selector from the DOM. Lastly, you add a click event listener to the ‘Hear it’ button to trigger our speakText()
method to read the input text out loud when clicked.

You’re excited to hear your browser speak French, but when you try ‘Hear it’ you might get an error like this in your console:

That’s my fault because I made you use JavaScript ES6 modules, which due to security reasons require your code to be served to satisfy script references. You can read about it here if you want.
If you aren’t already serving your code on localhost
for testing, now is the time to launch your server. If you aren’t sure how to do that, a simple solution is to run Python’s SimpleHTTPServer
in your project folder following these quick instructions. Once your server is on, you can access your app from the URL http://localhost:<PORT NUMBER HERE>/
and hear it speak French to you. In my case, for example:

In fact, it will now speak any of the languages you select. Go ahead and listen to omellete au fromage in Japanese or German flavors.
Introducing SpeechRecognition
So far the app is already fun. Maybe you feel like calling it a day and using it to demystify the names of all those French wines in your cupboard, or to let your mother know she’s been pronouncing quinoa wrong all these years.
But we have yet to get to the best part — the judging factor!
SpeechSynthesis
and SpeechRecognition
are the meat and potatoes of this tutorial. We’ve done the meat — here’s the basic syntax for the potatoes:
First, you create an instance of SpeechRecognition
using the webkit
prefix for Chrome. After configuring this recognizer’s language with the .lang
attribute, you tell it to fire the mic and start listening for speech.
There are a handful of events available to the recognizer, but we are mainly interested is 'result'
which fires once a word or phrase is positively recognized. The results
property of this event returns a SpeechRecognitionResultList
object, which we can drill through to get the transcript of the recognized text. Above we display the transcript to the console on line 7, if words were actually recognized. In case you’re curious what the SpeechRecognitionResultList
object looks like, here’s a sample result of me saying “I have two eyebrows” in English:

Two things to note here:
- You need to serve your code through a web server for recognition to work in Chrome. No worries — we are already set up with
localhost
from earlier. - The user will be prompted for mic access the first time in a session
.start()
is called on aSpeechRecognition
object. You can’t get around this. It is to prevent spying on people’s speech without their knowledge.

Adding speech recognition to our app
Make a module in the root of the app named speechToText.js
. You will create a function called recognizeSpeech()
that accepts a language, starts a recognizer in that languuge, and returns a Promise containing the recognizer result once it finishes listening.
Note that all this recognition magic happens in a black box in the cloud, making it an asynchronous operation. We will use the async
and await
keywords to tell JavaScript, “Hey, this is an asynchronous call, wait to hear back from the cloud with a fulfilled promise containing our transcript before doing something with it!”
Don’t forget to export this module so we can use it in main.js
.
Now we need to trigger ourrecognizeSpeech()
when the user clicks the ‘Say it’ button. Switch over to main.js
and import the recognizeSpeech
function you just made, right below your import of speakText
. We will grab and add an event listener to the recognizeButton
, below the speakButton
event listener (omitted):
What’s going on here? After getting the user’s selected language from the dropdown value, we pass it to the recognizeSpeech()
function. This function returns a promise containing the recognition transcript once the user is done speaking. We take the result of this fulfilled promise and display it as the text value of the second input box so the user can see what they said. Lastly, the .catch
will intercept any error that might occur in this Promise chain and display it to the console.
Let’s try it out! Make sure your server is on, click ‘Say it’, allow the mic, and you should see a some kind of icon on your tab indicating the mic is listening.

Give it a go:

You have now mastered both SpeechSynthesis and SpeechRecognition, in multiple languages! One problem though — where’s the sass when you pronounce something wrong?
Final touches
Judge the user’s pronunciation by displaying a message
Remember that empty div #result-message
in the HTML? That’s where we will display a message to the user letting them know whether they successfully said the word. Make a new module file named compare.js
and add this code.
This compare()
function compares the target string and the user’s attempt, then displays a success or failure message in the div and changes the background color to green or red before resetting to normal after two seconds. Be sure to export
this function, import
it into main.js, then call it as an additional action in the chain of .then
’s on the asynchronous recognizeSpeech()
. Here is the complete .main.js
file reflecting these additions :

Animate the ‘Say it’ button when active
Make it more obvious to the user when the mic is active. I used Animate.CSS which animates elements simply by adding class names. I made this helper function to toggle the animation classes, and then called it in speechToText.js
on the ‘start’ and ‘end’ events of the recognizer.

Conclusion
Adding voice to your app with Web Speech API can be a nice-to-have extra feature, given its limited browser compatibility. You can get your browser talking or listening with just 3–5 lines of vanilla JavaScript using the Web Speech API. You even know how to handle transcript results with Promises, giving you power to navigate chains of events with voice. The vocal world is your oyster, go nuts. But be sensible — you probably don’t want users yelling their passwords or social security numbers into forms on the train.
See the full code from this tutorial here, and try out the app here.