Accents in Latent Spaces: How AI Hears Accent Strength in English

We work with accents a lot at BoldVoice, the AI-powered accent coaching app for non-native English speakers. Accents are subtle patterns in speech—vowel shape, timing, pitch, and more. Usually, you need a linguist to make sense of these qualities. However, our goal at BoldVoice is to get machines to understand accents, and machines don’t think like linguists. So, we ask: how does a machine learning model understand an accent, and specifically, how strong it is?

To begin this journey, we first introduce the “accent fingerprint,” an embedding that is generated by inferencing an English speech recording through BoldVoice’s large-scale accented speech model.

torch.Size([1, 768, 12])

The accent fingerprint embedding dimensions

In this post we’ll show where the accent fingerprint lives in a latent space, how distances and directions in that space correspond to accent similarity and language background, and how we use it to coach our product management intern Victor, a non-native English speaker, toward the American English accent of our expert accent coach Eliza.

The Original Recordings

First off, here's how Victor sounds when speaking English:

Victor (original recording)

Now have a listen to Eliza reading the same passage. Eliza is demonstrating our “target” American accent.

Eliza's recording

Compared to Eliza, who is an American English native speaker, Victor has a noticeably strong Chinese accent when speaking English.

The Latent Space

So that we can make sense of how the machine learning model understands both of these recordings, we now populate a latent space with 1,000 speech recordings sourced from our internal data representing varied levels of accent. Feel free to inspect the 2D visualization of the latent space¹ and hover over the points to see details about each recording.

The full dimensional latent space contains information about speaker identity, accent, intelligibility, emotion, and other characteristics. This visualization has been pruned to show only the information relevant to "accent strength", that is "how strong is the speaker's accent relative to native speakers of English?"

More specifically, we apply PLS regression to identify the latent space directions which correlate most with human accent strength ratings, and for the purpose of this visualization only, we apply 2D UMAP dimensionality reduction. The x-axis represents the first hidden dimension of accent strength, while the y-axis represents the second hidden dimension.²

The below pseudocode shows how the dimensions of the latent space are selected:

accent_strength_directions = PLSRegression.fit(train_accent_fingerprints, train_accent_strength_ratings)
accent_strength_features = test_accent_fingerprints[accent_strength_directions]
visualization_features = UMAP(n_components=2).fit_transform(accent_strength_features)

¹ For the sake of brevity, we will use the term latent space to refer to both the full dimensional space, as well as the pruned 2D visualization.

² The dimensions are not readily interpretable, are not orthogonal, and are chosen solely to maximize their utility for discriminating between accent strength in L2 English (English as a second language).

Plotting Accents

Now, let’s visualize the accent fingerprints of Victor’s and Eliza’s recordings in this latent space. You can see a purple diamond in the bottom left representing Eliza’s recording and a yellow diamond towards the top right representing Victor’s recording.

From what we can see, the more towards the lower left of the plot a recording is, the more ”native sounding” and “less strong” its speaker’s accent is. Accordingly, we labeled the points as Native, Near Native, Advanced, Intermediate, and Beginner based on their distance from Eliza’s position in the latent space.

Another finding we immediately see is that the latent space is not biased towards different native languages, as we don’t see any clustering based on the speaker’s native language, and a fairly uniform distribution of native languages across all proficiency levels.

Now, let’s look at some creative ways that we can use our in-house suite of speech models and tools to help Victor get closer to Eliza’s accent.

Cleaning the Background Noise

The first thing that jumps out is that Eliza’s recording is much cleaner than Victor’s. Perhaps it will be easier for him to focus on just the accent differences if we can get rid of the background noise in his recording?

Victor (cleaned recording)

Surprise! This didn’t change Victor's position in the latent space much, the cleaned recording lands very close to Victor's original at the top right of the latent space. This is a good sanity check that our latent space is working correctly—the recording quality and level background noise are not relevant to accent strength.

Converting the Accent

Next, perhaps Victor finds it difficult to mimic Eliza’s accent because the register of his voice is so much lower than hers. So we’re going to use BoldVoice’s in-house accent conversion model to hear what Victor sounds like with Eliza's accent. (Yes, we can really do that—we'll share more about this in a future post.)

Victor's original recording

Victor (converted recording)

As you can see, the position of Victor with Eliza’s accent is right next to Eliza’s original position in the latent space. Phonetically speaking, there are still some differences in vowel shapes, emphasis, pitch, and timing, but even without expert knowledge, Victor will have a much easier time mimicking Eliza’s accent now that it’s in his own voice.

Practicing the Accent

We left Victor with this audio of his voice with Eliza's accent for about 10 minutes to give him time to practice mimicking it. Here’s what Victor sounds like after that practice:

Victor (after practice recording)

Compare to Eliza's original recording

Not bad—Victor matched her timing, intonation and stress pretty well, but some of the vowel shapes still aren’t quite the same. Let’s see how far he is from Eliza in the latent space now.

That’s quite an improvement! Victor’s new position in the latent space is right on the border of Intermediate and Advanced.

If Victor wanted to move beyond this point, the sound-by-sound phonetic analysis available in the BoldVoice app would allow him to understand the patterns in pronunciation and stress that contribute to Eliza’s accent and teach him how to apply them in his own speech.

What did we learn?

This machine learning model can clearly distinguish the strength of a speaker’s accent.
The model’s assessment of accent strength appears independent of the speaker's native language background.
The accent strength of a speaker is something that can be changed with practice.
Voice conversion technology can map a target accent onto a different voice, providing a useful tool for practice.
Changes to the acoustic environment, such as denoising, don’t result in a large change in measured accent strength.

Applications and Next Steps

The accent strength metric derived from this model has several promising applications.

It offers a quantitative way to track an English learner’s accent journey over multiple recordings by measuring their distance from a target accent profile in the latent space.
This same quantitative approach can be applied to rigorously evaluate automatic speech recognition (ASR) systems for performance variations across different accent strengths.
It can similarly monitor text-to-speech (TTS) systems for unwanted changes in accent, often referred to as “accent drift.”

Stay tuned for more!

Do you have any questions or comments? Or a suggestion for what you would like for us to cover in the future? Please reach out to us at [email protected]!

In our next post, we'll demonstrate how to explore accent fingerprints (embeddings) directly without engineering them for any particular task, and go on a tour of the world’s accents in English.