Deepfake Audio Text to Speech - An Introduction

Deepfake Audio Text to Speech

An Introduction

Andrea Hauser
by Andrea Hauser
on March 18, 2021
time to read: 8 minutes


This is how Audio Deepfakes work

  • In addition to visual deepfakes, there is also the possibility of creating deepfakes from audio
  • This uses the so-called text-to-speech method
  • With currently publicly available tools, a solid technical understanding is required to create audio deepfakes, the general public cannot currently easily create an audio deepfake
  • Nevertheless, companies should think about how they will deal with for example fake calls in the future

In 2018 and 2019, in a series of articles, deepfakes were analysed as well as own deepfakes were created. At that time, only video or image deepfakes were considered. However, deepfakes can be created not only for videos or images, but this possibility also exists for audio recordings. This article gives an overview of deepfakes for voice recordings.

The procedure for creating audio deepfakes is similar to visual deepfakes but still different. What is similar is that audio deepfakes are based on the same principles of computation with neural networks. However, the approach to processing voice material is different because of the starting point. To create an audio deepfake, clear recordings of a speaker, preferably without interruptions, ambient or background noise, are needed. And the more such material is available, the better the audio deepfake will be. The approach of current tools is to read out text in the voice of a selected person.

How is an Audio Deepfake created?

In a first step, the model must be taught to read in a specified language and to be able to reproduce what has been read. This is based on a generic voice for which a lot of voice material, at least 24 hours of audio, has to be available. In addition, transcripts must be available for the recordings so that text can eventually be converted to voice sounds. The model is fed with the recordings and the transcripts. The text and audio segments given to the model should not be longer than 10 seconds each for the training and should stop with the end of a word. This means that a lot of effort is needed to prepare the data for such a generic model. On the one hand, enough good recordings must be available, on the other hand, the recordings and texts must be brought into the state expected by the model. After the material has been prepared, a lengthy calculation is performed to allow the model to establish a correlation between the text and the audio. This results in the generic base model in the desired language, which can be used in the next step to fine tune the model with the target voice.

The fine tuning requires another 30% of the time needed for the training of the base model. About 2.5 to 3 hours of voice recordings of the desired speaker are necessary to achieve a good result. However, the results obtained from this fine-tuning still sound relatively metallic and robot-like. The reason for this is that in this training only the most important frequencies were trained, since the amount of computation and time required would be far too great to train correctly for all frequencies that are present. In order to turn the metallic voice into a better sounding imitation or an unrecognizable imitation of the voice, one last step is needed. The results obtained are fed into a so-called neural vocoder, which fills in the gaps in the frequencies and thus gives the whole thing a natural sound.

What Tools are currently available?

There are different publicly available tools. Two of these tools that look very promising are TTS from Mozilla and tacotron2 from NVIDIA. Both have instructions on how to use them, but it quickly becomes clear that the tools currently available require technical understanding as well as an understanding of how audio deepfakes work.

How good are these Tools?

On YouTube you can find many examples of results of audio deepfakes with Tacotron2. For example, the Vocal Synthesis channel uses this approach. However, with the publicly available tools, the results are mostly audibly manipulated.

In this video of a Donald Trump impersonation, strange remnants of processing can still be heard clearly. However, if this clip were to be backed up with a loud background noise, such as a party or a bad-sounding connection, and nothing malicious is expected, it could very well be used to fool people.

However, there are also better examples like the clip of the channel Speaking of AI, in which an imitation of Homer Simpson can be heard.

However, the examples from this YouTube channel are created with a method that is not publicly available. Nevertheless, it shows that amazing results can already be achieved today.

What are the Consequences?

In combination with visual deepfakes, this can potentially create a complete imitation of a person. This results in similar advantages and disadvantages as already discussed in the introduction to deepfakes.

In addition, however, other points come up. With the ability to imitate voices, convincing phone phishing attacks can be carried out against companies. Furthermore, the basic existence of the technology gives people the ability to reject video or audio evidence as fake, whether that evidence is true or false.

A positive point for the technology is that it can be used to recreate the voices of people who have lost their voice due to illness or other factors. This makes it possible to offer personalized computer voices as voice substitutes.

Dealing with Audio Deepfake Attacks over the Phone

At the very least, it would make sense to hang up when a fake call is suspected and to call the person back and confirm what has been discussed. The callback should not be to a number given on the phone, but to the known number of the person who allegedly called.


The possibility to create realistic audio deepfakes already exists today. However, with the currently existing, publicly available tools, it is necessary to have a certain technical understanding. Nevertheless, companies should think about how they will deal with for example possible fake phone calls in the future, and how staff should be trained to deal with them.

About the Author

Andrea Hauser

Andrea Hauser graduated with a Bachelor of Science FHO in information technology at the University of Applied Sciences Rapperswil. She is focusing her offensive work on web application security testing and the realization of social engineering campaigns. Her research focus is creating and analyzing deepfakes. (ORCID 0000-0002-5161-8658)


You want to evaluate or develop an AI?

Our experts will get in contact with you!

JWT Issues

JWT Issues

Andrea Hauser

Transport Layer Security

Transport Layer Security

Andrea Hauser

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here