This is how Audio Deepfakes work
The procedure for creating audio deepfakes is similar to visual deepfakes but still different. What is similar is that audio deepfakes are based on the same principles of computation with neural networks. However, the approach to processing voice material is different because of the starting point. To create an audio deepfake, clear recordings of a speaker, preferably without interruptions, ambient or background noise, are needed. And the more such material is available, the better the audio deepfake will be. The approach of current tools is to read out text in the voice of a selected person.
In a first step, the model must be taught to read in a specified language and to be able to reproduce what has been read. This is based on a generic voice for which a lot of voice material, at least 24 hours of audio, has to be available. In addition, transcripts must be available for the recordings so that text can eventually be converted to voice sounds. The model is fed with the recordings and the transcripts. The text and audio segments given to the model should not be longer than 10 seconds each for the training and should stop with the end of a word. This means that a lot of effort is needed to prepare the data for such a generic model. On the one hand, enough good recordings must be available, on the other hand, the recordings and texts must be brought into the state expected by the model. After the material has been prepared, a lengthy calculation is performed to allow the model to establish a correlation between the text and the audio. This results in the generic base model in the desired language, which can be used in the next step to fine tune the model with the target voice.
The fine tuning requires another 30% of the time needed for the training of the base model. About 2.5 to 3 hours of voice recordings of the desired speaker are necessary to achieve a good result. However, the results obtained from this fine-tuning still sound relatively metallic and robot-like. The reason for this is that in this training only the most important frequencies were trained, since the amount of computation and time required would be far too great to train correctly for all frequencies that are present. In order to turn the metallic voice into a better sounding imitation or an unrecognizable imitation of the voice, one last step is needed. The results obtained are fed into a so-called neural vocoder, which fills in the gaps in the frequencies and thus gives the whole thing a natural sound.
There are different publicly available tools. Two of these tools that look very promising are TTS from Mozilla and tacotron2 from NVIDIA. Both have instructions on how to use them, but it quickly becomes clear that the tools currently available require technical understanding as well as an understanding of how audio deepfakes work.
On YouTube you can find many examples of results of audio deepfakes with Tacotron2. For example, the Vocal Synthesis channel uses this approach. However, with the publicly available tools, the results are mostly audibly manipulated.
In this video of a Donald Trump impersonation, strange remnants of processing can still be heard clearly. However, if this clip were to be backed up with a loud background noise, such as a party or a bad-sounding connection, and nothing malicious is expected, it could very well be used to fool people.
However, there are also better examples like the clip of the channel Speaking of AI, in which an imitation of Homer Simpson can be heard.
However, the examples from this YouTube channel are created with a method that is not publicly available. Nevertheless, it shows that amazing results can already be achieved today.
In combination with visual deepfakes, this can potentially create a complete imitation of a person. This results in similar advantages and disadvantages as already discussed in the introduction to deepfakes.
In addition, however, other points come up. With the ability to imitate voices, convincing phone phishing attacks can be carried out against companies. Furthermore, the basic existence of the technology gives people the ability to reject video or audio evidence as fake, whether that evidence is true or false.
A positive point for the technology is that it can be used to recreate the voices of people who have lost their voice due to illness or other factors. This makes it possible to offer personalized computer voices as voice substitutes.
At the very least, it would make sense to hang up when a fake call is suspected and to call the person back and confirm what has been discussed. The callback should not be to a number given on the phone, but to the known number of the person who allegedly called.
The possibility to create realistic audio deepfakes already exists today. However, with the currently existing, publicly available tools, it is necessary to have a certain technical understanding. Nevertheless, companies should think about how they will deal with for example possible fake phone calls in the future, and how staff should be trained to deal with them.
Our experts will get in contact with you!
Our experts will get in contact with you!
Further articles available here