Deepfake Audio Text to Speech
This is how you create the perfect Deepfake
The calculation of the source material is based on a model trained with Donald Trump and Nicolas Cage. The tests are divided into these three categories:
These categories present multiple test cases which consists of two videos as results. The first video shows the result generated by the default values and the second one shows a manually tweaked version to ensure the best possible result.
The goal of this category is to determine the minimum amount of images required to create a successful deepfake. The test cases are divided into 500, 2.000, and 5.000 images. We would also like to determine if it is necessary to have a large amount of images of the target video or if there have to be multiple target videos. The basic target video of George W. Bush is a Youtube video with the title Bush’s Best Speech.
George Clooney is represented by a mix of four different videos.
After 24 hours of computing the results are the ones shown below. You can clearly see that manual tweaking of parameters increases the quality of the result.
After another 24 hours of computing the result consists of the two videos shown below. Once again the manual tweaking of the merge parameters produces much better results.
After another 24 hours the following results can be shown. Once again the default video is of lesser quality.
After training on 5.000 to 5.000 images for a day, the resulting model was used in a one-minute-long conversion to the 7 second video of the other test cases. As usual the default values generate less convincing results. However, it is not possible to determine any differences in quality between the 168 images version of Bush and the 5.000 images version of Bush. There is also no difference between 500 and 5.000 images of Clooney. This leads us to the conclusion that just 500 images are required to create a perfect deepfake.
We shall now determine how lighting and shadows influence the quality of deepfakes. We have chosen material by George Clooney where his face is partially in the shadow or the lighting of the source and destination videos is not the same. The source video of George W. Bush remains the same Youtube video as used before.
Both images have solid lighting in their respective videos. The face of George Clooney is a bit reddish. This difference in color can also be seen in the results. The video with the default values shows the same flickering as before. But in this case it was not possible to eliminate this effect with manual tweaking of the parameters, as normalizing the colors applies some of Clooney’s reddish color to Bush’s pale skin tone, resulting in an unconvincing image. However, this analysis shows what kind of facial areas were touched by the deepfake algorithm.
In the source material for George Clooney, one side of his face is in shadow and the other side is lit a little too brightly. This test case clearly shows that the lighting of the source video plays an important role in the selection of the source material. No good result can be achieved with distinctly different illuminations.
In this case, the usual source for George Bush was used, while for Clooney an interview with a black background was drawn upon. The color of the background does not lead to any significant issues when using default parameters, although the typical flicker is present. In the manually tweaked merge, however, the black background has a clear effect on the result: The produced face is too dark.
This produced a generally good result, although the resulting face looked to be blurry. More training would likely reduce this blur effect.
With this case, we investigated how many side views are necessary in the source material to convincingly fake a target video with side views. The same video from Amount of Images was used as the base video of George W. Bush.
While the result is generally good, this video does also clearly show areas where deepfake technology needs to improve. Focusing on the mouth, it becomes clear that the algorithm cannot handle teeth particularly well. They are either not shown at all or as a single white area which even overlaps the lips in most cases.
For this category, a pre-trained model of Nicolas Cage was utilized, which led to an effect where the resulting face became a mixture of George W. Bush, George Clooney and Nicolas Cage in the side views.
Here, too, Nicolas Cage’s facial traits show in some side views. We can therefore conclude that more than 30% of the source recording need to be side views to produce convincing side views.
The amount of faces plays less of a role than expected. Much more important is the similarity of the material in terms of illumination and angles of the faces, as high quality deepfakes can only be produced with similar material.
Our experts will get in contact with you!
Our experts will get in contact with you!
Further articles available here