Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

Alonso de Apellániz, Patricia ORCID: https://orcid.org/0000-0002-8604-9758 (2020). Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations. Tesis (Master), E.T.S.I. Telecomunicación (UPM).

Descripción

Título: Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations
Autor/es:
Director/es:
Tipo de Documento: Tesis (Master)
Título del máster: Ingeniería de Telecomunicación
Fecha: 2020
Materias:
ODS:
Palabras Clave Informales: Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolutional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantitative evaluations.
Escuela: E.T.S.I. Telecomunicación (UPM)
Departamento: Señales, Sistemas y Radiocomunicaciones
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of TESIS_MASTER_PATRICIA_ALONSO_DE_APELLANIZ.pdf]
Vista Previa
PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (7MB) | Vista Previa

Resumen

Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications
Throughout this Master's Thesis, a study of the state of the art in the different techniques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different existing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows

First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms

As the second approach of this Thesis, the implementation of three different convolutional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]'s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network's behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning

Finally, to compare and evaluate each network's results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem

Más información

ID de Registro: 62958
Identificador DC: https://oa.upm.es/62958/
Identificador OAI: oai:oa.upm.es:62958
Depositado por: Biblioteca ETSI Telecomunicación
Depositado el: 10 Jul 2020 08:13
Ultima Modificación: 15 Jun 2023 08:32