Texto completo
Vista Previa |
PDF (Portable Document Format)
- Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (7MB) | Vista Previa |
ORCID: https://orcid.org/0000-0002-8604-9758
(2020).
Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations.
Tesis (Master), E.T.S.I. Telecomunicación (UPM).
| Título: | Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations |
|---|---|
| Autor/es: |
|
| Director/es: |
|
| Tipo de Documento: | Tesis (Master) |
| Título del máster: | Ingeniería de Telecomunicación |
| Fecha: | 2020 |
| Materias: | |
| ODS: | |
| Palabras Clave Informales: | Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolutional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantitative evaluations. |
| Escuela: | E.T.S.I. Telecomunicación (UPM) |
| Departamento: | Señales, Sistemas y Radiocomunicaciones |
| Licencias Creative Commons: | Reconocimiento - Sin obra derivada - No comercial |
Vista Previa |
PDF (Portable Document Format)
- Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (7MB) | Vista Previa |
Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications
Throughout this Master's Thesis, a study of the state of the art in the different techniques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different existing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows
First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms
As the second approach of this Thesis, the implementation of three different convolutional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]'s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network's behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning
Finally, to compare and evaluate each network's results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem
| ID de Registro: | 62958 |
|---|---|
| Identificador DC: | https://oa.upm.es/62958/ |
| Identificador OAI: | oai:oa.upm.es:62958 |
| Depositado por: | Biblioteca ETSI Telecomunicación |
| Depositado el: | 10 Jul 2020 08:13 |
| Ultima Modificación: | 15 Jun 2023 08:32 |
Publicar en el Archivo Digital desde el Portal Científico