Abstract
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been extensively used to transform and create images or sounds in their own domains. But transformation between different modalities is a problem that hasn’t been so explored. This work proposes a method to generate sound from image, based on Pix2Pix architecture (Isola et al., 2017), a conditional GAN that was designed for general purpose image-to-image translation. In this work a new implementation that allows creating music from images has been developed. The main goal is to create music that describes specific paintings and to answer the question: How does that image sound?. This is an answer that blind people could find useful in several applications, like in museums. To do so it has been taken into account different thesis that posit that there is an interaction between visual art and music, also several works that study synesthetic experimentations. The process implies: first to label and pair images and sounds from different style and points in time, second extract common features from the data by exploring multiple methods for music feature extraction and third to introduce multimodal layers into the GAN. Finally, a method to create novel pieces of music by using the generated sound features has been implemented. As it will be presented in the state-of-the-art section, some advances in crossmodal generation have been achieved but most of them are focused on creating image from sound or image from text, but only a few explore image-to-sound transformations.