A.I. and Sound design – Mirko Perri Sound Blog

For some years now, investments and the focus of computer research have moved significantly in the field of machine learning, artificial intelligence, augmented reality and speech recognition. We already see the first commercial products, Tesla produces cars that drive alone, Apple will implement in the next releases of its operating systems algorithms of advanced machine learning and so Google, Amazon and all the big IT. The tendency is to create tools that are not only useful for a specific purpose, but that “collaborate” actively to achieve that goal with methods that we can only imagine today. We will move from a generic quantitative manipulation of numerical data to something with a strong qualitative character and customized for each individual user. How could this affect the field of sound design?

Izotope, for example, applied this technique producing more advanced tools for processing audio signals and to put on the market algorithms that not only operate on the sound in a surgical manner, but that are also capable of understand which parts in the sound spectrum are useful and which are not. These are instruments that in the complexity of the frequencies of a recording can recognize and isolate a voice or manage to exclude unwanted things, for example the rubbing of a dress on the microphone. In a very simple way, these algorithms make “choices” in complete autonomy and are similar to the choices that the human operator uses constantly during the work. All this is possible thanks to the fact that they have been “trained” to do so and this thanks to deep learning.

Very interesting is the approach with which Andrew Owens of MIT explores the potential of this technology in the field of sound and image. By exploiting the artificial neural network the software is able to autonomously predict which sound to synchronize to a silent video, it analyzes both the movement, deciding the exact moment in which to start the sound event and can understand the type of material taken up on stage, thus synthesizing the appropriate sound. The algorithm is based on a dataset called “Greatest Hits” composed of almost 1000 movies and 46000 sound effects. It behaves like a foley artist, it chooses to produce sounds not necessarily generated by the same object present on the scene, but it uses others that have acoustic characteristics suitable to mimic a plausible sound; just think of the classic coconuts to sound the horse’s hooves.

The video below shows the result of this experiment.

Here you can download the detailed document where the technical and methodological details are displayed.

Another study, carried out in Edinburgh by Eduardo Reck Miranda in 1995, aims to create an algorithm capable of revolutionizing the way of making sound synthesis by means of machine learning. Miranda hypothesizes a system, called ARTIST (Artificial intelligent sound tool), in which the user can literally dialogue with the machine. By means of a vocabulary made up of terms that describe the characteristics of the sound, the algorithm generates timbres with the required qualities. The interesting part of the system is the fact that the algorithm is conceived to learn from the user’s requests and over time generates arbitrary behaviors suggesting to the user solutions based on experience. According to Miranda, to assess whether a system can be considered intelligent, it must satisfy in following points.

It must respond intuitively to user requests.
It must be able to be configured by the user.
It must propose alternative synthesis solutions.
It must be a system with a certain degree of uncertainty.
It must learn.

For those wishing to learn more here a document written by Miranda that explain in detail his idea.

Miranda builds an interesting theoretical framework and who knows if at the department of Google Brain, the team that deals with the Magenta project has not taken it into account. They put into practice a synthesis system based on artificial intelligence, they made Nsynth an algorithm capable of analyzing and understanding the intrinsic qualities of a musical sound produced by any instrument. On the basis of this knowledge, the algorithm gives the possibility to interpolate two or more sounds generating completely new timbres that however continue to clearly maintain the characteristics of the original instruments. Here you can download the software in Max for Live format. Below is a video demonstrating how Nsynth works.

In the academic field there is a lot of excitement around these subjects, the researches that try to teach the software to listen, isolate and even recognize the semantic value of the sound are numerous and are fundamental to give the hearing to that super artificial intelligence with which we will talk and work in the future.

By simplifying, machine learning technology can be divided into following broad categories:

Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. As special cases, the input signal can be only partially available, or restricted to special feedback:
- Semi-supervised learning: the computer is given only an incomplete training signal: a training set with some (often many) of the target outputs missing.
- Active learning: the computer can only obtain training labels for a limited set of instances (based on a budget), and also has to optimize its choice of objects to acquire labels for. When used interactively, these can be presented to the user for labeling.
- Reinforcement learning: training data (in form of rewards and punishments) is given only as feedback to the program’s actions in a dynamic environment, such as driving a vehicle or playing a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

In the cases described above the learning is supervised, the input software must be carefully tagged to be processed by the algorithm. Also in the field of audio there are many people who deal with building and growing datasets of sounds organized specifically for machine learning. One of the biggest is Google AudioSet that, having available the sea of Youtube, can easily collect a lot of audio hours impressive, but it is not the only one doing this job, here are some audio datasets found on the net.

Urban sound dataset

Environmental sounds

Dataset open source in collaborazione con Freesound.org

VidTimit audio video dataset

Una lista di datasets specifici per la ricerca musicale

A list of specific datasets for music research

It would be interesting if all the audio libraries on the market took into account the storage characteristics for this type of application, would help the newborn artificial intelligence to have material on which to study in the future.
Of course we are still very far from systems that can replace man in sound design, also because it is something that strongly involves the emotional component, something that a machine will hardly be able to understand.

But if we want to do some plausible futurology, just engage the “follow the money” philosophy. The trend of economic investments in these technologies reveals glimpses, especially in the user / software relationship. The professional audio-video market is very dynamic, growing 12% a year and is expected to reach over $ 180 billion in revenue in 2020, in such a dynamic environment it’s no wonder that the technology of machine learning will find fertile ground

Maybe we will be working in virtual environments handling holographic interfaces and we will have a verbal dialogue with a workstation that will suggest the most suitable creative solution to our needs after it will have compared all similar cases, in all the existing media, from the day the cinema was born until that future day. We will have available softwares that will offer us the elements to use to build a soundscape based on geographic coordinates, temperature and humidity deduced from the scene. It will be a machine calibrated on our tastes and our sensitivity, we will no longer adapt to the workflow of this or that software, they will adapt to the way we work and this is why we will teach them.

Here are some links to academic research in progress.
Cornell university library
Arthur Juliani
SoundNet
Standford university (PDF)
Università Pompeu Fabra – Barcellona
Deep learning methods for environmental sound
Loria