PRE2016 3 Groep2: Difference between revisions

From Control Systems Technology Group
Jump to navigation Jump to search
Line 170: Line 170:


One of the main problems with regular recurrent neural network structures is the ability to use information that is essential for the context of the present task but occured a long time ago. In other words, properly retrieving inputs from a distance that might be relevant for the present context is a difficulty.  
One of the main problems with regular recurrent neural network structures is the ability to use information that is essential for the context of the present task but occured a long time ago. In other words, properly retrieving inputs from a distance that might be relevant for the present context is a difficulty.  
[[File:RNN-longtermdependencies.png|600px|center]]
[[File:RNN-longtermdependencies.png|600px|center]]<ref name="understandinglstms" />
LSTMs were introduced by Hochreiter & Schmidhuber in 1997 and were explicitly created to deal with the long-term dependency issue. The main difference between a regular chain of recurrent neural network modules and an LSTM chain, is a cell state that holds and passes along information across the chain of LSTM network modules. Using this cell state, LSTMs can carefully decide what long term information to keep unchanged and what information to modify according to new inputs. These decisions are made using four neural network layers in each network module, unlike the one tanh layer in a typical RNN. The following diagrams depict the differences between a traditional RNN and an LSTM. Notice the persistent cell state towards the top and the 4 neural network layers in the LSTM.
LSTMs were introduced by Hochreiter & Schmidhuber in 1997 and were explicitly created to deal with the long-term dependency issue. The main difference between a regular chain of recurrent neural network modules and an LSTM chain, is a cell state that holds and passes along information across the chain of LSTM network modules. Using this cell state, LSTMs can carefully decide what long term information to keep unchanged and what information to modify according to new inputs. These decisions are made using four neural network layers in each network module, unlike the one tanh layer in a typical RNN. The following diagrams depict the differences between a traditional RNN and an LSTM. Notice the persistent cell state towards the top and the 4 neural network layers in the LSTM.



Revision as of 22:31, 12 April 2017

This is group 2's page.

Group 2 Composition

  • Jaimini Boenders
  • Rick Coonen
  • Herman Galioulline
  • Steven Ge
  • Bas van Geffen
  • Noud de Kroon
  • Rolf Morel

Abstract

Taking music user’s needs into account, we identified the desire for music users to have more control over their music experience. We explore the capabilities of machine learning to generate music in order to meet these needs. Out of the different machine learning architectures available, we focus on the GAN architecture to realize this goal.

Introduction

[version 1]

Driven by technological innovation, society and businesses are continuously evolving and adapting to the presence of new technological capabilities. Major instances of this happening in the past have included introducing new tools such as: the light bulb, vaccines, the automobile, and more recently: the internet, and mobile phones. Each providing a new service or capability to society that was previously not possible. And each instance, fundamentally changing the functionality of society and the capability of business. With the light bulb, working hours changed in all industries - clearly affecting business capabilities, but also the economic structure of society. With the internet, communication has never been so easy and so fast - providing a new platform for businesses to engage customers on, but also affecting how society communicates with itself.

And now to today, with an emerging automation technology called machine learning still in it’s infancy, we (arguably) are witnessing the development of a technology that has the capability to impact both society and enterprise in a similarly major way. With the novelty of this type of automation (as opposed to previous attempts) being the potential to mimic human creativity - a limit on current automation technology. And, with creativity being the one of the last prided capabilities exclusive to humans, the significance of such a change could be revolutionary - in terms of society's identity and capability of businesses to automate tasks.

With this possibility in mind, in this wiki we will focus on machine learning exclusively within the music industry (as music generation is an activity that requires creativity). Specifically, we will consider how taking user preferences in generating music via machine learning could create a unique and customized product for consumers of music. We will not only discuss how such an idea could impact users of music, the music industry and the music community, but also our attempt at implementing this idea.

[version 2]

Driven by technological innovation, society is continuously evolving and aiming to increase the quality of life for everyone. A key element that contributes to happiness for many people, is enjoying art. Art created by artists has been a prominent part of human life throughout all of history. This raises the question if art is something that could only be created by humans that use creativity. This question forms the basis of our project. We will not go into depth on what makes something art, but will rather focus on whether it is possible for a computer to create “art” that can be enjoyed by people. More specifically, we will focus answering the question: How can AI serve the user by generating new music tailored to the user’s needs?


Concept

The project consists of multiple aspects, which we will briefly introduce to give an idea of our work. In particular, we split up our project in “data” and “network”; these two parts form the technical aspect of our project.

The data team is responsible for finding appropriate datasets, which do not only contain music but also metadata, i.e. labels. There is a clear trade-off between music that is suitable for the network (e.g. homogenous data, labels that clearly affect the structure of the data, such as tempo) and music that is suitable for the application (e.g. many genres, labels that are not very formal but intuitive, e.g. mood). The data team does also look at existing solutions (mostly for scores/sheet music), and how we can use them to improve our product.

The network team is responsible for setting up a neural network program that is suitable for the task at hand. They work closely with the data team to ensure that the network architecture is appropriate for the given datasets. Furthermore, the network team focuses on studying Deep Learning, Artificial Neural Networks and state-of-the-art research/architectures/technologies in Machine Learning.

Another essential part of the project is the website. The website is built on top of a flexible and extensive PHP framework, to allow an complex application to be built. Furthermore, the UI is an important aspect and should look modern and be user-friendly. The website is an initial concept of the application which we explore. As such, it is intended to show how generating music can be used in a real application.

Finally, in our project we perform a survey and subsequent tests to verify whether we have met the customer wishes. The survey is mostly intended to explore what kind of functionality users consider as valuable when generating music. A concrete example is certain “features” of music such as genre, tempo, artist or key. In particular, we explore whether users prefer more theoretical/formal concepts (such as key, tempo) or intuitive ideas which seem to describe our preferences of music (style, genre, artist). The results will be incorporated in the selection of datasets, as well as the architecture of the network.



Applications for User, Society and Enterprise

We started this project with concerning ourselves about how an AI that can generate music on demand may have impact on Users, Society and Enterprise. As a first step, we looked at a variety of possible applications, which we will discuss briefly describe in the following subsections. After having gone over these possible applications we decided that we should go for the one where we believed we might have the most impact: the user-centered application.

User

Music, like any art form, means different things to different people. Because of this it is evident that listeners of music do so with different intents in mind. To illustrate this fact, we have compiled a list of a few different possible uses (or applications) of a machine learning algorithm able to generate music tailored to a user’s preference.

The first potential use for a music generator could be to mimic the style of a certain artist, and, in particular, mimic artists that have passed away. This would allow listeners to keep listening to new songs in the style of their favourite artist long after the artist’s death. Already there are a large number of well known artists with large repertoires (Bach, David Bowie, The Beatles, to just name a few) where this envisioned technology could be applied to. This could in fact turn out to become an emerging market to help classical music lovers relive the past. Another use could be in the movie industry. In the movie industry vast amount of money are spent on generating soundtracks for different scenes. However, over time certain sounds have become key identifiers for evoking certain types of emotions in the audience (clashing chords to create distress and discomfort, for example). So it happens that when a new score is developed for a movie, the music is different, but several elements in the music must remain the same in order to create the desired effect. Thus, this could potentially be done much more efficiently and cheaply by training a machine learning algorithm for each of the desired effects. For example, having an algorithm to just generate "scary" music trained on the music played during scary scenes in movies. The last example, but definitely not the last possible application, could be to use music generation as a tool for music creators. By generating music features like beats and chord progressions, a creator could get inspiration for new music as well as use these features directly in their music. This would be different than any previous type of generator because the user would have some control over what the product will be, but not entirely. That way the result would be random, yet relevant at the same time.


Even though in each of these applications the intent is different, they all use the same tool in the same way - to generate music based on their preferences. Now to focus on the users: a product that does not satisfy the user's needs is inherently deficient. So the most important question here is finding out if there really are users of music that would be interested in the above applications or if there is something more or different they want with music generation. Or perhaps they want nothing at all; since often music is more to people than just the aural pleasure of hearing music performed. That is, it is also about the artist behind the music and the meaning of the context in which it was created?

So to gauge interest in an AI product generating music and how such a product would be received we used surveys to find out about features that are actually desired by users and their impression of music generated by AI. Guided by the insights that the survey gives us we will steer the project so as to accommodate user needs.


Enterprise

Beyond music being a creative pursuit, it is also responsible for the existence of a multi-billion dollar industry. In addition to the actual artists we have producers, record labels and content delivery services, etc. On the productive side there is the possibility for producers to more efficiently obtain the kind of music they are after, be it a generated film score or a sample to be incorporated in a song, et cetera, by musical artificial intelligence. This could mean more creative control for content owners otherwise not available, but it also means less dependence on artists that need to be paid. This, of course, creates opportunities to increase profits.

Another aspect to consider is ownership of the produced music. Successful AI technologies as we know them today use training data to learn patterns. Often there will be no way around the fact that if you would like the AI to produce songs similar to existing songs, e.g. from a certain artist, then these songs (in the case of a particular artist: their discography) will have to be part of the training data. But in general these songs will be copyright protected. As the AI will not reproduce songs but only generate similar songs based on the patterns embedded in the original songs, a strong case could be made for that a generated song does not (directly) infringe on the copyright of the original songs. The AI distilled the underlying patterns of the music and ownership of these interpretive patterns is currently unclear, it is even the case that the patterns found are different based on the AI technology used.

We will name a few applications for AI music in industry in the section below, but a more thorough exploration is beyond the scope of our user centric project. The legal aspects of creating similar music through pattern recognition will become an important issue in the future when creative technology does become commercially viable. This issue will have to be considered by legal minds, but public opinion on this matter will also be a significant factor.

Society

A possible consequence for a very effective musical AI is that it will be competent and efficient to a degree where it will compete with humans. Up till this point in time humans have considered creative tasks, such as composing music, safe from automation. Recent advances in AI technology have shown that there are algorithms that produce artefacts that are very similar in quality to human produced exemplars. Even when these algorithms learn no more than patterns from a data set, there are jobs out there that ask for this (one example being the use of temporary music in the film industry).

Given these advances and the rate of research progress it could very well mean that that humans now performing these creative musical tasks will be replaced by AI technology. While out of the scope for this short duration project, a comprehensive effort is clearly needed so as to look into the impact of the intrusion of musical AI into the field of musicians and discuss how this influences employment in the sector.

Survey

To investigate people’s thoughts on the subject we conducted a survey. The actual survey is available here: File:SurveyArtificialArt.pdf. The survey consists of three parts:

  • Get some general information about the subject.
  • Investigate the subject’s thoughts on generating music.
  • See how interested the subject is in various suggested features for music generation.

For the last two parts, we asked the subjects to express their interest or level of agreement on a scale from 1 to 5. This enables us to process the survey results numerically, instead of having to draw conclusions from exact opinions.

As a demographic for our survey we initially decided to go with a group of random strangers. To survey these people we went on to the streets of Eindhoven and posed our questions to people there. Additionally, we created an online survey with the same questions, and had our friends fill out this survey. We realized that since this technology is relatively far from being fully realised, it would make sense anyway to choose a younger group as our demographic. Therefore, asking our friends did actually make sense in this regard.

An interesting observation from the survey results is that around half of the people seem to be open to paying for music generated by an AI:
Would people pay for generated music?
We have compiled the survey results in a bar chart that shows how the different features compare to each other when it comes to how much the subjects desired each of them.
What features do users desire?
From these results we can see that in general user are definitely the most interested in generating music based on their selected genre, tempo and their mood. From these three, we can see that especially generating music based on mood seems to be very desirable, since many people strongly desire this feature. As for genre and tempo, we can see that they have a higher number of people interested in them in total, but a bigger part of these people is only mildly interested. This is especially true for generating music based on tempo. To give an idea of the number of people interested in each feature, compared to people not interested, take a look at the following chart, that shows the distribution of people's desires for generating music based on genre.
Do people want to generate music based on genre?

Requirements

From the results of the survey together with our own ideas for the project, we have constructed a basic list of requirements for our product, which will be a website.

  • The user shall be able to select a genre if desired.
  • The user shall be able to select a tempo if desired.
  • The user shall be able to choose a name for the generated song.
  • The user shall be able to generate a song for any combination of genre and tempo.
  • The user shall be able to generate a song without selecting a genre or tempo.
  • The user shall be able to listen to the generated song.
  • The user shall be able to download the generated song.
  • The application shall give a brief explanation about how it works.
  • The application should run on Google Chrome.
  • The application should generate an audio file with synthesized sounds.
  • The application should not generate a score.

Research (Alternatives)

Algorithmic composition

Besides using neural networks, there are other methods to compose/generate music. Two popular methods for music composition, we looked at during our research are Mathematical models and Knowledge-based systems.

Mathematical models creates music by encoding music theory with mathematical equations and using random events. These random events can be created with stochastic processes. This creates unique, as it uses randomness, but harmonically sound, as it uses music theory, music.

Knowledge-based systems make use of a large database with pieces of music. The data is labeled, so that it knows what the characteristics of certain pieces are. It uses a set of rules to create “new” music, to match the characteristics given by a user. An example is a computer program called “Emily Howell”. It uses a database to create music and uses feedback from users, to create new music based on their taste.

None of these techniques generates sound on its own. It either creates a score or patches existing sound files together to create “new” music. We are also more curious about methods that have not been used yet for the generation of music. Therefore we shall not use these techniques.

Wavenet

To implement our Machine Learning idea to be able to mimic a certain artist we initially explored a Deep Learning algorithm called WaveNet. WaveNet is a deep neural network architecture created by Google Deepmind that takes raw audio as input. Already, it has been used successfully to train on certain dialects to create a state-of-the-art text-to-speech converter. Specifically, WaveNet has been shown to be capable of capturing a large variety of key speaker characteristics with equal fidelity. So instead of feeding it voice data, we intended to feed it musical input in order to train the network to generate our desired music. This has already been shown to work on piano music by the WaveNet team.

WaveNet was an attractive starting point for learning to work with neural nets because it is available as an open source project written in python using Tensorflow and is available on GitHub. This open source project was not created by the original DeepMind team, but it allowed us to easily train our first deep learning network using Tensorflow to get to know the technology and to ensure it is properly functioning on our hardware.

However, the results of WaveNet were not satisfactory, it was clearly not able to deal with the high level of heterogeneity of arbitrary music input. Furthermore, since this is already a proven technology made for a very specific goal, we thought it was less interesting to explore.

We trained WaveNet on the discography of David Bowie. An example of a sound file generated by the trained network in found here: https://soundcloud.com/noud-de-kroon/wavenet-sample

Deep Learning

In this section, we provide an overview of the prerequisite knowledges of our work with respect to Deep Learning (and Machine Learning in general). Apart from the cited papers, it is noteworthy that we use the book Deep Learning by Goodfellow et al. [1] both implicitly (e.g. for general models such as Artificial Neural Networks) and explicitly (e.g. for specific results) as a source throughout this section. Note that in the remaining sections we often use feature vector instead of metadata vector, this is in order to be consistent with terminology often used in literature.

Neural Network

Model of an Artificial Neural Network

Artificial Neural Networks (or simply Neural Networks) are large networks of simple building blocks (neurons or neural units) connected with many other neural units. A single neural unit is some summation function over its inputs, and can inhibit or excite connected neurons. Moreover, the signal produced by neurons are defined using an activation function, which generally sets a threshold value for producing a signal. Neural networks are inspired by the biological brain, although they are generally not designed to be realistic models of biological function [1]. Instead, ANNs are so useful because they do not need to be programmed explicitly, instead they are self-learning. The challenges of designing an ANN include defining appropriate representations of the data, activation function(s), cost function, and a learning algorithm. Nevertheless, it is very useful for problems in which defining or formalizing the solution or relationships in the data is very difficult. Moreover, ANNs have the potential to solve problems we do not understand ourselves. For example, music is very complicated to formalize, especially without restricting to scores (i.e. sheet music), specific genres or limited varieties of music.

Model of an Deep Neural Network

Using multiple neural networks as separate layers, one can construct a network with multiple hidden layers between the input and output layer: a Deep Neural Network. It is noteworthy that although Deep Neural Network are commonly used as an architecture in Deep Learning, it is not the only architecture used in the field of Deep Learning.

Generative Adversarial Networks (GANs)

A generative adversarial network (or GANs) is a generative modeling approach, introduced by Ian Goodfellow in 2014. The intuitive idea behind GANs is a game (theoretic) scenario in which the generator neural network competes against an discriminator network (the adversary). The generator is trained to produce samples from a particular data distribution of interest. Its adversary, the discriminator, attempts to accurately distinguish between “real” samples (i.e. from the training set) and samples generated/”fake” samples (i.e. produced by the generator). More concretely, for a given sample the discriminator produces a value, which indicates the probability that the given sample is a real training example - as opposed to a generated sample.

A simple analogy - to explain the general idea - is art forgery. The generator can be considered the forger, who tries to sell fake paintings to art collectors. The discriminator is the art specialist, which tries to establish the authenticity of given paintings, e.g. by comparing a new painting with known (authentic) paintings. Over time, both forgers and experts will learn new methods and become better at their respective tasks. The cat-and-mouse game that arises is the competition that drives a GAN.

More formally, the learning in GANs is a zero-sum game, in which the payoff of the generator is taken as the additive inverse (i.e. sign change) of the payoff of the discriminator. The goal of training is then to maximize the payoff of the generator, however, both players attempt to maximize its own payoff. In theory, GANs will not converge in general, which causes GANs to underfit. Nevertheless, recent research on GANs has led to many successes and progress on multiple problems. In particular, applying a divide-and-conquer technique to break up the generative process into multiple layers can be used to simplify learning. Mirza et al. [2] introduced a conditional version of a generative adversarial network, which conditioned both the generator and discriminator on class label(s). This idea was used by Denton et al.[3] to show that a cascade of conditional GANs can be used to generate higher resolution samples, by incrementally adding details to the sample. A recent example is the StackGAN architecture.

Generative adversarial networks have proven to be effective in practice, however, achieving stability is non-trivial. As such, we decided to build on top of an existing state-of-the-art architecture. Moreover, we attempt to follow good practice and successful strategies. For example, we normalize our input between -1 and 1, and use dense activation (e.g. Leaky ReLU).

StackGAN

StackGAN[4] uses two “stacked” generative adversarial networks to generate high-resolution (256x256) photorealistic images. This architecture utilizes two stages, corresponding to two separate GANs. The first network, stage I, generates relatively low-resolution images (64x64) conditioned on some textual description. This stage results in an image that clearly resembles the description, however, lacks vivid details and contains shape distortions. The second stage is conditioned on both the textual description and the low-resolution image (either generated by stage I; or for real images: downsampled). This results in convincing and detailed images of a higher quality than previously possible using a single level GAN.

blagh

The actual networks are not conditioned on the text description itself, but instead of an encoded text embedding. Furthermore, in the StackGAN architecture more conditioning variables are added by sampling from a Gaussian distribution, with mean and standard deviation defined as a function of the text embeddings. This has multiple (mostly technical) advantages; including the ability to sample/generate multiple different images with the same textual description. This is also used to evaluate how varied the outputs are; two vastly different images can be generated using the same description.

LSTM

(used as reference: http://colah.github.io/posts/2015-08-Understanding-LSTMs/) An LSTM network is a recurrent neural network. Before the discovery of recurrent neural networks, it was not possible to use a network's previous experience to influence the task it is performing currently. For example, it was unclear how to use traditional neural networks to classify an event happening during a certain point in a movie, based on what had happened before in the movie. Recurrent neural networks address this short-coming by passing in a previous output as an input parameter along with the actual input. Pictographically, this can be represented as a loop in a given piece of the network. The loop can also be represented as multiple copies of the same neural network, taking an old output as input for a new copy of the neural network.

RNN-unrolled.png

[5]

It is these types of networks that have been able to produce very interesting results with speech recognition, language modeling, translation and music composition. More specifically, the LSTM---a specific type of recurrent neural network---has played an essential role in producing these results.

One of the main problems with regular recurrent neural network structures is the ability to use information that is essential for the context of the present task but occured a long time ago. In other words, properly retrieving inputs from a distance that might be relevant for the present context is a difficulty.

RNN-longtermdependencies.png

[5]

LSTMs were introduced by Hochreiter & Schmidhuber in 1997 and were explicitly created to deal with the long-term dependency issue. The main difference between a regular chain of recurrent neural network modules and an LSTM chain, is a cell state that holds and passes along information across the chain of LSTM network modules. Using this cell state, LSTMs can carefully decide what long term information to keep unchanged and what information to modify according to new inputs. These decisions are made using four neural network layers in each network module, unlike the one tanh layer in a typical RNN. The following diagrams depict the differences between a traditional RNN and an LSTM. Notice the persistent cell state towards the top and the 4 neural network layers in the LSTM.

The first layer in the LSTM is responsible for deciding how much of each piece of information to keep in the cell state according to the output of the previous network module and the input to the current network module. The subsequent two layers decide what new information to add to the cell state. The last layer decides what the ouput of the current module is, using a filtered version of the current cell state that was modified by the previous three layers.

LSTMs have produced impressive results in a diverse set of fields. We were of course interested in its performance in music production. In 2002, Schmidhuber (one of the discoverers of LSTMs) along with some other researchers looked into finding temporal structure in music and found promising results. A very interesting part of the abstract: "Because recurrent neural networks (RNNs) can, in principle, learn the temporal structure of a signal, they are good candidates for such a task. Unfortunately, music composed by standard RNNs often lacks global coherence. The reason for this failure seems to be that RNNs cannot keep track of temporally distant events that indicate global music structure. Long short-term memory (LSTM) has succeeded in similar domains where other RNNs have failed, such as timing and counting and the learning of context sensitive languages. We show that LSTM is also a good mechanism for learning to compose music." (http://ieeexplore.ieee.org/abstract/document/1030094/) Again, this clearly shows the advantage of being able to learn long-term dependencies, something that LSTMs excel at. Another paper in 2008 by Douglas Eck (http://www-labs.iro.umontreal.ca/~pift6080/H08/documents/papers/lstm_music.pdf) presenst a music structure learner based on LSTMs. In their conclusion they note, "There is ample evidence that LSTM is good model for discovering and learning long-timescale dependencies in a time series. By providing LSTM with information about metrical structure in the form of time-delayed inputs, we have built a music structure learner able to use global music structure to learn a musical style.".

More recently, a Google brain project called Magenta (https://magenta.tensorflow.org/) has used LSTMs to create a bot that plays music with the user. We used their open-source framework to generate our own music samples. (Internal reference to Magenta section).

I think we should only cover the descriptions and general information about Deep Learning / Machine Learning architectures in this section, and have (a) different section(s) for what we actually did including motivations, dataset used (and why), results, etc. (Bas)

Overview Components

Now that we have discussed what user needs we need to satisfy and have explored some technologies that might be suited for this goal, we can start looking into the components necessary to build the music generating application.

There are three major components necessary for constructing our product: the music generating Neural Network, the datasets from which to learn, and the user facing interface, for which we use a website.

In overview the product works like this: A user of the product inputs their preference on the website where upon the backend of the website forwards the request for a piece of music (based on the submitted preferences) to the Neural Network component. The Neural Network is tasked with generating music, but to do so it must learn what music sounds like, in particular how certain features (e.g. tempo) of music express themselves. The network learns from existing music, labeled with information such as the tempo, artist, instruments etc. Given a request for a piece of music the network will have to be trained on a dataset containing sufficiently many examples of music expressing at least some of the requested features. In the best case scenario the network can be pre-trained on a single dataset, such that every possible request can be serviced. A more realistic situation is that multiple instances of the network will have to be trained, each on a separate dataset, such that one of the instances corresponds well to certain types of feature requests. Assuming that the Neural Network has been appropriately trained, the generating of the new piece of music is a straightforward process of feeding in the encoded features to the inputs of the network, propagating the inputs throughout the network, and finally converting the network outputs to a proper audio file. This output audio file becomes available in the user interface as the result for the user.

We will now discuss our choices regarding the datasets, the Neural Network setups that we deemed fitting, in GAN form, but in LSTM form as well, and finally we take a look at our user facing component.


Datasets

When looking for datasets to train the Neural Network on, we encountered 3 main sources: The One Million Song dataset, the Free Music Archive (FMA) data set and the MusicNet set. Each of these will be explained in detail in the sections below.

One Million Song dataset

The Million Song Dataset was the first one we found when looking for large free music datasets online.The Million Song Dataset is a collection of audio features and metadata for a million contemporary songs (donated by Echo Nest) and hosted by Columbia University (NYC) to encourage the research on algorithms that work on large datasets. It was by far the biggest we found (around 300gb of musical metadata for over a million songs), but unfortunately also one of the oldest and and most difficult of all: it actually contained no audio files in the dataset. According to the Million Song website, this wasn’t suppose to be a problem since they had scripts to easily scrape songs off of the 7digital site (a website similar to iTunes) using the 7digital API. But, the Million Songs’s pythons scripts were over 7 years old (with no maintenance done) and required old python libraries riddled with inconsistencies. After several failed attempts to get it working, we were forced to start from scratch in building our own script to extract songs. This underestimation of the difficulty of actually getting the audio files proved to be the biggest obstacle with this dataset. So in summary, this combination of old documentation, the sheer scale of the amount of data and inconvenience of not having the song files readily available made this one of the hardest datasets to work with.

Free Music Archive (FMA)

The next data set is the “Free Music Archive” (FMA) which we found after we realized how difficult the One Million Song set would be. The Free Music Archive (FMA) dataset was created by researchers at the École polytechnique fédérale de Lausanne (EPFL) with the intention of helping push forward machine learning on music. Even though this dataset was much smaller than the One Million Song set, it importantly contained the actual audio files (in mp3 format) within the set. Being that this set was only released recently (December 2016) only the small and medium versions of the set were available (sizes of 4,000 and 14,000 songs respectively) with the large and full size sets to be released sometime later. Conveniently, the metadata and features were stored in a JSON file, allowing for easy access using a Python library called “Panda”. This meant that all had to be done was write a script to format the data into a form that the neural network would accept (described in a later section). Furthermore, data in the metadata files included things like song ID, title, artist, genre and play counts. Additional, the audio provided was taken from center sections of the songs since the middle of songs was considered to be the most active/interesting by the researches providing the media. However, when going through some of the song clips, it was found that the quality wasn’t as high quality as expected (some recordings sounded as if they were live performances) and each genre wasn't as homogeneous as expected.

Link paper: https://arxiv.org/pdf/1612.01840.pdf Link data set: https://github.com/mdeff/fma

MusicNet

The last set we considered was the MusicNet set which was the smallest (330 songs) but the most homogenous (all classical music) and had the most comprehensive labeling of any of the previous sets. Note here that these are 330 full songs of about 6 minutes each, so comparably it is still a significant data set. This set included over 1 million labels over features such as the precise time of each note, the instrumenting playing each note, each note’s position in the structure of composition. Due to the superior labeling, this was the dataset we ended up training our neural network on.

Initial investigation showed that while the dataset had 11 instruments, by far the largest portion of music only contained 4 instruments, and for the other 7 instruments there was very little data. Therefore, we decided to only select songs with these 4 instruments. We generated samples in the following matter. We cut each song into samples of 4 seconds each. For each sample we checked if at least 4 notes are played within this sample, otherwise we reject the sample. This is to ensure that we do not have samples which are silent.

We generated a metadata vector for each sample in the following matter. We encoded data like which composer and composition was playing and the ensemble type using one-hot encoding. We then cut the sample into 16 pieces of equal length (thus each piece is 0.25 seconds). For each piece we encoded which instrument had an active note again using one-hot encoding. This to provide as much labeling to our neural net as possible, to increase the chance of success.

Representation

The scripts that extract data from the datasets create two vectors. One vector contains the information about the song (the raw song data) and the other vector contains the metadata. The vector containing the song data is created with the help of the librosa library. The library loads an audio file as a floating point time series. The length of the song vector is the duration of the song multiplied with the sample rate. The vector containing the metadata is created with the help of the pandas library. The metadata of the datasets are extracted from a JSON file in the case of the FMA dataset or from a CSV file in the case of the MusicNet dataset. The metadata vector contains categorical data and floating point data. Each category of categorical data is stored as separate entries in the vector with the value 1 in the entry, when the song belongs to the specified category or value 0 otherwise. For example, in the metadata vector for the FMA dataset, each genre is encoded with a separate entry in the metadata vector. If the song belongs to a certain genre, then that entry in the metadata vector is 1, otherwise that entry will be 0. The reason for this approach is to prevent the neural net from learning a linear relationship between the genres as such a relationship does not exist. You cannot say that rock is two times blues. The floating point data has exactly one entry in the metadata vector. The entry contains the value of that data. For example, for the tempo entry, the entry will contain the value of the beats per minute of the song. This type of data does contain a linear relationship. You can say that 200 beats per minute is twice as fast as 100 beats per minute.

MusicGAN

In this section we describe our work on MusicGAN. MusicGAN uses the implementation and architecture of StackGAN as a base. Three vastly different approaches are implemented, executed and evaluated (iteratively when problems/challenges were found). We first describe in detail how StackGAN works, why we use it, and what limitations it poses. We then describe the three different approaches and the underlying theory. Finally, we present the results of the three approaches.

StackGAN

The StackGAN[6] architecture uses two “stacked” generative adversarial networks to generate high-resolution (256x256) photorealistic images based on a text description. The stacked nature of the architecture refers to using two GAN setups conditioned on text input, named Stage I and Stage II. These two Networks correspond to phases in the learning process. The architecture needs a sufficient amount of properly labeled data (i.e. correct text description for real images of birds/flowers/etc.) to learn from.

The Stage I phase is concerned with learning how to produce low resolution (64x64 RGB pixel) images that are somewhat convincing for a given text description. Given the results of Stage I, the Stage II network takes the low resolution image produced for a text description by the Stage I network, adds the text description again, and takes these two as the input to the generating process. The output of Stage II are high resolution images (256x256 RGB pixels). The idea is that the second stage will be able to add details to the image, or even correct “mistakes” from the low resolution generated image, such that the final result is truly a convincing image given the text description. This method of stacking GAN networks was very successful applied by Zhang, Han, et al. to generate convincing images of birds. In the StackGAN paper the same method was equally successful in generating images for flowers from text descriptions.

Stage I

First the generator and discriminator networks of Stage I are trained. The generator network learns how to produce low resolution images, still only containing basic shapes and blurs of colors, that are convincing to the discriminator network, which gets incrementally better at detecting forgeries, while at the same time the generator becomes better at creating these fake low resolution images.

The output of the generator is in the form of 64x64 RGB images, represented as 64x63x3 tensors (a tensor can be thought of a multidimensional array, and can also be used as a mapping between multidimensional arrays), where the last dimension has three indices, due to the Red, Green and Blue channels of the image. The input neurons of the generator are an encoding of the text description, a topic we will not touch on, together with a noise vector, which, when adjusted for one text description, makes the image generation nondeterministic. A series of upsampling blocks (deconvolution) layers are applied, each time producing a 3x3 block of neurons for each neuron in the previous layer. After a number of layers the required output size is reached. The tanh activation function is used to make sure that the output RGB channels values are constrained to [-1,1]. The discriminator network works by taking an input image as a 64x63x3 tensor, along with the text encoding, and apply convolution layers (going from a square of 4x4 down to a 2x2 square) to go eventually to a single neuron whose activation is used to determine whether the discriminator believed the input to be fake or real. For every (de-)convolution layer ReLU activations functions are applied. A feature of the actual architecture that we will not discuss in detail is that the networks actually work by batching, which means that 64 images are generated/fed into the networks at the same time. This does not have significant consequences for the above description, as the most important change is that tensors gain an additional dimension with 64 indices for the separate inputs/outputs. For more details we refer to the StackGAN paper.

The training process occurs by feeding the discriminator with a combination of real and fake images. The weights of the network are adjusted by backpropagation according to whether the discriminator was successful in identifying realness of the input. Among the fake inputs are mismatched text descriptions and real images, but also, of course, the images generated by the generator network. The generator’s success is measured by whether it was able to fool the discriminator network. Initially both the discriminator and the generator know nothing, but iteratively they are able to improve due to improvements in their opponent.


Stage II

The Stage II generator network takes a text encoding along with a 64x64x3 tensor encoding of an image as input. By applying upsampling blocks the networks expands the input to a 256x256x3 output layer, which represents the final image. The discriminator for Stage II is largely the same as the Stage I discriminator except now it has even more downsampling layers.

The training for the discriminator is similar to Stage I, its network weights get adjusted based on whether it is able to detect fake images from real images. The generator now works a bit differently, namely that a text description will first be fed to the trained Stage I generator. The output from Stage I is then fed into the input of Stage II along with the same text description. Based on this input the generator produces its high resolution output.

From StackGAN to MusicGAN

In the literature review we conducted we were especially impressed by the successes that were being attained by Generative Adversarial Networks. In this search for an appropriate technique for generating music we were encouraged to opt for GANs by the following quote from an answer on Quora, by Ian Goodfellow, inventor of the GAN architecture: “In theory, GANs should be useful for generating music or speech. In practice, someone needs to run the experiment and see how well they can get the model to work.” Using the GAN approach to generating music is also an attractive proposition due to the to networks learning by example from a dataset, and does not require a deep and profound theoretical understanding of what makes an image look like a bird, or a piano song sound like a piano song. Also the approach of directly generating generating audio samples versus scores or the like is that we could capture the complexity of multiple instruments playing together, a common feature of music. This is in opposition to existing score generating approaches (for a single instrument) such as the Magenta project discussed in a later section.

The StackGAN architecture is one of the most impressive recent GAN approaches. The following features of StackGAN were particularly attractive for this project:

  • The resolution and level of detail that StackGAN is able to achieve. In particular important for the length of the music fragments we want to produce.
  • The publicly available source developed in one of the major Deep Learning libraries, TensorFlow, such that we can be build upon a proven product and enjoy the benefit of an active community.
  • The training sessions remain basically unchanged. It is possible to change out the data sets and only modify the code so as to deal with the change in data representation.
  • The goal of generating images. When we consider Fourier Transforms we also get a two dimensional representation, which means that for one approach we only need to do minimal work on the code.

There are also some limitations to chosen for the (Stack)GAN method:

  • The StackGAN project’s goal of producing images assumes that a two dimensional structure for the data is appropriate. While this might be obvious in the case of images, in the case of audio it might not be the best choice to have have this representation assumption embedded in your network structure.
  • While being able to work with the entirety of the codebase for the StackGAN project is definitely a plus in terms of being able to “stand on the shoulders of giants”, the complexity of the entire project is considerable.
  • The resolution for the images is not that high when compared to Fragment length? Stage 1: low frequencies only (up to 1 kHz), 4 second sample Stage 2: frequencies up to 16 kHz, 4 second sample

Before we look at the separate approaches we took in modifying the StackGAN project into a music generating MusicGAN, we need to mention that instead of the text encoding that was fed into the networks as a description of the image, we use the metadata vector obtained from the datasets as a description of the audio that is input.

Approach: Representation Learning

Many tasks can be either very easy or very difficult, depending on how the problem is represented. For example, consider the following imagine in which we show two different representations of the same dataset. The task at hand is to draw a straight line that divides the two kinds of data points (indicated by colour and shape) as well as possible. Example of different representation (by David Warde-Farley) Clearly, using polar coordinates as a representation of the same dataset makes the task at hand much easier. However, for real problems in machine learning, finding an appropriate representation (i.e. one that makes learning easier) is non-trivial and has several serious disadvantages. In particular, finding a good representation by hand is often a very lengthy process, on which researchers spend multiple months or even years. A different approach is Representation Learning (or Feature Learning), which are techniques used to learn an appropriate representation (automatically). This approach was inspired by Goodfellow, who states that learned representations are often effective in practise.

Learned representations often result in much better performance than can be obtained with hand-designed representations.
—I. Goodfellow et al., Deep Learning (2016)

In our case, we feed the network data in the .wav or WAVE (Waveform Audio) File Format. However, it is noteworthy that raw music data files in different formats should work without much work. The idea is that the network learns to transform this data into some representation with the shape of a square RGB-image (due to the architecture of StackGAN). Intuitively, we like to think of this representation as a 3-dimensional spectrogram.

By slightly extending both the generative and discriminative networks with an additional (de)convolutional layer, we transform the data from raw music to the dimensions of the images used in StackGAN. As such, the architecture of StackGAN remains almost unchanged. More specifically, in Stage 1 the music is transformed to a 64x64x3 tensor (64x64 RGB image), and in Stage 2 to a 128x128x3 tensor. For example, the stage 1 generative network outputs a 64x64x3 tensor. The add a final (deconvolutional) layer to the network which converts the image back to the dimensions of the original music. Furthermore, we change the original last layer to use a leaky ReLU activation function, instead of the tanh originally used. We do this because only the last layer should use the tanh activation function. In this added layer, we experimented with multiple kernel and stride dimensions, as long as they resulted in the dimensions of the original music. Finally, we reshape to ensure that the shape of the output is a vector of 1xn, where n is the number of samples in the music file. For the discriminative network, an initial layer is added. Note that the discriminative network is now given data in the format of the music samples (both the real samples and the adapted generator’s output). As such, we have a convolutional layer that changes the input to the dimensions of the original network (64x64x3 in stage 1). The layer uses Leaky ReLU for activation, as do other layers (except for the last) in the StackGAN’s layers.

Note that even with above description there are multiple choices we can make. In particular, how the samples are taken is important. One choice to make is related to the size of the samples. For example, if we give the Stage I network a sample with more sample points than 64x64x3, then the layers we added are forced to compress and decompress the music. An advantage is that we can use the sample samples for both the stage I and stage II networks. Furthermore, the network can “decide” which information to store and which to get rid of (i.e. unimportant on the output). A large downside is that real samples will not be compressed (and decompressed), which potentially makes it easier for discriminator to beat the generator. As such, we choose to simply sample at a sample rate such that a single sample has 64x64x3 sample points. Since we are evaluating whether this approach works, we decide to simple set a very low sample rate for the Stage I network. This is not very convenient when training Stage II, since we essentially provide the same samples to stage 2 except including more mid-high range sounds. However, since we are trying to establish whether this approach could work, this strategy works best as it allows us to evaluate the progress of the stage 1 network. In other words, if this strategy does not yield progress after training stage 1, it is senseless to train stage 2. Finally, note that selecting larger samples than 64x64x3 (thus requiring compression), essentially results in the added layers functioning as an autoencoder.

Intermezzo: Short Time Fourier Transforms

As an alternative to having the network learn the representation itself for the following two approaches we now consider picking a representation.

That any particular sound wave is composed of many interfering waves at distinct frequencies is a basic law of physics. Fourier Transformations are a way of decomposing a sound wave into the different frequencies, e.g. decomposing a chord into its constituent notes. The result of applying a Fourier Transform to an audio fragment is that you obtain a list of complex numbers, one for each frequency, such that the magnitude of the complex number of that frequency represents the amplitude of frequency within the audio fragment and the angle of the complex number represents the phase of the frequency. The Fourier inversion theorem says that given that we know all the frequency and phase information of a wave we can reconstruct the wave, meaning that we can obtain back our original sound fragment from our decomposition information. One result for Fourier Transforms is that one needs to analyse at least half the number of audio samples of frequencies to analyse for the inversion to be accurate.

An important technique for the analysis of music is to not take the Fourier Transform of the entire music fragment but to split up the music fragment into many short fragments and applying the Fourier Transform to each of these short fragments. This technique is called the Short Time Fourier Transform (STFT). What is gained from this type of analysis is that one can see the progression of the presence of certain tones throughout the entire music fragment by looking at the presence of frequencies in each short time window.


The above figure is called a spectrogram and is a plot of the amplitude for the presence of frequencies in short time window in a speech audio file. The vertical axis are the frequencies, the horizontal axis is the time, and the color dimension represents the amplitude for each frequency at the short time slice in the audio file (Note that the phase information from the STFT) is ignored.

Note that the splitting up of a audio file where the wave are only presented as intensities as a function of time into time windows with frequencies gives a representation of audio that is much closer to the two dimensional representation of images.

We used the librosa library to convert between audio wave representation and Short Time Fourier Transform representation.

Approach: Complex Images (STFT)

The second approach of encoding music as image uses the Short Time Fourier Transform discussed in the previous section. With these technique, we can generate a 2D matrix representation of the music sample. However the entries are complex numbers which is poorly supported in Tensorflow (technically complex valued neural networks are possible). Our initial goal is to represent these as regular 64 by 64 images with 3 channels so that we do not have to alter the architecture (since time was of the essence and this is the original input format of the StackGAN architecture).

To achieve this, we preprocessed the samples as follows. First we played with the parameters of short time fourier transform and the sample rate such that we generated a 64 by 96 amplitude matrix M. For each row of the matrix, we then split the complex numbers into its real and imaginary parts resulting in matrix M’. Let M(i) denote row i of matrix M. Then M’ is defined by M’(2i) = Real(M(i)) and M’(2i+1) = Imaginary(M(i)). This resulted in a 64 by 192 matrix M’. The next step was then to put the first 64 rows of M’ in channel 0 of tensor T, the next 64 rows in channel 1 of tensor T, and the next 64 rows in channel 2, such that T is a 64x64x3 tensor. This tensor was finally normalized such that values ranged from -1 to 1 (since the deep learning network can only output values between -1 and 1, because the last layer has a tanh activation function). The result was fed to StackGAN as the input for the first stage.


Approach: 2-D Spectrograms (STFT)

For this final approach we turn again to the Short Time Fourier Transform, but use the most straightforward mapping to the main idea behind the StackGAN network, namely that pixels get upsampled and downsampled.

For this approach we take the view of the STFT as being a two dimensional representation, with a time (window) dimension and a frequency dimension, where both of these dimension are discrete. At each discrete index in this representation we have a complex number representing the amplitude and phase which corresponds to the strength of the presence of the frequency at that time slice. As the complex number has the natural “2D” representation of amplitude and phase, one can view this as two channels of a pixel (where the spectrogram only used the amplitude as one channel for a pixel).

As such we modified the network by no longer taking 64x64x3 tensors (for RGB images with 3 channels per pixel) but instead to take 64x64x2 tensors. TBC

Results

Approach: Representation Learning

For this approach we altered the architecture of the network such that it has direct music as input and directly outputs music. We did this by adding one additional layer at the start of the discriminator network, and one layer at the end of the generator layer. The responsibility of this layer was to encode the music to and from a 64 by 64 by 3 image. Note that the input of the first stage of StackGAN had quite low frequencies. This is an example of an original music file: https://soundcloud.com/noud-de-kroon/target-representation Initially the neural network only output static. An example output is this: https://soundcloud.com/noud-de-kroon/initial-representation After 75 epochs (an epoch is a cycle in which all the training data is trained on once) the output was as follows: https://soundcloud.com/noud-de-kroon/epoch75-representation Possibly there might be some rhythm being learned by the network at this point (but this might also be caused by some unknown factor. After this unfortunately, the training quickly falls into a failure mode. The discriminator loss quickly approaches 0, which causes process to halt. This is a known problem that may be encountered when training GAN’s. At epoch 300 the output of the network is as follows: https://soundcloud.com/noud-de-kroon/final-representation At this point training was halted since it seems a lost cause.


Approach: Complex images (STFT)

This approach was the first architecture we were able to implement and run (apart from the original StackGAN architecture on images). The first trial was unsuccessful, and the resulting output did not improve while training. The initial output was as “close” to the groundtruth (and, by extension, to “random noise”) as the final output (after 600 epochs of training) was. The discriminator very rapidly took over (having a loss very close to 0), and the generator was unable to “learn”. We inspected the results, and noticed that the generated music was in a very small range, whereas the real music had a much larger range. We realized that the mistake we made was that the generated output has values in (-1, 1) by design (tanh activation function). However, the real data samples were not normalized appropriately (as normalization of images is vastly different). Furthermore, when storing the output, we should denormalize the generated music. In short, we normalize the (real) samples using the maximum and minimum value of all samples. Finally, we denormalize the generated samples using these same max and min values. With these modifications, the results were still not very successful unfortunately. Although there seemed to be small improvements, we were not convinced that the network was learning sufficiently regarding the internal structure of the music data. As such, we did not continue with stage 2.

Approach: 2-D Spectrograms (STFT)

For this approach we had to somewhat alter the inner structure of stackGAN. This to make it handle 64 by 64 by 2 tensors (instead of 64 by 64 by 3). While this was a relatively minor adjustment, it required understanding the internal structure of stackGAN to make the proper adjustments. Sound was sampled at 1 kHz. An example of a sound sample is given here: https://soundcloud.com/noud-de-kroon/target-approach-3 An example of initial network output is given here: https://soundcloud.com/noud-de-kroon/epoch-0-approach-3 After 75 epochs the output sounded as follows: https://soundcloud.com/noud-de-kroon/epoch75-approach-3 After 300 epochs the output sounded as follows: https://soundcloud.com/noud-de-kroon/epoch-300-approach-3 And finally when training completed at 600 epochs the output sounded as follows: https://soundcloud.com/noud-de-kroon/epoch-600-approach-3 Looking at the spectrograms, a generated sound file looks as follows:

While a fake sample looks as follows:


Challenges

This was one of the most technically challenging projects we have encountered so far during our bachelors. In this section we would like to discuss some of the problems we encountered while adapting stackGAN.

Firstly, the original source code has very high complexity and is poorly documented. StackGAN relies on the TensorFlow library, which is a very high level library. This means that a single function call can be quite hard to understand, and might have many side effects. This made effectively working together on the code very hard, since any change to the code should be very well thought through and required understanding of the entire project. Another difficulty related to this is that starting the software project took around 5 minutes. This meant that if there was a bug somewhere, we would have to try fix the problem, start the project, wait 5 minutes, check the result, and then start the process over again. This made working on the project very cumbersome.

The second challenge was that hardware intensity of training deep neural networks. We had two computers with hardware sufficient for training deep neural networks. During the final 2 weeks of the project, these computers were training 24/7. If we tried a new strategy, we would only see results a few days later. This made iteration very slow.

The first two challenges combined caused us not to experiment with running a stage 2 of the network, especially because stage 2 is even more hardware intensive than stage 1. The MusicNet dataset for stage 2 is around 40 GB, and training stage 2 would take approximately 5-7 days. Running stage 2 would only be beneficial if stage 1 actually provided good results, thus we thought it was more important to improve the result of stage 1 and we could iterate more often on stage 1. In the end we believe stage 1 still does not provide good enough output for stage 2 to have potential to generate good output.

We also attempted to use Amazon Web Servers and Google Cloud. Unfortunately, we needed a server with GPU. Both Amazon and Google requires extra administrative work to be allowed to use instances with GPU’s. Unfortunately, we were not able to get these approved on time.

Experimental LSTM

One of the biggest problems with relying on GANs throughout the entire project is that they have never before been used to create music. We could not be 100% certain that the output of that network would be successful. Therefore, as a backup plan, we researched a different type of network called the Long short-term memory (LSTM) network, which has already produced good quality results. One of the main difference between GANs and LSTMs ---also what makes GANs more impressive---is the way they take input. Our approach with GANs relies on inputting raw audio while our LSTM approach relies on ultimately taking text input.

Magenta

"Magenta is a Google Brain project to ask and answer the questions, “Can we use machine learning to create compelling art and music? If so, how? If not, why not?” Our work is done in TensorFlow, and we regularly release our models and tools in open source. These are accompanied by demos, tutorial blog postings and technical papers. To follow our progress, watch our GitHub and join our discussion group. Magenta encompasses two goals. It’s first a research project to advance the state-of-the art in music, video, image and text generation. So much has been done with machine learning to understand content—for example speech recognition and translation; in this project we explore content generation and creativity. Second, Magenta is building a community of artists, coders, and machine learning researchers. To facilitate that, the core Magenta team is building open-source infrastructure around TensorFlow for making art and music. This already includes tools for working with data formats like MIDI, and is expanding to platforms that help artists connect with machine learning models." - (https://magenta.tensorflow.org/) We used the LSTM models released in the open source Magenta project to experiment with this approach to music generation.

Data

We used midi files for the LSTM network. The data sets we planned on using was the Lakh MIDI Data set (http://colinraffel.com/projects/lmd/) and a data set created by a redditor called The Midi man (https://www.reddit.com/r/datasets/comments/3akhxy/the_largest_midi_collection_on_the_internet/). The data set created by The Midi man contains information about the artist and the genre of the midi files. The Lakh MIDI Data set consist of several versions. The version we are used is the LMD-matched version. This version is a subset of the full version of the data set. This subset had all its midi files matched with the metadata from the one million song database.

LSTM Model Configuration

The magenta framework allowed for three types of LSTM configurations which our results were based on.

Basic

This configuration acted as a baseline for melody generation with an LSTM model. It used basic one-hot encoding to represent extracted melodies as input to the LSTM.

Lookback

The lookback RNN introduced custom inputs and labels. The custom inputs allowed the model to more easily recognize patterns that occur across 1 and 2 bars. They also helped the model recognize patterns related to an event's position within the measure. The custom labels reduce the amount of information that the RNN’s cell state has to remember by allowing the model to more easily repeat events from 1 and 2 bars ago. This resulted in melodies that wander less and have a more musical structure.

Attention

In this configuration we introduced the use of attention. Attention allowed the model to more easily access past information without having to store that information in the RNN cell's state. This allowed the model to more easily learn longer term dependencies, and resulted in melodies having longer arching themes.

Results

We were able to produce some preliminary results using Magenta's pre-trained models. It is interesting to hear the differences between the various configurations (basic, lookback and attention) of the LSTM.

Basic Recurrent Neural Network Example 1

Basic Recurrent Neural Network Example 2

Lookback Recurrent Neural Network Example 1

Lookback Recurrent Neural Network Example 2

Attention Recurrent Neural Network Example 1

Attention Recurrent Neural Network Example 2

The next two samples were generated by priming the LSTM with the first four notes of "Twinkle Twinkle Little Star" using the attention recurrent neural network.

Twinkle Twinkle Little Star Example 1

Twinkle Twinkle Little Star Example 2

User Interface

With our target user type being an average music listener, we needed a simple interface that hides the technical details from the user. The small number of features, that we based on the survey results, reduces the freedom users have in customizing their music, but in turn greatly simplifies the use of the application. By doing this the user does not get overwhelmed by features for something that is already a new technology anyway. It must also be stressed that the current level of our music generation network is not high enough to provide a wide range of customizable input either. Therefore, this application is only the first step in providing this technology to the user, and by no means a finished music generation tool. The current version of the application only has the most important features, the we found in our survey.

Website

To create the interface described above we used a basic setup with Laravel as the backend framework and Semantic UI as the style framework. The interface we created is shown in the figure below.

MusicGANwebsite.PNG

The most recent version of the live website is available at MusicGan.eu.

User Feedback

To make sure our product actually satisfies potential users, we need to evaluate it by having them try it out and ask them questions. By doing this we want to find out the following things:

  • Is it clear for the user what the application is doing?
  • Is it clear for the user how to use the application?
  • Is the user satisfied with the available options?
    • If not, what options are missing?
  • What does the user think of the generated song?
    • Does it reflect the selected options?

By asking people around us (friends and family), we found that our application could be improved in a few ways:

  • Problem: It is hard for the average user to equate a certain number of beats per minute to an actual speed.
    • Solution: Change the tempo selection to a slider or multiple choice option, where the user can choose between a range from slow to fast, while not being aware of exact numbers.
  • Problem: Not all genres that users want are available.
    • Unfortunately we will always be limited by the genres in the dataset when we generate music based on a genre.
  • Problem: The limited options make the application seem somewhat useless, since one might as well search for existing music with the right filters.
    • Solution: Since this technology is still new we are very limited in the options we have, and therefore the user’s choice is also limited. As the technology develops further, the user shall be able to give a better specification of what they want. As we saw in the survey, users seem to be interested in generating music based on emotion. This is one example of how music generation might become more advanced in the future, and so become more valuable to users too.

Next to this critical feedback we mainly noticed that, because of its simplicity, users did not seem to have problems user the application. Since the quality of songs we have been able to generate so far have not been good enough to present to the users, we did not receive feedback on those. Overall, users seemed very curious about the possibilities and wished wanted to have more control over the generation process. Of course, the main reason why we have been unable to deliver that is that the current level of our generation software does not allow for that flexibility yet. Furthermore, we have put our focus on generating any music at all, since this problem was already hard enough to begin with.

Discussion

GAN

While the results were far from perfect, there seems to be some promise for using GAN’s to generate music directly. Looking at the output of the approach using 2-D spectrograms, there seems some potential in this approach. Further research is required to keep improving on this result (which is something a few of us plan to do during the summer holidays).

One thing to note is that StackGAN is from the ground up built to process images. While using StackGAN allowed us to get results within the short time span allotted to this project, we feel that in order to properly attack this problem, from the ground up music has to be kept in mind. This mostly shows itself in the fact that kernel sizes of convolutional layers of the layer are chosen such that features of images can be recognized. We need to rethink these kernel properties in order for the layers to be able to detect features relevant to music.

LSTM

It would have been interesting to explore the extent to which we could have molded Magenta's LSTM technology to the needs of our users. As is described in the Experimental LSTM section (internal reference), we achieved output of pre-trained models and found datasets to use. We did not pursue further because we decided that we did not want to dedicate many of our resources towards the backup option while the GAN approach still showed promise. The idea was that if all GAN approaches showed unimpressive results, we would shift our focus to existing technologies like LSTMs, and the limited resource (usually one person) would be able to quickly bring everyone up to speed on the backup approach. If this did happen, we would first work on training our own LSTM and tweak it towards what we found our users are looking for, like generating samples based on genre, existing artists and tempo. We were happy that we were not forced to follow this path; we were much more interested in discovering new ways of generating artificial music rather than using existing technologies. Nevertheless, it would have been a valid approach for the product we were trying to build for our users, and might have even produced better results.

UI

In general people were quite receptive to our ideas and our UI. Improvements from the user’s perspective are that users want more control, since only genre and tempo are very limited. Ideally we would find an intuitive way that would allow the user to control the generation process to create music that perfectly suits their needs.

Conclusion

In conclusion, we feel we learned a lot from this project and we are very happy with our results.According to our research there seems to be market interest in artificially generated music. We managed to get some actual results out of a new method for music generation, something that we often doubted would be possible halfway the project. We feel that there is definitely potential to pursue this project further. And finally, we learned a lot about dealing with uncertainty and morale in a R&D focussed project. Especially this last part was challenging, since at many points it was hard to keep believing that the final results would be worth all the effort put in. Reflecting back on the project, we think we chose a very difficult subject, but it was the thing we were most interested in exploring and in the end we managed to get actual results.

Planning

GANT chart

The GANT chart for this project.
The GANT chart of this project

This GANT chart is created to keep this project organized. It shows in which activities this project consist, when these activities take part and when an activity needs to be finished. It also contains a list of milestones. This list shows in which parts this project is divided. The major activities of the GANT chart (somewhat) correspond to these milestones. During the course, the percentage of completeness in this GANT chart will be updated.

week 1

  • Decided on music generation as topic
  • Clarify USE aspects of the project
  • Working on first presentation

week 2

  • First presentation done
  • Working on second presentation
  • Created a project plan
  • Created a task division
  • Start researching available datasets
  • Start researching neural nets

week 3

  • Second presentation done
  • Reworked project based on feedback
  • Create questionnaire for survey
  • Selected the datasets (FMA and MusicNet) to create scripts for
  • Selected the open-source neural net (StackGAN) to build the project upon

week 4 (Carnival Week)

  • Conduct survey
  • Start adapting StackGAN, to make it work with sound instead of text
  • Begin user interface (website)
  • Start creation data extraction script for the FMA dataset

week 5

  • Analyse results from survey
  • Something that’s done for the neural net (Further adaptations on StackGAN)
  • Script for FMA dataset done
  • Start creation data extraction script for the MusicNet dataset

week 6

  • Created 4 test setups for MusicGAN
  • Something that’s done for the neural net (Errors corrected)
  • Script for MusicNet dataset Done
  • Fleshing out user interface (based on survey results)
  • Create second survey for the user interface

week 7

  • Conduct second survey
  • Start training MusicGAN

week 8

  • Working on final presentation
  • Analyze results second survey
  • Something that’s done for the neural net

week 9

  • Final presentation done
  • User interface finished (Revised with second survey results)
  • Tidy up wiki
  • Something that’s done for the neural net

week 10

  • Finished the wiki



Task Division

Neural Network

  • Noud de Kroon
  • Rolf Morel
  • Bas van Geffen

Data Engineering

  • Noud de Kroon
  • Herman Galioulline
  • Rick Coonen
  • Jaimini Boender
  • Steven Ge

Website

  • Rick Coonen

Surveys/Public opinion

  • Rick Coonen
  • Herman Galioulline
  • Steven Ge

References

http://artsites.ucsc.edu/faculty/cope/Emily-howell.htm [algorithmic composition]

  1. 1.0 1.1 Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
  2. Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014).
  3. Denton, Emily L., Soumith Chintala, and Rob Fergus. "Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks." Advances in neural information processing systems. 2015.
  4. Zhang, Han, et al. "StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks." arXiv preprint arXiv:1612.03242 (2016).
  5. 5.0 5.1 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  6. Zhang, Han, et al. "StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks." arXiv preprint arXiv:1612.03242 (2016).