The Music of Machines     


The Music of Machines - overview

janani mukundanJanani Mukundan: Teaching a machine to compose music

Inside IBM Research | Where great science and social innovation meet

Listen to The Music of Machines on Soundcloud

IBM Watson, the cognitive computer that defeated the best Jeopardy! champions in the world, has moved on. Since 2011, Watson has revolutionized the way human beings can see patterns in “unstructured” data — research papers, blogs, reports, photos — and influence decisions in healthcare, finance, retail, travel and services.

In this episode of Inside IBM Research, Janani Mukundan (pictured), a machine learning researcher at the IBM Austin Lab, talks about how a stack of Restricted Boltzmann Machines is now building off Watson to “learn” how to compose music.

The cool thing about using music — using machine learning to learn music — is first thing is, it can be expressed mathematically. So, you can tell what pitch is being played, and what is the time signature, what is the key signature, etc. using the input files.

When we started out, the rules of music, we thought it was simpler than actual natural language, and the semantic and grammatical rules were simpler. So we thought it would be good to learn all of these things.

So, what we’re trying to learn here is, I guess pitch variations and rhythm variations in the song, and the input that we use basically is a MIDI file, which already has all this information. You have to give it a digital representation. You can’t just give it an mp3. You have to give it a MIDI.

So, we’re trying to extract features of the song and express it in a different way.

A Restricted Boltzmann Machine is basically a stochastic neural network. It has a layer of visible units or neurons. And it also has another layer of hidden units. And the idea behind a Restricted Boltzmann Machine, especially for our project is, we provide the information on the pitch, the note that’s being played at any given point, and the visible layer captures this information, and we train the model in a way that the hidden layer tries to capture or extract features of the visible layer by, you know, updating the weights and so on. And the hope is that once the hidden layer of neurons captures the essence of this visible layer, which is basically the music, we should be able to recreate this input with just the extracted features and the weights.

So, in an ideal world, what would happen is, if we don’t perturb or disturb the model, we should be able to exactly reconstruct the input that was provided to it. But now we don’t want to do that. We want to be able to create new music.

So, in order to create new music, we bias the model, or we perturb the model. And we add creativity genes, or neurons, to this model, so that, when you try to extract the features, you’re not only extracting the essence of the actual song, but you’re also adding subtle nuances that were not existing in the original song. So, when you recreate the music, you get a song that is familiar, but is also yet different from what was given to the model.

You can actually mix multiple songs to come up with one song. So, mixing “Mary Had a Little Lamb” and “Oh Susannah” to come up with a new version of the song that sounds similar to both songs basically.


A neural network basically consists of a visible layer of neurons. And when I say “visible layer,” it means that the input is actually provided to those neurons. What is an input for us? An input is basically — given a particular instance of time, what pitch is being played?

So, I take an input song and I divide it into one-eighth of notes, or one-quarter of notes or half notes and so on. And then every neuron represents a half note or a quarter note. And then the value of that neuron is basically the note of the pitch that is being played.

And this is the most simplest [sic] song. We can play around with other things. Like, we can add dynamics to the neuron. We can say that, “I want this neuron to be a quarter note as opposed to a half note. Those things are, like, additions that you can do it.

But the basic form is having a neuron and it represents a note or a pitch that is being played at any given point in time.

So that’s what basically those two layers mean.

And the algorithm that we use so that the hidden layer captures, or extracts, the features of the input is called contrastive divergence. It’s just an algorithm that’s been used for a really long time.

So, when I give an input to my model, this model has no knowledge of what good music sounds like. So, it’s completely unsupervised. It doesn’t know what sounds good, what sounds bad. So, you just give it a piece of music, and then it comes out with a new piece of music.

Now, we as human beings already know what sounds good and what sounds bad. We know certain notes go well with certain other notes. We know certain key signatures are better when played differently, and so on.

So, all of this information is provided to the WolframTones model. So, it’s kind of supervised, as in, you already know what sounds good and you want to extract something else from what sounds good to come up with new music. That’s pretty much what WolframTones does.

So, what I noticed from my training experiences: The harder the song, the better it learns. Classical music, it was able to learn really well. because classical music is really hard. Some pieces of classical music are easy, but most pieces are really hard. And it was able to learn it better. It was able to add more subtlety and more nuances to it to make it sound different.

When the music is really simple, like, you know, pop music, it was much simpler [sic] when compared to classical music. The output wasn’t very creative, in my opinion, because the input that was given to it was already not so complex. So the output that came out of it was not complex as well. So, that was one drawback. It couldn’t come up with something very creative when you gave it a simpler song as opposed to a more difficult piece.

Like, I trained De Angelus Gloria, which took about, I think, fifteen minutes on my laptop. I trained one of Adele’s songs, and that took about, like, five minutes to train because it was much simpler to train.

I actually tried to learn Dire Straits, one of my favorite bands. It was really hard to learn psychedelic rock, or any other kind of rock music because it didn’t have grammatical rules like classical music, or even like pop, which is very simple. So that was really hard to learn, and I’m still trying to learn it.


How can we make use of this? The possibilities are endless.

You can think of a cognitive music composer Pandora station. You’re tired of listening to all sorts of songs that you already listened to before. You let the computer create your own music for you.

And I can think of composers using it. You know, they want to tweak certain aspects of their song, make it sound different. They can just plug this piece of information into the model. It comes up with an endless number of alternatives that it can use.

I can also think of a cognitive cloud service offering. You have cloud and mobile platforms that you can have streaming, composition, licensing, etc. for music. All sorts of things can be done with it.

And the whole idea of basically being able to pick music from different genres and mix them together to come up with new music is something new that doesn’t exist right now.

So just like how we can listen to music and learn from it, this can be applied to natural language as well. The models are going to be much more complicated and the grammar’s going to be different. The semantic searches are going to be different. But in essence, music is a language, and, you know, natural language can also be applied to the same model. You feed pieces of information — books, or whatever it is — to the model. It tries to extract features out of it. You can classify books based on this. It’s pretty much the same idea for music as a language and natural languages for language as well.

Were you able to hear how the “perturbed model” resulted in one piece that sounds like a cross between “Mary Had a Little Lamb” and “Oh Susannah” and another that sounds like De Angelus Gloria?

You can hear the complete musical pieces that Janani's cognitive model was able to learn here:

Mary Had a Little Lamb & Oh Susannah

De Angelus Gloria

You’ve been listening to Inside IBM Research. I’m your host, Barbara Finkelstein. Our producer is Chris Nay at our IBM Austin Lab. Our music is “Happy Alley” by Kevin MacLeod (a human composer). Share this episode with colleagues and friends — and keep an ear out for our next episode.


Last updated on September 30, 2015