Generative Music - Ambient Poem Rhythm Game Proposal

Since the last blogging, we had a workshop for MusicVAE, using Jupyter notebook. It is really lovely that the last workshop really fits to my final project concept, so I can save sometime to research and trial and errors.

One thing I figured out was, not only training and generating new pieces, (Even just testing pre-trained models took hours to just generate few sentences!!) I needed to simplified my concept.

I changed to use poem from prose. This makes sense for the concept of listening narration rhythmically. Poems have rhythm and musical elements by nature.
I scraped Poetry Foundation for poems, and they actually do have Listen section as a way to appreciate poetry.

I was thinking about a very ambitious plan such as real-time generating narrative-music, but I guess I need to be satisfied with a small start.

The poem I am gonna use is A Gift for You BY EILEEN MYLES.

  • generate the narration.
    Wavenet Vocoder (Colab link/VocoderGithub/TacotrionGithub)

    Tacotron2 predicts Mel-spectrogram of a text, and WaveNet synthesis a voice based on that.

  • Nsynth interpolation.

    Since I wanted to give a sense of musical piece, I chose NSynth for the interpolation between generated voice and ambient music.

Screen Shot 2018-12-13 at 6.49.02 PM.png
Screen Shot 2018-12-13 at 6.48.26 PM.png

I tried just like the poem lines, but the sound so stiff and strong, so I made lines again based on the waveform of original narration.

It took so much time to generate each line, I tried Nsynth simultaneously.

NSynth 1 [Human Voice || Tibet Singing-bowl || Ibo Drum]

NSynth 2 [Generated Voice || Ocarina sound || Flute Sound]

At the first iteration, I thought low pitch instrumental sounds are really good to hear, but the interpolation result with human voice was a bit monstrous and unpleasant. I changed to high pitch instruments for the second iteration.

Instrument sounds samples below

After these trial, it looks like Nsynth might not be the model I was looking for. What I was looking for was something like Style Transfer for Music.

Using Nsynth interpolates pitch and rhythm also, but I wanted to keep the rhythm and pitch of narration while change the timbre from instruments, or a certain music genre.

While looking for code, I found Wavenet generatef from Mel-spectrogram representation, so it could be interesting generate instrumental sound from mel-spectrogram of narration file. I actually tried to do that, but I was using Jupyter Notebook so I couldn’t access output files.

For the proof of concept, I tired to put the sounds to rhythm game Unity assets. Still I need to manually assign lines to the sound, but it was so interesting that text and sound was kinda synced if I consider them as four-quarter measure music. Since the asset counts the beats, no matter how long or short the line was, each line passed after 4 beats and it looked okay. This can be a new way fo an approach for the next step.