Generative Music - exploration

google slide

My concept of final project is making an interactive reading experience with intonation-like ambient background sound.

I got the idea from the frustration of reading English text, compared to how I feel so free when I read in Korean.

I got bunch of text to read every week for discussion (as a grad student), and somehow manage to read them all. But what I found in the classes is I totally lost important cases in the text.

Honestly, I got more overwhelmed by the amount of text, rather than focusing on the contents in the text.

And I found a new experience with audio book recently, since making audio book is a very expensive industry I never experienced the full human voice narration for ebook in Korea. I am not sure if I just miss them, but most of audio services for ebooks are just very distraction TTS, it was better not to hear them for me. Using audiobook narration with ebook was quite interesting, it did help me a lot to focus on the text, especially academic texts. The most challenging part of reading second language text is, you need to stop for searching the meaning of words. when you stop again and again, you just lose your focus and it is hard to come back to the line. (There is a research that outside hinderance is the most challenging disturbance against productiveness.) It may sound ridiculous, but you can speak in second language with people without understanding every single words. There is a semantic understanding within the context. Listening audiobook is quite like that.

Even though audiobook was pretty helpful, but I feel hard to focus on reading them for couple of reasons below.

  • Reading speed is commonly much faster than listening

  • Speed of narration is hard to modify due to the amount of information

  • We actually don’t need to hear every single words - It is like we don’t we actually don’t read every single spell of a word, we just see general shape of the word.

That is the reason that I want to make a narration-like ambient music, which doesn’t contain any distinguishable information, but you feel like someone is mumbling next to you.

So how can I achieve the experience? I found a good blog for deep learning models and intonation (mostly about TTS and deep learning) It looks like several approaches available.

  • Pairing a sentence + manipulated audio file from narration —> generate mumbling music and show texts accordingly.
    = this one is the most straight forward approach, but also has a lot of constraints. How can I pair each word and sound? is there any tool helps with setting data? or should I do that manually?

  • Get a score of narration and pair each score to the texts.

    = this will be more music piece like approach, and seems doable. I guess this will be like rhythm game.

  • Existing models seems appropriate to my projects: Sample RNN, Wave net, Lyrebird(?)