• 한국어
    • 日本語
    • 中文-繁體

    2022.09.16 R&D

    Jumping Into the Field of Art: Music AI Team

    In 2022, NC introduced Music AI Team in Speech AI Lab of AI Center. Music AI Team is a group of deep learning and applied music specialists. Experts from different areas are collaborating with each other for music and AI research. Recently, their paper was accepted by Interspeech 2022, global conference on spoken language processing.

    This article will introduce an interview with Music AI Team. In particular, we will listen to what kind of research they are carrying out based on the subject of music that has infinite possibility for expansion, how they are communicating and cooperating with each other despite the fact they are coming from different backgrounds, and how the music AI technology would evolve in the future.

    Tae-woo Kim, Min-su Kang, Hyun-mu Lee, Kyung-hoon Lee, Yang-sun Lee, and Yun-jin Na (from left)

    Running a Three-legged Race

    Could you introduce yourself, Music AI Team?

    Kyung-hoon   Music AI Team studies AI related to music. Our major fields of research include Singing Voice Synthesis (SVS)* that generates singing voice based on lyric and score, Singing Voice Style Transfer (SST)** that changes rhythm or tone of singing voice, Harmony Generation that creates harmony added to the singing voice based on tonality and chord progression, and Melody Generation that creates melody based on certain inputs (images, chord progression, etc.)

    *Singing Voice Synthesis (SVS): A model that creates a singing voice based on lyric and MIDI (the data including pitch and duration)

    **Singing Voice Style Transfer (SST): A model that transforms amateur singers’ voice style, using voice of professional singers

    You used to be called “Singing Voice TF”, but the team’s name officially changed to Music AI Team this year. What kind of changes are you going through?

    Kyung-hoon   When we first started, we named our team, “Singing Voice TF”, because we wanted to study singing voices. However, we have changed our team’s name into “Music AI”, since we aim to carry out AI research related to more various types of music including singing voice based on musical elements. Since the scope indicated by the team’s name has been expanded, the scope of subjects that we cover has been extended. At the moment, singing voice account for a larger part in our research, but we aim to gradually expand the scope of research subject.

    Music and AI do not seem like a usual combination. We wonder how the team is composed.

    Kyung-hoon   Music AI Team covers musical elements unlike the ones covering ordinary voices. Therefore, our team is composed of Yang-sun Lee, Hyun-mu Lee, Yun-jin Nah that are applied music specialists and I, Tae-woo Kim, and Min-su Kang that are deep learning researchers. Researchers would mainly study and development models, while musicians would accumulate and analyze data. The process is being carried out through cooperation.

    You could have received support from external parties if you had needed resources from other professional areas. Was there any special reason that you recruited applied music specialists as team members?

    Kyung-hoon   The thing that we are most concerned about in AI research based on the subject of music is ‘musical expertise’. To create an AI with excellent performance, we need high-quality data. I have always thought about ways to secure high-quality data set. For instance, modeling of speech and singing voice are very different. Unlike speaking voices, singing voices should be synthesized in a way that is consistent with the pitch and beat on the score. Also, since different vocal technique would apply depending on the singer’s genuine characteristics and the composer’s intension, musical knowledge is required. Therefore, in addition to selection of songs, there are many processes requiring musical expertise, including how to instruct singers when they sing and how to edit audio files. If we outsource the work, we may save time and money, but many errors may occur in transcription. Therefore, we have invited three applied music specialists to focus more on research.

    What kind of differences has the ‘musical expertise’ made?

    Kyung-hoon   We could prepare the data in a more thorough way, because musicians would consider musical factors in establishment and analysis of DB. For instance, we selected songs based on a certain style in order for them to have a unified style. However, now we not only select songs based on a certain style, but also may dig deeper into different styles including verse, bridge, and chorus, depending on the song’s format.

    A Singing AI

    Please explain us about the process for the applied music and the deep learning specialists to create ideas and to develop models.

    [Figure 1. Workflow of Music AI Team]

    Kyung-hoon   The process is mainly composed of research → recording planning and song preparation → recording → DB → development. First of all, we all together develop ideas based on market trend analysis, literary review, etc. Next, we carry out recording after preparing for a certain genre and songs, summarize recording data, and develop models. In the meantime, all team members would exchange many opinions, but it is the musicians that take the lead in data establishment and analysis, while researchers would take the lead in model development.

    We were told that you would review not only papers, but also market trends as part of your research.

    Kyung-hoon   Papers mainly focus on improving performance. Therefore, it is difficult to determine direction for actual application solely based on papers. We check whether the technology has been commercialized or not through monitoring market trends and set the target level that we have to reach. Also, we refer to the monitoring results to generate ideas about the areas to which Music AI Team may contribute in the “Digital Human Project” of AI Center.

    Yun-jin   When we carry out research, it is important to consider the service to be experienced by users in advance. Therefore, we monitor the market trend every day. In case there are good issues to be shared with the entire team, we summarize them separately to share them once every week and have an idea generation meeting. If you look at the market trend, various digital humans are being introduced, and in most cases, actual human singing voices are used after limited alteration.

    However, since Music AI Team is studying singing voice synthesis, we are thinking about creating digital human beings that could sing by themselves. Since it may lead to many possibilities, it is important to keep monitoring the market trend.

    Data establishment and analysis are very tricky. How do you prepare the data?

    Yang-sun   There are many things to do, indeed. (Laughter) The first step in data preparation is to select a genre. After selecting a genre, we build a list of songs that are considered to belong to the genre. Songs these days are considered very authentic to their creator’s style and therefore, difficult to be classified into a single genre. Therefore, we select a few songs as reference and carry out initial classification based on each song’s atmosphere. For instance, we divided them into ‘A. light atmosphere’ and ‘B. slightly dark atmosphere’. After classification is completed, all of us may agree on a common classification standard based on which we may categorize songs into different categories. Then, we have the AI learn the common elements of the songs and develop models that are considered different from each other and sing in different ways.

    We were told that the team focuses its research to enhance naturalness. From Music AI Team’s point of view, what could be the definition of ‘naturalness’?

    Yang-sun   For us, naturalness means a singing voice that is close to that of human beings and where emotions are alive. In other words, the emotions are felt when an AI sings and the atmosphere of the song is delivered in a complete manner. Of course, the word, “atmosphere”, is quite subjective, and therefore, we are analyzing songs based on analysis standards to ensure objective evaluation. Also, we are focused on recording and post-recording processes. We hire professional singers, instead of amateurs, and we have improved the quality of post-recording processes to the level where the song is readily publishable, because we have to let the models learn high-quality data. In conclusion, I believe naturalness comes from the efforts to enhance data quality.

    Respect and Sympathy To Overcome Differences

    It seems that music and AI are at the opposite side from each other. Were there any difficulties when working together?

    Kyung-hoon   Now the workflow has been established, but in the beginning, there were many cases where we would have different definitions for the terms we used. While some of the terms used in the music industry and the music AI model research have the same meaning, others have different meanings in the two areas. Since more simplified data were used in AI research compared to music, we had different understandings about words. I think it was difficult to explain the concept that one had to others from different fields.

    Also, difference in experience was also one of the challenges that we were faced with. For instance, when a musician listens to a synthesized singing voice, they can intuitively feel the awkwardness in the voice. However, it requires much thought for them to clearly explain the awkwardness to researchers. On the contrary, researchers had to think about ways to develop logics to teach what was explained by musicians to the model.

    Are there any episode where you failed to understand each other?

    Tae-woo   As a researcher, I tend to compare the result with that of the previous model once the model’s performance improves. So when the result is better, I feel satisfied and confidently explain the result to musicians. Whenever I ask for their opinions, they just disapprovingly shake their heads. (Laughter)

    Hyun-mu   Musicians evaluate singing voice synthesis in an intuitive manner. (Laughter) It is similar to the way human beings would see a cat and intuitively perceive that it is a cat, instead of making a decision that it is a cat based on observation that they have four feet and whiskers. Now our standards on naturalness have been converged much and therefore, we share the same thoughts about more realistic results even without explanation.

    Yang-sun   Whenever we listened to the results that researchers brought to us, we used to say “This is not so natural…” (Laughter) We made much effort to explain why the results that were intuitively perceived bad were actually bad. (Laughter) Since there were terminologies taken for granted in each other’s professional area, we made efforts to understand them.

    It seems that there is a caring and respectful culture in place inside the team.

    Kyung-hoon   We decided not to use “but” inside the team, and it was one of the few promises that we made to each other, thinking about ways for us to happily work together. When person B expresses his (or her) opinion saying “but…” immediately after person A expresses his (or her) opinion, person A may feel that his (or her) opinion is disrespected. In such a case, it may be difficult for him (or her) to express his (or her) opinion again. We wanted to respect each other in this regard.

    Yun-jin   It seemed that we made efforts to respect each other’s opinion even more after making the promises.

    Let us know if there are other things that you consider important in terms of team work.

    Kyung-hoon   Even from the beginning, we emphasized the value of ‘sharing’. Since we came from different backgrounds, there had to be opportunities to share each other’s knowledge in a proactive manner. Starting from the musical theory study group led by a musician team member, we have started to share each other’s knowledge.

    Yang-sun   I took the lead in the study group, since I believed that we could carry out research in an even successful manner if we knew about each other’s professional areas. Since a few weeks ago, researchers have started and taken the lead in AI study group.

    Min-su   In addition to accumulating knowledge through the study group, I get to know the professional area itself. It seems that we are expanding the scope of understanding and taking better balance in the level of knowledge, supplementing each other’s weaknesses.

    How Far Can We Go with AI and Music?

    What are the things that we may expect in case the singing voice synthesis technology studied by Music AI Team would develop further and become perfect?

    Min-su   I think there is an entry barrier for ordinary people to start doing music. For instance, most people believe that they should have various musical knowledge and technique to be a good singer. Therefore, I believe that in case the singing voice synthesis technology develops further, we may lower down the entry barrier, because people will be able to create music with the help of technology even if they do not know how to vocalize or lack musical knowledge. In case a new type of music that breaks the existing boundaries is created, I think it may prompt the needs for new technologies.

    Hyun-mu   Some people may have fears towards technological development, but I think the number of things that we could newly try would also increase. Producers may create a teen band that they want through synthesizing various types of singing voices and may generate profits through creating secondary content. I believe that in case all detailed elements of music are considered as a single data, the number of things that could be expressed would increase even further, and people would be able to feel a new type of joy. For instance, in case a user is very good at vibrato*, they may extract the part separately and sell them.

    *Vibrato: A technique to vibrate either the human voice or the sound of musical instrument in music

    Kyung-hoon   We may think about a service to introduce a digital human voice so that it could communicate with people and sing to them. For instance, they may sing more passionately in front of the people that react passionately and sing comforting songs to those that are depressed or going through a heartbreaking break-up. Ultimately, we aim to create a singer-songwriter that may create and sing songs of various genres through combining music and AI. Music AI Team’s goal is to create AI voices and music that could sing with much emotion like an actual human being and move people’s heart through creating a music that used not to exist before.