A Glimpse into Three Research Papers from INTERSPEECH 2023

2023.08.25 R&D

A Glimpse into Three Research Papers from INTERSPEECH 2023

The Speech AI Lab within NC’s AI Center has accomplished a significant achievement by consistently publishing research papers at INTERSPEECH, the world's largest and most comprehensive conference on the science and technology of spoken language processing, for four consecutive years. Over this period, the Speech AI Lab has been dedicated to presenting research papers that focus on enhancing the quality of verbal output and speech. While previous studies primarily delved into natural verbal communication and the generation of AI singing voice, the recent research papers covered a broader array of artificial intelligence models and systems. These results offer experiences that closely simulate interactions with real individuals, all aimed to create “digital humans” with the capacity to engage effectively with people.

NC has a steadfast commitment to the creation of digital humans capable of meaningful conversations with users. To create digital humans tailored to the unique preferences of individual users, the team has been prioritizing the development of technology, focusing not only on natural language communication but also on the recognition of users' speech and gestures. The compelling research findings presented at the INTERSPEECH 2023 conference are poised to drive the evolution of personalized digital humans. These digital entities are designed to readily adopt nicknames, provide swift responses at any time and place, and even comprehend and empathize with users' emotions. Let's delve into the three research papers that offer a sneak peek into the forthcoming advancements in speech technology.

Digital Humans Are in Need of Unique Names

Title: “PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords”

Yonghyeok Lee, Namhyun Cho of Speech AI Lab

For more information on the research check out the NC Research blog. Link

We live in a world where interacting with machines has become commonplace. Thus, it is only natural that humans engage in dialogue with these technological entities in our current age. People ask questions to gather information or give commands to various devices, including AI speakers, smartphones, and even automobiles. In these scenarios, the specific wake words we employ to summon these machines—such as “Hey Siri,” “Alexa,” and “Hey Google”—are known as “keywords.” Keyword Spotting (KWS), a technology that recognizes these wake words, is one of the core technologies in human-machine interaction. Lately, research has been focusing on the customization of these keywords, which involves formulating distinct keywords tailored to each individual.

This trend underscores users’ inclination to engage with machines in a manner that resonates with their individual preferences. This inclination is anticipated to gain further momentum as we enter the era of digital humans. In response, NC is actively delving into the realm of User-Defined Keyword Spotting (UDKWS) technology, designed to recognize a wide array of personalized keywords used by individuals.

Challenges in Developing User-Defined Keyword Spotting Technology

The strategy for developing this technology might appear straightforward at first glance. Could it be as simple as providing a User-Defined Keyword Spotting model for training with the desired keywords of users? However, certain challenges complicate this process. In order to train a model capable of distinguishing between keywords and non-keywords, a substantial amount of training speech data is essential. Yet, acquiring such data is far from being an easy task. Moreover, external non-technical complications can emerge when users personally record additional keywords while modifying the existing keyword list. While customization may offer significant performance improvements, it also brings forth the potential for user inconvenience and privacy concerns. Furthermore, as users typically prioritize swift responses, models embedded within the device hold more appeal than a server-client structure. Consequently, the preferences of users negate the feasibility of deploying a high-performance speech spotting model with continuous background operation, thereby limiting the utilization of resources.

Digital Humans with Customizable Nicknames

NC has adopted a Zero-Shot Keyword Spotting model to address the diverse range of keywords essential for technologies such as customized digital humans. What sets this model apart from previous studies on customized keywords is its focus on pronunciation*. While preceding technologies mainly grasped the binary distinction of “same or different” for pairs of words (Word 1 vs. Word 2), the model proposed by NC surpasses this by comparing the pronunciation patterns of these two words. This advancement empowers the model to discern even subtle differences in words sharing similar pronunciations. In short, the model is built through training on the congruity between speech and phoneme information from large general verbal output databases, facilitating the assessment of speech-text equivalence. Based on this, PhonMatchNet was proposed, which can determine the verbal output status for arbitrary keywords. By applying this innovation, the seamless incorporation of user-specific keywords becomes viable, obviating the task of amassing and training additional speech data, as demonstrated in the zero-shot KWS approach. This significantly reduces the time and cost required for comprehensive learning of a singular keyword. Significantly, this research also harbors potential applications in assessing pronunciation proficiency within second language education. Given its high practicality, the model is on course to have a patent application submitted by the end of August.

* To be more precise, these pronunciation components are termed “phonemes.” A word is a composite of multiple pronunciations, and the smallest unit among these pronunciation elements is referred to as a “phoneme.”

Proposed Model Structure

Borderless Rapid Responsiveness of Digital Humans

Title: “Fast Enrollable Streaming Keyword Spotting System: Training and Inference using a Web Browser”

Namhyun Cho, Sunmin Kim, Yoseb Kang, Heeman Kim of Speech AI Lab

For more information on the research check out the NC Research blog. Link

NC embarked on further research to amplify the responsiveness of these speech spotting models, rendering them more user-friendly. In this pursuit, a platform framework was developed, tailored for training and spotting keywords, and designed to operate seamlessly within a web browser environment. A specific case study was then presented, showcasing how this framework effectively overcame the prior limitations of deep learning models being restricted to server-based execution.

When it comes to creating models that exhibit rapid responsiveness within the device, the concept of data lightweighting emerges as a necessity. To achieve this, the initial step involved the integration of a speech embedding model into the architecture of a keyword spotting system. This leveraged pre-existing information about the distinctive attributes of audio signals, enabling the training of the keyword spotting model even when faced with limitations in available training data. Furthermore, recomputation was applied exclusively to areas influenced by newly introduced audio input. The model was also enhanced to a stream-able format, making use of the results from previous frame calculations and effectively removing duplicate computations.

However, even with this lightweight keyword model, operating it in the conventional manner that previously required a server-based execution environment posed a challenge. This is because deep learning models and libraries are written in a format that web browsers cannot understand. However, the definition of digital humans set by NC necessitates their ability to run effortlessly on the most versatile platform. Therefore, to address this issue, WebAssembly was utilized to enable real-time inference in web browsers, rendering the deep learning-related code intelligible to currently deployed web browsers. With this environment, deep learning models can be run in web browsers. In short, users can create keyword models and independently operate them through a web browser on any operating system or platform. Also, since the model can be run directly within the web environment without intermediary servers, users can anticipate a markedly accelerated response speed.

Proposed Model Structure

Essential Technologies for Personalized Digital Humans

The previously mentioned study, titled “PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords,” and the current study both place a significant emphasis on the concept of a “keyword.” The “Personalized Digital Human” that NC aims to develop refers to digital humans tailored to each individual. Both the aforementioned studies revolve around the idea of creating digital humans capable of swiftly acquiring individual names with minimal training data, while maintaining commendable performance levels at a cost-efficient scale. As a result, digital humans can now be addressed by names given them by users. Moreover, they are primed for interaction in near real-time through platforms that are more accessible and familiar to us, thus providing users with an elevated and refined personalized experience.

In the future, NC envisions the refinement of digital humans for swifter and more efficient operation through diverse optimization techniques. An added advantage lies in the infinite potential for the expansion of digital humans, as they are not limited to languages such as English, Korean, Japanese, Chinese, Spanish, etc. Creating the most robust model against both internal and external noise is also crucial. Even in extreme environments, such as the noise resulting from the integration of video information with text and sound, the introduced model demonstrates its robustness.

Digital Humans That Read and Empathize with Individuals

“Focus-attention-enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition”

Keulbit Kim, Namhyun Cho of Speech AI Lab

For more information on the research check out the NC Research blog. Link

The way humans and machines have communicated until now has been vertical and one-sided, which complicates its classification as genuine interaction. This interaction paradigm has been confined to transactional exchanges where humans pose queries or provide directives, and machines reciprocate with responses. However, the vision at NC encompasses the development of digital humans equipped not only to decode human emotions but also to foster more profound and substantial interactions. This ambitious objective requires the integration of emotion recognition capabilities. Yet, perceiving emotions poses a challenge even for humans, as it entails more than simple voice recognition. It also encompasses understanding various factors such as gestures, expressions, context, and then making comprehensive assessments. In other words, rather than considering modalities like voice and facial expressions individually, research on multimodal technologies* was necessary. Given this foundation, the exploration of multimodal emotion recognition was initiated, focusing on two key modalities: text and speech.

* Technology that combines various types of data to understand and illustrate their interconnections.

Challenging the Most Difficult Task: Human Emotion Recognition

During this research, three challenges were effectively addressed. Initially, the challenge in the realm of multimodal emotion recognition emerged from the scarcity of paired data that encompassed multiple modalities, rather than just a single modality. Moreover, the collection and annotation of such data incur significant costs. To overcome the data scarcity issue, an approach was employed that utilizes pre-trained speech and text Self-Supervised Learning (SSL) models with unlabeled large-scale data.

Next, in multimodal emotion recognition, identifying the “key elements” posed a difficulty when assessing emotions through two modalities, namely speech and text. For instance, in the context of emotion recognition within the sentence “Argh, the stock market plummeted and it’s driving me insane,” the text identifies negative words like “plummeted” and “insane,” while the speech focuses on the sigh at the beginning of the sentence and the frail expression in the intonation. In the context of emulating a human-like cognitive mechanism, it was essential to design a network that intuitively focuses on different aspects of each modality to recognize emotions. To achieve this, a fusion network with a new focus-attention mechanism was designed, which helped identify important parts for emotion recognition in each modality. Finally, metric learning techniques were applied to ensure distinct classification spaces for different emotions and similar classification spaces for related emotions. Through these methods, the design for emotion classification was enhanced, and the accuracy and performance of emotion recognition were improved, as demonstrated (with the highest performance reaching 80 and the research paper performance at 78).

Proposed Model Structure

The First Step in Multimodal Digital Human Research

One of the most important goals in the advancement of digital humans is to enable them to engage in emotionally attuned conversations akin to human interactions. This research paper introduces an empathetic interactive feature that detects users’ emotions and responds in a natural and fitting manner. This is achieved by imitating how humans perceive multimodal information. This feature goes beyond the limitations of goal-oriented conversations, enabling a more natural understanding and reflection of human emotions in conversations. Consequently, it empowers digital humans to establish a horizontal rapport with users, effectively evolving into true collaborative partners. The emotion recognition module based on multimodal inputs incorporates both speech and text information. Moving forward, NC aims to expand its research, covering existing fields such as Vision and NLP, to develop digital humans with advanced emotion recognition capabilities.

Continuous Challenges and Research Development for Deeply Interactive Digital Humans

Last year, the Speech AI Lab at NC aimed not only to develop AI speech synthesis technology, but also to establish a competitive speech AI as a distinct core technology specific to NC by integrating it with diverse fields, including music. The main focus was placed on implementing speech AI technology that goes beyond being able to communicate in multiple languages, to grasping comprehensive emotion recognition and expression through voice tones, distinctive voices, and manipulation of voice elements. On the contrary, this year’s research focused on enhancing the level of interaction between humans and digital humans.

NC aims to create personalized digital humans that can be tailored to individual users based on a high level of interaction. To achieve this, continuous efforts are made to collect various emotions and real data, while simultaneously striving to enhance the performance of these digital entities. This research will enrich the interaction between future AI and humans, contribute to better user experiences, and lay the groundwork for the growth of digital humans as more natural companions in our day-to-day lives. NC’s persistent commitment and unyielding determination in developing a digital human that mirror human-like attributes and interactions will undoubtedly persist into the future.

For more information on the research check out the R&D story written by the presenters. Link

#AI
#AI CENTER
#DIGITAL HUMAN
#Interspeech 2023
#Speech AI Lab