Emotion AI: The Challenge of Multimodal Emotion Recognition

From：Nexdata Date： 2024-08-15

➤ Problems in multimodal sentiment analysis

The rapid development of artificial intelligence is inseparable from the support of high-quality data. Data is not only the fuel that drives the progress of AI model learning, but also the core factor to improve model performance, accuracy and stability. Especially in the field of automatic tasks and intelligent decision-making, deep learning algorithms based on massive data have shown their potential. Therefore, having well-structured and rich datasets has become a top priority for engineers and developers to ensure that AI systems can perform well in a variety of different scenarios.

With the rise of artificial intelligence, obtaining a more humanized and intelligent human-computer interaction experience has always attracted much attention, which makes affective computing one of the research hotspots. As an important branch of affective computing research, emotion recognition has developed rapidly in recent years and has broad prospects.

The main methods of emotion recognition research include speech-based emotion recognition research, image-based emotion recognition research and multimodal fusion-based emotion recognition research. Since the emotional information expressed by a single voice or image modal information is incomplete, it cannot fully meet people’s expectations. The multi-modal fusion emotion recognition research integrates each modal information, so that the modal information can complement each other to achieve a better recognition effect.

Multimodal model strategies are necessary in sentiment analysis tasks. First of all, it is often difficult to accurately judge the emotional state only through text or speech. An extreme example is irony. Irony often combines neutral or positive textual content with audio expressions that do not match the content to complete a negative (negative) emotional expression. This situation is difficult to fundamentally solve with single-modality alone. Secondly, the single-modal model is easily affected by noise and causes performance problems. For example, the text transcribed by ASR, the errors in the upstream ASR will often have a greater impact on the downstream classification task. Therefore, to have a stable and powerful model in practical applications requires us to adopt a multimodal modeling method.

When solving practical application problems, the existing multimodal sentiment analysis methods have the following problems:

1. Model Cross-domain Problem

Models trained in a specific scenario (such as insurance customer service) often cannot be directly used in other scenarios (such as operator customer service), since emotional expression itself depends on the scenes, and the expression in one scene may in another scene, there are different emotional attitudes, and the definition of emotional attitude interpretation will be different in different scenes.

2. High Cost of Data Labeling

➤ Nexdata's emotion recognition data

Multimodal models require data from different modalities to be given in pairs, and comprehensively consider the information expressed by each modal during the labeling process. The difficulty of obtaining such data and the cost of labeling also have implications for the practical application of multimodal emotion models.

In order to power the R&D of multimodal emotion recognition technology, Nexdata has developed speech and image emotion recognition datasets for multiple application scenarios. Nexdata strictly abides by relevant regulations, and the data is collected with proper data collection authorization agreement.

American English Colloquial Video Speech Data

The dataset is collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.

English Emotional Speech Data by Microphone

English emotional audio data captured by microphone, American native speakers participate in the recording, 2,100 sentences per person. The recorded script covers 10 emotions such as anger, happiness, sadness; the voice is recorded by high-fidelity microphone therefore has high quality.

Emotional Video Data

The data diversity includes multiple races, multiple indoor scenes, multiple age groups, multiple languages, multiple emotions (11 types of facial emotions, 15 types of inner emotions). For each sentence in each video, emotion types (including facial emotions and inner emotions), start & end time, and text transcription were annotated.

Chinese Mandarin Synthesis Corpus-Female, Emotional

It is recorded by Chinese native speaker,emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

Multi-pose and Multi-expression Face Data

The data includes more than 1,500 Chinese people. For each subject, 62 multi-pose face images and 6 multi-expression face images were collected. The data diversity includes multiple angles, multiple poses and multple light conditions image data from all ages.

Besides the off-the-shelf datasets, Nexdata also supports on-demand data collection and annotation services in the field of multimodal emotion recognition, covering various modal combinations, such as image+text, speech+text, speech+image+text and etc.

End

If you need data services, please feel free to contact us: info@nexdata.ai.

Emotion AI: The Challenge of Multimodal Emotion Recognition

When solving practical application problems, the existing multimodal sentiment analysis methods have the following problems:

1. Model Cross-domain Problem

➤ Nexdata's multimodal data services

2. High Cost of Data Labeling

American English Colloquial Video Speech Data

English Emotional Speech Data by Microphone

Emotional Video Data

Chinese Mandarin Synthesis Corpus-Female, Emotional

It is recorded by Chinese native speaker,emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

Multi-pose and Multi-expression Face Data

End

If you need data services, please feel free to contact us: info@nexdata.ai.

Facing with growing demand for data, companies and researchers need to constantly explore new data collection and annotation methods. AI technology can better cope with fast changing market demands only by continuously improving the quality of data. With the accelerated development of data-driven intelligent trends, we have reason to look forward to a more efficient, intelligent, and secure future.

Emotion AI: The Challenge of Multimodal Emotion Recognition

Emotion AI: The Challenge of Multimodal Emotion Recognition

Recent

How to Train Embodied AI That Works Everywhere: A Universal Dataset Blueprint

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Previous

SEE YOU IN CVPR 2022

Next

AI in Retail: How AI is Transforming Retail Industry