Nexdata enhances personalized speech synthesis for ai conversations

From:Nexdata Date:2023-11-09

Nowadays, Text-to-Speech (TTS) technology has become quite mature, enabling machines to engage in seamless communication with humans through voice. It has found widespread applications in areas such as voice assistants, intelligent customer service, and smart homes. In the latest update of ChatGPT, one of the most exciting features is the addition of voice conversation functionality. Users can choose from synthesized voices and engage in real-time conversations with the chatbot, similar to making a phone call, receiving instant responses from ChatGPT.


As this highly natural and intelligent human-machine interaction becomes increasingly integrated into our lives, there is a noticeable rise in people's demand for emotional expressiveness and personalization in machine interactions. To empower AI voice interactions in the era of large models, Nexdata has swiftly upgraded its personalized voice synthesis ai data service capabilities, assisting clients in enhancing voice authenticity and emotional expression for applications like virtual assistants, voice readings, short videos, and intelligent customer service.


I. Upgrade in Multimodal Data Collection Capability


Multimodal voice synthesis refers to the addition of video perception modalities achieved through facial capture on top of the traditional audio perception modality. Leveraging years of experience in audio and visual data collection and annotation and an enhanced high-quality synthesis system, Nexdata has created a new dataset that combines voice and visual multimodal fusion.


This dataset, collected from multiple participants, utilizes synchronized recording through various devices, ensuring precise alignment using pulse signals to meet high accuracy requirements. The participants convey rich emotions, making facial expressions more expressive. Furthermore, by reproducing conventional natural dialogues, the synthesized voice becomes more naturally realistic.


II. Resource Reservoir Advantage


With years of experience in TTS ai annotation services, Nexdata has accumulated a wealth of professional actors and model resources. These professionals excel in script delivery and possess excellent vocal and facial expression abilities, resulting in higher data quality.


Professional Collection Equipment


Nexdata has introduced professional condenser microphones, supporting multi-channel synchronous multimodal ai data collection at different distances and spatial anchors. This covers various scenarios, ages, and dozens of shooting angles, ensuring excellent collection diversity.


In addition to differentiating from traditional TTS data production processes, Nexdata keeps pace with market demand changes, helping achieve a comprehensive upgrade of synthetic effects, enabling clients to adapt models to more personalized and expressive scenarios, thus obtaining higher synthesis efficiency and a more perfect sound experience.


III. Upgrade in Multi-Person Average Model Library


In addition to single-person voice library data, Nexdata has added a multi-person average model library, expanding voice coverage to various types and high levels of personalization, assisting clients in various tasks during voice synthesis training.


IV. Upgrade in Music Data Collection Annotation Capability


In traditional music data annotation services formats, musical information is annotated through musical notation, reflecting information on various music theory levels. Additionally, language-related information annotation is required through text grids.


Nexdata's TTS processing capability has been comprehensively upgraded. We support unifying music information and language information into the same format, extracting key information such as pitch and legato through text grids for unified annotation. This streamlines the process, greatly improving efficiency.


Moreover, Nexdata has added annotation capabilities such as singing style, making the processing capabilities of vocal data more refined.


V. Upgrade in Personalized Collection Capability


To actively address the growing demand for voice synthesis in various fields, Nexdata has its own professional TTS recording studio and has accumulated mature collection capabilities and a vast library of finished data resources. The personalized voice library meets diverse needs for various tones, roles, and languages, such as authoritative CEO tone, next-door brother tone, and cool elder sister tone.


VI. Upgrade in Ultimate Scene Restoration Collection Capability


Nexdata has an extensive reserve of dialogue-based TTS data, using professional customer service and journalism personnel. In Nexdata's proprietary professional recording studio adhering to the professional NR15 acoustic standard, real-life imitations of interview and customer service scenarios are conducted, achieving an ultimate restoration of the working states of various roles. This is currently the most natural dialogue collection method.


VII. Specially Appointed Professional Listening Directors


Nexdata assigns professional listening personnel to each TTS project, overseeing recording quality throughout the process, ensuring that satisfactory voice clarity is delivered under any circumstances, and maintaining professional high-quality data control.




In the era of rapid development of large models, TTS technology is empowering a natural, realistic, and smooth user experience. Nexdata has a comprehensive system for managing the quality and security of TTS data. Through professional equipment and environments, abundant voice samples, and years of experience accumulated in TTS projects, Nexdata can meet various demands for vocal image creation