Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets > Speech Synthesis Datasets > 200,475 Sentences - Chinese Text Normalization Dataset for TTS & NLP

200,475 Sentences - Chinese Text Normalization Dataset for TTS & NLP

Chinese text normalization dataset

Mandarin TTS corpus

Text normalization for speech synthesis

Symbol-to-character annotation dataset

Mandarin text preprocessing data

This dataset comprises 200,475 Mandarin Chinese sentences annotated for text normalization, transforming special symbols and Arabic numerals into Chinese characters. It is ideal for training and evaluating Text-to-Speech (TTS) systems and Natural Language Processing (NLP) models.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Data content

200,475 sentences of text were transcribed in Chinese characters;

Data scale

200,475 original texts with 457,832 annotations;

Content source

Sentences extracted from various types of news, articles, novels, etc.

Language

Chinese;

Annotation

Annotate the special symbols and Arabic numerals in the sentences as Chinese characters;

Applications

TTS, Text normalization;

Sample

Recommended Dataset

319,977 Sentences - Mandarin Polyphone Dataset for Pinyin Disambiguation

This dataset contains 319,977 Mandarin Chinese sentences, it is designed for polyphone disambiguation. It includes 603 common Mandarin pinyin pronunciations, There are differences in the number of phonetic corpora according to the number of phrases in a single word. It is ideal for Natural Language Processing (NLP) tasks, Text-to-Speech (TTS) systems, and linguistic research.

Mandarin polyphone corpus Pinyin disambiguation dataset Chinese polyphone dataset Polyphonic character corpus Pinyin pronunciation dataset

200,955 Sentences - Mandarin Prosodic Dataset for TTS Prosody Prediction

This dataset contains 4 prosodic hierarchies annotating for the 200000 carefully selected Chinese texts, covering both news and colloquial language. The sentence length is appropriate with diversified sentence patterns. This can be used as a TTS front-end prosody prediction training data set.

Mandarin prosodic corpus TTS prosody training data Front-end prosody prediction corpus Mandarin speech synthesis data Prosodic hierarchy annotation Chinese TTS front-end dataset Sentence-level prosody corpus Mandarin intonation dataset

Tell Us Your Special Needs

Current Project Maturity

Early exploration (no concrete specs yet)

Defined goals, need professional guidance

Active development or optimization phase

Data & labeling experts with clear specifications

Full Name *

Contact Phone No.*

Company name *

Company Email *

Data Requirements *

By submitting, I agree to the Privacy Protection

Submit

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

c27ac4ba-f7ec-4ecf-97ef-6ab4751799bd

0a7cf500-74bd-44da-82f0-f2a055a4f555

200,475 Sentences - Chinese Text Normalization Dataset for TTS & NLP

Chinese text normalization dataset Mandarin TTS corpus Text normalization for speech synthesis Symbol-to-character annotation dataset Mandarin text preprocessing data

This dataset comprises 200,475 Mandarin Chinese sentences annotated for text normalization, transforming special symbols and Arabic numerals into Chinese characters. It is ideal for training and evaluating Text-to-Speech (TTS) systems and Natural Language Processing (NLP) models.

Current Project Maturity

Chinese text normalization dataset

Mandarin TTS corpus

Text normalization for speech synthesis

Symbol-to-character annotation dataset

Mandarin text preprocessing data