off-the-shelf datasets, dataset provider logo

en

m.nexdata.datatang.com

633 Hours - Japanese Speech Dataset (Mobile Phone Recordings)

Japanese speech dataset
Japanese spontaneous dialogue dataset
Japanese ASR training data
Japanese audio dataset

This dataset contains 633 hours of Japanese spontaneous dialogues, dialogues are based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(around 1000 native speakers), geographicly speaking, enhancing model performance in real and complex tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS) systems, and NLP research. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Format
16kHz, 16bit, uncompressed wav, mono channel;
Recording Environment
quiet indoor environment, without echo;
Recording Content
dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Speaker
1,066 Japanese, with 46% male and 54% female;
Annotation
annotating for the transcription text, speaker identification and gender;
Recording device
Android smartphone, iPhone
Country
Japan(JPN)
Language(Region) Code
ja-JP
Language
Japanese
Accuracy rate
Sentence accuracy rate(SAR) 95%
  • Audio

    え、じゃさあ、今までで、いっちばん、楽しかった旅行って、何、どこ

  • Audio

    うん、こ- ここでやったこれが一番楽しかった、忘れられなかった

  • Audio

    うん、一番自分がこう、いい印象に残っとるというか、あ、ここは行ってよかったなあと思ったり、後は

  • Audio

    ニュージーランド

  • Audio

    一番楽しかった?楽しい?楽しいってどういう、どういう意味?もう引っ括めて、一番楽しかった?

a2096f57-b2a2-43da-9345-c33cdd532ebb

06b6d454-894e-4752-b31d-6f8349eefaa7