{"id":1838,"datatype":"1","titleimg":"https://www.nexdata.ai/shujutang/static/image/index/datatang_yuyin_default.webp","type1":"165","type1str":null,"type2":"166","type2str":null,"dataname":"Tamil Speech Dataset – 500 Hours Monologue Audio Corpus","datazy":[{"title":"Format","content":"16kHz, 16bit, uncompressed wav, mono channel."},{"title":"Recording condition","content":"quiet indoor environment, low background noise, without echo;"},{"title":"Recording device","content":"Android smartphone, iPhone;"},{"title":"Speaker","content":"About 500 people"},{"title":"Language","content":"Tamil;"},{"title":"Features of annotation","content":"Transcription text;"},{"title":"Accuracy Rate","content":"Word Accuracy Rate (WAR) 95%;"}],"datatag":"reading,Tamil","technologydoc":null,"downurl":null,"datainfo":null,"standard":null,"dataylurl":null,"flag":null,"publishtime":null,"createby":null,"createtime":null,"ext1":null,"samplestoreloc":null,"hosturl":null,"datasize":null,"industryPlan":null,"keyInformation":null,"samplePresentation":[{"name":"G00001S0001.wav","url":"https://storage-product.datatang.com/damp/product/instructions_zh/20250709152612/G00001S0001.wav?Expires=4102415999&OSSAccessKeyId=LTAI5tEBeSWUJiqjXvBMsxEu&Signature=uAjPdqhYWF5Lg7x6%2FP8ZxME99ec%3D","intro":"ஒவ்வொரு மாணவர்களின் வளர்ச்சிக்கும் பள்ளிக்கூடம் மிகவும் அவசியமானது.","size":163512,"progress":100,"type":"mp3"},{"name":"G00001S0002.wav","url":"https://storage-product.datatang.com/damp/product/instructions_zh/20250709152612/G00001S0002.wav?Expires=4102415999&OSSAccessKeyId=LTAI5tEBeSWUJiqjXvBMsxEu&Signature=%2BpTGEcpJOHfiEGmG%2Bl4FEEoApSQ%3D","intro":"எனது தமிழ் பாடப்புத்தகத்தில் சரியா அல்லது தவறா கேள்விகள் கேட்கப்பட்டுள்ளது.","size":183374,"progress":100,"type":"mp3"},{"name":"G00001S0003.wav","url":"https://storage-product.datatang.com/damp/product/instructions_zh/20250709152612/G00001S0003.wav?Expires=4102415999&OSSAccessKeyId=LTAI5tEBeSWUJiqjXvBMsxEu&Signature=iREXsf4B%2Bfx%2FXble0p2fM7OsbWY%3D","intro":"சீன வாய்மொழி கற்றுக்கொள்ள ஆசை.","size":77320,"progress":100,"type":"mp3"},{"name":"G00001S0004.wav","url":"https://storage-product.datatang.com/damp/product/instructions_zh/20250709152612/G00001S0004.wav?Expires=4102415999&OSSAccessKeyId=LTAI5tEBeSWUJiqjXvBMsxEu&Signature=Itrg%2FIM1buUsD%2FALub4aP96h4t0%3D","intro":"பாடத்திட்டத்தில் கணிதம் எனக்கு மிகவும் பிடிக்கும்.","size":104070,"progress":100,"type":"mp3"},{"name":"G00001S0005.wav","url":"https://storage-product.datatang.com/damp/product/instructions_zh/20250709152612/G00001S0005.wav?Expires=4102415999&OSSAccessKeyId=LTAI5tEBeSWUJiqjXvBMsxEu&Signature=xWaKI4vvM%2FLmOgsIuVKBUGz1gxI%3D","intro":"பாடத்திட்டத்தில் அந்நிய மொழிகளை தவிர்க்க வேண்டும்.","size":107042,"progress":100,"type":"mp3"}],"officialSummary":"This dataset includes 500 hours of scripted Tamil monologue speech collected using smartphones. Each sample is transcribed with text content and metadata such as speaker ID, gender, and age. The dataset features diverse speakers from various regions, making it highly representative of real-world Tamil language use and suitable for automatic speech recognition (ASR), text-to-speech (TTS), voice activity detection (VAD), and natural language processing (NLP) tasks. Validated by leading AI companies, the dataset is designed to enhance model robustness in multilingual environments and low-resource languages. All data was collected in full compliance with global privacy regulations including GDPR, CCPA, and PIPL, ensuring ethical sourcing and responsible AI development.","dataexampl":null,"datakeyword":["Tamil speech dataset","Tamil audio dataset","Tamil language dataset","Tamil monologue dataset","Tamil voice corpus","Tamil ASR data","scripted speech in Tamil","smartphone Tamil dataset","speech recognition Tamil dataset","multilingual speech data"],"isDelete":null,"ids":null,"idsList":null,"datasetCode":null,"productStatus":null,"tagTypeEn":"Data Type,Language","tagTypeZh":null,"website":null,"samplePresentationList":null,"datazyList":null,"keyInformationList":null,"dataexamplList":null,"bgimg":null,"datazyScriptList":null,"datakeywordListString":null,"sourceShowPage":"speechRec","BGimg":"brightSpot_audio","voiceBg":["/shujutang/static/image/comm/audio_bg.webp","/shujutang/static/image/comm/audio_bg2.webp","/shujutang/static/image/comm/audio_bg3.webp","/shujutang/static/image/comm/audio_bg4.webp","/shujutang/static/image/comm/audio_bg5.webp"]}
Tamil Speech Dataset – 500 Hours Monologue Audio Corpus
Tamil speech dataset
Tamil audio dataset
Tamil language dataset
Tamil monologue dataset
Tamil voice corpus
Tamil ASR data
scripted speech in Tamil
smartphone Tamil dataset
speech recognition Tamil dataset
multilingual speech data
This dataset includes 500 hours of scripted Tamil monologue speech collected using smartphones. Each sample is transcribed with text content and metadata such as speaker ID, gender, and age. The dataset features diverse speakers from various regions, making it highly representative of real-world Tamil language use and suitable for automatic speech recognition (ASR), text-to-speech (TTS), voice activity detection (VAD), and natural language processing (NLP) tasks. Validated by leading AI companies, the dataset is designed to enhance model robustness in multilingual environments and low-resource languages. All data was collected in full compliance with global privacy regulations including GDPR, CCPA, and PIPL, ensuring ethical sourcing and responsible AI development.
This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.
Specifications
Format
16kHz, 16bit, uncompressed wav, mono channel.
Recording condition
quiet indoor environment, low background noise, without echo;
Recording device
Android smartphone, iPhone;
Speaker
About 500 people
Language
Tamil;
Features of annotation
Transcription text;
Accuracy Rate
Word Accuracy Rate (WAR) 95%;
Sample
Audio
ஒவ்வொரு மாணவர்களின் வளர்ச்சிக்கும் பள்ளிக்கூடம் மிகவும் அவசியமானது.
Audio
எனது தமிழ் பாடப்புத்தகத்தில் சரியா அல்லது தவறா கேள்விகள் கேட்கப்பட்டுள்ளது.
Audio
சீன வாய்மொழி கற்றுக்கொள்ள ஆசை.
Audio
பாடத்திட்டத்தில் கணிதம் எனக்கு மிகவும் பிடிக்கும்.
Audio
பாடத்திட்டத்தில் அந்நிய மொழிகளை தவிர்க்க வேண்டும்.