en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

31 Million Southeast Asian News Text Dataset – 4 Languages for AI

Southeast Asian dataset
multilingual news dataset
Indonesian text dataset
Malay news corpus
Thai language dataset
Vietnamese text dataset
AI training data Southeast Asia
multilingual text corpu
large-scale news dataset

The 31 Million Southeast Asian Language News Text Dataset contains multilingual news articles across Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help multilingual AI training, and cross-linguistic model development, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.

Paid Datasets
This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.
SpecificationsSpecifications
Languages
Indonesian, Malay, Thai, Vietnamese
Data volume
14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces
Field
URL,title,published_time,article_content,category
Format
JSONL
Sample Sample
  • 31 Million Southeast Asian News Text Dataset – 4 Languages for AI
  • 31 Million Southeast Asian News Text Dataset – 4 Languages for AI
  • 31 Million Southeast Asian News Text Dataset – 4 Languages for AI
Tell Us Your Special Needs

By submitting, I agree to the Privacy Protection

a2376fe0-47cf-4a71-b4e9-d8b20aa1b784

2cc7b912-5389-4126-8184-f1c6dfc33ff5