en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets > LLM Datasets > 31 Million Southeast Asian News Text Dataset – 4 Languages for AI

31 Million Southeast Asian News Text Dataset – 4 Languages for AI

Southeast Asian dataset

multilingual news dataset

Indonesian text dataset

Malay news corpus

Thai language dataset

Vietnamese text dataset

AI training data Southeast Asia

multilingual text corpu

large-scale news dataset

The 31 Million Southeast Asian Language News Text Dataset contains multilingual news articles across Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help multilingual AI training, and cross-linguistic model development, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Specifications

Languages

Indonesian, Malay, Thai, Vietnamese

Data volume

14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces

Field

URL,title,published_time,article_content,category

Format

JSONL

Sample

Sample

Tell Us Your Special Needs

Current Project Maturity

Early exploration (no concrete specs yet)

Defined goals, need professional guidance

Active development or optimization phase

Data & labeling experts with clear specifications

Full Name *

Contact Phone No.*

Company name *

Company Email *

Data Requirements *

By submitting, I agree to the Privacy Protection

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

nexdata_ai facebook

nexdata_ai twitter

nexdata_ai linkedin

nexdata_ai youtube

Copyright © 2023 NEXDATA TECHNOLOGY INC

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

0320bf5d-9618-4298-9b69-65b25e4e9f5e

254c5a98-fe0e-4b5f-8081-db56657d35d4