en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Case Study: Indonesian Language Data Collection Project

From:Nexdata Date: 2025-10-14


To meet the client's requirement for natural conversation data in Indonesia, this project conducted 100 hours of data collection over a one-month period.

Project Overview

· Language: Indonesian

· Volume: 100 hours

· Type: Natural conversation data collection

· Duration: 1 month

Implementation Issues

Data Collection

· Background noise must not exceed 40dB; any noise must be avoided during recording. (Standard natural conversation recording allows for background noise to be no more than 50dB, and accepts sudden noise; online recording is difficult to meet this requirement.)

· Financial topics must account for at least one-third of the total content. (Standard natural conversation projects limit a single topic to no more than 30 minutes; to meet the client's financial topic ratio requirements, a dedicated financial topic recording team will be required.)

Annotation

· Text: Indonesian contains a large number of colloquial vocabulary; some colloquial and formal terms vary in both writing and meaning, with no fixed rules.

· Labels: The client required multiple types of tags; due to the client's limited Indonesian proficiency, multiple tags (e.g., Arabic, Java) were added during the annotation process.

Problem Resolution

Data Collection

· Background noise: Shifting from online to offline data collection to control noise; selecting locations near schools for efficient data collection to prevent cost overruns.

· Collection subjects shifted from the general population to students; specifically recruiting accounting/finance students to record financial topics.

Annotation

· Based on the acceptance report, a coordination meeting was held with the client and third-party acceptance team to reconfirm the transcription rule for colloquial terminology: "transcribe as heard." For terms with minor pronunciation differences, both the colloquial and formal versions will be considered correct (unification will be carried out in post-processing if necessary). The standard for comma/period usage was clarified: both usages are acceptable as long as they do not affect the meaning of the sentence.

· Regarding subjective background noise labeling, quality inspectors attended meetings to understand the client’s judgment criteria and achieved a high degree of subjective consistency with the client’s standards.

 

Project Reflection

· Under tight schedules, project execution must adhere to the process (starting with a trial run, followed by mass production only after approval), rather than simply pursuing a deadline..

· During project execution, proactively identify the client's roles and responsibilities; clearly define the acceptance party and coordinate acceptance criteria through direct communication.

· For highly subjective issues, abandon "judgment by experience" and strictly adhere to the client's standards.


The core project experience includes: strictly implementing the pilot process to control quality risks; clarifying acceptance criteria and responsible parties in advance; and adhering to client standards for subjective issues.


If you have similar voice data collection/annotation needs (any language), please feel free to contact [email protected]. With over a decade of experience in professional data services, Nexdata can help you accelerate your AI journey.

ad380ec8-bca2-4275-a76a-575c3d8909c6