en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

Nexdata Launches New Parallel Corpora for Machine Translation

From:Nexdata Date: 2024-04-03

As of the end of January 2021, China has signed 205 cooperation documents on the Belt and Road Initiative with 140 countries and 31 international organizations, involving 12 language families, 28 language families, and about 132 languages. Language barriers caused by linguistic diversity are considered to be one of the main challenges hindering in-depth exchanges between countries and regions related to the Belt and Road Initiative.

With the rapid development of artificial intelligence and natural language processing technology in recent years, the gap between machine translation technology and human translation has been narrowing. It plays an increasingly important role in politics, diplomacy, and cultural exchanges.

Machine translation, that is, the translation of text in one language into another language by computer, has become one of the important methods to solve the language barrier.

Statistical machine translation is the current mainstream machine translation method. It acquires translation knowledge based on parallel corpus data, and can develop an efficient and high-performance translation system. Large-scale high-quality parallel corpus data plays an important role in improving the performance of statistical machine translation systems.

Recently, Nexdata has released new parallel corpus data, covering dozens of languages, and written, spoken and other catagories. Until now, Nexdata has accumulated more than 2 billion pieces of text data.

Chinese-English Parallel Corpus Data

3,060,000 sets of parallel translation corpus between Chinese and English. It is stored in txt files. It covers files like travel, medicine, daily and TV play. Data cleaning, desensitization, and quality inspection have been carried out. It can be used as the basic corpus database in text data file as well as used in machine translation.

Japanese-English Parallel Corpus Data

Japanese and English parallel corpus, 380,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.

English-Korean Parallel Corpus Data

English and Korean parallel corpus, 1340,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.

Chinese-Korean Parallel Corpus Data

5,280,000 set of parallel translation corpus betweeen China and Korea, which are stored in txt files. It covers many fields include traveling, medicine, daily, TV play. Data cleaning, desensitization, and quality inspection have been carried out. It can be used as the basic corpus database in text data file as well as used in machine translation.

Chinese-French Parallel Corpus Data

1 Million Pairs of Sentences — Chinese-French Parallel Corpus Data be stored in txt format. It covers multiple fields such as tourism, medical treatment, daily life, TV play, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.

Chinese-Japanese Parallel Corpus Data

2 Million Pairs of Sentences — Chinese-Japanese Parallel Corpus Data be stored in txt format. It covers multiple fields such as tourism, medical treatment, daily life, TV play, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.

English-Russian Parallel Corpus Data

English and Russian parallel corpus, 1,080,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.

Chinese-Germany Parallel Corpus Data

5.14 Million Pairs of Sentences — Chinese-Germany Parallel Corpus Data be stored in text format. It covers multiple fields such as tourism, medical treatment, daily life, news, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.

End

If you need data services, please feel free to contact us at info@nexdata.ai.

9d29db66-1482-462b-a0b0-cee384381717