From:Nexdata Date: 2024-04-03
With the development of deep neural networks, the accuracy of machine translation has been significantly improved, but the problem of human cross-language communication has not been solved. For example, in high-precision simultaneous interpretation tasks, machine translation still needs to be polished. For the translation of novels, machine translation is not comparable to human translation.
The Challenge of Machine Translation
● Selection of Translations
Human language is very broad and profound, and the phenomenon of polysemy is very common. Take Chinese and English as an example, look can be look, see, watch, read, etc. This requires that the machine translation can not be just a simple word conversion, but needs to make the correct translation selection after knowing the subject and predicate in different words and phrases.
● Word Order Adjustment
Under different language and cultural backgrounds, people’s expression habits are also different. For example, Chinese people often say inverted sentences. Usually, the subject, verb, and object expressed in Chinese may become the subject, object, and verb in Japanese. The longer the sentence, the more complicated the word order adjustment.
● Scarce Training Corpus
At present, there are more than 5,000 languages in the world, and what machines can do can only be the most commonly used ones based on the statistical results of big data. The training data on the market is mainly concentrated in several commonly used languages, and other languages account for very little.
Given the scarcity of minority language corpora, professional data providers can help researchers collect language corpora faster. Nexdata has accumulated about 2 billion pieces of natural language processing(NLP) data, covering parallel corpus in more than 30 countries.
Chinese-English Parallel Corpus Data
3,060,000 sets of parallel translation corpus between Chinese and English. It is stored in txt files. It covers files like travel, medicine, daily and TV play. Data cleaning, desensitization, and quality inspection have been carried out. It can be used as the basic corpus database in text data file as well as used in machine translation.
Chinese-Korean Parallel Corpus Data
5,280,000 set of parallel translation corpus betweeen China and Korea, which are stored in txt files. It covers many fields include traveling, medicine, daily, TV play. Data cleaning, desensitization, and quality inspection have been carried out. It can be used as the basic corpus database in text data file as well as used in machine translation.
Japanese-English Parallel Corpus Data
Japanese and English parallel corpus, 380,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.
English-Korean Parallel Corpus Data
English and Korean parallel corpus, 1340,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.
Chinese-French Parallel Corpus Data
1 Million Pairs of Sentences — Chinese-French Parallel Corpus Data be stored in txt format. It covers multiple fields such as tourism, medical treatment, daily life, TV play, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.
Chinese-Japanese Parallel Corpus Data
2 Million Pairs of Sentences — Chinese-Japanese Parallel Corpus Data be stored in txt format. It covers multiple fields such as tourism, medical treatment, daily life, TV play, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.
English-Russian Parallel Corpus Data
English and Russian parallel corpus, 1,080,000 groups in total; excluded political, porn, personal information and other sensitive vocabulary; it can be a base corpus for text-based data analysis, used in machine translation and other fields.
Chinese-Germany Parallel Corpus Data
5.14 Million Pairs of Sentences — Chinese-Germany Parallel Corpus Data be stored in text format. It covers multiple fields such as tourism, medical treatment, daily life, news, etc. The data desensitization and quality checking had been done. It can be used as a basic corpus for text data analysis in fields such as machine translation.
End
If you need data services, please feel free to contact us: info@nexdata.com.