10 Open Source Datasets for Machine Learning

From：Nexdata Date： 2024-04-07

The research and implementation of machine learning is inseparable from a large amount of data. Using the open source datasets, on the one hand, you can train your algorithm models, and on the other hand, you can find out the deficiencies in your algorithm by comparing with other algorithms. In this article, I will share 10 open source datasets for computer vision, speech and NLP, to help your AI research.

Computer Vision

● Real-World Masked Face Dataset

Real-World Masked Face Dataset, referred to as RMFD, is a face recognition dataset opened by the National Multimedia Software Technology Research Center of Wuhan University in early March 2020, including nearly 100,000 masked and normal facial images, and 500,000 simulated masked faces.

Link: https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset

● Hypersim

For many basic scenes, it is difficult or impossible to obtain a ground label for each pixel from a real image. Apple solves this problem by introducing Hypersim, a synthetic dataset for real indoor scenes. To create this dataset, Apple used a large repository of synthetic scenes created by professional artists and generated 77,400 images of 461 indoor scenes, with detailed labels for each pixel and corresponding ground truth geometry.

Link: https://github.com/apple/ml-hypersim

● OASIS

This dataset covers 140,000 Internet images, manually annotated and realized 3D surface pixel-level reconstruction. The dataset can play a role in depth estimation, three-dimensional surface reconstruction, edge detection, instance segmentation and other directions.

Link: https://oasis.cs.princeton.edu/

● Visual Genome

Visual Genome is a very detailed computer vision database with deep learning subtitles of 100,000 images. Compared with the ImageNet dataset, the information contained in each image in this dataset is richer and the relationship between objects and attributes is annotated.

● Audi Autonomous Driving Dataset

The dataset is released in 2020. The annotation types include object 3D bounding box, semantic segmentation, instance segmentation, and data extracted from the car. The labeled non-sequential data 41,227 frames contain semantic segmentation annotations and point cloud tags, which contain front-facing cameras. The 3D bounding box of the target in the field of view is marked with 12,497 frames. In addition, the datasets also includes 392,556 consecutive frames of unlabeled sensor data. The license plates and faces in the image are all blurred.

Link: https://www.a2d2.audi/a2d2/en.html

Speech

● Common Voice

The Common Voice dataset, including 18 different languages, has accumulated nearly 1,400 hours of voice data from more than 42,000 contributors.

● ainexdata_1505zh

The ainexdata_1505zh dataset is 1,505 hours in length and is part of the Mandarin Chinese speech database of Nexdata. The collection area covers 34 provincial administrative regions across China. The number of participants in the recording reached 6,408, and the recording contents exceeded 300,000 colloquial sentences. The accuracy of sentence annotation exceeds 98%.

Link: https://www.nexdata.ai/opensource

● CN-Celeb

The dataset contains 130,000 speech segments, a total of 1,000 Chinese celebrities are collected, a total of 274 hours.

Link: http://www.openslr.org/82/

NLP

● WikiText

The WikiText Long Term Dependency Language Modeling Dataset is an English thesaurus data containing 100 million words, which are extracted from Wikipedia’s high-quality articles. There are two versions WikiText-2 and WikiText-103. The number of words in WikiText-103 is 110 times as that in Penn Treebank (PTB).

● SQuAD

SQuAD is a reading comprehension dataset launched by Stanford University. All articles in this dataset are selected from Wikipedia, and the amount of the dataset is dozens of times that of other similar datasets. There are a total of 107,785 questions and 536 supporting articles.

Link：https://rajpurkar.github.io/SQuAD-explorer/

Besides the above ten datasets, Nexdata has launched the Open Source Research Datasets for universities and academic institutions around the world since 2020, in order to support the research of artificial intelligence. Filling in the relevant application materials can get an AI dataset worth about US$100,000 for free.

● Multi-language OCR Data

The dataset covers the conference scene PPT in French, Korean, Japanese, Spanish, German, Italian, Portuguese, and Russian, as well as posters, road signs, packaging instructions, menus, etc. of natural scenes in Chinese and English. Natural scenes are labelled with row-level rectangular boxes, and PPT scenes are labelled with quadrangular boxes, and the contents are transcribed.

● Multi-race Face Recognition Datasets

The data covers Asian, Caucasian people, Indian and black people, and the ratio of men to women is 1:1. The collection environment is indoor and outdoor scenes, and the collection equipment includes mobile phones and cameras.

● Mandarin Chinese Conversational Speech Data by Mobile Phone

The data was recorded by 440 participants with natural speaking and casual conversation, with a balanced gender ratio. In a relatively quiet indoor environment, the ambient noise level does not exceed 50db, and the text, speaker, and start and end time of valid sentences are marked. The sentence accuracy exceeds 97%.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai

10 Open Source Datasets for Machine Learning

End

Recent

Behavior Detection Data: Enhancing Systems through Human Behavior Analysis

Text-to-Speech (TTS) Data: Fueling the Future of Synthetic Voices

Human Voice Datasets: A Key Resource for Speech Technology Development

Previous

Why Infrared Face Recognition Can Work in the Dark

Next

Meet Xiaoice, the Author of AI-created Paintings