500K Multilingual OCR Dataset – Document & Scene Text
500,000 high-resolution images featuring multilingual Optical Character Recognition (OCR) data across both natural scenes and various document types. This dataset spans 20 languages, including Traditional Chinese, Simplified Chinese, Japanese, Korean, Thai, Vietnamese, Indonesian, Malay, Polish, and more. The data covers a wide range of real-world conditions—natural scenes, printed documents, handwritten notes, signs, and posters—captured from multiple countries and environments, with varied backgrounds, lighting conditions, and camera angles. All images are annotated for OCR tasks, making this dataset highly suitable for training deep learning models for text detection, recognition, and layout analysis in multi-language scenarios. The dataset complies with global data protection standards (GDPR, CCPA, PIPL), and is validated by leading AI enterprises for commercial and research applications.
multilingual OCR dataset scene text dataset document OCR images Chinese OCR Japanese OCR dataset Thai text recognition OCR dataset with annotation multi-language OCR document image dataset OCR training data