89K Japanese-Arabic Image-Text Dataset for Multimodal LLM Training

image text dataset

multimodal dataset

vision language dataset

image caption dataset

vlm training data

multimodal llm dataset

The dataset comprises a total of 89,007 samples, with each sample consisting of an image and a JSON document. The JSON document may contain image descriptions, visual question-answering pairs, OCR results extracted from the image, or visual question-answering pairs based on the OCR results.The dataset covers Arabic and Japanese languages and spans six major domains: ① Business and Finance, ②Coding and Computer Science, ③Law, Government, and Politics, ④Science, Technology, Engineering, and Mathematics (STEM), ⑤Society, Culture, Humanities, and Religion, ⑥ Sports, Lifestyle, and Leisure. Image classification accuracy (per-image) exceeds 95%; image-text matching accuracy is above 95%; OCR recognition accuracy (per-sentence) exceeds 95%.Suitable for multilingual OCR, multimodal Large Language Model (LLM) training, image captioning, and multilingual Visual Question Answering (VQA) tasks.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Data Content

Each data sample consists of one image and one JSON document. The JSON document contains either:OCR text recognition results of the image, or a textual description (caption) of the image, or visual question answering (VQA) based on the image, or visual question answering based on the OCR recognition results of the image，Among them, visual question answering includes at least one round of Q&A.

Data Scale

89,007 sets in total, including 42,094 sets in Arabic and 46,913 sets in Japanese.

Category Distribution

The dataset includes two languages, Japanese and Arabic, and covers four task categories for each language: Image Captioning , Visual Question Answering, Optical Character Recognition , and OCR-based Visual Question Answering. Each category is further divided into six domains: ①Business and Finance, ②Coding and Computer Science，③Law, Government, and Politics, ④Science, Technology, Engineering, and Mathematics , ⑤Society, Culture, Humanities, and Religion , ⑥Sports, Lifestyle, and Leisure.

Data Format

Images in JPG or other common image formats; annotations in JSON format.

Collection accuracy

The accuracy of image domain classification（per-image accuracy） is above 95%

Annotation Accuracy

The matching degree between image and text description is greater than 95%；OCR recognition accuracy (per-sentence accuracy) must exceed 95%. Accuracy is measured by segmenting at punctuation marks (such as commas, semicolons, exclamation marks, etc.) or at titles/headings.

89K Japanese-Arabic Image-Text Dataset for Multimodal LLM Training

image text dataset multimodal dataset vision language dataset image caption dataset vlm training data multimodal llm dataset

Current Project Maturity

image text dataset

multimodal dataset

vision language dataset

image caption dataset

vlm training data

multimodal llm dataset