Gradio

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

📄 PISA-Bench is a multilingual, multimodal benchmark designed to evaluate vision-language models on human-authored reasoning tasks derived from the OECD PISA assessments. Unlike many existing VLM datasets that rely on synthetic or English-only content, PISA-Bench provides 122 high-quality examples with images, questions, and answer options extracted from real PISA test material, translated and verified across six languages (EN, DE, ES, FR, IT, ZH).

The benchmark focuses on genuine reasoning rather than pattern matching: each item was manually curated, quality-checked, and categorized into key reasoning types such as spatial & geometric reasoning, graph & pattern analysis, quantitative reasoning, and text & diagram understanding.

Initial results show that even strong VLMs struggle with these tasks, especially in non-English settings, highlighting persistent gaps in multilingual multimodal reasoning.

This leaderboard tracks model performance across languages, providing a transparent and standardized evaluation for future research on multilingual VLMs.

📊 Results


SmolVLM2-256M-Video-Instruct	69.4	68.9	69.7	65.6	64.8	67.2	67.8	-11.2


GPT-4o	71	68.9	69.7	65.6	64.8	67.2	67.8	-3.8
Qwen2.5-VL-72B-Instruct	69.4	58.2	60.7	64.8	63.1	63.9	63.3	-7.2
Claude-3-5-Haiku	62.9	56.6	64.8	59.8	61.5	64.8	61.7	-1.4
gemma-3-27b-it	60.5	63.9	61.5	63.9	61.5	54.1	60.9	0.5
Qwen3-VL-8B-Instruct	58.9	52.5	57.4	48.4	55.7	55.7	54.8	-4.9
gemma-3-12b-it	58.1	52.5	50.8	51.6	48.4	54.1	52.6	-6.6
Qwen2.5-VL-7B-Instruct	52.4	48.4	56.6	54.1	46.7	47.5	50.9	-1.8
Qwen3-VL-30B-A3B-Instruct	57.3	48.4	50.8	50.8	44.3	50	50.3	-8.4
Qwen3-VL-4B-Instruct	48.4	50	51.6	49.2	45.1	52.5	49.5	1.3
Qwen2.5-VL-32B-Instruct	51.6	44.3	46.7	45.1	39.3	44.3	45.2	-7.7
Qwen2.5-VL-3B-Instruct	46.8	41	42.6	43.4	41	40.2	42.5	-5.1
Idefics3-8B-Llama3	47.6	42.6	36.9	38.5	42.6	36.9	40.9	-8.1
llava-v1.6-34b-hf	43.5	36.9	36.9	34.4	38.5	41.8	38.7	-5.8
SmolVLM2-256M-Video-Instruct	33.9	36.9	41.8	41.8	44.3	31.1	38.3	5.3
gemma-3-4b-it	45.2	35.2	36.9	36.9	36.1	38.5	38.1	-8.4
llava-1.5-7b-hf	30.6	29.5	32.8	36.1	31.1	29.5	31.6	1.2
llava-1.5-13b-hf	35.5	32.8	27	31.1	28.7	33.6	31.5	-4.8
SmolVLM2-2.2B-Instruct	38.7	23.8	25.4	29.5	34.4	30.3	30.4	-10
SmolVLM2-500M-Video-Instruct	29.8	23	13.1	17.2	19.7	20.5	20.5	-11.2

Task type: Multimodal reasoning (image + text)

Input:

Instruction (optional)
Image
Question

Answer options or free-form answer prompt

Output: Models must generate a concise textual answer. Evaluation is performed using an LLM-as-a-judge protocol comparing the model’s answer to the gold reference. For multiple-choice questions, the generated answer must correspond to one of the provided options.

Overview of the dataset construction pipeline. We (1) collect tasks from the original OECD PISA tests, (2) decompose them into modular components (instruction, image, question, and answer options), (3) verify, augment, and, if necessary, correct the extracted content, and (4) translate them into five target languages (ES, DE, CH, FR, IT) and verify translations through native speakers.

Image