PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

๐Ÿ“„ PISA-Bench is a multilingual, multimodal benchmark designed to evaluate vision-language models on human-authored reasoning tasks derived from the OECD PISA assessments. Unlike many existing VLM datasets that rely on synthetic or English-only content, PISA-Bench provides 122 high-quality examples with images, questions, and answer options extracted from real PISA test material, translated and verified across six languages (EN, DE, ES, FR, IT, ZH).

The benchmark focuses on genuine reasoning rather than pattern matching: each item was manually curated, quality-checked, and categorized into key reasoning types such as spatial & geometric reasoning, graph & pattern analysis, quantitative reasoning, and text & diagram understanding.

Initial results show that even strong VLMs struggle with these tasks, especially in non-English settings, highlighting persistent gaps in multilingual multimodal reasoning.

This leaderboard tracks model performance across languages, providing a transparent and standardized evaluation for future research on multilingual VLMs.


๐Ÿ“Š Results

Task type: Multimodal reasoning (image + text)

Input:

  • Instruction (optional)
  • Image
  • Question

Answer options or free-form answer prompt

Output: Models must generate a concise textual answer. Evaluation is performed using an LLM-as-a-judge protocol comparing the modelโ€™s answer to the gold reference. For multiple-choice questions, the generated answer must correspond to one of the provided options.

Overview of the dataset construction pipeline. We (1) collect tasks from the original OECD PISA tests, (2) decompose them into modular components (instruction, image, question, and answer options), (3) verify, augment, and, if necessary, correct the extracted content, and (4) translate them into five target languages (ES, DE, CH, FR, IT) and verify translations through native speakers.