This repository contains Screen Question Answering datasets data. The datasets are meant to be used as benchmarks for screen content understanding via question answering.
The datasets are based on the screenshots from the public Rico dataset. The screenshots are represented by unique image ids, and those should be used to retrieve the corresponding images and accompanying data from Rico.
There are currently 3 datasets available here:
- ScreenQA (original).
- ScreenQA Short.
- ComplexQA.
See also a related Screen Annotation dataset.
The dataset contains ~86K questions and answers for ~35K screenshots from the public Rico dataset. Only the screenshots with View Hierarchy in sync were used (see section 4.1 of the paper for more details). This data was produced by human raters.
All the screenshots are split randomly into a training set, a validation set, and a test set. This means that while a single screenshot can have multiple questions and answers, all questions and answers for the same screenshot are in the same split.
Train, validation and test splits contain 28,378 (~80%), 3,485 (~10%) and 3,489 (~10%) of all screenshots and 68,951 (~80%), 8,614 (~10%) and 8,419 (~10%) of all questions respectively.
You can find the dataset in the
answers_and_bboxes
directory. It contains ScreenQA data as 3 JSON files, one per each data split.
Each JSON file contains a list of question-answers pairs.
The available keys for each entry are:
image_id
- screenshot identifier in Rico dataset (should be used to get image bytes and other information tied to this screenshot).image_width
- width of the screenshot.image_height
- height of the screenshot.question
- question about the screen.ground_truth
- list of information about the answer to the question, each element from a different human rater. Each contains:full_answer
- answer to the question as a full sentence.ui_elements
- list of elements on the screenshot that contain the answer and together with the question are used to produce the answer as a full sentence. Each element contains:text
- text description of the element (usually the text inside the selected area, if available; for icons it is a description of the icon, e.g. “4.5 stars”, “home”, “on” for selected checkbox/radiobutton and “off” for unselected).bounds
- an array of 4 integers representing left, top, right and bottom pixel coordinates of the element.vh_index
- either-1
if the element was drawn by the rater manually, or non-negative integer representing the index of the element in the View Hierarchy tree depth-first traversal (starting from 0) if the element is one of the View Hierarchy elements.
This is a modification of the original ScreenQA dataset. It contains the same set of questions for the same screenshots in each of the train, validation and test splits. The answers data was produced automatically by a model based on the original data from human raters.
You can find the dataset in the
short_answers
directory. It contains 3 JSON files, one for each data split. Each JSON file
contains a list of question-answers pairs.
The available keys for each entry are:
image_id
- screenshot identifier in Rico dataset (should be used to get image bytes and other information tied to this screenshot).question
- question about the screen.ground_truth
- list of short answers to the question.
This is an extension/alternative to the ScreenQA Short dataset containing questions and answers mainly focused on counting, arithmetic, and comparison operations requiring information from more than one part of the screen. It contains 11,781 question-answer pairs. The data was produced automatically by a model based on the screen information and validated by human raters.
You can find the dataset in the
complex_qa
directory. It contains a data.json
JSON file with a list of question-answer
pairs.
The available keys for each entry are:
image_id
- screenshot identifier in Rico dataset (should be used to get image bytes and other information tied to this screenshot).question
- question about the screen.ground_truth
- list of short answers to the question (current version contains only one answer though).
This paper describes the original ScreenQA dataset.
If you use or discuss this dataset in your work, please cite our paper:
@misc{hsiao2024screenqa,
title={ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots},
author={Yu-Chung Hsiao and Fedir Zubach and Maria Wang and Jindong Chen},
year={2024},
eprint={2209.08199},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This paper describes 3 datasets:
- ScreenQA Short.
- ComplexQA.
- Screen Annotation (located in a different repository).
If you use or discuss any of those 3 datasets in your work, please cite our paper:
@misc{baechler2024screenai,
title={ScreenAI: A Vision-Language Model for UI and Infographics Understanding},
author={Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma},
year={2024},
eprint={2402.04615},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Dataset is licensed under CC BY 4.0.
If you have a technical question regarding the dataset or publication, please create an issue in this repository.