The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.
For more details, see the docs/
directory and the paper.
The contents of this repository is licensed under CC-BY.
Make sure to cite the following paper when using this dataset:
@inproceedings{clse2022,
title={CLSE: Corpus of Linguistically Significant Entities},
author={Chuklin, Aleksandr and Zhao, Justin and Kale, Mihir},
booktitle={Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2022) at EMNLP 2022},
year={2022}
}