All examples are under directory egs
and named by its name of dataset. All data-sets starts with "mock" are data-sets for test.
DataSet |
Supported Tasks |
Description |
ATIS |
Sequence labeling/ Text classification/ NLU joint learning |
Air Travel Information System (ATIS) pilot corpus. |
CoNLL2003 |
Sequence labeling |
The CoNLL 2003 NER task consists of newswire text from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). |
MSRA_NER |
Sequence labeling |
MSRA datasets are in the news domain about NER. |
SNIL |
Sentence Matching |
Stanford Natural Language Inference corpus is a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. |
Quora_QP |
Sentence Matching |
Data collected from the quara platform. Quora is a place to gain and share knowledge—about anything. |
Yahoo_Answer |
Document Classification |
Yahoo answers are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes. The document we use includes question titles, question contexts and best answers. |
Trec |
Document Classification |
This data collection contains all the data used in our learning question classification experiments,which has question class definitions. |
DataSet |
Supported Tasks |
Description |
hkust |
ASR |
HKUST Mandarin Telephone Speech |
voxceleb |
Speaker Verfication |
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube |
iemocap |
Emotion |
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC. |