Name		Name	Last commit message	Last commit date
parent directory ..
atis		atis
atis2/nlu_joint/v1		atis2/nlu_joint/v1
cnn_dailymail/seq2seq/v1		cnn_dailymail/seq2seq/v1
conll2003		conll2003
hkust/asr/v1		hkust/asr/v1
iemocap		iemocap
mini_an4		mini_an4
mock_text_cls_data/text_cls/v1		mock_text_cls_data/text_cls/v1
mock_text_match_data/text_match/v1		mock_text_match_data/text_match/v1
mock_text_nlu_joint_data/nlu-joint/v1		mock_text_nlu_joint_data/nlu-joint/v1
mock_text_seq2seq_data/seq2seq/v1		mock_text_seq2seq_data/seq2seq/v1
mock_text_seq_label_data/seq-label/v1		mock_text_seq_label_data/seq-label/v1
msra_ner		msra_ner
quora_qp		quora_qp
snli		snli
sre16/v1		sre16/v1
trec		trec
voxceleb		voxceleb
wmt14_en_de/nlp1		wmt14_en_de/nlp1
yahoo_answer		yahoo_answer
README.md		README.md

README.md

Examples

All examples are under directory egs and named by its name of dataset. All data-sets starts with "mock" are data-sets for test.

DataSet	Supported Tasks	Description
ATIS	Sequence labeling/ Text classification/ NLU joint learning	Air Travel Information System (ATIS) pilot corpus.
CoNLL2003	Sequence labeling	The CoNLL 2003 NER task consists of newswire text from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC).
MSRA_NER	Sequence labeling	MSRA datasets are in the news domain about NER.
SNIL	Sentence Matching	Stanford Natural Language Inference corpus is a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning.
Quora_QP	Sentence Matching	Data collected from the quara platform. Quora is a place to gain and share knowledge—about anything.
Yahoo_Answer	Document Classification	Yahoo answers are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes. The document we use includes question titles, question contexts and best answers.
Trec	Document Classification	This data collection contains all the data used in our learning question classification experiments,which has question class definitions.

DataSet	Supported Tasks	Description
hkust	ASR	HKUST Mandarin Telephone Speech
voxceleb	Speaker Verfication	VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
iemocap	Emotion	The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC.