GitHub - eligugliotta/tarc: Tunisian Arabish Corpus

Tunisian Arabish Corpus (TArC)

This repository describes and contains the corpus mentioned in the following papers:

Gugliotta, E. & Dinarelli, M. (2020, May). TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus. In Proceedings of The 12th LREC (pp. 6279-6286).
Gugliotta, E., & Dinarelli, M. (2020, June). TArC. Un corpus d'arabish tunisien. In Actes: JEP-TALN (RÉCITAL).Vol 2: (pp. 232-240). ATALA.
Gugliotta, E. et al., (2020). Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation.

TArC has been designed as a flexible and multi-purpose open corpus in order to be a useful support for different types of analyses: computational and linguistics, as well as for NLP tools training.

Arabish, also known as Arabizi, is a spontaneous encoding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer Mediated Communication (CMC) and text messaging informal frameworks [2].

Overview of TArC

In a nutshell, the data gathered in TArC represent Tunisian Arabish writing and its evolution over the last ten years.

TArC texts have been extracted from social media for an amount of 43 313 tokens. Each text has been extracted together with the user's metadata when publicly shared. The metadata consists in:

The governorate of provenance
Age range: [-25],[25-35],[35-50],[50+]
Gender: M/F

The Tunisian Arabish texts collected in the TArC have been provided with various annotation levels semi-automatically produced by a Multi-Task Sequence Prediction System:

Token classification into arabizi, foreign and emotag.
Encoding in Arabic Script of the tokens classified as arabizi (following the CODA convention [4]).
Tokenization of the CODAfied texts.
Part-of-Speech tagging for the arabizi tokens.

TArC numbers:

	SENTENCES		WORDS
TOTAL	4,797		43,327
		arabizi	foreign	emotag
Forum (11,909)	755	6,024	5,873	12
Social (16,056)	3,162	11,835	3,623	598
Blog (6,671)	366	5,970	694	7
Rap (8,691)	514	7,681	1,009	1

Classification

The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag. The first class is for Tunisian and Modern Standard Arabic tokens; the second one is used to classify non-Arabic code-mixing or code-switching elements; the latter is the label used for elements such as smiley or emoticons. This operation is preparatory to the second level of annotation, aka the encoding in Arabic characters.

The token-level classification has been carried on through a RNN character-level model pre-trained on:

Hussem Ben Belgacem's French dictionary, consisting in 336 351 tokens
A Tunisian Arabish dictionary of 100 936 tokens, resulting from the merge of the following datasets:

The Tunizi Sentiment Analysis Tunisian Arabic Dataset [1]
The TLD dataset of Arabish [6]

The emotag dictionary was built by extracting smileys and emoticons from the Arabizi dictionary (2nd item). Once the model was pre-trained on the above data, it has been possible to start an iterative procedure for TArC text classification. [3] The model reached 97% of accuracy.
Each token classification has been manually checked.

Tokenization and PoS tagging

The tokenization at string level consists in reducing each string to its components, concatenated by the symbol +. Only those tokens classified as arabizi have been tokenized.
E.g.: "be in a bad lay" (lit.:on the wall), [ʕal'ħiːtˤ], ع+ال+حيط <= عالحيط

The Part-of-Speech tagging is the morphosyntatic annotation of strings. It has been operated at both levels: morphological and functional. The first one describes the morphological nature of each element of the string, while the second one describes the grammatical function of the whole string. The PoS annotation style follows the guidelines of the Penn Arabic Treebank (PATB) [5].

arabish	CODA	tokenization	POS	gloss
sa7a	صحّة	صحّة	[NOUN-NSUFF_FEM_SG]INTERJ	lit:health
w	و	و	CONJ	and
bechfee	بالشفى	ب+ال+شفى	[PREP+DET+NOUN]INTERJ	lit:to your good health

How to use TArC

It is possible to download TArC text files organized by genres:

Forum texts: forum.tab
Social Network texts: social.tab
Blog texts: blog.tab
Rap lyrics texts: rap.tab

In each file, the sentences are delimited by a blank line and each file contains the header such as:

data	arabish	class	coda	token	pos	governorate	age

The publication date information follow the structure: yyyymmdd.

Not all sentences are provided with all the metadata information (governorate/age/gender). You can find '/' in place of the metadata, it means that the users didn't published this information.

The governorate names are encoded following the French orhographical conventions and are: Ariana, Béja, Sousse, Bizerte, Gabès, Nabeul, Jendouba, Kairouan, Zaghouan, Kebili, El Kef, Mahdia, Manouba, Medenine, Monastir, Gafsa, Sfax, Sidi Bouzid, Siliana, Ben Arous, Tataouine, Tozeur, Tunis, Kasserine.

Any user personal information has been anonymized by covering it with the token:

m5abbi+id number

The Tunisian word 'm5abbi' means 'covered', correspond to the CODAfied token 'مخبّي' and the PoS-tag: 'NOUN_PROP'.

If you want to use this data with the Multi-Task Sequence Prediction System, you should remove the header and extract the columns that you need, depending on process that you want to perform.

The lemmatization layer is coming, as is a distribution of the data in Excel format.

License

Attribution 4.0 International (CC BY-NC-SA 4.0)

Citation

Please cite this work as:

@inproceedings{gugliotta-etal-wanlp2020, 
    title={Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation}, 
    author={Gugliotta, Elisa and Dinarelli, Marco and Kraif, Olivier}, 
    booktitle={The Fifth Arabic Natural Language Processing Workshop (WANLP)}, 
    year={2020},
}

References

[1] Fourati, C. et al., (2020). TUNIZI: a Tunisian Arabizi sentiment analysis Dataset. arXiv preprint arXiv:2004.14303.

[2] Gugliotta, E. & Dinarelli, M. (2020, May). TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 6279-6286).

[3] Gugliotta, E. et al., (2020). Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation. Accepted for The Fifth Arabic Natural Language Processing Workshop (WANLP) 2020. Forthcoming.

[4] Habash, N. et al., (2012, May). Conventional Orthography for Dialectal Arabic. In Proceedings of The Language Resources and Evaluation Conference (pp. 711-718).

[5] Maamouri, M. et al., (2009). Penn Arabic Treebank guidelines v4.

[6] Younes, J., et al., (2015, June). Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In International Conference on Web Engineering (pp. 3-14). Springer, Cham.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Linguistic_Analyses		Linguistic_Analyses
Multi_Task_Relations_LDK2023_presentation.pdf		Multi_Task_Relations_LDK2023_presentation.pdf
README.md		README.md
blog.tab		blog.tab
forum.tab		forum.tab
rap.tab		rap.tab
social.tab		social.tab
tarc.tsv		tarc.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tunisian Arabish Corpus (TArC)

Overview of TArC

Classification

Tokenization and PoS tagging

How to use TArC

License

Citation

References

About

Releases

Packages

Languages

eligugliotta/tarc

Folders and files

Latest commit

History

Repository files navigation

Tunisian Arabish Corpus (TArC)

Overview of TArC

Classification

Tokenization and PoS tagging

How to use TArC

License

Citation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages