This work empirically demonstrates the ability of Text Graph Convolutional Network (Text GCN) to outperform traditional natural language processing benchmarks for the task of semi-supervised Swahili news classification. In particular, we focus our experimentation on the sparsely-labelled semi-supervised context which is representative of the practical constraints facing low-resourced African languages. We follow up on this result by introducing a variant of the Text GCN model which utilises a bag of words embedding rather than a naive one-hot encoding to reduce the memory footprint of Text GCN whilst demonstrating similar predictive performance.
This repo contains the code used to generate the experimental results presented in the following two papers:
- Alexandros Kastanos and Tyler Martin, Graph Convolutional Network for Swahili News Classification, European Chapter of the Association for Computational Linguistics, AfricaNLP, 2021
- Alexandros Kastanos and Tyler Martin, Graph Convolutional Network for Swahili News Classification, North American Chapter of the Association for Computational Linguistics, SIGTYP, 2021
This project is compatible with Python >= 3.6.9. Install the full list of dependencies in your virtual environemnt by running:
pip install -r requirements.txt
NOTE: This repo was developed (and will continue to be maintained) exclusively for Linux and Mac. The exact package versions are valid for Python 3.6.9.
Before you get started, open a Python prompt from your virtual environment and run the following to download the punkt
data:
>>> import nltk
>>> nltk.download('punkt')
The raw dataset can be downloaded, preprocessed, and organised into a DataFrame by running the following command:
python src/create_dataset.py <DOWNLOADED_DATA_DIR_NAME> --dataset-name <DATASET_NAME>
This script is set up to handle Helsinki Corpus of Swahili 2.0 Not Annotated Version
and the Zenodo Swahili : News Classification Dataset
. You select which dataset to process using the --dataset-name
CLI option. The dataset split is generated by this script as well.
It is worth noting that for the purposes of replicating our work in the paper listed above, you only need to consider the Swahili news classification dataset.
We can apply preprocessing (stemming, error removal, etc.) using the script below:
python src/generate_stemming_map.py \
--results-dir <PREPROC_DATA_DIR_NAME> \
--input-data-dir <DOWNLOADED_DATA_DIR_NAME>
This gives us our cleaned stemming map and vocabulary counts as the main outputs.
A number of baseline models are set up to be trained out-the-box. They all use a feature generation stage followed by a logistic regression classifier. Here's an excerpt from our paper where a set of baseline models are compared to two Text GCN variants. For more details, see Table 2 in our paper.
To train any of these baseline models, simply use the following:
python src/train_baseline.py <BASELINE_RESULTS_DIR> \
--input-data-dir <DOWNLOADED_DATA_DIR_NAME> \
--stemmer-path <PREPROC_DATA_DIR_NAME>/stemming/stemming_cleaned.json \
--model <BASELINE_MODEL>
There are two Text GCN variant models which can be trained: the vanilla Text GCN
and Text GCN-t2v
. You can refer to the figure below for a comparison of these models to selected baseline models as the proportion of training labels is varied. See figure 3 in our paper for full details.
To train a GCN model, run:
python src/train_gnn.py <GNN_RESULTS_DIR> \
--input-data-dir <DOWNLOADED_DATA_DIR_NAME> \
--stemmer-path <PREPROC_DATA_DIR_NAME>/stemming/stemming_cleaned.json \
--model <GNN_MODEL> \
--input-features <NODE_REREPRESENTATION> \
....
As with all the above scripts, you can use the --help
flag to see the full set of CLI options.
@misc{kastanos2021graph,
title={Graph Convolutional Network for Swahili News Classification},
author={Alexandros Kastanos and Tyler Martin},
year={2021},
eprint={2103.09325},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Zenodo's Swahili News Dataset:
@misc{davis_david_2020_4300294,
author = {Davis David},
title = {Swahili : News Classification Dataset},
month = dec,
year = 2020,
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.4300294},
url = {https://doi.org/10.5281/zenodo.4300294}
}
Helsinki Corpus of Swahili 2.0:
@misc{hcs-na-v2_en,
author={Arvi Hurskainen and {Department of World Cultures, University of Helsinki}},
year={2016},
title={{Helsinki Corpus of Swahili 2.0 Not Annotated Version}},
publisher={Kielipankki},
type={text corpus},
url={http://urn.fi/urn:nbn:fi:lb-2016011302},
}