Skip to content

Latest commit

 

History

History
151 lines (121 loc) · 7.89 KB

README.md

File metadata and controls

151 lines (121 loc) · 7.89 KB

End-to-End Text Classification via Image-based Embedding using Character-level Networks

CoRR preprint arXiv:1810.03595 IEEE Xplore

Author: Shunsuke Kitada, Ryunosuke Kotani, Hitoshi Iyatomi

Proposed CE-CLCNN model [1] Example of data augmentation on image domain with random erasing [2]
Screen Shot 2021-11-27 18 11 26 anim_re

Abstract: For analysing and/or understanding languages having no word boundaries based on morphological analysis such as Japanese, Chinese, and Thai, it is desirable to perform appropriate word segmentation before word embeddings. But it is inherently difficult in these languages. In recent years, various language models based on deep learning have made remarkable progress, and some of these methodologies utilizing character-level features have successfully avoided such a difficult problem. However, when a model is fed character-level features of the above languages, it often causes overfitting due to a large number of character types. In this paper, we propose a CE-CLCNN, character-level convolutional neural networks using a character encoder to tackle these problems. The proposed CE-CLCNN is an end-to-end learning model and has an image-based character encoder, i.e. the CE-CLCNN handles each character in the target document as an image. Through various experiments, we found and confirmed that our CE-CLCNN captured closely embedded features for visually and semantically similar characters and achieves state-of-the-art results on several open document classification tasks. In this paper we report the performance of our CE-CLCNN with the Wikipedia title estimation task and analyse the internal behaviour.

We recommend that you also check out the following studies related to ours.

Install and Run the code

Python 3.8 Code style: black Powered by AllenNLP

Install the requirements

pip install -U pip poetry setuptools
poetry install

# If you want to use CUDA 11+, try the following command:
poetry run poe force-cuda11

Run training the model

  • Our CE-CLCNN models
bash bash scripts/run_exps/run_ceclcnn.sh

# or dry-run the scripts
# DRY_RUN=1 bash bash scripts/run_exps/run_ceclcnn.sh
bash scripts/run_exps/run_liu_acl17.sh

# or dry-run the scripts
# DRY_RUN=1 bash scripts/run_exps/run_liu_acl17.sh

Inference with test data using the pre-trained model

  • Example of inference using the best model for the Japanese wikipedia title dataset.
CUDA_VISIBLE_DEVICES=0 allennlp predict \
  output/CE-CLCNN/wiki_title/ja/with_RE_and_WT/model.tar.gz \
  https://github.com/frederick0329/Learning-Character-Level/raw/master/data/ja_test.txt \
  --cuda-device 0 \
  --use-dataset-reader \
  --dataset-reader-choice validation \
  --predictor wiki_title \
  --output-file output/CE-CLCNN/wiki_title/with_RE_and_WT/prediction_result.jsonl \
  --silent

If you want to use the following pre-trained model, you should download first and then execute the above command.

mkdir -p output/CE-CLCNN/wiki_title/ja/with_RE_and_WT
wget https://github.com/IyatomiLab/CE-CLCNN/raw/master/pretrained_models/CE-CLCNN/wiki_title/ja/with_RE_and_WT/model.tar.gz -P output/CE-CLCNN/wiki_title/ja/with_RE_and_WT

Pre-trained models

Dataset Language Model Pre-trained Model
Wikipedia Title Dataset [3] Chinese CE-CLCNN (proposed) [Download]
CE-CLCNN w/ RE and WT (proposed) [Download]
Visual model [3] [Download]
Japanese CE-CLCNN (proposed) [Download]
CE-CLCNN w/ RE and WT (proposed) [Download]
Visual model [3] [Download]
Korea CE-CLCNN (proposed) [Download]
CE-CLCNN w/ RE and WT (proposed) [Download]
Visual model [3] [Download]

Citation

@inproceedings{kitada2018end,
  title={End-to-end text classification via image-based embedding using character-level networks},
  author={Kitada, Shunsuke and Kotani, Ryunosuke and Iyatomi, Hitoshi},
  booktitle={2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)},
  pages={1--4},
  year={2018},
  organization={IEEE},
  doi={10.1109/AIPR.2018.8707407},
}

Reference

  • [1] S. Kitada, R. Kotani, and H. Iyatomi. "End-to-end Text Classification via Image-based Embedding using Character-level Networks." In Proceedings of IEEE Applied Imagery Pattern Recognition (AIPR) Workshop. IEEE, 2018. doi: https://doi.org/10.1109/AIPR.2018.8707407
  • [2] Z. Zhong et al. "Random erasing data augmentation." In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020. doi: https://doi.org/10.1609/aaai.v34i07.7000
  • [3] F. Liu et al. "Learning Character-level Compositionality with Visual Features." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi: http://dx.doi.org/10.18653/v1/P17-1188