This page contains instructions for how to index the 20 Newsgroups dataset.
There are many versions of the 20 Newsgroups dataset available on the web, we're specifically going to use this one (the "bydate" version).
We're going to use collections/20newsgroups/
as the working directory.
First, we need to download and extract the dataset:
mkdir -p collections/20newsgroups/
wget -nc http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -P collections/20newsgroups
tar -xvzf collections/20newsgroups/20news-bydate.tar.gz -C collections/20newsgroups
To confirm, 20news-bydate.tar.gz
should have MD5 checksum of d6e9e45cb8cb77ec5276dfa6dfc14318
.
After unpacking, you should see the following two folders:
ls collections/20newsgroups/20news-bydate-test
ls collections/20newsgroups/20news-bydate-train
There are docs with the same id in different categories.
For example, doc 123
exists in misc.forsale
and sci.crypt
, with different texts.
Since we assume unique docids when building an index, we need to clean the the dataset first.
To prune and merge both train and test splits into one folder:
python src/main/python/20newsgroups/prune_and_merge.py \
--paths collections/20newsgroups/20news-bydate-test collections/20newsgroups/20news-bydate-train \
--out collections/20newsgroups/20news-bydate
Now you should see the train and test splits merged into one folder in collections/20newsgroups/20news-bydate/
.
To index train and test together:
bin/run.sh io.anserini.index.IndexCollection \
-collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate \
-index indexes/lucene-index.20newsgroups.all \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw -optimize
To index just the train set:
bin/run.sh io.anserini.index.IndexCollection \
-collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate-train \
-index indexes/lucene-index.20newsgroups.train \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw -optimize
To index just the test set:
bin/run.sh io.anserini.index.IndexCollection \
-collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate-test \
-index indexes/lucene-index.20newsgroups.test \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw -optimize
Indexing should take just a few seconds.
You can check the document count (for train and test together, or train/test individually) with:
bin/run.sh io.anserini.index.IndexReaderUtils \
-index indexes/lucene-index.20newsgroups.all \
-stats
Which should output:
Index statistics
----------------
documents: 18846
documents (non-empty): 18846
unique terms: 165633
total terms: 4219956
For reference, the number of docs indexed should be exactly as follows:
# of docs | pre-built index | |
---|---|---|
Train | 11,314 | [download] |
Test | 7,532 | [download] |
Train + Test | 18,846 | [download] |
For convenience, we also provide pre-built indexes above.
Reproduction Log*
- Results reproduced by @stephaniewhoo on 2020-11-24 (commit
b7f1f08
) - Results reproduced by @b8zhong on 2024-11-27 (commit
a5e6771
)