Skip to content

Commit

Permalink
Update README and fix config typo
Browse files Browse the repository at this point in the history
  • Loading branch information
lgrz committed Sep 4, 2020
1 parent f8f9fef commit bc7855e
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 13 deletions.
38 changes: 26 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@ Collections_ from the CIKM 2020 Resource Track.

## Prerequisites for Building the Dataset

* Indri index of ClueWeb09B (example [config provided][clueindri])
* Indri index of ClueWeb09B ([example config][clueindri])
* ~350GiB RAM
* ~300GiB disk space
* Webgraph data: [ClueWeb09_WG_50m.graph-txt.gz][graph] and [ClueB-ID-DOCNO.txt.tar.gz][iddocno]
* Webgraph data [ClueWeb09_WG_50m.graph-txt.gz][graph] and [ClueB-ID-DOCNO.txt.tar.gz][iddocno]. Once downloaded decompress the `ClueB-ID-DOCNO.txt.tar.gz`:
- `ClueWeb09B_WG_50m.graph-txt.gz` leave this as is.
- `ClueB-ID-DOCNO.txt.tar.gz` decompress to `ClueB-ID-DOCNO.txt`.
* The [gradle][gradleversion] build system was used for the AlexaRank data
* gcc 8 (not tested with clang)

[clueindri]: config/clueweb09b.xml
[graph]: http://boston.lti.cs.cmu.edu/clueweb09/WebGraph/ClueWeb09_WG_50m.graph-txt.gz
[iddocno]: http://boston.lti.cs.cmu.edu/clueweb09/pagerank/ClueB-ID-DOCNO.txt.tar.gz
* GCC 8.x (not tested with Clang)
* Boost (tested with 1.65.1)
* Cmake 3.x

## Environment Setup

Expand All @@ -41,17 +41,31 @@ pip install -r requirements.txt
## Build the Dataset

1. Copy configuration template: `cp config/dataset.dist config/dataset`
2. Edit `config/dataset`
2. Edit `config/dataset` and configure the following variables:
- `INDRI_INDEX_PATH` - path to existing ClueWeb09B Indri index ([example config][clueindri])
- `FXT_INDEX_PATH` - path where the Fxt index will be created
- `BOOST_INCLUDE_PATH` - path to Boost headers
- `BOOST_LIBRARY_PATH` - path to Boost libraries
- `INDRI_INCLUDE_PATH` - path to Indri headers
- `INDRI_LIBRARY_PATH` - path to Indri libraries
- `WEBGRAPH_PATH` - path to `ClueWeb09_WG_50m.graph-txt.gz` (gzipped).
- `GRAPHPAIRS_PATH` - path to `ClueB-ID-DOCNO.txt` (decompressed).
3. Run `./src/dataset/main.sh`
4. Come back in a day or so...
5. Dataset files `build/cikm20ltr`

[clueindri]: config/clueweb09b.xml
[graph]: http://boston.lti.cs.cmu.edu/clueweb09/WebGraph/ClueWeb09_WG_50m.graph-txt.gz
[iddocno]: http://boston.lti.cs.cmu.edu/clueweb09/pagerank/ClueB-ID-DOCNO.txt.tar.gz

## Reproduce Experiments

* Run `./src/experiment/main.sh`
* Come back in ~10 minutes...
* `cat` the results: `for i in build/result/wt??/test/eval/*.txt; do echo $i; cat $i; done`
1. Run `./src/experiment/main.sh`
2. Come back in ~10 minutes...
3. `cat` the results: `for i in build/result/wt??/test/eval/*.txt; do echo $i; cat $i; done`
4. TREC run files `build/result/wt??/test/run`

## AlexaRank Data (Notes)
## AlexaRank Notes

The snapshot for the AlexaRank data is from [2010][alexarank].
This was the temporally closest working snapshot to Jan-Feb 2009 for
Expand Down
2 changes: 1 addition & 1 deletion config/fxt-cw09.ini
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ f_bm25_bigram_u8 = 1
f_bm25_tp_dist_w100 = 1

; Enable/disable feature f_sdm
f_sdm = 1
f_sdm = 0

; Enable/disable feature f_tag_title_qry_count
f_tag_title_qry_count = 1
Expand Down

0 comments on commit bc7855e

Please sign in to comment.