Anserini: BM25 Baselines for MS MARCO Passage Ranking

This page contains instructions for running BM25 baselines on the MS MARCO passage ranking task. Note that there is a separate MS MARCO document ranking task. We also have a separate page describing document expansion experiments (doc2query) for this task.

Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP) for this particular task, try to provision enough resources as the tasks could take some time to finish such as RAM > 6GB and storage ~ 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.

If you're a Waterloo undergraduate going through this guide as the screening exercise of joining my research group, try to understand what you're actually doing, instead of simply cargo culting (i.e., blinding copying and pasting commands into a shell). In particular, you'll want to pay attention to the "What's going on here?" sections.

What's going on here?

As a really high level summary: in the MS MARCO passage ranking task, you're given a bunch of passages to search and a bunch of queries. The system's task is to return the best passages for each query (i.e., passages that are relevant).

Note that "the things you're searching" are called documents (in the generic sense), even though they're actually passages (extracted from web pages) in this case. You could be search web pages, PDFs, Excel spreadsheets, and even podcasts. Information retrieval researchers refer to these all as "documents".

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

What's going on here?

If you peak inside the collection:

head collections/msmarco-passage/collection.tsv

You'll see that collection.tsv contains the passages that we're searching. Each line represents a passage: the first column contains a unique identifier for the passage (called the docid) and the second column contains the text of the passage itself.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a JsonCollection using Anserini:

sh target/appassembler/bin/IndexCollection -threads 9 -collection JsonCollection \
 -generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \
 -index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Performing Retrieval on the Dev Queries

Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:

python tools/scripts/msmarco/filter_queries.py \
 --qrels collections/msmarco-passage/qrels.dev.small.tsv \
 --queries collections/msmarco-passage/queries.dev.tsv \
 --output collections/msmarco-passage/queries.dev.small.tsv

The output queries file should contain 6980 lines.

What's going on here?

Check out the contents of the queries file:

$ head collections/msmarco-passage/queries.dev.small.tsv
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number

These are the queries we're going to feed to the search engine. The first field is a unique identifier for the query (called the qid) and the second column is the query itself. These queries are taken from Bing search logs, so they're "realistic" web queries in that they may be ambiguous, contain typos, etc.

We can now perform a retrieval run using this smaller set of queries:

sh target/appassembler/bin/SearchMsmarco -hits 1000 -threads 1 \
 -index indexes/msmarco-passage/lucene-index-msmarco \
 -queries collections/msmarco-passage/queries.dev.small.tsv \
 -output runs/run.msmarco-passage.dev.small.tsv

Note that by default, the above script uses BM25 with tuned parameters k1=0.82, b=0.68. The option -hits specifies the number of documents per query to be retrieved. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Retrieval speed will vary by machine: On a modern desktop with an SSD, we can get ~0.07 s/query, so the run should finish in under ten minutes. We can perform multi-threaded retrieval by changing the -threads argument.

What's going on here?

Congratulations, you've performed your first retrieval run!

You feed a search engine a bunch of queries, and the retrieval run is the output of the search engine. For each query, the search engine gives back a ranked list of results (i.e., a list of hits).

Let's take a look:

$ head runs/run.msmarco-passage.dev.small.tsv
1048585	7187158	1
1048585	7187157	2
1048585	7187163	3
1048585	7546327	4
1048585	7187160	5
1048585	8227279	6
1048585	7617404	7
1048585	7187156	8
1048585	2298838	9
1048585	7187155	10

The first column is the qid (corresponding to the query). From above, we can see that qid 1048585 is the query "what is paula deen's brother". The second column is the docid of the retrieved result (i.e., the hit), and the third column is the rank position. That is, in a search interface, docid 7187158 would be shown in the top position, docid 7187157 would be shown in the second position, etc.

You can grep through the collection to see what that actual passage is:

$ grep 7187158 collections/msmarco-passage/collection.tsv
7187158	Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�

In this case, the hit seems relevant. That is, it answers the query. So here, the search engine did well.

Note that this particular passage is a bit dirty (garbage characters, dups, etc.)... but that's pretty much a fact of life when you're dealing with the web.

Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:

python tools/scripts/msmarco/msmarco_passage_eval.py \
 collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv

And the output should be like this:

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

What's going on here?

So how do we know if a search engine is any good? One method is manual examination, which is what we did above. That is, we actually looked at the results by hand.

Obviously, this isn't scalable if we want to evaluate lots of queries... If only someone told us which documents were relevant to which queries...

Well, someone has! (Specifically, human editors hired by Microsoft Bing in this case.) These are captured in what are known as relevance judgments. Take a look:

$ grep 1048585 collections/msmarco-passage/qrels.dev.tsv
1048585	0	7187158	1

This says that docid 7187158 is relevant to qid 1048585, which confirms our intuition above. The file is in what is known as the qrels format. You can ignore the second column. The fourth column "1", says that the docid is relevant. In some cases (though not here), that column might say "0", i.e., that the docid is not relevant.

With relevance judgments (qrels), we can now automatically evaluate the search engine output (i.e., the run). The final ingredient we need is a metric (i.e., how to score).

Here, we're using a metric called MRR, or mean reciprocal rank. The idea is quite simple: We look at where the relevant docid appears. If it appears at rank 1, the system gets a score of one. If it appears at rank 2, the system gets a score of 1/2. If it appears at rank 3, the system gets a score of 1/3. And so on. MRR@10 means that we only go down to rank 10. If the relevant docid doesn't appear in the top 10, then the system gets a score of zero.

That's the score of a query. We take the average of the scores across all queries (6980 in this case), and we arrive at the score for the entire run.

You can find this entry on the MS MARCO Passage Ranking Leaderboard as entry "BM25 (Lucene8, tuned)", so you've just reproduced (part of) a leaderboard submission!

We can also use the official TREC evaluation tool, trec_eval, to compute other metrics than MRR@10. For that we first need to convert runs and qrels files to the TREC format:

python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
 --input runs/run.msmarco-passage.dev.small.tsv \
 --output runs/run.msmarco-passage.dev.small.trec

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
 --input collections/msmarco-passage/qrels.dev.small.tsv \
 --output collections/msmarco-passage/qrels.dev.small.trec

And run the trec_eval tool:

tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
 collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec

The output should be:

map                   	all	0.1957
recall_1000           	all	0.8573

Average precision and recall@1000 are the two metrics we care about the most.

What's going on here?

Don't worry so much about the details here for now. The tl;dr is that there are different formats for run files and lots of different metrics you can compute. trec_eval is a standard tool used by information retrieval researchers.

In fact, researchers have been trying to answer the question "how do we know if a search result is good and how do we measure it" for over half a century... and the question still has not been fully resolved. In short, it's complicated.

BM25 Tuning

Note that this figure differs slightly from the value reported in Document Expansion by Query Prediction, which uses the Anserini (system-wide) default of k1=0.9, b=0.4.

Tuning was accomplished with tools/scripts/msmarco/tune_bm25.py, using the queries found here; the basic approach is grid search of parameter values in tenth increments. There are five different sets of 10k samples (using the shuf command). We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization). In separate trials, we optimized for:

recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).

It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between the Anserini default and optimized parameters:

Setting	MRR@10	MAP	Recall@1000
Default (`k1=0.9`, `b=0.4`)	0.1840	0.1926	0.8526
Optimized for recall@1000 (`k1=0.82`, `b=0.68`)	0.1874	0.1957	0.8573
Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`)	0.1892	0.1972	0.8555

To reproduce these results, the SearchMsmarco class above takes k1 and b parameters as command-line arguments, e.g., -k1 0.60 -b 0.62 (note that the default setting is k1=0.82 and b=0.68).

Reproduction Log*

Results reproduced by @ronakice on 2019-08-12 (commit 5b29d16)
Results reproduced by @MathBunny on 2019-08-12 (commit 5b29d16)
Results reproduced by @JMMackenzie on 2020-01-08 (commit f63cd22)
Results reproduced by @edwinzhng on 2020-01-08 (commit 5cc923d)
Results reproduced by @LuKuuu on 2020-01-15 (commit f21137b)
Results reproduced by @kevinxyc1 on 2020-01-18 (commit 798cb3a)
Results reproduced by @nikhilro on 2020-01-21 (commit 631589e)
Results reproduced by @yuki617 on 2020-03-29 (commit 074723c)
Results reproduced by @weipang142857 on 2020-04-20 (commit 074723c)
Results reproduced by @HangCui0510 on 2020-04-23 (commit 0ae567d)
Results reproduced by @x65han on 2020-04-25 (commit f5496b9)
Results reproduced by @y276lin on 2020-04-26 (commit 8f48f8e)
Results reproduced by @stephaniewhoo on 2020-04-26 (commit 8f48f8e)
Results reproduced by @eiston on 2020-05-04 (commit dd84a5a)
Results reproduced by @rohilg on 2020-05-09 (commit 20ee950)
Results reproduced by @wongalvis14 on 2020-05-09 (commit ebac5d6)
Results reproduced by @YimingDou on 2020-05-14 (commit 3b0a642)
Results reproduced by @richard3983 on 2020-05-14 (commit a65646f)
Results reproduced by @MXueguang on 2020-05-20 (commit 3b2751e)
Results reproduced by @shaneding on 2020-05-23 (commit b6e0367)
Results reproduced by @adamyy on 2020-05-28 (commit 94893f1)
Results reproduced by @kelvin-jiang on 2020-05-28 (commit d55531a)
Results reproduced by @TianchengY on 2020-05-28 (commit 2947a16)
Results reproduced by @stariqmi on 2020-05-28 (commit 4914305)
Results reproduced by @justinborromeo on 2020-06-10 (commit 7954eab)
Results reproduced by @yxzhu16 on 2020-07-03 (commit 68ace26)
Results reproduced by @LizzyZhang-tutu on 2020-07-13 (commit 8c98d5b)
Results reproduced by @estella98 on 2020-07-29 (commit 99092a8)
Results reproduced by @tangsaidi on 2020-08-19 (commit aba846)
Results reproduced by @qguo96 on 2020-09-07 (commit e16b3c1)
Results reproduced by @yuxuan-ji on 2020-09-08 (commit 0f9a8ec)
Results reproduced by @wiltan-uw on 2020-09-09 (commit 93d913f)
Results reproduced by @JeffreyCA on 2020-09-13 (commit bc2628b)
Results reproduced by @jhuang265 on 2020-10-15 (commit 66711b9)
Results reproduced by @rayyang29 on 2020-10-27 (commit ad8cc5a)
Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit 22c0ad3)
Results reproduced by @rakeeb123 on 2020-12-07 (commit f50dcce)
Results reproduced by @jrzhang12 on 2021-01-02 (commit be4e44d)
Results reproduced by @HEC2018 on 2021-01-04 (commit 4de21ec)
Results reproduced by @KaiSun314 on 2021-01-08 (commit 113f1c7)
Results reproduced by @yemiliey on 2021-01-18 (commit 179c242)
Results reproduced by @larryli1999 on 2021-01-22 (commit 3f9af5)
Results reproduced by @ArthurChen189 on 2021-04-08 (commit 45a5a21)
Results reproduced by @printfCalvin on 2021-04-11 (commit d808d4a)
Results reproduced by @saileshnankani on 2021-04-26 (commit 5781c87)
Results reproduced by @andrewyguo on 2021-04-29 (commit 71f3ca6)
Results reproduced by @mayankanand007 on 2021-05-04 (commit 906ca50)
Results reproduced by @Albert-Ma on 2021-05-07 (commit 5bcbccd)
Results reproduced by @rootofallevii on 2021-05-14 (commit 626da95)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-passage.md

experiments-msmarco-passage.md

Anserini: BM25 Baselines for MS MARCO Passage Ranking

Data Prep

Performing Retrieval on the Dev Queries

BM25 Tuning

Reproduction Log*

Files

experiments-msmarco-passage.md

Latest commit

History

experiments-msmarco-passage.md

File metadata and controls

Anserini: BM25 Baselines for MS MARCO Passage Ranking

Data Prep

Performing Retrieval on the Dev Queries

BM25 Tuning

Reproduction Log*