Skip to content

Latest commit

 

History

History
128 lines (88 loc) · 5.67 KB

experiments-cord19-extras.md

File metadata and controls

128 lines (88 loc) · 5.67 KB

Ingesting CORD-19 into Solr and Elasticsearch

This document describes how to ingest the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI into Solr and Elasticsearch. If you want to build or download Lucene indexes for CORD-19, see this guide.

Getting the Data

Follow the instructions here to get access to the data. This version of the guide has been verified to work with the version of 2020/07/16, which is the corpus used in round 5 of the TREC-COVID challenge.

Download the corpus using our script:

python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --download

Solr

From the Solr archives, download the Solr (non -src) version that matches Anserini's Lucene version to the anserini/ directory.

Extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Start Solr (adjust memory usage with -m as appropriate):

solrini/bin/solr start -c -m 8G

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

Next, create the collection:

solrini/bin/solr create -n anserini -c cord19

Adjust the schema (if there are errors, follow the instructions below):

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/cord19.json \
 http://localhost:8983/solr/cord19/schema

Note: If there are errors from field conflicts, you'll need to reset the configset and recreate the collection (select [All] for the fields to replace):

solrini/bin/solr delete -c cord19
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
solrini/bin/solr create -n anserini -c cord19

We can now index into Solr:

sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
 -threads 8 -input collections/cord19-2020-07-16 \
 -solr -solr.index cord19 -solr.zkUrl localhost:9983 \
 -storePositions -storeDocvectors -storeContents -storeRaw

Once indexing is complete, you can query in Solr at http://localhost:8983/solr/#/cord19/query.

You'll need to make sure your query is searching the contents field, so the query should look something like contents:"incubation period".

Elasticsearch + Kibana

From the Elasticsearch, download the correct distribution for your platform to the anserini/ directory. These instructions below work with version 7.10.0.

First, unpack and deploy Elasticsearch:

mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1
elastirini/bin/elasticsearch

Upack and deploy Kibana:

tar -zxvf kibana*.tar.gz -C elastirini --strip-components=1
elastirini/bin/kibana

Elasticsearch has a built-in safeguard to disable indexing if you're running low on disk space. The error is something like "flood stage disk watermark [95%] exceeded on ..." with indexes placed into readonly mode. Obviously, be careful, but if you're sure things are going to be okay and you won't run out of disk space, disable the safeguard as follows:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'

Set up the proper schema using this config:

cat src/main/resources/elasticsearch/index-config.cord19.json \
 | curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/cord19' -d @-

Indexing abstracts:

sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
 -es -es.index cord19 -threads 8 -input collections/cord19-2020-07-16 -storePositions -storeDocvectors -storeContents -storeRaw

We are now able to access interactive search and visualization capabilities from Kibana at http://localhost:5601/.

Here's an example: in the above webapp, create an "Index Pattern". Set the index pattern to cord19, and use publish_time as the time filter. Then navigate to "Discover" in Kibana to run a search. If you're not getting any results, be sure you've expanded the date range, next to the search bar.

Reproduction Log*

  • Reproduced by @adamyy on 2020-05-29 (commit 2947a16) on CORD-19 release of 2020/05/26.
  • Reproduced by @yxzhu16 on 2020-07-17 (commit fad12be) on CORD-19 release of 2020/06/19.
  • Reproduced by @LizzyZhang-tutu on 2020-07-26 (commit fad12be) on CORD-19 release of 2020/07/25.
  • Reproduced by @lintool on 2020-11-23 (commit 746447) on CORD-19 release of 2020/07/16 with Solr v8.3.0 and ES/Kibana v7.10.0.