Vespa powered search of the CORD-19 dataset

Query API

The default frontend query language searches uses weakAnd.

Deduping

The CORD-19 dataset has a lot of near duplicates, for all search requests, we dedup the top 100 results, using the specter embeddings with document-to-document similarity and a similarity threshold. The dedup functionality is implemented in a Searcher

API Access

Frontend: https://cord19.vespa.ai/
Full API access: https://api.cord19.vespa.ai/search/

For using the Search Api of Vespa please see API documentation, YQL Query Language. For the full document definition see doc.sd.

High level field description

These are the most important fields in the dataset

field	source in CORD-19	indexed/searchable	summary (returned with hit)	available for grouping	matching	Vespa type
default	title + abstract	yes	no	no	tokenized and stemmed (match:text)	fieldset
title	title from metadata	yes	yes with bolding	no	tokenized and stemmed (match:text)	string
abstract	abstract from metadata	yes	yes with bolding and dynamic summary	no	tokenized and stemmed (match:text)	string
journal	journal	yes	yes	yes	exact matching	string
source	source	yes	yes	yes	exact matching	string
doi	https:// + doi from metadata	no	yes	no	no	string
id	row id from metadata.csv	yes	yes	yes	yes	int
authors	authors in metadata or authors from sha json if found	yes using sameElement()	yes	yes	yes	array of struct

Ranking

See Vespa's Ranking documentation. There are two ranking profiles available:

Ranking	Description
bm25	Linear sum: bm25(title) + bm25(abstract)
colbert	Linear sum of colbert maxsim over title and abstract

See Vespa BM25 and ColBERT.

The ranking profiles are defined in the document definition (doc.sd).

Example API queries

For using the Search Api of Vespa please see API documentation, YQL Query Language. In the below examples we use python with the requests api, using the POST search api.

import requests 

#Search for documents matching all query terms (either in title or abstract)
search_request_all = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'all',
  'ranking': 'bm25'
}

#Search for documents matching any of query terms (either in title or abstract)
search_request_any = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'any',
  'ranking': 'colbert'
}

#Search for documents matching with weak and of query terms (either in title or abstract)
search_request_any = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'weakAnd',
  'ranking': 'colbert'
}

#Search authors which is an array of struct using sameElement operator
search_request_authors= {
  'yql': 'select id,authors from sources * where authors contains sameElement(first contains "Keith", last contains "Mansfield");',
  'hits': 5,
  'summary': 'short'
}

#Sample request 
endpoint='https://api.cord19.vespa.ai/search/'
response = requests.post(endpoint, json=search_request_all)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cord-19-queries.md

cord-19-queries.md

Vespa powered search of the CORD-19 dataset

Query API

Similar articles

Deduping

API Access

High level field description

Ranking

Example API queries

Files

cord-19-queries.md

Latest commit

History

cord-19-queries.md

File metadata and controls

Vespa powered search of the CORD-19 dataset

Query API

Similar articles

Deduping

API Access

High level field description

Ranking

Example API queries