layout |
---|
default |
FOR 2021 Details, CLICK HERE
Search engine results underpin many consequential decision making tasks. Examples include people using search technologies to seek health advice online, or time-pressured clinicians relying on search results to decide upon the best treatment/diagnosis/test for a patient.
A key problem when using search engines in order to complete such decision making tasks, is whether users are able to discern authoritative from unreliable information and correct from incorrect information. This problem is further exacerbated when the search occurs within uncontrolled data collections, such as the web, where information can be unreliable, generally misleading, too technical, and can lead to unfounded escalations (White&Horvitz, 2009). Information from search engine results can significantly influence decisions, and research shows that increasing the amount of incorrect information about a topic presented in a SERP can impel users to take incorrect decisions (Pogacar et al., 2017). As noted in the SWIRL III report (Culpepper et al., 2018), decision making with search engines is poorly understood, and likewise, evaluation measures for these search tasks need to be developed and improved.
This track aims to
- Provide a venue for research on retrieval methods that promote better decision making with search engines, and
- Develop new online and offline evaluation methods to predict the decision making quality induced by search results.
The track is planned over multiple years, with data and resources created in one year flowing into the next year. We plan for the track to run for at least 3 years.
Participants devise search technologies that promote correct information over incorrect information, with the assumption that correct information can better lead people to make correct decisions.
Note: this task is more than simply a new definition of what is relevant. There are 3 types of results: correct and relevant, incorrect, and non-relevant. It is important that search results avoid containing incorrect results, and ranking non-relevant results above incorrect is preferred. In place of notions of correctness, the credibility of the information source is useful and relevant and credible information is preferred.
Evaluation measures will consider relevance beyond topicality, including correctness of information and credibility.
Year 1 task summary: Given a data collection and a set of topics (portrayed as questions, see topics for more information), your task is to return relevant and credible information that will help searchers make correct decisions. A dual goal is to return relevant and correct information.
Following the year 1 assessment, the organizers will recruit test subjects to perform a decision making task using a selection of the year 1 runs. That is, test subjects will be given a fixed result list (selected from the participating teams submitted runs) and a decision task. We will collect user interaction data as well as the users' decisions.
In addition to a ranking task, the track will have evaluation tasks.
Given a query, a document ranking (results list) and interaction data of real users (collected right after year 1), predict the decisions users will take at the end of the search process, along with their confidence when taking such decisions. This simulates an online evaluation process.
Given a query, a document ranking (results list), and assessments, predict the decision the user will take at the end of the search process (along with the confidence expressed by the user with respect to their decision). This simulates an offline evaluation process.
The Track plans to focus on topics within the consumer health search domain (people seeking health advice online) to form user stories (search topics). Consumer health search represents an ideal prototypical example of the consequential decisions that we want search engines to correctly support.
Unlike previous tracks, the assessors will not be creating their own topic statements. Instead, the assessors will be provided the topic query and narrative. The topics will be provided as XML files using the following format:
<topics>
<topic>
<number>156</number>
<query>exercise scoliosis</query>
<cochranedoi>10.1002/14651858.CD007837.pub2</cochranedoi>
<description>Can exercises treat scoliosis?</description>
<narrative>Scoliosis is spinal deformity, which occurs as sideways curvature, that can reduce productivity, cause acute pain or breathing problems depending on its severity. It has been suggested that scoliosis specific exercises can reduce deformity and treat scoliosis symptoms. A relevant document discusses whether exercises can help to treat scoliosis or improve lives of people with scoliosis.</narrative>
</topic>
<topic>
...
</topic>
</topics>
To see the full list of topics please refer to NIST website
The collection used in TREC Decisions 2019 will be ClueWeb12-B13. Please refer to https://lemurproject.org/clueweb12/ for information on how to obtain the dataset.
The topics have a specified query to use. This query field replaces the traditional title field. Runs should be submitted using the query field as the query as given by a user to a search engine. Runs not using the query field or using the description field may also be submitted but will need to be marked as "other" runs.
Submission format will follow the standard TREC run format. The submission format when submitting ranked results is as follows:
qid Q0 docno rank score tag
where:
qid
: the topic number.Q0
: unused and should always be Q0.docno
: the official document id number returned by your system for the topic qid.rank
: the rank the document is retrieved,score
: the score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. The score is important to handle tied scores. (trec_eval
sorts documents the the scores values and not your ranks values).tag
: a tag that uniquely identifies your group AND the method you used to produce the run. Each run should have a different tag.
The fields should be spectated with a whitespace. The width of the columns in the format is not important, but it is important to include all columns and have some amount of white space between the columns.
Example run is shown below:
1 Q0 clueweb12-1018wb-57-17875 1 14.8928003311 myGroupNameMyMethodName
1 Q0 clueweb12-1311wb-18-06089 2 14.7590999603 myGroupNameMyMethodName
1 Q0 clueweb12-1113wb-13-13513 3 14.5707998276 myGroupNameMyMethodName
1 Q0 clueweb12-1200wb-47-01250 4 14.5642995834 myGroupNameMyMethodName
1 Q0 clueweb12-0205wb-37-09270 5 14.3723001481 myGroupNameMyMethodName
...
For each topic, please return 1,000 ranked documents.
Participating groups will be allowed to submit as many runs as they like, but need to ask permission before submitting more than 10 runs. Not all runs are likely to be used for pooling and groups will likely need to specify a preference ordering for pooling purposes.
Runs may be either automatic or manual runs. An automatic run is made without any tuning or manual influence of this year's topics. A manual run is anything that is not an automatic run. Manual runs commonly have some human input based on the topics, e.g. hand crafted queries or relevance feedback. Best practice for automatic runs is to avoid using the topics or even looking at them until after all decisions and code have been written to produce an automatic run.
NIST assessors will judge documents in three categories:
- Relevance: whether the document is relevant to the topic.
- Credibility: whether the document is considered credible by the assessor.
- Treatment Efficacy: whether the document contains correct information regarding the topic's treatment.
More information on the assessing guidelines is available here
The submitted runs will be evaluated with respect to the following measures proposed by Lioma et al. (ICTIR'17, https://doi.org/10.1145/3121050.3121072):
- Normalised Local Rank Error (NLRE): compares the rank position of documents pairwise and checks for errors, where errors are defined as misplacements of documents, i.e. a relevant or credible document placed after a not relevant or not credible document. The measure is originally defined to deal with relevance and credibility, but it has been extended to consider all the three aspects: relevance, credibility and correctness.
- Normalised Weighted Cumulative Score (nWCS): is based on nDCG, it generates a single label out of the multiple aspects and computes the standard nDCG.
- Convex Aggregating Measure (CAM): considers each aspect separately and computes either AP or nDCG with respect to the ranking obtained with single aspects. Finally it computes the average AP or nDCG across the aspects.
We will also evaluate runs in terms of traditional relevance measures, e.g. nDCG and MAP, with a goal of comparing performance measures between the relevance only measures and the measures that combine relevance, credibility and correctness.
- (
June 11) Final assessing guidelines available here - (
June 11, 2019) Topics are released [NIST website] - (
August 1, 2019August 28, 2019) Runs due.- Before 7am Aug 28 Eastern Time Zone (Gaithersburg, MD, USA).
- Please check TREC active participants page for the runs submission link.
Results returnedNotebook paper due(Nov 13--15, 2019)TREC Conference(Feb 6, 2020)Final report due.
Announcements and discussions will be posted in the google groups.
For more information or to ask questions, join the google groups