TripClick

TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine. The clicklog dataset comprises approximately 5.2 million user interactions, collected between 2013 and 2020. This dataset is accompanied with an IR evaluation benchmark and the required files to train deep learning IR models.

Paper: TripClick: The Log Files of a Large Health Web Search Engine

@inproceedings{rekabsaz2021tripclick,
    title={TripClick: The Log Files of a Large Health Web Search Engine},
    author={Rekabsaz, Navid and Lesota, Oleg and Schedl, Markus and Brassey, Jon and Eickhoff, Carsten},
    booktitle={In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'21), July 11–15, 2021, Virtual Event, Canada},
    doi={10.1145/3404835.3463242}
    year={2021},
    publisher = {ACM}
}

Access Data

To gain access to one or more of the collection’s data packages, please fill this form and send it to jon.brassey@tripdatabase.com. In the form, please specify needed data packages and intended use of the data.

Leaderboards

HEAD Queries - DCTR

DateDescriptionTeamNDCG@10 (val)RECALL@10 (val)NDCG@10 (test)RECALL@10 (test)PaperCode

HEAD Queries - RAW

DateDescriptionTeamNDCG@10 (val)RECALL@10 (val)NDCG@10 (test)RECALL@10 (test)PaperCode

TORSO Queries - RAW

DateDescriptionTeamNDCG@10 (val)RECALL@10 (val)NDCG@10 (test)RECALL@10 (test)PaperCode

TAIL Queries - RAW

DateDescriptionTeamNDCG@10 (val)RECALL@10 (val)NDCG@10 (test)RECALL@10 (test)PaperCode

Submission Instruction

We look forward to your submissions with the aim of fostering collaboration in the commuinty and tracking the progress on the benchmarks. To ensure the integrity of the official test results, the relevance information of the test set is not publically available. You can submit your TREC-formatted run files on the validation and test queries of all/either of the benchmarks. Please follow the instructions below for the run files submission.

TripClick Data Description

Logs Dataset

The logs consist of the user interactions of the Trip search engine collected between January 2013 and October 2020. Approximately 5.2 million click log entries from around 1.6 million search sessions are available. The provided logs.tar.gz contains allarticles.txt which provides the titles and URLs of all documents, and the \<YYYY>-\<MM>-\<DD>.json files contain the log entries split by date, e.g.: 2017-03-24.json. In the log files, each line represents a single json-formatted log record.

File Name Format Description
allarticles.txt tsv: id title url article collection
<YYYY>-<MM>-<DD>.json JSON log records

IR Benchmark

The IR evaluation benchmark/collection is created from around 4 million click log entries which refer to those documents that are indexed in the MEDLINE catalog. The collection has approximately 1.5 million documents, and around 692,000 queries split into three groups: HEAD, TORSO, and TAIL. The query-to-document relevance signals are derived using RAW and Document Click-Through Rate (DCTR) click-through models. See the paper for more details. The code used to create the benchmark from log files is available here.

To make the use of the collection easier, we provide the benchmark in two formats: TREC-style and TSV format. The contents of both formats are exactly the same.

TREC format

File Name Format Description
documents/docs_grp_<[00-15]>.txt TREC format document collection split between 16 files
qrels/qrels.dctr.head.<[train, val]>.txt qid, 0, docid, relevance DCTR-based qrels in two files:
(train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val]>.txt qid, 0, docid, relevance RAW-based qrels in six files:
(train, val)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.txt TREC format Topics in nine files:
(test, train, val)*(all, head, torso, tail)

TSV format

File Name Format Description
documents/docs.tsv docid \t doctext documents
qrels/qrels.dctr.head.<[train, val]>.tsv qid \t 0 \t docid \t relevance DCTR-based qrels in two files:
(train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val]>.tsv qid \t 0 \t docid \t relevance RAW-based qrels in six files:
(train, val)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.tsv qid \t qtext Topics in nine files:
(test, train, val)*(all, head, torso, tail)

Training Package for Deep Learning Models

To facilitate the training of deep IR models, we also create and provide the required training files alongside the benchmark. The provided files follow a similar format to the one of the MS MARCO collection.

File Name Format Description
run.trip.BM25.<[head, torso, tail]>.val.txt TREC-like:
qid, “Q0”, docid, rank, score, runstring
Pre-ranking results, three files:
(val)*(head, torso, tail)
runs_test/run.trip.BM25.<[head, torso, tail]>.test.txt TREC-like:
qid, “Q0”, docid, rank, score, runstring
Pre-ranking results, three files:
(test)*(head, torso, tail)
triples.train.tsv tsv:
query, pos. passage, neg. passage
Plain-text training data
(size: 86G)
tuples.<[head, torso, tail]>.<[test, val]>.top200.tsv tsv:
qid, pid, query, passage
test and validation sets, six files:
(test, val)*(head, torso, tail)

Additional Resources

Terms and Conditions

The provided datasets are intended for non-commercial research purposes to promote advancement in the field of natural language processing, information retrieval and related areas, and are made available free of charge without extending any license or other intellectual property rights. In particular:

Upon violation of any of these terms, my rights to use the dataset will end automatically. The datasets are provided “as is” without warranty. The side granting access to the datasets is not liable for any damages related to use of the dataset.

Team and Contacts

For any question regarding obtaining the data and terms of use please contact Jon Brassey. If you have any question regarding the technical aspects drop an email to tripclick@jku.at.



TripClick logo