TripClick

TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine. The clicklog dataset comprises approximately 5.2 million user interactions, collected between 2013 and 2020. This dataset is accompanied with an IR evaluation benchmark and the required files to train deep learning IR models.

Paper: TripClick: The Log Files of a Large Health Web Search Engine

@inproceedings{rekabsaz2021fairnessir,
    title={TripClick: The Log Files of a Large Health Web Search Engine},
    author={Rekabsaz, Navid and Lesota, Oleg and Schedl, Markus and Brassey, Jon and Eickhoff, Carsten},
    booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    doi={10.1145/3404835.3463242},
    pages={2507--2513},
    year={2021},
    publisher=
}

TripClick dataset

How to access the dataset

To gain access to one or more of the collection’s data packages, please fill this form and send it to jon.brassey@tripdatabase.com. In the form, please specify needed data packages and intended use of the data.

Logs Dataset

The logs consist of the user interactions of the Trip search engine collected between January 2013 and October 2020. Approximately 5.2 million click log entries from around 1.6 million search sessions are available. The provided logs.tar.gz contains allarticles.txt which provides the titles and URLs of all documents, and the \<YYYY>-\<MM>-\<DD>.json files contain the log entries split by date, e.g.: 2017-03-24.json. In the log files, each line represents a single json-formatted log record.

File Name Format Description
allarticles.txt tsv: id title url article collection
<YYYY>-<MM>-<DD>.json JSON log records

Information Retrieval Collection

The IR evaluation benchmark/collection is created from around 4 million click log entries which refer to those documents that are indexed in the MEDLINE catalog. The collection has approximately 1.5 million documents, and around 692,000 queries split into three groups: HEAD, TORSO, and TAIL. The query-to-document relevance signals are derived using RAW and Document Click-Through Rate (DCTR) click-through models. See the paper for more details. The code used to create the benchmark from log files is available here.

To make the use of the collection easier, we provide the benchmark in two formats: TREC-style and TSV format. The contents of both formats are exactly the same.

TREC format

File Name Format Description
documents/docs_grp_<[00-15]>.txt TREC format document collection split between 16 files
qrels/qrels.dctr.head.<[train, val]>.txt qid, 0, docid, relevance DCTR-based qrels in two files:
(train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.txt qid, 0, docid, relevance RAW-based qrels in six files:
(train, val, test)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.txt TREC format Topics in nine files:
(test, train, val)*(all, head, torso, tail)

TSV format

File Name Format Description
documents/docs.tsv docid \t doctext documents
qrels/qrels.dctr.head.<[train, val]>.tsv qid \t 0 \t docid \t relevance DCTR-based qrels in two files:
(train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.tsv qid \t 0 \t docid \t relevance RAW-based qrels in six files:
(train, val, test)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.tsv qid \t qtext Topics in nine files:
(test, train, val)*(all, head, torso, tail)

Training package for deep learning models

To facilitate the training of deep IR models, we also create and provide the required training files alongside the benchmark. The provided files follow a similar format to the one of the MS MARCO collection.

File Name Format Description
run.trip.BM25.<[head, torso, tail]>.val.txt TREC-like:
qid, “Q0”, docid, rank, score, runstring
Pre-ranking results, three files:
(val)*(head, torso, tail)
runs_test/run.trip.BM25.<[head, torso, tail]>.test.txt TREC-like:
qid, “Q0”, docid, rank, score, runstring
Pre-ranking results, three files:
(test)*(head, torso, tail)
triples.train.tsv tsv:
query, pos. passage, neg. passage
Plain-text training data
(size: 86G)
tuples.<[head, torso, tail]>.<[test, val]>.top200.tsv tsv:
qid, pid, query, passage
test and validation sets, six files:
(test, val)*(head, torso, tail)

Additional resources by collaborators

Team and Contact

For any question regarding obtaining the data and terms of use please contact Jon Brassey. If you have any question regarding the technical aspects drop an email to tripclick@jku.at.



Terms and conditions

The provided datasets are intended for non-commercial research purposes to promote advancement in the field of natural language processing, information retrieval and related areas, and are made available free of charge without extending any license or other intellectual property rights. In particular:

Upon violation of any of these terms, my rights to use the dataset will end automatically. The datasets are provided “as is” without warranty. The side granting access to the datasets is not liable for any damages related to use of the dataset.

TripClick logo