TripClick

TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine. The clicklog dataset comprises approximately 5.2 million user interactions, collected between 2013 and 2020. This dataset is accompanied with an IR evaluation benchmark and the required files to train deep learning IR models.

Paper: TripClick: The Log Files of a Large Health Web Search Engine

@inproceedings{rekabsaz2021fairnessir,
    title={TripClick: The Log Files of a Large Health Web Search Engine},
    author={Rekabsaz, Navid and Lesota, Oleg and Schedl, Markus and Brassey, Jon and Eickhoff, Carsten},
    booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    doi={10.1145/3404835.3463242},
    pages={2507--2513},
    year={2021},
    publisher=
}

TripClick dataset
Additional resources by collaborators
Team and contact

TripClick dataset

How to access the dataset

To gain access to one or more of the collection’s data packages, please fill this form and send it to jon.brassey@tripdatabase.com. In the form, please specify needed data packages and intended use of the data.

Logs Dataset

The logs consist of the user interactions of the Trip search engine collected between January 2013 and October 2020. Approximately 5.2 million click log entries from around 1.6 million search sessions are available. The provided logs.tar.gz contains allarticles.txt which provides the titles and URLs of all documents, and the \<YYYY>-\<MM>-\<DD>.json files contain the log entries split by date, e.g.: 2017-03-24.json. In the log files, each line represents a single json-formatted log record.

logs.tar.gz: size 871M, MD5 checksum 1d3a548685c2fbef9b2076b0b04ba44f

File Name	Format	Description
allarticles.txt	tsv: id title url	article collection
<YYYY>-<MM>-<DD>.json	JSON	log records

Information Retrieval Collection

The IR evaluation benchmark/collection is created from around 4 million click log entries which refer to those documents that are indexed in the MEDLINE catalog. The collection has approximately 1.5 million documents, and around 692,000 queries split into three groups: HEAD, TORSO, and TAIL. The query-to-document relevance signals are derived using RAW and Document Click-Through Rate (DCTR) click-through models. See the paper for more details. The code used to create the benchmark from log files is available here.

To make the use of the collection easier, we provide the benchmark in two formats: TREC-style and TSV format. The contents of both formats are exactly the same.

TREC format

benchmark.tar.gz: size 930M, MD5 checksum 6e5d3deeba138750e9a148b538f30a8f

File Name	Format	Description
documents/docs_grp_<[00-15]>.txt	TREC format	document collection split between 16 files
qrels/qrels.dctr.head.<[train, val]>.txt	qid, 0, docid, relevance	DCTR-based qrels in two files: (train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.txt	qid, 0, docid, relevance	RAW-based qrels in six files: (train, val, test)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.txt	TREC format	Topics in nine files: (test, train, val)*(all, head, torso, tail)

TSV format

benchmark_tsv.tar.gz: size 930M, MD5 checksum dff5f68eed8f9574eac432ea580275f7

File Name	Format	Description
documents/docs.tsv	docid \t doctext	documents
qrels/qrels.dctr.head.<[train, val]>.tsv	qid \t 0 \t docid \t relevance	DCTR-based qrels in two files: (train, val)
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.tsv	qid \t 0 \t docid \t relevance	RAW-based qrels in six files: (train, val, test)*(head, torso, tail)
topics/topics.<[head, torso, tail]>.<[test, train, val]>.tsv	qid \t qtext	Topics in nine files: (test, train, val)*(all, head, torso, tail)

Training package for deep learning models

To facilitate the training of deep IR models, we also create and provide the required training files alongside the benchmark. The provided files follow a similar format to the one of the MS MARCO collection.

dlfiles.tar.gz: size: 29G MD5 checksum 1f256c19466b414e365324d8ef21f09c
dlfiles_runs_test.tar.gz: size 35M MD5 checksum 2b5e98c683a91e19630636b6f83e3b15

File Name	Format	Description
run.trip.BM25.<[head, torso, tail]>.val.txt	TREC-like: qid, “Q0”, docid, rank, score, runstring	Pre-ranking results, three files: (val)*(head, torso, tail)
runs_test/run.trip.BM25.<[head, torso, tail]>.test.txt	TREC-like: qid, “Q0”, docid, rank, score, runstring	Pre-ranking results, three files: (test)*(head, torso, tail)
triples.train.tsv	tsv: query, pos. passage, neg. passage	Plain-text training data (size: 86G)
tuples.<[head, torso, tail]>.<[test, val]>.top200.tsv	tsv: qid, pid, query, passage	test and validation sets, six files: (test, val)*(head, torso, tail)

Additional resources by collaborators

Pyserini guideline for creating BM25 baselines: link
A new set of training triples (triples.train.tsv) provided by Hofstätter et al.: github, training triples

Team and Contact

For any question regarding obtaining the data and terms of use please contact Jon Brassey. If you have any question regarding the technical aspects drop an email to tripclick@jku.at.

Navid Rekab-saz
Johannes Kepler University Linz

Oleg Lesota
Johannes Kepler University Linz

Markus Schedl
Johannes Kepler University Linz

Jon Brassey
Trip Database

Carsten Eickhoff
Brown University

Terms and conditions

The provided datasets are intended for non-commercial research purposes to promote advancement in the field of natural language processing, information retrieval and related areas, and are made available free of charge without extending any license or other intellectual property rights. In particular:

Any parts of the datasets cannot be publicly shared or hosted (with exception for aggregated findings and visualizations);
The datasets can only be used for non-commercial research purposes;
The statistical models or any further resources created based on the datasets cannot be shared publicly without the permission of the data owners. These include for instance the weights of deep learning models trained on the provided data.

Upon violation of any of these terms, my rights to use the dataset will end automatically. The datasets are provided “as is” without warranty. The side granting access to the datasets is not liable for any damages related to use of the dataset.