TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine. The clicklog dataset comprises approximately 5.2 million user interactions, collected between 2013 and 2020. This dataset is accompanied with an IR evaluation benchmark and the required files to train deep learning IR models.
Paper: TripClick: The Log Files of a Large Health Web Search Engine
@inproceedings{rekabsaz2021fairnessir,
title={TripClick: The Log Files of a Large Health Web Search Engine},
author={Rekabsaz, Navid and Lesota, Oleg and Schedl, Markus and Brassey, Jon and Eickhoff, Carsten},
booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
doi={10.1145/3404835.3463242},
pages={2507--2513},
year={2021},
publisher=
}
To gain access to one or more of the collection’s data packages, please fill this form and send it to jon.brassey@tripdatabase.com. In the form, please specify needed data packages and intended use of the data.
The logs consist of the user interactions of the Trip search engine collected between January 2013 and October 2020. Approximately 5.2 million click log entries from around 1.6 million search sessions are available. The provided logs.tar.gz
contains allarticles.txt
which provides the titles and URLs of all documents, and the \<YYYY>-\<MM>-\<DD>.json
files contain the log entries split by date, e.g.: 2017-03-24.json
. In the log files, each line represents a single json-formatted log record.
logs.tar.gz
: size 871M, MD5 checksum 1d3a548685c2fbef9b2076b0b04ba44f
File Name | Format | Description |
---|---|---|
allarticles.txt | tsv: id title url | article collection |
<YYYY>-<MM>-<DD>.json | JSON | log records |
The IR evaluation benchmark/collection is created from around 4 million click log entries which refer to those documents that are indexed in the MEDLINE catalog. The collection has approximately 1.5 million documents, and around 692,000 queries split into three groups: HEAD, TORSO, and TAIL. The query-to-document relevance signals are derived using RAW and Document Click-Through Rate (DCTR) click-through models. See the paper for more details. The code used to create the benchmark from log files is available here.
To make the use of the collection easier, we provide the benchmark in two formats: TREC-style and TSV format. The contents of both formats are exactly the same.
benchmark.tar.gz
: size 930M, MD5 checksum 6e5d3deeba138750e9a148b538f30a8f
File Name | Format | Description |
---|---|---|
documents/docs_grp_<[00-15]>.txt | TREC format | document collection split between 16 files |
qrels/qrels.dctr.head.<[train, val]>.txt | qid, 0, docid, relevance | DCTR-based qrels in two files: (train, val) |
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.txt | qid, 0, docid, relevance | RAW-based qrels in six files: (train, val, test)*(head, torso, tail) |
topics/topics.<[head, torso, tail]>.<[test, train, val]>.txt | TREC format | Topics in nine files: (test, train, val)*(all, head, torso, tail) |
benchmark_tsv.tar.gz
: size 930M, MD5 checksum dff5f68eed8f9574eac432ea580275f7
File Name | Format | Description |
---|---|---|
documents/docs.tsv | docid \t doctext | documents |
qrels/qrels.dctr.head.<[train, val]>.tsv | qid \t 0 \t docid \t relevance | DCTR-based qrels in two files: (train, val) |
qrels/qrels.raw.<[head, torso, tail]>.<[train, val, test]>.tsv | qid \t 0 \t docid \t relevance | RAW-based qrels in six files: (train, val, test)*(head, torso, tail) |
topics/topics.<[head, torso, tail]>.<[test, train, val]>.tsv | qid \t qtext | Topics in nine files: (test, train, val)*(all, head, torso, tail) |
To facilitate the training of deep IR models, we also create and provide the required training files alongside the benchmark. The provided files follow a similar format to the one of the MS MARCO collection.
dlfiles.tar.gz
: size: 29G MD5 checksum 1f256c19466b414e365324d8ef21f09c
dlfiles_runs_test.tar.gz
: size 35M MD5 checksum 2b5e98c683a91e19630636b6f83e3b15
File Name | Format | Description |
---|---|---|
run.trip.BM25.<[head, torso, tail]>.val.txt | TREC-like: qid, “Q0”, docid, rank, score, runstring |
Pre-ranking results, three files: (val)*(head, torso, tail) |
runs_test/run.trip.BM25.<[head, torso, tail]>.test.txt | TREC-like: qid, “Q0”, docid, rank, score, runstring |
Pre-ranking results, three files: (test)*(head, torso, tail) |
triples.train.tsv | tsv: query, pos. passage, neg. passage |
Plain-text training data (size: 86G) |
tuples.<[head, torso, tail]>.<[test, val]>.top200.tsv | tsv: qid, pid, query, passage |
test and validation sets, six files: (test, val)*(head, torso, tail) |
triples.train.tsv
) provided by Hofstätter et al.: github, training triplesFor any question regarding obtaining the data and terms of use please contact Jon Brassey. If you have any question regarding the technical aspects drop an email to tripclick@jku.at.
The provided datasets are intended for non-commercial research purposes to promote advancement in the field of natural language processing, information retrieval and related areas, and are made available free of charge without extending any license or other intellectual property rights. In particular:
Upon violation of any of these terms, my rights to use the dataset will end automatically. The datasets are provided “as is” without warranty. The side granting access to the datasets is not liable for any damages related to use of the dataset.