Data
The Data component comprises the input data of the experiment. The metadata information about the Data should include the test collection, training data and others. For the test collection, the data source as well as the location of the qrels and topics files have to be reported. If available, the identifier that is chosen by the data catalog ir_datasets
should be reported as well. The example below uses another test collection as training data, but generally, this entry should be reported as a list, especially if more than one data source is used in the experiments. The third subcomponent other
covers miscellaneous information related to the data, for instance, about word embeddings or stopwords.
Checklist
data
→test collection
Description: A test collection includes but is not limited to aname
,source
,qrels
,topics
, and anir_datasets
identifier.
Type: Collection of scalarsdata
→test collection
→name
Description: Name of the test collection.
Type: Scalar
Encoding: UTF-8 encoded string of characters (RFC3629);!!str
.data
→test collection
→source
Description: Official source of the collection.
Type: Scalar
Encoding: URI according to RFC2396;!!str
.data
→test collection
→qrels
Description: Source of the qrels.
Type: Scalar
Encoding: URI according to RFC2396;!!str
.data
→test collection
→topics
Description: Source of the topic file.
Type: Scalar
Encoding: URI according to RFC2396;!!str
.-
data
→test collection
→ir_datasets
Description: Identifier inir_datasets
.
Type: Scalar
Encoding: UTF-8 encoded string of characters (RFC3629);!!str
. -
data
→training data
Description: List of different training data sources that are used in the experiments represented as mappings, a single mapping usually has aname
and asource
.
Type: Sequence of mappings;!!seq [!!map, !!map, ...]
. data
→training data
→name
Description: Name of the training data resource.
Type: Scalar
Encoding: UTF-8 encoded string of characters (RFC3629);!!str
.-
data
→training data
→source
Description: Source location of the training data resource.
Type: Scalar
Encoding: URI according to RFC2396;!!str
. -
data
→other
Description: List of other data sources that are used in the experiments, for instance, external stopword lists, thesauri, or word embeddings. These resources are represented as mappings, a single mapping usually has aname
and asource
.
Type: Sequence of mappings;!!seq [!!map, !!map, ...]
. data
→other
→name
Description: Name of the data resource.
Type: Scalar
Encoding: UTF-8 encoded string of characters (RFC3629);!!str
.data
→other
→source
Description: Source location of the data resource.
Type: Scalar
Encoding: URI according to RFC2396;!!str
.
Example
data:
test collection:
name: The New York Times Annotated Corpus
source: https://catalog.ldc.upenn.edu/LDC2008T19
qrels: https://trec.nist.gov/data/core/qrels.txt
topics: https://trec.nist.gov/data/core/core_nist.txt
ir_datasets: https://ir-datasets.com/nyt
training data:
- name: TREC disks 4 and 5
source: https://trec.nist.gov/data/cd45/index.html
qrels: https://trec.nist.gov/data/robust/qrels.robust2004.txt
topics: https://trec.nist.gov/data/robust/04.testset.gz
ir_datasets: https://ir-datasets.com/trec-robust04
other:
- name: GloVe embeddings
source: https://nlp.stanford.edu/projects/glove/
- name: Indri's stopword list
source: https://sourceforge.net/projects/lemur/