Sci Rep Eval
Sci Rep Eval
A BSTRACT
1 I NTRODUCTION
Learning representations of whole documents is critical for a variety of NLP tasks including classi-
fication, search, and recommendation (Cohan et al., 2020). Recent work has shown how pretrained
language models (e.g., (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020)) can be tailored
to produce high-quality representations of documents with contrastive learning (Xu et al., 2021;
Gao et al., 2021; Neelakantan et al., 2022). In the scientific domain, training objectives based on
contrastive learning of cross-document links (e.g., citations) have shown further improvements in
document-level representation learning (Cohan et al., 2020; Ostendorff et al., 2022b; Mysore et al.,
2022). These methods are especially useful because the representations they produce can be indexed
and later efficiently consumed by lightweight downstream models without additional fine-tuning.
While there has been significant progress in evaluating generalizability of NLP models (Ye et al.,
2021; Sanh et al., 2021), evaluation of scientific document representation models has remained
limited. Existing benchmarks either focus on document similarity (Mysore et al., 2021; Voorhees
et al., 2021) or include tasks that are highly correlated and not diverse (Cohan et al., 2020).
We introduce SciRepEval, the first benchmark for comprehensive evaluation of document-
representation learning models in the scientific domain. Unlike prior work, SciRepEval is large
and includes a collection of highly diverse tasks, thus encouraging research on generalization (in-
cluding instance-level, cross-task and cross-domain generalization). It consists of 25 realistic tasks
that reflect practical use cases of scientific document representations across four formats: text classi-
fication, regression, proximity-based ranking (e.g., nearest-neighbor), and ad-hoc search. Eleven of
these tasks are new contributions. SciRepEval contains standard sets of both training and evaluation
datasets to simplify and standardize comparisons between methods evaluated on the benchmark.
We then use this new benchmark to investigate and improve the generalization ability of docu-
ment representation models. Following recent work (Cohan et al., 2020; Ostendorff et al., 2022b;
Mysore et al., 2022) we further pre-fine-tune a transformer language model (SciNCL; Ostendorff
et al., 2022b) to produce high-quality representations for downstream tasks. We hypothesize that
1
condensing all relevant information of the document into a single vector representation might not be
expressive enough for generalization across a wide range of tasks. Prior work addresses a similar
challenge in the context of document similarity by learning multiple finer-grained representations,
each associated with a different aspect of a paper (e.g., task, method, results, etc) (Mysore et al.,
2022; Ostendorff et al., 2022a). In contrast, we aim to learn effective representations for multiple
downstream task formats.
Following recent success in multi-task learning in NLP (Ye et al., 2021; Sanh et al., 2021), we
explore large-scale multi-task training in the context of scientific document representations, where
we apply suitable optimization objectives for the various task formats in SciRepEval. i.e., cross-
entropy loss for classification, triplet loss for proximity/ad-hoc search, and mean square error loss
for regression. We explore two state-of-the-art techniques for generating format-specific document
representations: using control codes (Keskar et al., 2019; Raffel et al., 2020) as input indicating the
format, and parameter-efficient adapter methods (Houlsby et al., 2019; Pfeiffer et al., 2021; Stickland
& Murray, 2019), in which a separate network module is introduced for every task format.
Our experiments investigate: (i) if existing document representation methods have the ability to
generalize to a highly diverse set of tasks, (ii) if multi-task training on diverse data can improve
document representation models, and (iii) if task-format-specific representations can improve gen-
eralization. Through extensive analysis we find that existing state-of-the-art scientific document rep-
resentation models such as SPECTER (Cohan et al., 2020) and SciNCL (Ostendorff et al., 2022b)
struggle with generalizing to all task types. We interestingly find that simple multi-task training on
large set of tasks is not able to significantly improve the results. However, we learn that multiple
task format-specific representations can substantially improve generalization.
To summarize, our contributions are:
(i) SciRepEval, a new comprehensive benchmark of 25 highly diverse and practical tasks for
scientific document representation techniques across four different formats, of which 11 tasks
are made available for the first time, and six of the tasks are explicitly designed for training.
(ii) An extensive investigation on the generalization ability of state-of-the-art scientific document
representation models.
(iii) A set of new multi-task document representation models that, unlike existing methods, can
produce representations tailored to different task formats. The new methods show improved
generalization over previous work, outperforming prior methods by up to 1.5 points absolute.
We release the benchmark and associated code for training and evaluation to encourage further
research in this area 1 .
2 BACKGROUND
Representing Scientific Documents Earlier work aimed at document embeddings used word vec-
tors (J et al., 2016; Le & Mikolov, 2014; Wu et al., 2018), convolutions (Liu et al., 2017; Zamani
et al., 2018), bi-encoder networks (Conneau et al., 2017) and BERT-based methods (Reimers &
Gurevych, 2019). Recent works have produced large scale language models pre-trained on scientific
corpora (Beltagy et al., 2019; Yasunaga et al., 2022; Trewartha et al., 2022). These tend to perform
better than general purpose models on scientific domain tasks, and serve as a foundation for learn-
ing dense embeddings of scientific documents. Cohan et al. (2020) and Ostendorff et al. (2022b)
fine-tune SciBERT (Beltagy et al., 2019) with a triplet loss that encourages papers citing each other
to have similar embeddings, using the title and abstract of research papers as the input.
Both Cohan et al. (2020) and Ostendorff et al. (2022b) are evaluated on the SciDocs benchmark.
However, 4 of the 7 tasks in SciDocs are overly-simplistic in that the goal is to distinguish 5 real
citations from 20 randomly chosen non-citations (further limitations of SciDocs are discussed in
section 3 and Appendix F). Hence, the existing techniques work reasonably well on SciDocs. In
contrast, SciRepEval provides a more challenging and diverse set of tasks, for both training and
evaluation to help motivate methods for producing scientific document representations that can do
well across multiple task formats. As a first step in this direction, we attempt to learn task-specific
embeddings of the documents by pre-fine-tuning on multiple objectives simultaneously. Related to
1
https://github.com/allenai/scirepeval
2
our approach, there are techniques in learning multiple embeddings per paper (Ostendorff et al.,
2022a; Mysore et al., 2022). These methods are, however, orthogonal to ours in that they generate
an embedding per paper “facet”, while we focus on learning separate embeddings per task format.
In addition, these techniques focus only on finer-grained paper similarity, while our aim is producing
general embeddings applicable to a variety of task formats.
Multi-Task Learning Across Formats Multi-task learning (Caruana, 1993) with deep neural net-
works has been shown to improve performance over single-task training for related objectives (Liu
et al., 2015; 2019b). Though unrelated tasks can lead to negative transfer, recent work has shown
that simply increasing the number of tasks tends to yield better performance in multi-task learning
(Aghajanyan et al., 2021; Aribandi et al., 2022; Padmakumar et al., 2022). Aghajanyan et al. (2021)
pre-fine-tune pre-trained language models simulataneously on 46 tasks across 4 task types before
fine-tuning on the downstream task. Aribandi et al. (2022) pre-train T5 (Raffel et al., 2020) on a
combination of C4 span denoising and 107 other tasks across 8 task families. Ye et al. (2021) intro-
duce an ontology of 160 tasks for few shot multi-task training. Unlike these task families, which are
divided primarily by semantics (e.g., classifying sentiment vs classifying entailment), the training
tasks in SciRepEval consist of 8 large-scale scientific datasets across the four task formats. Since our
goal is to evaluate final document representations, rather than fine-tune on individual downstream
tasks like the above approaches, we follow SPECTER (Cohan et al., 2020) and directly apply the
representations as features to the tasks.
Adapters for Multiple Tasks Adapters were introduced by Houlsby et al. (2019) for parameter
efficient training of transformers (Vaswani et al., 2017). A small number of trainable parameters are
added to each layer, while freezing the base encoder. This strategy is similar to that of ELMo (Peters
et al., 2018), which learned task-specific weightings for the biLSTM layers. To apply adapters to
multi-task learning, Pfeiffer et al. (2021) define a two-step process they call Fusion. First, individual
adapter modules are trained for every task. The second step introduces task-specific fusion modules
at each layer which attend to (i.e. fuse) all the previously pre-fine-tuned adapters, keeping them
fixed. Similarly, Stickland & Murray (2019) introduced Projected Attention Layers (PALs) with
adapters and self-attention modules for every task, but the entire network is trained simultaneously.
Control Codes Control codes can be defined as token(s) pre-pended to the input to serve as addi-
tional signals to the model. Keskar et al. (2019) use control codes as prompts to govern style, con-
tent, and task-specific behavior for conditional text generation. Tay et al. (2022) use control codes to
switch between three de-noising modes during pre-training, and associate a downstream task with a
particular mode during fine-tuning. Zhang et al. (2022) apply control codes in the context of dense
retrieval to produce multiple representations covering different aspects of the same document, allow-
ing them to match queries written from multiple perspectives. In contrast to this past work, we use
control codes to indicate target task format for the embedding output by the model, and demonstrate
how this is effective for producing paper embeddings across different formats.
3 S CI R EP E VAL
We introduce SciRepEval, a benchmark suite of 25 tasks across four formats for training and eval-
uating multi-task embeddings of scholarly papers. SciRepEval aims to enable comprehensive eval-
uation of paper embeddings by providing (1) a highly diverse set of tasks—spanning multiple task
formats such as classification, regression, proximity and ad-hoc search—to challenge the general-
purpose applicability of embeddings, (2) realistic tasks that reflect practical use cases of paper em-
beddings, and (3) a standard set of both training and evaluation datasets to simplify comparisons
between methods evaluated on the benchmark.
The previous scholarly paper embedding benchmark is SciDocs (Cohan et al., 2020), which includes
two classification tasks, four nearest neighbors tasks, and one recommendation task. SciRepEval
includes SciDocs as a subset, but addresses several key limitations:
(i) The four nearest neighbor tasks in SciDocs are constructed to distinguish a related document
from randomly selected negatives given a query document, which might be too easy and not
representative of real tasks in scholarly information retrieval. SciRepEval has more realistic
tasks such as search, author disambiguation, and paper-reviewer matching among others.
(ii) For the methods evaluated in Section 5, we found that the SciDocs recommendations task was
noisy and had limited power to distinguish between different embeddings. The test set includes
3
Table 1: Summary of the SciRepEval benchmark tasks across the four formats - classification (CLF),
regression (RGN), proximity (PRX) and adhoc search (SRCH). The models in section 6 are first
trained on the in-train tasks and then benchmarked on their held-out sets as well as the 17 test tasks.
Information retrieval tasks have Q queries with P candidate pairs and the S2AND task has X clusters
with Y author-paper pairs. S: Silver, G: Gold. SciDocs is evaluated as per Cohan et al. (2020).
In-Train
Same Author Detection Q: 76,489 P: 673,170 Q: 13,585 P: 123,430 MAP Subramanian et al. (2021)
PRX
Highly Influential Citations Q: 65,982 P: 2,004,688 Q: 1,199 P: 54,255 MAP This work
Citation Prediction Triplets 819,836 — *not used for eval Cohan et al. (2020)
Out-of-Train
SciDocs
only 1000 clickthrough events, and the use of propensity weighting means that an even smaller
number of examples dominate test set performance. While SciRepEval includes SciDocs as a
subset, we exclude the recommendations task.
(iii) The tasks in SciDocs were constructed to be used only for evaluation, and have few-enough
samples that training on SciDocs is impractical (see Table 1). In SciRepEval, eight of the
largest tasks across the four formats are used for training, while the rest out-of-train tasks are
reserved for evaluation. This enables the study of multi-task approaches, rather than relying
solely on the citation signal. The training data in SciRepEval also has a large scale representa-
tion in multiple domains as discussed in Appendix D.
(iv) Four of the tasks in SciDocs have very high model-performance correlations between them
(greater than 0.99), indicating that the diversity of the tasks is limited. See Appendix F for
more details.
The tasks in SciRepEval are summarized in Table 1. They are a mixture of existing and new datasets.
Datasets with at least 100,000 instances (triplets for proximity/ad-hoc search) are in-train datasets
used for training while others are out-of-train used only for evaluation. Although SciDocs tasks are
used as out-of-training evaluation tasks, we report their performance in a separate category.
Next, we briefly describe each of the task formats and their component tasks. Full details are pro-
vided in Appendix A. Except for Search, all the tasks use paper embeddings created from a combi-
nation of paper title and abstract as the input. Search requires additional metadata (subsection 4.1)
which is concatenated to the title and abstract before producing the paper representation.
Ad-Hoc Search In ad-hoc search tasks, we are given a textual query and the task is to rank a set
of candidate papers by relatedness to the query. Ad-hoc search is a critical mechanism for paper
discovery in practice, and we gather multiple real-world data sets for training and evaluation. One
4
evaluation dataset comes from previous work, TREC-CoVID (Voorhees et al., 2021), a biomedical
challenge task that ranks papers from CORD-19 (Wang et al., 2020b) in response to textual search
queries. Two other datasets are newly introduced in our work: a ‘feeds’ dataset taken from a schol-
arly paper recommendation system, where we treat the user-specified feed name as the topic query,
and the goal is to rank the papers the user has annotated as relevant to the feed above those annotated
as irrelevant. Finally, for training, we release a new large data set of more than 700,000 clickthrough
events from a scholarly search engine which we term as the Search task.
To evaluate an embedding set on ad-hoc search, we rank candidate papers by increasing Euclidean
distance between the query embedding and the candidate paper embeddings. Pytrec_eval
(Van Gysel & de Rijke, 2018) is used to calculate the ranking metrics. Normalized Discounted
Cumulative Gain (nDCG) is used for Search and TREC-CoVID tasks as the true relevance score can
be > 1. For the feeds tasks which have binary labels, we use Mean Average Precision (MAP).
Proximity Similar to ad-hoc search, proximity tasks involve ranking a set of candidate papers by
their relatedness to a query, except the query in this case is not textual but instead a paper. Proximity-
based tasks form a basis for paper-based retrieval and recommendation, and for estimating paper
similarity for use in applications like author disambiguation. We include a total of eleven proximity-
based tasks, including four evaluation tasks from SciDocs (predicting citations and co-citations, and
predicting co-viewed or co-read papers), and two others from previous work: the S2AND author
disambiguation task (Subramanian et al., 2021) with paper similarity features, and Paper-Reviewer
Matching, where candidate reviewers are ranked by expert annotators based on the similarity of
their papers to the query paper to be reviewed. The Paper-Reviewer Matching task combines three
existing datasets (Mimno & McCallum, 2007; Liu et al., 2014; Zhao et al., 2022) which we describe
in more detail in subsection A.2. We also introduce five new proximity tasks including two feeds
evaluation tasks from the recommender discussed above, where one or multiple relevant papers serve
as queries. For training, we include three large-scale datasets aimed at predicting same-authors,
citations (via triplets) (Cohan et al., 2020), and influential citations, which we define as four or more
citations of the same paper in the text of a single paper.
For evaluating embeddings in proximity tasks, we rank candidates by Euclidean embedding distance,
with MAP as the evaluation metric except for S2AND, which uses B3 F1 (Bagga & Baldwin, 1998),
and Peer Review Matching, which uses precision@5 and @10.
Classification Paper classification, in which the input is a paper and the output is a topical cate-
gory, is a foundational task for document organization and discovery. Apart from the two SciDocs
classification tasks (MAG and MeSH Diseases), we take four additional classification tasks, includ-
ing a binary task to predict whether a paper is relevant to biomimicry (Shyam et al., 2019), two
biomedical classification tasks, namely DRSM from Burns (2022) and MeSH Descriptors classifi-
cation (Lipscomb, 2000), and a new large-scale field of study (FoS) multi-label training set of more
than 500K papers with silver FoS labels based on publication venue.
We evaluate embeddings on classification by scoring their performance as features within linear
support vector classifiers. Results for these tasks are evaluated using F1 score (which may be micro-
or macro-F1 depending on the dataset, indicated in Table 1). To better understand how embeddings
perform in data-scarce regimes, we also construct two few-shot versions each from both out-of-train
classification datasets and the FoS dataset subset for which we have manually annotated gold labels.
Regression We also consider a set of regression tasks where the goal is to predict a continuous
quantity for a given paper. For evaluation, we consider predicting three numeric attributes related
to prominence or quality: Tweet Mentions (Jain & Singh, 2021), and the peer review rating and
maximum h-index of authors for a collection of ICLR papers obtained from OpenReview2 (forming
two new datasets). For training, we introduce two additional datasets of more than 200K examples
each, predicting citation count and year of publication.
2
https://api.openreview.net
5
Control Codes
Adapters
[CLF]
[QRY] PRX QRY
⊕ doc Language Model
[RGN] CLF RGN
[PRX] RG
PR X
CL F
QRY
N
Figure 1: Generating multi-format embeddings. A task format is either associated with a task-
specific control code supplied with input document, or adapter blocks attached to the model.
We evaluate embeddings on regression tasks by scoring their performance when used as features
within linear support vector regression models. Results for these tasks are evaluated using the
Kendall’s τ rank correlation between the true and predicted labels.3
4.1 M ODEL
We follow Cohan et al. (2020) in using a pretrained transformer encoder as our base model. A sci-
entific document is given as input to the encoder as a concatenation of its title and abstract separated
3
We found in our experiments that Pearson’s ρ and Kendall’s τ produced similar relative results between
models. We did not use MSE because its values are unbounded and could skew the overall average across the
datasets in the benchmark.
6
by the [SEP] token 4 . Unlike Cohan et al. (2020), we use three different types of training objectives
suitable for each format to train the model as described in subsection 4.2. We explore two methods
to learn separate embeddings for each task form: control codes and adapters as shown in Figure 1.
Control Codes In the control code approach, we prepend a special per-format token (see Table 6
in the appendix) to the input and pass it to the transformer model, taking the final layer representation
corresponding to this token as the document embedding and feeding it as input to the task-specific
head (described in Section 4.2).
Adapters We also experiment with adapters which have been shown to be effective for multi-task
learning. In particular, we explore Adapter Fusion (Pfeiffer et al., 2021) and PALs (Stickland &
Murray, 2019) methods, each of which introduces task-specific adapters and attention modules at
every transformer layer. Since our goal is to learn different embeddings for different task formats,
we create modules for each task format rather than each task, and the final hidden representation of
the [CLS] token output via the adapter is taken as the corresponding embedding of the document.
4.2 T RAINING
We train the model in a multi-task setup with task-heterogeneous batching (Aghajanyan et al., 2021).
For classification and regression tasks, we use a linear head atop the base transformer encoder5 . We
train on both multi-class and multi-label tasks, using Cross Entropy loss for the former and Binary
Cross Entropy (BCE) with sigmoid activation for the latter. For regression we minimize the Mean
Square Error (MSE) loss.
For proximity and ad-hoc search tasks we use the triplet loss as in Cohan et al. (2020). For these task
forms, given a query, a relevance score accompanies each candidate. The query can be a document
(for which we wish to find similar documents) or a raw textual query. Each training instance in this
setup is a triplet consisting of a paper or plain text query Q, a positive candidate paper P+ and a
negative candidate P−, where P+ has a higher score than P−. Then, we optimize the triplet loss:
+ −
Ltriplet = max{d(QE , PE ) − d(QE , PE ) + , 0} (1)
where d is the Euclidean distance used as a measure of similarity between the query embedding QE
+ −
and candidate embeddings PE and PE , and is the margin hyperparameter whose value is chosen
as 1 based on preliminary experiments.
5 E XPERIMENT S ETUP
Training Data We train our multi-format models on the 8 large scale in-train tasks detailed in
Table 1. For the proximity and ad-hoc search tasks, we create up to 5 examples for each query
by sampling positive and negative papers from its candidate pool. We limit the number of training
samples from each task to at most 600K.6 The resultant training and validation data sets consist of a
total of 3.27M and 446K instances respectively.
Transformer Baselines As a first step, we evaluate the existing document representation meth-
ods on our benchmark. These include the transformer encoders SciBERT (Beltagy et al., 2019) –
a language model pre-trained on scientific corpora; and SPECTER (Cohan et al., 2020), SciNCL
(Ostendorff et al., 2022b) and ASPIRE (Mysore et al., 2022). ASPIRE produces representations for
aspect-based matching between query and candidate papers which is a similar setting as our prox-
imity tasks. Hence we only evaluate it on that specific subset and report the results in Appendix C.
Next, for our multi-format baselines, we initialize with SciNCL which is the state of the art on Sci-
Docs, and then further train it in a multi-task setup on the in-train tasks both with (MTL CTRL)
and without the control codes (MTL CLS). Finally, to compare the control codes-based approach
with the adapter techniques, we experiment with the BERT PALs and Fusion architectures, keeping
SciNCL as the base model in both. Fusion being a two step process, first introduces task format
specific adapters (Adapters) and then the fusion modules (Adapter Fusion). The MTL CTRL and
adapter approaches produce multiple representations per document while MTL CLS produces a sin-
4
For the Search task, additional metadata like paper venue and year of publishing is also supplied.
5
The linear heads are thrown away after training.
6
Performance with smaller dataset samples - max 400K samples/tasks was relatively poor.
7
Table 2: Evaluation results on SciRepEval in multiple settings. MTL CLS generates a single embed-
ding for all tasks, MTL CTRL (control codes) and Adapter variants (Adapters, PALs, and Adapter
Fusion) produce an embedding per task format. We consider an ensemble approach that averages
the MTL CTRL and Adapter embeddings. For models we trained, we report the mean and standard
deviation (in parentheses) across 5 runs with different seeds. The best results are highlighted in bold.
We conduct one way analysis of variance (ANOVA) with Tukey’s test (Haynes, 2013) for α = 0.05
across multiple settings and underline those not statistically significantly different from the best.
gle representation similar to existing methods. We use the PyTorch implementations of the models
by HuggingFace7 . The specific training configurations are described in Appendix B.
6 R ESULTS
Table 2 shows the evaluation of all our transformer baselines producing both single and multiple
representations per document on SciRepEval. Our benchmark includes diverse tasks with a variety
of different evaluation metrics, and following previous work (e.g., Wang et al., 2019) we report an
average of the individual task metrics (which each range from 0-100). The pre-fine-tuned multi-
format variants outperform the vanilla models on average, and we also find that all the approaches
that produce multiple representation types outperform, by up to 1.5 points, the MTL CLS model,
which learns only a single representation shared for all tasks. The adapter variants are better than
MTL CTRL overall, and result in an improvement of 0.6-1.3 points on the out-of-train tasks with
task-format specific adapters performing the best.
Further, as shown in Table 5, the control codes and adapters are the most efficient in terms of model
size and computational efficiency. Hence, we try to improve upon each by combining representations
from the Adapter model and the MTL CTRL model by averaging them8 , and we find that these
combined embeddings outperform the individual ones consistently across the in-train, out-of-train,
and SciDocs settings. All the models except SciBERT (not pre-trained with a citation objective)
perform well on SciDocs, with vanilla SciNCL being the best. ASPIRE, as reported in Appendix C,
performs well on SciDocs but not on other similar tasks in SciRepEval.
Alternative Base Models To confirm that our findings hold across multiple base models, we com-
pare MTL CLS, MTL CTRL and adapters with SPECTER and SciBERT as the base models. Table 3
shows that the MTL CTRL token and the adapters approaches still substantially outperform the MTL
CLS approach, suggesting that the efficacy of using an embedding per task format instead of a single
embedding per document holds across a range of base model types.
7 A NALYSES
Specialization of Control Code Embeddings Our hypothesis is that by training embedding
spaces on particular task formats, they will become more accurate for tasks of that format than
for others. We test this hypothesis by sampling one in-train and one out-of-train9 task of every for-
mat (for ease of computation) and applying all the control codes to them for evaluation. As shown
7
https://huggingface.co/models
8
We also tried concatenating the embeddings in preliminary experiments, which yielded similar results but
doubled the embedding size.
9
In-train: FoS, Citation Count, Same Author Detection, Search; Out-of-train: DRSM, Peer Review Score,
Peer-Reviewer Matching, TREC-CoVID
8
Table 3: Results for multi-format training with SciBERT and SPECTER as base models. For brevity,
we report only the single adapters results due to their additional advantage of computation efficiency.
The best results for each base model are underlined.
Table 4: Cross task analysis for control codes. The best results for each task format across all
control codes is underlined. These are represented in the diagonal for both in-train and out-of-train
tasks suggesting that format based partitioning in multi-task training produces effective document
representations suitable for the corresponding format.
in Table 4, the control codes trained on a task format perform best for tasks of that format, for both
in-train and out-of-train.
As an extension to this experiment we also analyze how well the control code representations work
when the encoder is trained on tasks which are randomly grouped together as opposed to by task
format. We take the mean evaluation metrics produced from 5 random partition runs. On the out-
of-train tasks, the corresponding control codes for classification, regression, proximity and ad-hoc
search show a gain of +0.2, +3.9, +4.5 and +2.2 points respectively over random partitioning. Simi-
larly, for in-train tasks the control codes are better by +5.2, +3.8, +1.2 and +1.3 points respectively.
The results suggest that representations specific to each task format do lead to better results overall.
Finally, to study training affinity among the task formats, we pre-fine-tune on a maximum of two
formats at once. Appendix G reveals that combined multi-task training on similar task formats
like regression/classification and proximity/adhoc-search results in performance gains, but only on
related tasks. Training on all the tasks yields better results on average across the task formats.
Efficiency While the variants producing representations based on task-format serve as strong base-
lines on the SciRepEval benchmark as shown in Table 2, efficiency is another important consider-
ation in practice. As shown in Table 5, the control code approach only requires one new control
code embedding per format, and does not affect training time. PALs, by contrast, introduces new
attention layers and trains the entire network, increasing training time, and Adapters adds and only
trains half as many parameters as PALs. Fusion layers introduce 10x as many parameters as PALs
leading to 2x more time on inference. Training and inference times are measured on runs with 1k
and 10k samples, respectively.
8 C ONCLUSION
We introduce SciRepEval, a benchmark for scientific document representation methods with 25
tasks across four task formats. On this benchmark, we show that learning a separate document
representation for each task format substantially improves task performance compared to learning a
single representation for all tasks. Future work could address limitations of our work by evaluating
9
Table 5: Parameter and (relative) runtime efficiency comparison of models. MTL CTRL and
Adapters are similar in runtime, but the PALs and Fusion variants add significant computation costs.
partitioning schemes beyond task format, crafting higher-fidelity metrics to account for the diversity
of tasks in SciRepEval (which vary in sensitivity and in relevance to downstream applications), or
further exploring how accuracy varies with computational and storage cost.
ACKNOWLEDGEMENTS
We would like to thank Jonathan Bragg, Zhipeng Hou, and the anonymous reviewers for helpful
comments, suggestions and feedback. We would also like to acknowledge the support of NASA
AATT Project for funding the PeTaL research and contributing the biomimicry dataset. This work
was supported in part by NSF Grant OIA-2033558.
R EFERENCES
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal
Gupta. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, pp. 5799–5811, Online
and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguis-
tics. doi: 10.18653/v1/2021.emnlp-main.468. URL https://aclanthology.org/2021.
emnlp-main.468.
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta,
Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and
Donald Metzler. Ext5: Towards extreme multi-task scaling for transfer learning. In International
Conference on Learning Representations, 2022. URL https://openreview.net/forum?
id=Vzh1BFUCiIX.
Amit Bagga and Breck Baldwin. Algorithms for scoring coreference chains. In The first inter-
national conference on language resources and evaluation workshop on linguistics coreference,
1998.
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
pp. 3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics.
doi: 10.18653/v1/D19-1371. URL https://aclanthology.org/D19-1371.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Ad-
vances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Asso-
ciates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Gully Burns. Drsm-corpus v1, 2022. URL https://github.com/chanzuckerberg/
DRSM-corpus.
10
Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of
the Tenth International Conference on International Conference on Machine Learning, ICML’93,
pp. 41–48, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077.
Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient
normalization for adaptive loss balancing in deep multitask networks. In Jennifer Dy and Andreas
Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80
of Proceedings of Machine Learning Research, pp. 794–803. PMLR, 10–15 Jul 2018. URL
https://proceedings.mlr.press/v80/chen18a.html.
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. SPECTER:
Document-level representation learning using citation-informed transformers. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2270–2282, On-
line, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.207.
URL https://aclanthology.org/2020.acl-main.207.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c Barrault, and Antoine Bordes. Supervised
learning of universal sentence representations from natural language inference data. In Proceed-
ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–
680, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:
10.18653/v1/D17-1070. URL https://aclanthology.org/D17-1070.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:
//aclanthology.org/N19-1423.
Christopher Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently
identifying task groupings for multi-task learning. In NeurIPS, 2021.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sen-
tence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, Online and Punta Cana, Dominican Republic, November 2021. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https:
//aclanthology.org/2021.emnlp-main.552.
Winston Haynes. Tukey’s Test, pp. 2303–2304. Springer New York, New York, NY, 2013. ISBN
978-1-4419-9863-7. doi: 10.1007/978-1-4419-9863-7 1212. URL https://doi.org/10.
1007/978-1-4419-9863-7_1212.
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spacy: Industrial-
strength natural language processing in python. 2020. doi: 10.5281/zenodo.1212303.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for
NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th In-
ternational Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pp. 2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.
press/v97/houlsby19a.html.
Ganesh J, Manish Gupta, and Vasudeva Varma. Doc2sent2vec: A novel two-phase approach for
learning document representation. In Proceedings of the 39th International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, SIGIR ’16, pp. 809–812, New
York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450340694. doi:
10.1145/2911451.2914717. URL https://doi.org/10.1145/2911451.2914717.
Naman Jain and Mayank Singh. Tweetpap: A dataset to study the social media discourse of scientific
papers. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 328–329, 2021.
doi: 10.1109/JCDL52503.2021.00055.
11
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. Ctrl:
A conditional transformer language model for controllable generation. ArXiv, abs/1909.05858,
2019.
Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceed-
ings of the 31st International Conference on International Conference on Machine Learning -
Volume 32, ICML’14, pp. II–1188–II–1196. JMLR.org, 2014.
Carolyn E Lipscomb. Medical subject headings (mesh). Bulletin of the Medical Library Association,
88(3):265, 2000.
Chundi Liu, Shunan Zhao, and Maksims Volkovs. Unsupervised document embedding with cnns.
ArXiv, abs/1711.04168, 2017.
Shengchao Liu, Yingyu Liang, and Anthony Gitter. Loss-balanced task weighting to reduce negative
transfer in multi-task learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33
(01):9977–9978, Jul. 2019a. doi: 10.1609/aaai.v33i01.33019977. URL https://ojs.aaai.
org/index.php/AAAI/article/view/5125.
Xiang Liu, Torsten Suel, and Nasir Memon. A robust model for paper reviewer assignment. In
Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pp. 25–32, New
York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450326681. doi: 10.
1145/2645710.2645749. URL https://doi-org.turing.library.northwestern.
edu/10.1145/2645710.2645749.
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-yi Wang. Representation
learning using multi-task deep neural networks for semantic classification and information re-
trieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 912–921, Denver, Colorado,
May–June 2015. Association for Computational Linguistics. doi: 10.3115/v1/N15-1092. URL
https://aclanthology.org/N15-1092.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks
for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pp. 4487–4496, Florence, Italy, July 2019b. Association for Com-
putational Linguistics. doi: 10.18653/v1/P19-1441. URL https://aclanthology.org/
P19-1441.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-
ence on Learning Representations (ICLR), 2019.
David Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers.
In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’07, pp. 500–509, New York, NY, USA, 2007. Association for Com-
puting Machinery. ISBN 9781595936097. doi: 10.1145/1281192.1281247. URL https://
doi-org.turing.library.northwestern.edu/10.1145/1281192.1281247.
Sheshera Mysore, Tim O’Gorman, Andrew McCallum, and Hamed Zamani. Csfcube-a test collec-
tion of computer science research articles for faceted query by example. In Thirty-fifth Conference
on Neural Information Processing Systems (NeurIPS), 2021.
Sheshera Mysore, Arman Cohan, and Tom Hope. Multi-vector models with textual guidance
for fine-grained scientific document similarity. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Seattle, United States, July 2022. Association for Computational Linguis-
tics. doi: 10.18653/v1/2022.naacl-main.331. URL https://aclanthology.org/2022.
naacl-main.331.
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming
Yuan, Nikolas A. Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam,
Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David P. Schnurr, Fe-
lipe Petroski Such, Kenny Sai-Kin Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov,
Joanne Jang, Peter Welinder, and Lilian Weng. Text and code embeddings by contrastive pre-
training. ArXiv, abs/2201.10005, 2022.
12
Malte Ostendorff, Till Blume, Terry Ruas, Bela Gipp, and Georg Rehm. Specialized doc-
ument embeddings for aspect-based similarity of research papers. In Proceedings of the
22nd ACM/IEEE Joint Conference on Digital Libraries, JCDL ’22, New York, NY, USA,
2022a. Association for Computing Machinery. ISBN 9781450393454. doi: 10.1145/3529372.
3530912. URL https://doi-org.turing.library.northwestern.edu/10.
1145/3529372.3530912.
Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. Neighborhood
Contrastive Learning for Scientific Document Representations with Citation Embeddings. In The
2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), Abu
Dhabi, December 2022b. Association for Computational Linguistics. doi: 10.48550/arXiv.2202.
06671. 7-11 December 2022. Accepted for publication.
Vishakh Padmakumar, Leonard Lausen, Miguel Ballesteros, Sheng Zha, He He, and George
Karypis. Exploring the role of task transferability in large-scale multi-task learning. In Pro-
ceedings of the 2022 Conference of the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, pp. 2542–2550, Seattle, United States, July
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.183. URL
https://aclanthology.org/2022.naacl-main.183.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New Orleans, Louisiana,
June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL
https://aclanthology.org/N18-1202.
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapter-
Fusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Con-
ference of the European Chapter of the Association for Computational Linguistics: Main Volume,
pp. 487–503, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/
2021.eacl-main.39. URL https://aclanthology.org/2021.eacl-main.39.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-
text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http:
//jmlr.org/papers/v21/20-074.html.
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-
networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Com-
putational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/
D19-1410.
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables
zero-shot task generalization. In International Conference on Learning Representations, 2021.
Vikram Shyam, Lauren Friend, Brian Whiteaker, Nicholas Bense, Jonathan Dowdall, Bishoy Boktor,
Manju Johny, Isaias Reyes, Angeera Naser, Nikhitha Sakhamuri, Victoria Kravets, Alexandra
Calvin, Kaylee Gabus, Delonte Goodman, Herbert Schilling, Calvin Robinson, Robert Omar
Reid II, and Colleen Unsworth. Petal (periodic table of life) and physiomimetics. Designs, 3
(3), 2019. doi: 10.3390/designs3030043. URL https://www.mdpi.com/2411-9660/3/
3/43.
Asa Cooper Stickland and Iain Murray. BERT and PALs: Projected attention layers for ef-
ficient adaptation in multi-task learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov
(eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pp. 5986–5995. PMLR, 09–15 Jun 2019. URL
https://proceedings.mlr.press/v97/stickland19a.html.
13
Shivashankar Subramanian, Daniel King, Doug Downey, and Sergey Feldman. S2AND: A Bench-
mark and Evaluation System for Author Name Disambiguation. In JCDL ’21: Proceedings of
the ACM/IEEE Joint Conference on Digital Libraries in 2021, JCDL ’21, New York, NY, USA,
2021. Association for Computing Machinery.
Yi Tay, Mostafa Dehghani, Vinh Quang Tran, Xavier Garcı́a, Dara Bahri, Tal Schuster, Huaixiu
Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ArXiv,
abs/2205.05131, 2022.
Amalie Trewartha, Nicholas Walker, Haoyan Huo, Sanghoon Lee, Kevin Cruse, John Dagdelen,
Alexander Dunn, Kristin A. Persson, Gerbrand Ceder, and Anubhav Jain. Quantifying the ad-
vantage of domain-specific pre-training on named entity recognition tasks in materials science.
Patterns, 3(4):100488, 2022. doi: https://doi.org/10.1016/j.patter.2022.100488. URL https:
//www.sciencedirect.com/science/article/pii/S2666389922000733.
Marco Valenzuela, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Work-
shop: Scholarly Big Data, 2015.
Christophe Van Gysel and Maarten de Rijke. Pytrec eval: An extremely fast python interface to
trec eval. In The 41st International ACM SIGIR Conference on Research and Development in In-
formation Retrieval, SIGIR ’18, pp. 873–876, New York, NY, USA, 2018. Association for Com-
puting Machinery. ISBN 9781450356572. doi: 10.1145/3209978.3210065. URL https://
doi-org.turing.library.northwestern.edu/10.1145/3209978.3210065.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st In-
ternational Conference on Neural Information Processing Systems, NeurIPS’17, pp. 6000–6010,
Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle
Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: Constructing a pandemic in-
formation retrieval test collection. SIGIR Forum, 54(1), feb 2021. ISSN 0163-5840. doi: 10.
1145/3451964.3451965. URL https://doi-org.turing.library.northwestern.
edu/10.1145/3451964.3451965.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language un-
derstanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran As-
sociates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
Kuansan Wang, Iris Shen, Charles Huang, Chieh-Han Wu, Yuxiao Dong, and An-
shul Kanakia. Microsoft academic graph: when experts are not enough. Quanti-
tative Science Studies, 1(1):396–413, February 2020a. doi: 10.1162/qss\ a\ 00021.
URL https://www.microsoft.com/en-us/research/publication/
microsoft-academic-graph-when-experts-are-not-enough/.
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick,
Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Michael Kinney, Yunyao Li, Ziyang Liu,
William Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen,
Brandon Stilson, Alex D. Wade, Kuansan Wang, Nancy Xin Ru Wang, Christopher Wilhelm,
Boya Xie, Douglas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. CORD-
19: The COVID-19 open research dataset. In Proceedings of the 1st Workshop on NLP for
COVID-19 at ACL 2020, Online, July 2020b. Association for Computational Linguistics. URL
https://aclanthology.org/2020.nlpcovid19-acl.1.
Lingfei Wu, Ian En-Hsu Yen, Kun Xu, Fangli Xu, Avinash Balakrishnan, Pin-Yu Chen, Pradeep
Ravikumar, and Michael J. Witbrock. Word mover’s embedding: From Word2Vec to document
embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pp. 4524–4534, Brussels, Belgium, October-November 2018. Association for Com-
putational Linguistics. doi: 10.18653/v1/D18-1482. URL https://aclanthology.org/
D18-1482.
14
Peng Xu, Xinchi Chen, Xiaofei Ma, Zhiheng Huang, and Bing Xiang. Contrastive document rep-
resentation learning with graph attention networks. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2021, pp. 3874–3884, Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.327.
URL https://aclanthology.org/2021.findings-emnlp.327.
Michihiro Yasunaga, Jure Leskovec, and Percy Liang. LinkBERT: Pretraining language mod-
els with document links. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 8003–8016, Dublin, Ireland, May
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.551. URL
https://aclanthology.org/2022.acl-long.551.
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. CrossFit: A few-shot learning challenge for cross-task
generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, pp. 7163–7189, Online and Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.572. URL
https://aclanthology.org/2021.emnlp-main.572.
Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. From
neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In
Proceedings of the 27th ACM International Conference on Information and Knowledge Manage-
ment, CIKM ’18, pp. 497–506, New York, NY, USA, 2018. Association for Computing Machin-
ery. ISBN 9781450360142. doi: 10.1145/3269206.3271800. URL https://doi.org/10.
1145/3269206.3271800.
Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, and Nan Duan. Multi-view document repre-
sentation learning for open-domain dense retrieval. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5990–6000, Dublin,
Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.
414. URL https://aclanthology.org/2022.acl-long.414.
Yue Zhao, Ajay Anand, and Gaurav Sharma. Reviewer recommendations using document vector
embeddings and a publisher database: Implementation and evaluation. IEEE Access, 10:21798–
21811, 2022. doi: 10.1109/ACCESS.2022.3151640.
A S CI R EP E VAL TASKS
Search We used clickthrough data from an academic search engine. Only search queries with at
least 10 results were included, and a set of heuristic rules were applied to exclude likely noise and
bots. We removed author queries when the query was judged to contain any person tokens by named
entity recognition (Honnibal et al., 2020).
Feeds Research feeds help researchers maintain a library of papers they are currently reading and
also index them by topics. We use anonymized research feeds data from an academic search engine
that recommends papers based on a user’s library. This data includes information on whether users
found the recommendations relevant or not. The data contains 430 feeds which have more than five
positive and two negative paper annotations from users. We use this to create the Feeds-1, Feeds-M
and Feeds Title tasks. The first two are proximity tasks and are described in section 3.
Feeds Title: The title of a research feed provided by the user usually indicates the topic of scientific
articles in the feed. While the other two datasets have papers as query and belong to the proximity
family, this dataset is classified as ad-hoc search, as the query is a short text snippet rather than a
paper. We remove feeds with generic titles like ‘My field’ and ‘Final Project’; replace abbreviations
with their long forms where possible and filter out feeds with non-English titles.
15
from the CORD-19 corpus (Wang et al., 2020b) along with their relevance scores on a scale of
0-2. Each query consists of a short title, a question asking the required information and a narrative
describes briefly exactly the type of information that the results should have. For our evaluation we
combine these fields into a single text separated by the [SEP] token.
A.2 P ROXIMITY
S2AND and Same Author Detection The S2AND dataset (Subramanian et al., 2021) contains
signatures (author-paper pairs) that are clustered according to which author mentions refer to the
same person. Due to the high resource requirements of running the original S2AND evaluation, we
create S2AND-mini, a version of S2AND with only 1000 blocks from each of S2AND’s dataset
sources and at most 500 signatures per block. Our evaluation of S2AND-mini follows the original
evaluation of S2AND; that is, our method’s document embeddings are used along with author and
paper metadata to create features for a clustering algorithm that consists of a pairwise scoring model
followed by greedy agglomerative clustering. We use B3 F1 (Bagga & Baldwin, 1998) as in the
original paper for evaluation.
We also use S2AND to create the data for our same-author detection task. Unlike the original
S2AND evaluation, our same-author task uses only paper embeddings without any additional author
or paper metadata, which allows us to directly train the embedding model on the data. Same-author
detection is formulated as a triplet ranking task; given three papers of which two share an author,
the goal is to find the matching pair.
Feeds-1 We re-purpose the feeds dataset from section A.1 for this and the next task. The first
paper added to a feed chronologically serves as the query. The next 5 positive user annotations are
considered relevant and 5 negative candidates are sampled either from user annotations or randomly.
Feeds-M Given K positive papers annotated in a feed (assuming K > 5), we use the first M =
K − 5 as queries. For every query, the positive candidates are sampled from all the papers the
user positively annotated after the query paper was added to their feed, and negative candidates are
sampled from user annotations or randomly.
Peer Reviewer Matching In this task the goal is to judge whether a given paper is relevant to a
potential reviewer. As data for this task is hard to obtain at scale due to the double-blind nature of
many conferences and journals, we combine multiple existing reviewer-paper matching datasets:
• Mimno & McCallum (2007), with 393 paper-review relevance ratings from a corpus of 148
NeurIPS 2006 papers and 364 reviewers, annotated by nine human experts.
• Liu et al. (2014), an extension of Mimno & McCallum (2007) which adds 766 additional
paper-review annotations.
• Zhao et al. (2022), with 694 paper-reviewer relevance ratings from a corpus of 75 papers
and 1833 reviewers from the IEEE ICIP 2016 conference, annotated by 3 human experts.
All datasets have been annotated on the same 0-3 relevance rating scale. The candidate reviewers are
all researchers, and we embed all the papers written by them using our models. To obtain the model’s
score for each candidate reviewer, we compute the cosine similarity between the query paper and
each of the candidate’s papers, and take the mean of the top 3 similarities as the score. We consider
two ways to map the 0-3 relevance judgements to binary labels—hard and soft decision—where for
the soft decision a score of 2 or 3 is considered relevant and for hard decision only a score of 3 is
considered relevant. Precision at 5 (P@5) and 10 (P@10) results are used as the final metric, which
ultimately results in four numbers (P@5 and P@10 for each of hard and soft decisions), which are
averaged to produce the single number reported in our final results for this task.
Highly Influential Citations In this task, given a paper A and paper B, we aim to predict whether
B is highly influenced by A. As measuring influence is subjective and human annotation is expen-
sive, we approximate influence by counting the number of times A is cited in the text of B. If A is
cited at least 4 times, we consider it to be highly influential (a positive example in our triplet-based
loss); otherwise, we consider it to be a negative example. During evaluation, we sample query pa-
16
pers which have at least 5 positive candidates and compute the L2 distance for similarity ranking.
Note that our definition of ‘influential’ differs from that in Valenzuela et al. (2015).
Citation Prediction (SPECTER Pre-training Triplets) This is the task and dataset used for pre-
training in Cohan et al. (2020). It is based on citation links between scientific documents where each
instance is a triplet consisting of a query, a positive and a negative paper. Each query can have up to
five triplets, where the positives are sampled from papers directly cited by the query and negatives
are chosen either randomly (easy) or from citations of citations (hard). 3 easy and 2 hard difficult
are chosen for each query. To evaluate the effectiveness of this pre-training we follow Cohan et al.
(2020) and use SciDocs for evaluation, excluding the recommendations task.
A.3 C LASSIFICATION
MeSH Descriptors Medical Subject Headings (MeSH) (Lipscomb, 2000) indexes biomedical
publications into a categorical hierarchy consisting of descriptors which refer to topic headings
and specific aspect related to a topic respectively. The dataset is a collection of scientific documents
belonging to the 30 most frequently occurring top level MeSH descriptors and having exactly one
qualifier. We filter out the records that don’t have an associated qualifier. The descriptors thus serve
as the labels in the multi-class classification task.
Fields of Study (FoS) The FoS task is a multi-label classification problem where each scientific
document is assigned one or more classes out of 23 possible fields. For gold test data, we manually
labeled 471 papers into at most three fields-of-study. For silver training data, we assumed that a
paper within a venue generally falls within a narrow set of fields and manually assigned FoS labels
to publication venues. We then propagated the venue labels to the papers published therein.
To evaluate different data sizes, we obtain the F1 score on the gold data in three settings: 5-shot,
10-shot, and the complete gold test set. The average of these scores is treated as the score for this
task when computing the overall average score for the benchmark.
Disease Research State Model (DRSM) DRSM (Burns, 2022) is a collection of Pubmed papers
that deal with six specific aspects of rare diseases. The gold data is annotated by in-house experts and
used for evaluation, while the silver data is generated by annotation service providers with medical
expertise.
Similar to FoS, we obtain the F1 score on 24-shot, 64-shot, and full data, then average the results
before computing the final benchmark score.
Biomimicry We sample tags for a set of papers in the PeTaL database (Shyam et al., 2019) to
create a binary classification dataset with labels indicating whether each paper is about biomimicry.
The data is unbalanced, with only 13% positive samples. We evaluate 16-shot, 64-shot, and full-data
setup and take the mean to get the final score.
A.4 R EGRESSION
Citation Count We sample a collection of scientific articles published in 2016 from the set of
papers in the search dataset described in A.1, so that a 5 year period has passed for them to collect
citations. Each article has at least one citation, and the citation counts are converted to log scale.
Year of Publication The aim of this task is to determine research trends by predicting the year of
publication of a scientific article. We sample publications from the search dataset with a publication
date after the year 2005 and scale the years so that their values are between 0 and 1. Further, since
this task is used for training along with citation count prediction, and to align the loss scales, the
labels are scaled by the mean of the labels in citation count for parity.
17
Table 6: Assigned input formats and control codes for each task form. [CLF], [RGN], [PRX] and
[QRY] are special tokens, doc is the input.
Peer Review Score We use the OpenReview API10 to collect paper metadata and corresponding
review scores for ICLR conferences from 2017 to 2022. Each reviewer in ICLR assigns a final rating
in the range [0-10], and we take the mean rating as the label for every paper.
h-Index of Authors In this task the goal is to predict the maximum h-Index of any of the authors
of a scientific publication. We re-use the peer review score dataset, obtain the h-Index of all the
authors for each paper using the Semantic Scholar API11 , and pick the max as the label. The labels
are normalized to lie between [0,1].
Tweet Mentions The goal of this task is to predict the combined number of a paper’s mentions
and retweets. We post-process the dataset created by Jain & Singh (2021) containing tweets about
Arxiv papers between 2010-19. The sum of normalized counts of mentions and retweets is finally
considered as the score to be predicted.
B I MPLEMENTATION DETAILS
During pre-training, all the tasks with the same format share their task-format specific parameters.
The control code based paradigm introduces four new (randomly-initialized) special tokens to the
vocabulary. We try initializing these additional parameters randomly, with the [CLS] token and
a combination of [CLS] with some noise. However, it has little impact on the resulting model
performance with random initialization being better on average. Further, we also tried loss weighting
strategies (Chen et al., 2018; Liu et al., 2019a) but our preliminary experiments produced better
results without any scaling so we didn’t explore it further. All the base models are trained for two
epochs on two 48GB NVIDIA Quadro RTX 8000 GPUs with 16 bit precision, an effective batch size
of 256, and a maximum input length of 512 tokens. Each batch is sampled with an equal number
of examples from each task.12 We use AdamW (Loshchilov & Hutter, 2019) with = 1e-8. The
learning rate follows an inverse square root schedule with a linear warmup of 700 steps and peak of
5e-5.
The adapter approaches follow the two step training process and learning rate configurations de-
scribed in Pfeiffer et al. (2021). One adapter per task family is attached to the base model in both
single adapter and fusion stages and is trained for a maximum of 6 and 4 epochs respectively. For
PALs one layer is added per task format and the entire network is trained for 2 epochs as in Stickland
& Murray (2019).
B.1 E VALUATION
For classification and regression, we train a linear SVM on each downstream task using the em-
beddings as input, and we tune the regularization parameter C via grid search. Multi-class and
multi-label classification are configured under the one vs all classifier setting.
18
Table 7: Comparison of the our SciNCL multi-format methods with ASPIRE on proximity tasks.
The best results for each base model are underlined. TS: Text Supervision, OT: Optimal Transport
C ASPIRE EVALUATION
ASPIRE (Mysore et al., 2022) produces representations for the dense retrieval of scientific docu-
ments based on matching multiple aspects between the query and candidates. To evaluate these
representations under the settings they are designed for, we only report the results on the proximity
tasks in Table 7. We use the model implementations available on HuggingFace which have been pre-
trained on documents from the Computer Science (CS) and Biomedical (Bio) domains. The models
variants can be further sub-categorized as retrieval based on best aspect matching (TS ASPIRE)
and a weighted sum of the similarity score among all the aspects based on Optimal Transport (OT
ASPIRE) between the query and candidates. Both our multi-format approaches with control codes
and adapters produce better results overall and on out-of-train tasks. Note however, since ASPIRE
models are trained on co-citations, they perform much better on average on the citation based tasks
from SciDocs.
Table 8: Data domain distribution in SciRepEval for the training tasks and comparison with SciDocs.
We group the unique documents in both the benchmarks by their MAG (Wang et al., 2020a) fields
of study and present the counts in columns 2 and 3 and the absolute increase per field in column4.
10
https://api.openreview.net
11
https://api.semanticscholar.org/
12
We experimented with mixed and task sequential batching as well which did not yield good results.
19
Figure 2: Correlations of model performances between tasks in SciRepEval.
We study the domain diversity of SciRepEval and display the results in Table 8. To compare against
the training data for SciDocs, we consider the citation prediction triplets on which SPECTER is
trained which is also a subset of the SciRepEval in-train tasks. Even though Medicine and Computer
Science papers still form a bulk of the data, SciRepEval has 105x more documents on average per
domain compared to the SPECTER triplets.
E SPECTER O BJECTIVE
Lastly, we perform an ablation study to better understand the importance of the unsupervised
citation-based training objective. We used SciBERT as the base model for this ablation since both
SPECTER and SciNCL were trained with the citation objective. Removing the citation objective and
its accompanying data from SciBERT + MTL CTRL, we find that the in-train performance drops
from 61.9 to 61.8, while out-of-train drops from 57.9 to 57.5, hinting that the citation objective may
be helpful for generalization to new tasks.
In Figure 2 we show Pearson correlations of model performance metrics between tasks in SciRepE-
val. To compute the correlations, we include all of the individual task results of the model runs
20
Table 9: Task relatedness analysis for choosing a sub-group of tasks to train on so as to obtain opti-
mum performance. The base SciNCL model(Ostendorff et al., 2022b) is trained on one or more task
formats (rows) and then evaluated for a comparison with SciNCL CTRL (last row). Both per task
format and overall average performance is reported (columns). The best training combination for
every task is highlighted in bold. The best single and combined training results for every evaluated
task format respectively are underlined.
shown in Table 2 and Table 3, excluding the ensembles. The correlations between tasks in SciDocs
(bottom right) are highest, while correlations between tasks in the entirety of SciRepEval span a
larger range. Notably, DRSM-Complete and S2AND are uncorrelated with most other tasks. This
shows that the overall task diversity is larger in SciRepEval than in SciDocs.
21