CL Honours Report Naman
CL Honours Report Naman
CL Honours Report Naman
Spring 2022
Name: Suyash Vardhan Mathur
Roll no: 2019114006
Objectives
1. To gain a proper understanding of Natural Language Inference and look at the points
where it can go wrong.
2. Understand the various datasets that have been created for the NLI task – whether
those that use adversarial techniques to improve the dataset. Also, understand why
older datasets for NLI don’t live upto the mark and why they were problematic.
3. Look at various approaches that have been used to improve the shortcomings of NLI.
These include usage of Knowledge graphs, open-domain NLI, etc.
4. In order to gain proper understanding of GNNs and Knowledge graphs, I also worked
on the project of Multi-hop Question Answering using Knowledge Graphs which
would help in continuing the work of improving NLI using KGs.
Problem Background
Natural Language Inference Task
Natural language inference (NLI) is the task of determining whether a "hypothesis" is true
(entailment), false (contradiction), or undetermined (neutral) given a "premise". While they
may have different kinds of labels depending upon the dataset, the core task remains the
same throughout NLI.
An example of the task is:
An older and younger man smiling. neutral Two men are smiling and laughing at
the cats playing on the floor.
A soccer game with multiple males entailment Some men are playing a sport.
playing.
The semantic concepts of entailment and contradiction are central to all aspects of natural
language meaning from the lexicon to the content of entire texts. Thus, natural language
inference (NLI) — characterizing and using these relations in computational systems — is
essential in tasks ranging from information retrieval to semantic parsing to commonsense
reasoning. Approaches used for NLI include earlier symbolic and statistical approaches to
more recent deep learning approaches. Benchmark datasets used for NLI include SNLI,
MultiNLI, SciTail, among others.
All of these datasets have certain shortcomings, which are visible in terms of usage of
annotation artifiacts by models for question-answering, as well as syntactic cues instead of
the actual meaning for the task. In order to combat such shortcomings, other datasets have
been created which use adversarial techniques to give model training instances that remove
the usage of such spurious patterns in the NLI task.
Studies/Experiments
NLI Task
I looked at the following papers to understand the various NLI datasets and various
benchmarks on them at the time they were published:
● Stanford NLI dataset(SNLI)
● Adversarial NLI
● Dataset created using Adversarial Training for Cross-Lingual NLI
SNLI
The SNLI corpus was the first-of-its-kind corpus in terms of its large size. Thus, despite
being high-quality, hand-labeled data sets, and they have stimulated innovative logical and
statistical models of natural language reasoning, but their small size (fewer than a thousand
examples each) limits their utility as a testbed for learned distributed representations. Even
with the SICK dataset being a jump in terms of size, with only 4,500 training sentences, the
dataset introduced some spurious patterns in the dataset. Further, some datasets that were
automatically created/labelled contained high amount of noise, which led to them being
suitable only as supplementary data, and not as the primary data for the task. Further, the
existing corpuses even suffered from the problem of indeterminancies of event and entity
coreference that lead to high indeterminacy regarding the correct semantic label. For
example, consider the sentence pair A boat sank in the Pacific Ocean and A boat sank
in the Atlantic Ocean. The pair could be labeled as a contradiction if one assumes that the
two sentences refer to the same single event, but could also be reasonably labeled as
neutral if that assumption is not made.
The SNLI dataset tackles various problems above. At 570K pairs, it is two orders of
magnitude larger than all other resources of its type. Further, all of its sentences and labels
were written by humans in a grounded, naturalistic context. Further, in the paper, the corpus
is used to evaluate a variety of models for natural language inference, including rule-based
systems, simple linear classifiers, and neural network-based models.
2 models are found to have comparable performance - a feature-rich classifier and a neural
network centred around an LSTM. By further evaluating the LSTM, it is shown that it can be
adapted to an existing NLI challenge task, yielding the best reported performance by a
neural network model. Further, SNLI overcomes the coreference challenge as follows:
SNLI has tried to surmount this issue as follows:
● examples were grounded in specific scenarios, and the premise and hypothesis
sentences in each example were constrained to describe that scenario from the
same perspective
● prompt gave participants the freedom to produce entirely novel sentences within the
task setting - richer examples
● a subset of the resulting sentences were sent to a validation task aimed at providing
a highly reliable set of annotations
Amazon Mechanical Turk used for data collection, where each worker was presented with
premise scene descriptions from a pre-existing corpus, and asked to supply hypotheses for
each of our three labels— entailment, neutral, and contradiction—forcing the data to be
balanced among these classes. Workers given specific instructions, and minimum sentence
length was not enforced. For the premises, captions from the Flickr30k corpus were used.
In order to measure the quality of the corpus, and in order to construct maximally useful
testing and development sets, an additional round of validation for about 10% of the data
was performed. The workers were presented with pairs of sentences in batches of five, and
supplied each pair to four annotators, yielding five labels per pair including the label used by
the original author. If any one of the three labels was chosen by at least three of the five
annotators, it was chosen as the gold label. The overall rate of agreement was extremely
high, suggesting that the corpus is sufficiently high quality. Insights about the levels of
disagreement across the three semantic classes. This disagreement likely reflects not just
the limitations of large crowdsourcing efforts but also the uncertainty inherent in naturalistic
NLI.
The paper finally looks at the accuracies of various models on the dataset, ranging from
Rule-based, statistical models to Neural network based models. Since the paper is dated, it
finds out the best results for LSTMs, which previously weren’t able to give proper
performance due to smaller datasets available. Even models which used Lexicalized
features showed better performance on the SNLI dataset as compared to SICK dataset,
which was again due to its sheer size.
Adversarial NLI(ANLI)
This paper introduces a new large-scale NLI benchmark dataset, collected via an iterative,
adversarial human-and-model-in-the-loop process. The paper shows that models trained on
this new dataset leads to SOTA performance on various popular NLI benchmarks, while
posing a more difficult challenge with its new test set.
While the challenging benchmarks in other areas took years to surpass, the NLU
benchmarks have struggled to keep pace. Further, the fast pace even makes us question
whether the NLU models are genuinely as good as their benchmark performace suggests.
Huge evidence suggests that SOTA models learn to exploit spurious statistical patterns in
datasets instead of learning meaning the way humans do. Given this, human annotators
should be able to construct examples that expose such shortcomings of the models.
In the task-setup, our starting point is a base model, trained on NLI data. Rather than
employing automated adversarial methods, here the model’s “adversary” is a human
annotator. Given a context(premise), and a desired target label, we ask the human writer to
provide a hypothesis that fools the model into misclassifying the label. For every
misclassified example, the writer provided the reason he thinks why it was misclassified.
These were also verified by three human verifiers. Once data collection is finished, a new
training set is constructed from verified correct examples.
Qualified Mechanical Turk workers were used for annotations. Annotators were presented
with a context and a target label—either ‘entailment’, ‘contradiction’, or ‘neutral’—and asked
to write a hypothesis that corresponds to the label. Model predictions(label probabilities) for
the pair are obtained and shown to the annotators. If the model prediction was incorrect, the
job is complete. If not, the worker continues to write hypotheses until error is obtained.
Three rounds were performed for the annotation.
Round 1 used a BERT-Large model trained on a concatenation of SNLI and MNLI , and
selected the best-performing model that could train as the starting point for the dataset
collection procedure. Contexts were randomly sampled short multi-sentence passages from
Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set.
Contexts are either ground-truth contexts from that dataset, or they are Wikipedia passages
retrieved using TF-IDF based on a HotpotQA question.
Round 2 used a more powerful RoBERTa model trained on SNLI, MNLI, an NLI version of
FEVER, and the training data from the previous round. Hyperparameters were tuned to give
the best performance on the previous dataset, and using different random seeds, different
models were created. During annotation, an ensemble was constructed by randomly picking
a model from the model set as the adversary at each turn, helping avoid annotators
exploiting vulnerabilities in a single model. Similar as before, new non-overlapping contexts
were taken from Wikipedia via HotpotQA.
Round 3 selected a more diverse set of contexts, in order to explore robustness under
domain transfer. Included contexts: News (extracted from Common Crawl), fiction (extracted
from StoryCloze and CBT), formal spoken text (excerpted from court and presidential debate
transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National
Corpus3 ), and causal or procedural text, which describes sequences of events or actions,
extracted from WikiHow. Annotations collected using the longer contexts present in the
GLUE RTE training data, which came from the RTE5 dataset. Trained an even stronger
RoBERTa ensemble by adding the training set from the second round (A2) to the training
data.
This paper proposes the usage of multilingual representation models to use the labeled data
from one language to train a cross-lingual model that is applicable for multiple languages for
NLI. Further, it proposes a data augmentation strategy for better cross-lingual NLI by
enriching the data to reflect more diversity in a semantically faithful way. In order to do this,
the paper proposes two methods of training a generative model to induce synthesized
examples, and then leverages the resulting data using an adversarial training regimen for
more robustness.
This paper, proposes a novel data augmentation scheme to synthesize controllable and
much less noisy data for cross-lingual NLI. This augmentation consists of
two parts. One serves to encourage language adaptation by means of reordering source
language words based on word alignments to better cope with typological divergency
between languages, denoted as Reorder Augmentation (RA). Another seeks to enrich the
set of semantic relationships between a premise and pertinent hypotheses, denoted as
Semantic Augmentation (SA). Both are achieved by learning corresponding
sequence-to-sequence (Seq2Seq) models.
Now, XNLI, which is the most prominent cross-lingual NLI corpus, is used for evaluation. The
results of the evaluation can be seen below:
RESULTS
● Compared with vanilla XLM-R without adversarial training, XLM-R with PGD works
better across a range of non-English languages, which shows the effectiveness of
adversarial training for more robustness in cross-lingual settings.
● We observe that XLM-R, when trained with EA or RA, outperform the setting without
augmentation for English and some non-English languages, though it does not
achieve sufficiently stronger results in terms of the average accuracy across different
languages.
● This suggests that XLM-R struggles to benefit from the augmented instances from
RA for better generalizability.
● In contrast, when trained with SA, XLM-R performs better than without SA examples
for most languages, confirming that our semantic augmentation is beneficial.
Multihop-KGQA
The following is the list of the contributions made towards the project:
● Rewrote the code and reproduced the results from the paper.1
● Explored the effect of knowledge graph embedding models in the Knowledge Graph
Embedding module (TuckER and ComplEx).2
● Explored the effect of various transformer models in the Question Embedding module
(RoBERTa, ALBERT, SBERT). 3
● Verified the importance of the Relation Matching (RM) module and n-hop filtering
(ablation study).
1
https://github.com/MSurfer20/MultiHop-KGQA
2
Trained embeddings
3
Trained QA models Link 2
Model Description
EmbedKGQA has three modules:
1. KG Embedding Module: This module contains a KG embedding model to learn
embeddings for all entities in the input KG.
2. Question Embedding Module: We pass the question text into RoBERTa and use the
hidden states of the last layer to get a 768-dim representation of the question. This is then
passed through 4 fully connected NN layers with ReLU activation and projected onto the
complex space to get the question embedding.
3. Answer Selection Module: This module uses the outputs of module 1 and 2 to select the
final answer by scoring the <head-entity, question> pair against all possible answers.
The model is learned by minimizing the binary cross-entropy loss between the sigmoid of the
scores and the target labels, where the target label is 1 for the correct answers and 0
otherwise.
Experiments
MetaQA: MetaQA has different partitions of the dataset for 1-hop, 2-hop, and 3-hop
questions.
In the paper, LSTM is used as a question embedding model and ComplEx is used as KG
embedding on the MetaQA dataset. We trained TuckER+LSTM as a new experiment on both
KG-half and KG-Full settings.
We observed major improvements in using TuckER embedding for 3 hops. For 1-hop and
2-hop questions, ComplEx is performing slightly better than TuckER. The results can be
seen in Table 1.
ComplEx 0.9712 0.9995 0.9998 0.9412 0.9855 0.9929 0.5025 0.8405 0.9203
TuckER 0.9551 0.9981 0.9997 0.9313 0.987 0.9928 0.7381 0.936 0.9609
ComplEx 0.6878 0.7458 0.7655 0.5217 0.6304 0.6671 0.4354 0.6782 0.754
TuckER 0.6843 0.7433 0.7624 0.5046 0.6304 0.6655 0.7196 0.9116 0.9394
WebQSP: WebQSP has a relatively small number of training examples but uses a large KG
(Freebase) as background knowledge. This makes multi-hop KGQA much harder. Training
KG embedding took a lot of time. In the paper, RoBERTa is used as the question embedding
model and ComplEx is used as KG embedding. We train ComplEx + Sentence Transformer
as a new experiment. In both cases, KG-half and KG-full, Sentence Transformer beats
RoBERTa. We also trained TuckER+RoBERTa, TuckER+SBERT, and TuckER+ALBERT, but
didn’t get good accuracy on them. We tried various combinations of hyperparameters but the
training loss didn’t reduce beyond an extent.
Ablation Studies
Relation Matching: To perform relation matching, we chose a scoring function S(r,q) such
that it ranks each relation r for a given question q.
Those relations are chosen for which the score is > 0.5 (set Ra). Now, for each candidate
entity a’, we find relations in shortest path between head entity h and a’ (set R a’). Now,
relation score is defined as |Ra∩Ra’|. Linear combination of the relation score and ComplEx
score is used to find the answer entity.