CL Honours Report Naman

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

CL Honours Report

Spring 2022
Name: Suyash Vardhan Mathur
Roll no: 2019114006
Objectives
1. To gain a proper understanding of Natural Language Inference and look at the points
where it can go wrong.
2. Understand the various datasets that have been created for the NLI task – whether
those that use adversarial techniques to improve the dataset. Also, understand why
older datasets for NLI don’t live upto the mark and why they were problematic.
3. Look at various approaches that have been used to improve the shortcomings of NLI.
These include usage of Knowledge graphs, open-domain NLI, etc.
4. In order to gain proper understanding of GNNs and Knowledge graphs, I also worked
on the project of Multi-hop Question Answering using Knowledge Graphs which
would help in continuing the work of improving NLI using KGs.

Problem Background
Natural Language Inference Task
Natural language inference (NLI) is the task of determining whether a "hypothesis" is true
(entailment), false (contradiction), or undetermined (neutral) given a "premise". While they
may have different kinds of labels depending upon the dataset, the core task remains the
same throughout NLI.
An example of the task is:

Premise Label Hypothesis

A man inspects the uniform of a contradiction The man is sleeping.


figure in some East Asian country.

An older and younger man smiling. neutral Two men are smiling and laughing at
the cats playing on the floor.

A soccer game with multiple males entailment Some men are playing a sport.
playing.

The semantic concepts of entailment and contradiction are central to all aspects of natural
language meaning from the lexicon to the content of entire texts. Thus, natural language
inference (NLI) — characterizing and using these relations in computational systems — is
essential in tasks ranging from information retrieval to semantic parsing to commonsense
reasoning. Approaches used for NLI include earlier symbolic and statistical approaches to
more recent deep learning approaches. Benchmark datasets used for NLI include SNLI,
MultiNLI, SciTail, among others.

All of these datasets have certain shortcomings, which are visible in terms of usage of
annotation artifiacts by models for question-answering, as well as syntactic cues instead of
the actual meaning for the task. In order to combat such shortcomings, other datasets have
been created which use adversarial techniques to give model training instances that remove
the usage of such spurious patterns in the NLI task.

Multi-hop KGQA Task


Knowledge Graphs (KG) are multi-relational graphs consisting of entities as nodes and
relations among them as typed edges. Goal of the Question Answering over KG (KGQA)
task is to answer natural language queries posed over the KG. Multi-hop KGQA requires
reasoning over multiple edges of the KG to arrive at the right answer, while single-hop
reasoning can arrive at the answer in just a single reasoning over the graph.
Thus, the formal problem statement of the project can be defined as “Given a KG, a natural
language question q and a topic entity h, we wish to find the answer entity a in the
knowledge graph that best answers the question. The training data for this problem consists
of a KG as background knowledge along with a list of question, entity, answer tuples.”
Here, we need to note that there is no path annotation in the training data – we are only
provided with the question and the answer. The model is expected to find the reasoning
chain on its own.
An example of the problem can be seen below, where we are given a graph which consists
of various real-life entities as the vertices(movies, actors, directors, etc.) and the edges
represent the relations between them(directed_in, acted_in, etc.)

Fig1: Example graph


Here, questions like Genre of GANGSTER NO.1 can be determined in a single hop - and
thus, are single hop questions that need a single fact to determine the answer. On the other
hand, there are questions like Genre of movies written by LOUIS MELLIS, which would
require multiple hops over the KG to find the answer entity – namely 2 here. Such questions
are called n-hop or multi-hop questions.

Studies/Experiments
NLI Task
I looked at the following papers to understand the various NLI datasets and various
benchmarks on them at the time they were published:
● Stanford NLI dataset(SNLI)
● Adversarial NLI
● Dataset created using Adversarial Training for Cross-Lingual NLI
SNLI
The SNLI corpus was the first-of-its-kind corpus in terms of its large size. Thus, despite
being high-quality, hand-labeled data sets, and they have stimulated innovative logical and
statistical models of natural language reasoning, but their small size (fewer than a thousand
examples each) limits their utility as a testbed for learned distributed representations. Even
with the SICK dataset being a jump in terms of size, with only 4,500 training sentences, the
dataset introduced some spurious patterns in the dataset. Further, some datasets that were
automatically created/labelled contained high amount of noise, which led to them being
suitable only as supplementary data, and not as the primary data for the task. Further, the
existing corpuses even suffered from the problem of indeterminancies of event and entity
coreference that lead to high indeterminacy regarding the correct semantic label. For
example, consider the sentence pair A boat sank in the Pacific Ocean and A boat sank
in the Atlantic Ocean. The pair could be labeled as a contradiction if one assumes that the
two sentences refer to the same single event, but could also be reasonably labeled as
neutral if that assumption is not made.

The SNLI dataset tackles various problems above. At 570K pairs, it is two orders of
magnitude larger than all other resources of its type. Further, all of its sentences and labels
were written by humans in a grounded, naturalistic context. Further, in the paper, the corpus
is used to evaluate a variety of models for natural language inference, including rule-based
systems, simple linear classifiers, and neural network-based models.
2 models are found to have comparable performance - a feature-rich classifier and a neural
network centred around an LSTM. By further evaluating the LSTM, it is shown that it can be
adapted to an existing NLI challenge task, yielding the best reported performance by a
neural network model. Further, SNLI overcomes the coreference challenge as follows:
SNLI has tried to surmount this issue as follows:
● examples were grounded in specific scenarios, and the premise and hypothesis
sentences in each example were constrained to describe that scenario from the
same perspective
● prompt gave participants the freedom to produce entirely novel sentences within the
task setting - richer examples
● a subset of the resulting sentences were sent to a validation task aimed at providing
a highly reliable set of annotations
Amazon Mechanical Turk used for data collection, where each worker was presented with
premise scene descriptions from a pre-existing corpus, and asked to supply hypotheses for
each of our three labels— entailment, neutral, and contradiction—forcing the data to be
balanced among these classes. Workers given specific instructions, and minimum sentence
length was not enforced. For the premises, captions from the Flickr30k corpus were used.

In order to measure the quality of the corpus, and in order to construct maximally useful
testing and development sets, an additional round of validation for about 10% of the data
was performed. The workers were presented with pairs of sentences in batches of five, and
supplied each pair to four annotators, yielding five labels per pair including the label used by
the original author. If any one of the three labels was chosen by at least three of the five
annotators, it was chosen as the gold label. The overall rate of agreement was extremely
high, suggesting that the corpus is sufficiently high quality. Insights about the levels of
disagreement across the three semantic classes. This disagreement likely reflects not just
the limitations of large crowdsourcing efforts but also the uncertainty inherent in naturalistic
NLI.

The paper finally looks at the accuracies of various models on the dataset, ranging from
Rule-based, statistical models to Neural network based models. Since the paper is dated, it
finds out the best results for LSTMs, which previously weren’t able to give proper
performance due to smaller datasets available. Even models which used Lexicalized
features showed better performance on the SNLI dataset as compared to SICK dataset,
which was again due to its sheer size.

Adversarial NLI(ANLI)
This paper introduces a new large-scale NLI benchmark dataset, collected via an iterative,
adversarial human-and-model-in-the-loop process. The paper shows that models trained on
this new dataset leads to SOTA performance on various popular NLI benchmarks, while
posing a more difficult challenge with its new test set.

While the challenging benchmarks in other areas took years to surpass, the NLU
benchmarks have struggled to keep pace. Further, the fast pace even makes us question
whether the NLU models are genuinely as good as their benchmark performace suggests.
Huge evidence suggests that SOTA models learn to exploit spurious statistical patterns in
datasets instead of learning meaning the way humans do. Given this, human annotators
should be able to construct examples that expose such shortcomings of the models.

Thus, the paper proposes an iterative, adversarial human-andmodel-in-the-loop solution for


NLU dataset collection that addresses both benchmark longevity and robustness issues.
In the first stage, human annotators devise examples that our current best models cannot
determine the correct label for. We then subject the strengthened model to the same
procedure and collect weaknesses over several rounds. After each round, we train a new
model and set aside a new test set. Thus, not only is the resultant dataset harder than
existing benchmarks, but this process also yields a “moving post” dynamic target for NLU
systems, rather than a static benchmark

In the task-setup, our starting point is a base model, trained on NLI data. Rather than
employing automated adversarial methods, here the model’s “adversary” is a human
annotator. Given a context(premise), and a desired target label, we ask the human writer to
provide a hypothesis that fools the model into misclassifying the label. For every
misclassified example, the writer provided the reason he thinks why it was misclassified.
These were also verified by three human verifiers. Once data collection is finished, a new
training set is constructed from verified correct examples.

Qualified Mechanical Turk workers were used for annotations. Annotators were presented
with a context and a target label—either ‘entailment’, ‘contradiction’, or ‘neutral’—and asked
to write a hypothesis that corresponds to the label. Model predictions(label probabilities) for
the pair are obtained and shown to the annotators. If the model prediction was incorrect, the
job is complete. If not, the worker continues to write hypotheses until error is obtained.
Three rounds were performed for the annotation.

Round 1 used a BERT-Large model trained on a concatenation of SNLI and MNLI , and
selected the best-performing model that could train as the starting point for the dataset
collection procedure. Contexts were randomly sampled short multi-sentence passages from
Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set.
Contexts are either ground-truth contexts from that dataset, or they are Wikipedia passages
retrieved using TF-IDF based on a HotpotQA question.

Round 2 used a more powerful RoBERTa model trained on SNLI, MNLI, an NLI version of
FEVER, and the training data from the previous round. Hyperparameters were tuned to give
the best performance on the previous dataset, and using different random seeds, different
models were created. During annotation, an ensemble was constructed by randomly picking
a model from the model set as the adversary at each turn, helping avoid annotators
exploiting vulnerabilities in a single model. Similar as before, new non-overlapping contexts
were taken from Wikipedia via HotpotQA.

Round 3 selected a more diverse set of contexts, in order to explore robustness under
domain transfer. Included contexts: News (extracted from Common Crawl), fiction (extracted
from StoryCloze and CBT), formal spoken text (excerpted from court and presidential debate
transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National
Corpus3 ), and causal or procedural text, which describes sequences of events or actions,
extracted from WikiHow. Annotations collected using the longer contexts present in the
GLUE RTE training data, which came from the RTE5 dataset. Trained an even stronger
RoBERTa ensemble by adding the training set from the second round (A2) to the training
data.

The dataset is collected to be more difficult than previous datasets, by design


Remedies a problem with SNLI, namely that its contexts (or premises) are very short,
because they were selected from the image captioning domain. They believe longer contexts
should naturally lead to harder examples, and so constructed ANLI contexts from longer,
multisentence source material.

Below are the results of various SOTA models on ANLI dataset:


Base model for each round performs very poorly on that round’s test set. This is the
expected outcome
● round1: gets the entire test set wrong, by design
● round2 and 3: we used an ensemble, so performance is not necessarily zero.
However, as it turns out, performance still falls well below chance indicating that
workers did not find vulnerabilities specific to a single model, but generally applicable
ones for that model class
● Round 3 is more difficult (yields lower performance) than round 2, and round 2 is
more difficult than round 1. This is true for all model architectures
● Generally, our results indicate that training on more rounds improves model
performance. This is true for all model architectures
● We obtain state of the art performance on both SNLI and MNLI with the RoBERTa
model finetuned on our new data. The RoBERTa paper reports a score of 90.2 for
both MNLI-matched and -mismatched dev, while we obtain 91.0 and 90.7
● However, the base (RoBERTa) models for rounds 2 and 3 are outperformed by both
BERT and XLNet (rows 5, 6 and 10). This shows that annotators found examples
that RoBERTa generally struggles with, which cannot be mitigated by more examples
alone. It also implies that BERT, XLNet, and RoBERTa all have different weaknesses,
possibly as a function of their training data
● We also observe that continuously augmenting data doesn’t downgrade the
performance. Even though ANLI training data is different from SNLI and MNLI,
adding it to the training set does not harm performance on those tasks
● It was also observed that the exclusive test subset difference is small for the dataset.
This is because we included an exclusive test subset (ANLI-E) with examples from
annotators never seen in training, and find negligible differences, indicating that our
models do not over-rely on annotator’s writing styles
● We sample from respective datasets to ensure exactly equal amounts of training
data. The adversarial data improves performance, including on SNLI and MNLI
when we replace part of those datasets with the adversarial data. This suggests that
the adversarial data is more data-efficient than “normally collected” data. Adversarial
data collected in later rounds is of higher quality and more data-efficient.
● For SNLI and MNLI, concerns have been raised about the propensity of models to
pick up on spurious artifacts that are present just in the hypotheses. Here, we
compare full models to models trained only on the hypothesisAdversarial data
collected in later rounds is of higher quality and more data-efficient. Hypothesis-only
models perform poorly on ANLI 5 and obtain good performance on SNLI and MNLI.
● In rounds 2 and 3, RoBERTa is not much better than hypothesis-only. This could
mean two things: either the test data is very difficult, or the training data is not good.
The test sets are so difficult that state-of-the-art models cannot outperform a
hypothesis-only prior.
Adversarial Training augmented data for Cross-Lingual NLI
Recently, pretrained language model architectures such as BERT (Devlin et al., 2019) have
been shown capable of learning joint multilingual representations with self-supervised
objectives under a shared vocabulary, simply by combining the input from multiple
languages. Such representations greatly facilitate cross-lingual applications.

This paper proposes the usage of multilingual representation models to use the labeled data
from one language to train a cross-lingual model that is applicable for multiple languages for
NLI. Further, it proposes a data augmentation strategy for better cross-lingual NLI by
enriching the data to reflect more diversity in a semantically faithful way. In order to do this,
the paper proposes two methods of training a generative model to induce synthesized
examples, and then leverages the resulting data using an adversarial training regimen for
more robustness.

To boost the performance of cross-lingual models, an intuitive thought is to draw on


unlabeled data from the target language so as to enable the model to better account for the
specifics of that language, rather than just being fine-tuned on the source language. For
text, data augmentation is challenging, and straightforward techniques include simple
operations on words within the original training sequences, such as synonym replacement,
random insertion, random swapping, or random deletion. . In practice, however, there are
two notable problems. One is that the synthesized data from data augmentation techniques
may as well be noisy and unreliable. Second, new examples may diverge from the
distribution of the original data. These problems are particularly pronounced in case of NLI,
since modified versions of sentences may no longer have the same entailments.

This paper, proposes a novel data augmentation scheme to synthesize controllable and
much less noisy data for cross-lingual NLI. This augmentation consists of
two parts. One serves to encourage language adaptation by means of reordering source
language words based on word alignments to better cope with typological divergency
between languages, denoted as Reorder Augmentation (RA). Another seeks to enrich the
set of semantic relationships between a premise and pertinent hypotheses, denoted as
Semantic Augmentation (SA). Both are achieved by learning corresponding
sequence-to-sequence (Seq2Seq) models.

The data augmentation performed is of the following types:


● Reorder Augmentation: Reorder augmentation is based on the intuition of making a
model more robust with respect to differences in word order typology.
● Semantic Augmentation: The second augmentation strategy involves training a
controllable model that, given a sentence and a label describing the desired
relationship, seeks to emit a second sentence that stands in said relationship to the
input sentence.
Adversarial Training: As a special training regimen, the paper adopts adversarial training,
which seeks to minimize the maximal loss incurred by label-preserving adversarial
perturbations, thereby promising to make the model more robust. Nonetheless, the gains
observed from it in practice have been somewhat limited in both monolingual and
cross-lingual settings. We conjecture that this is because it has previously merely been
invoked as an additional form of monolingual regularization.

Now, XNLI, which is the most prominent cross-lingual NLI corpus, is used for evaluation. The
results of the evaluation can be seen below:

RESULTS
● Compared with vanilla XLM-R without adversarial training, XLM-R with PGD works
better across a range of non-English languages, which shows the effectiveness of
adversarial training for more robustness in cross-lingual settings.
● We observe that XLM-R, when trained with EA or RA, outperform the setting without
augmentation for English and some non-English languages, though it does not
achieve sufficiently stronger results in terms of the average accuracy across different
languages.
● This suggests that XLM-R struggles to benefit from the augmented instances from
RA for better generalizability.
● In contrast, when trained with SA, XLM-R performs better than without SA examples
for most languages, confirming that our semantic augmentation is beneficial.

Multihop-KGQA
The following is the list of the contributions made towards the project:
● Rewrote the code and reproduced the results from the paper.1
● Explored the effect of knowledge graph embedding models in the Knowledge Graph
Embedding module (TuckER and ComplEx).2
● Explored the effect of various transformer models in the Question Embedding module
(RoBERTa, ALBERT, SBERT). 3
● Verified the importance of the Relation Matching (RM) module and n-hop filtering
(ablation study).

1
https://github.com/MSurfer20/MultiHop-KGQA
2
Trained embeddings
3
Trained QA models Link 2
Model Description
EmbedKGQA has three modules:
1. KG Embedding Module: This module contains a KG embedding model to learn
embeddings for all entities in the input KG.
2. Question Embedding Module: We pass the question text into RoBERTa and use the
hidden states of the last layer to get a 768-dim representation of the question. This is then
passed through 4 fully connected NN layers with ReLU activation and projected onto the
complex space to get the question embedding.

3. Answer Selection Module: This module uses the outputs of module 1 and 2 to select the
final answer by scoring the <head-entity, question> pair against all possible answers.

The model is learned by minimizing the binary cross-entropy loss between the sigmoid of the
scores and the target labels, where the target label is 1 for the correct answers and 0
otherwise.
Experiments
MetaQA: MetaQA has different partitions of the dataset for 1-hop, 2-hop, and 3-hop
questions.
In the paper, LSTM is used as a question embedding model and ComplEx is used as KG
embedding on the MetaQA dataset. We trained TuckER+LSTM as a new experiment on both
KG-half and KG-Full settings.
We observed major improvements in using TuckER embedding for 3 hops. For 1-hop and
2-hop questions, ComplEx is performing slightly better than TuckER. The results can be
seen in Table 1.

KG-Full 1-Hop 2-Hop 3-Hop

Hits@1 Hits@5 Hits@10 Hits@1 Hits@5 Hits@10 Hits@1 Hits@5 Hits@10

ComplEx 0.9712 0.9995 0.9998 0.9412 0.9855 0.9929 0.5025 0.8405 0.9203

TuckER 0.9551 0.9981 0.9997 0.9313 0.987 0.9928 0.7381 0.936 0.9609

KG-half 1-Hop 2-Hop 3-Hop

Hits@1 Hits@5 Hits@10 Hits@1 Hits@5 Hits@10 Hits@1 Hits@5 Hits@10

ComplEx 0.6878 0.7458 0.7655 0.5217 0.6304 0.6671 0.4354 0.6782 0.754

TuckER 0.6843 0.7433 0.7624 0.5046 0.6304 0.6655 0.7196 0.9116 0.9394

Table1: Results for MetaQA

WebQSP: WebQSP has a relatively small number of training examples but uses a large KG
(Freebase) as background knowledge. This makes multi-hop KGQA much harder. Training
KG embedding took a lot of time. In the paper, RoBERTa is used as the question embedding
model and ComplEx is used as KG embedding. We train ComplEx + Sentence Transformer
as a new experiment. In both cases, KG-half and KG-full, Sentence Transformer beats
RoBERTa. We also trained TuckER+RoBERTa, TuckER+SBERT, and TuckER+ALBERT, but
didn’t get good accuracy on them. We tried various combinations of hyperparameters but the
training loss didn’t reduce beyond an extent.

KG half/ Ques Hits Hits Hits


embed full embed @1 @5 @ 10

ComplE full SBERT 0.5594 0.695 0.733


x 7 3

ComplE half SBERT 0.4465 0.538 0.561


x 6 3

ComplE full RoBERT 0.5496 0.676 0.719


x a 2 7

Comple half RoBERT 0.4127 0.511 0.541


x a 4 9

Table2: Experiments on ComplEx and different ques_embed(WebQSP)

Ablation Studies
Relation Matching: To perform relation matching, we chose a scoring function S(r,q) such
that it ranks each relation r for a given question q.

Those relations are chosen for which the score is > 0.5 (set Ra). Now, for each candidate
entity a’, we find relations in shortest path between head entity h and a’ (set R a’). Now,
relation score is defined as |Ra∩Ra’|. Linear combination of the relation score and ComplEx
score is used to find the answer entity.

Relation matching has a significant impact on the performance.


N-hop filtering: WebQSP KG has an order of magnitude more entities than MetaQA (1.8M
versus 134k in MetaQA) and the number of possible answers is large. So reducing the set of
candidate answers to a n-hop neighbourhood of the head entity showed improved
performance.
Without 1-hop 2-hop 3-hop
filtering filtering filtering filterin
g

With 66.6 78 72.5 CUDA


Relation out of
Matching memor
yerror
Without 48.1 63.2 58.7
Relation
Matching

Table3: Ablation studies


Conclusion
In addition to the work above, I also read various probing papers in NLI, in particular, Is My Model
Using the Right Evidence? Systematic Probes for Examining Evidence-Based Tabular
Reasoning. Based upon the introductory papers that I read, as well as the probing papers, and also
the GNN exploration done in the semester, I am thinking of going with the problem statement of using
GNNs and KGs to improve NLI.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy