0% found this document useful (0 votes)
20 views7 pages

2403.09720v1

This document discusses the SemEval-2023 Task 4, focusing on the effectiveness of fine-tuning versus prompting in understanding human values through language models. It explores various methodologies, including the use of large language models (LLMs) and the Human Value Detection dataset, while posing key questions about model performance and alignment with human preferences. The findings indicate that while fine-tuning has shown some effectiveness, further research is needed to enhance understanding and classification of human values in natural language processing tasks.

Uploaded by

Shumaila Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

2403.09720v1

This document discusses the SemEval-2023 Task 4, focusing on the effectiveness of fine-tuning versus prompting in understanding human values through language models. It explores various methodologies, including the use of large language models (LLMs) and the Human Value Detection dataset, while posing key questions about model performance and alignment with human preferences. The findings indicate that while fine-tuning has shown some effectiveness, further research is needed to enhance understanding and classification of human values in natural language processing tasks.

Uploaded by

Shumaila Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SemEval-2023 Task 4: Fine-tuning vs Prompting,

Can Language Models Understand Human Values?

Pingwei SUN
BDT program, HKUST
psunah@connect.ust.hk

Abstract model ensembling and expanding the dataset to im-


prove scores. Taking into account computational
Accurately handling the underlying support val-
resources and the rise of LLMs, our project will fo-
ues in sentences is crucial for understanding
the speaker’s tendencies, yet it poses a chal-
cus on the following questions based on the dataset:
arXiv:2403.09720v1 [cs.CL] 12 Mar 2024

lenging task in natural language understanding


• Q1: Does prompt-tuning work compared with
(NLU). In this article, we explore the poten-
tial of fine-tuning and prompt tuning in this fine-tuning on the complex downstream task?
downstream task, using the Human Value De-
• Q2: Can PLMs handle human values powered
tection 2023. Additionally, we attempt to vali-
date whether models can effectively solve the by knowledge from the pre-training stage?
problem based on the knowledge acquired dur-
ing the pre-training stage. Simultaneously, our
• Q3: How do LLMs perform on the task after
interest lies in the capabilities of large language aligning with human preference?
models (LLMs) aligned with RLHF in this task,
and some preliminary attempts are presented. 2 Related Work
Human Value Detection 2023. Among the sub-
1 Introduction
missions, most teams applied transformers-based
The persuasiveness of arguments is heavily influ- models as their backbone and treated it as a classi-
enced by individuals’ values, and variations in the fication task.
priority assigned to these values can lead to dis- Team Adam Smith (Schroter et al., 2023) em-
putes, especially between different cultures. To ploys an ensemble strategy, including DeBERTa
address this, computational linguistics leverages and RoBERTa. They trained these models for loss
human values to categorize and evaluate arguments. minimization or F1-score maximization on three
Schwartz’s value categorization(Schwartz, 1994) folds each RoBERTa model was pretrained on the
serves as a common framework. However, existing IBM-ArgQ-Rank30KArgs dataset. Ensembling in-
datasets have limitations, such as small sizes and volved averaging predictions with an optimized de-
cultural biases. To overcome these issues, the com- cision threshold. They experimented with a stacked
petition holder proposes an extension, Touché23- meta-classifier based on logistic regression.
ValueEval (Mirzakhmedova et al., 2023), contain- Team John Arthur (Balikas, 2023) fine-tuned a
ing 9324 diverse arguments from various sources DeBERTa model on the task’s data, using a con-
and cultures. This dataset aims to facilitate the de- catenated representation of stance, premise, and
velopment of Pre-trained Language Models (PLM) conclusion. They found that separate token sym-
for the automatic identification of human values bols for stance improved classification. The model
in persuasive communication, supporting a broad was trained to minimize binary cross-entropy loss,
array of applications. The task attracted many re- and the team observed performance benefits with
searchers to work on it and various methods were more training data.
proposed to solve the problem. Team PAI (Ma, 2023) combined models using
However, from the leaderboard, this task still various input datasets. They applied weight voting
requires more in-depth research, with the top per- based on F1 score for ensembling and explored
former only achieving an average F1 score of 0.56. different loss functions, ultimately favoring a class-
Most participating teams tried to solve it as a clas- balanced loss. Classification thresholds were tested,
sification task and employed techniques such as but no performance improvement was achieved.
Feature Value Type Example
Argument ID Str ”A01002”
Conclusion Str ”We should ban fast food”
Stance Str [”supporting”, ”against”]
Premise Str ”Fast food should be banned because it is really bad ...”
Value Labels Int [0, 1]
Value categories: ”Achievement”, ”Stimulation”, etc.
Value Description Str ”Self-direction: thought. It is good to have own ideas and interests.
Contained values and associated arguments (examples): ...”

Table 1: The dataset used in the experiment with the corresponding types and examples. Stance is mapped to single
words to simplify the prompting template. Square brackets represent single selections from among. The Value
Description are defined in Schwartz, 1994.

Fine-tuning. Since the release of BERT (De- 3.1 Data Utilization


vlin et al., 2019), the approach of large-scale pre- The data file contains various annotations and sup-
training followed by fine-tuning on downstream plementary datasets from platforms like Zhihu and
tasks has been widely explored and applied to vari- The New York Times. Experiments are conducted
ous NLP subtasks. on the main dataset, and due to the unavailability
Through the integration of a range of training of test labels, all results in this paper reflect the
techniques, this method has achieved impressive model’s performance on the validation set.
results. The bi-encoder structure employed during Before commencing the experiments, the data
pre-training inherently endows it with the ability files are reorganized, and certain features are pre-
to effectively extract semantic information from se- processed, as shown in Table 1.
quences. However, it performs poorly when facing
3.2 Fine-tuning as a Classification Task
few-shot and zero-shot scenarios.
Classifier. To solve the multi-label classification
Prompting. Recently, researchers have been
task, there are two methods to choose from. On
attempting to transform NLP tasks into a seq-to-
the one hand, it is sensible to use a single classi-
seq format using prompts. T5 (Raffel et al., 2020)
fier, apply the sigmoid function to each xi in the
marked the first successful endeavor in this direc-
output = {x1 , x2 , ..., xc }, and then calculate the
tion. Such an approach leverages the knowledge ac-
Binary Cross Entropy Loss as follows.
quired by language models during the pre-training
C
phase rather than treating them as feature extractors. X
Consequently, it exhibits excellent performance on yi · log σ(xi ) + (1 − yi ) · log(1 − σ(xi )) (1)
i=1
few-shot and zero-shot tasks, with the added benefit
of reduced fine-tuning costs. On the other hand, it also works by setting up
the same number of classifiers as the number of
GPT, with its decoder structure, demonstrates labels, each classifier to handle one category.
enhanced capabilities in language modeling. The Hidden States. Serving as an encoding module,
team Hitachi (Tsunokake et al., 2023) has used BERT can provide various levels of features of the
BART, T5, and GPT3, feeding the data to them in sequence. It is commonly agreed that the deeper
question-answering format. layers always output high-level semantic informa-
tion (Sun et al., 2019) and the first token [CLS] can
be used for sequence classification.
3 Methods Avoid Forgetting. Since the model has learned
solid knowledge during pre-training, we need to
To answer the questions proposed in § 1, we pro- find an appropriate way to arrange the learning
pose the following techniques, to our best knowl- rate for fine-tuning. The lower layers handle more
edge, to measure the potential of models on the general information so they should be updated with
complex task of detecting human values. Figure 1 relatively lower learning rates. We use the decay
illustrates our workflow in the project. factor to achieve layer-wise learning rates.
Advanced
PLM Type Task Mode Tuning Method Target Question
Technique

Multiple Classifiers
Classification Q1
Add classifiers after the last Finet-une vs Prompt
Encoder: transformer layer. Layer-wise LR
RoBERTa, DeBERTa Fine-tuning
(Unfix PLM) Contrastive Learning

MLM Q2
Seq2Seq: CLS Template
Predict the masked value Prompt + Fine-tuning Human Value with PLMs
T5 "Yes" or "No" of each label.
BC Template
Prompt tuning
Decoder: (Fix PLM) OA Template
GLM
GPT2, LLMs LLM
Q3
Generate judgement for each Performance of LLMs
label then map to labels. CoT Template

P-tuning

Figure 1: Illustration of the workflow of our experiments. Tracks marked by different colors stand for combinations
of models, tasks, and methods, details of which are told in § 4. The dashed lines mean methods are theoretically
available but we do not consider them in the project because of poor performance or being computationally
demanding. The colorful dots are experiment results and indicate the analysis for the questions proposed in § 1.

3.3 Contrastive Learning Model(MLM) by filling the blank or a Generative


This project is a complex Natural Language Un- Language Model(GLM) by generation.
derstanding task, so obtaining a good embedding Open answering template with knowledge-
representation of the sequence is crucial for the able verbalizer. When our templates become more
subsequent training of the classifier. complex and open, it often allows for better stimu-
Inspired by the SimCSE (Gao et al., 2021), Con- lation of internal model knowledge. At this point,
trastive Learning (CL) loss is applied to optimize we need a verbalizer to map the output results to
the embedding states of sequences. Since the target labels, and the choice of verbalizer significantly
is multi-label classification, the positive and neg- impacts the outcomes (Hu et al., 2022).
ative samples can not be defined directly inside a Notably, the dataset includes descriptions and
mini-batch. Consequently, we compute the CL loss examples of human values. However, they were
according to the following formulas: artificially defined by linguists two decades ago.
To enhance the dataset’s generalization, we intro-
yiT · yj duce LLM to rewrite the descriptions and provide
wij = PB
T · yk + ε synonyms, which will serve as a knowledgeable
k=1 yi
PB sim(xi ,xj ) (2) verbalizer to map the output results to labels of
j=1 wij · e
τ′
CL ℓi = − log human values.
sim(xi ,xj )
PB Chain-of-Thought template. Language mod-
j=1 e
τ′
els, particularly LLMs, have demonstrated remark-
where y is the label vector, x is the embedding rep- able reasoning capabilities, as evidenced by studies
resentation and B is the batch size. The τ ′ stands such as (Kojima et al., 2022). Despite the challenge
for the temperature hyper-parameter of CL, which of directly comprehending whether specific human
is a scaling factor of similarity scores. values are implied in triples, models can acquire rel-
evant knowledge through a systematic prompting
3.4 Prompt Tuning Techniques process. This step-by-step approach allows LLMs
When doing prompt tuning, the model should be to navigate and extract pertinent information, ul-
transferred to other modes to fit the template’s re- timately enabling them to effectively address the
quirements, shown in Figure 2. The details of tem- intricacies of classification problems.
plate construction are introduced below. In light of this, we have incorporated templates
Binary choice template. It is easy to come in the form of CoT. These templates serve as struc-
up with the idea to transfer the multi-label clas- tured prompts that guide the model in interpreting
sification task to several binary choice questions. and understanding the nuances of human values
It can be solved either with a Masked Language within extra knowledge.
Prompt Template Board

CLS Template MBC Template


Input: Input:
Human Values are indicated in the premise: [Premise] and the Given the premise: [Premise] and the conclusion: [Conclusion]
conclusion: [Conclusion] [Stance] it. [Stance] it. Is it based on the [Value Description]? [MASK]
Output: Output:
[Logits for classifiers] "Yes" or "No"

BCA Template OA Template

Input: Input:
Question: Given the premise: [Premise] and the conclusion: . I claim that [Conclusion] because I [Stance] the
[Conclusion] [Stance] it. Is it based on the [ValueDescription]? [Premise]. In the aspect of [Value Label], use some words to
The answer is describe what kind of person am I?
Output: Output:
"Yes" or "No" True if any in the answer, else False.

CoT Template
Solving all labels in one run
Input 1:
What kind of human values are indicated in the premise: Available in encoder with classifier mode
[Premise] and the conclusion: [Conclusion] [Stance] it?
Output 1: Available in MLM mode
#Reply from LLM#
Input 2: Available in GLM mode
Do you think [#Reply from LLM#] is aligned with ,
which can also be called ? Human value descriptions from ChatGPT
Output 2:
"Yes" or "No" Synonyms from ChatGPT

Figure 2: Prompt templates for different task processing modes, including classification(CLS), masked binary
choice(MBC), binary choice answering(BCA), open answering(OA), and Chain-of-Thought(CoT). The bolded
content in brackets represents the features in the dataset, shown in Table 1.

4 Experiments we treat it as either a pre-training task or an ex-


tra loss item at the fine-tuning stage. The second
The following experiments are conducted to an- strategy performs slightly better, and both of them
swer questions mentioned in § 1. Since we aim make progress on the macro F1 score.
to explore fine-tuning methods and have a limited
computational budget, tricks such as model ensem-
Parameters Value
bling to improve scores are not applied.
Trainer
4.1 Exp-I: Fine-tuning Batch size 8
Epochs 3.0
Based on the previous team’s papers, we choose Lr scheduler cosine
RoBERTa and DeBERTa followed by a classifier Warmup ratio 0.1
for fine-tuning experiments. The Premise, Con- Lr decay 0.97
clusion, and Stance features are directly fed to the Optimizer
model. Settings of hyperparameters for training are Optimzier Adamw
listed in Table 2. Learning rate 2e-5
The performance of different models and classi- Trainable param last 8 layers + classifier(s)
fiers is presented in Table 3. RoBERTa with mul-
tiple classifiers is selected after the preliminary Table 2: We try various combinations of the hyperpa-
fine-tuning experiment. rameters and compare validation scores after training.
Then Contrastive Learning is introduced for fur- The above is an optimal choice among them.
ther improvement of the performance. In this part,
Model Macro F1 score 5 Analysis
RoBERT aLarge
w/ SH .481 5.1 Effectiveness of Fine-tuning and Input
w/ MH .507 Description
DeBERT aLarge
We achieved a relatively competitive result com-
w/ SH .477
pared with last year’s leaderboard. At the same
w/ MH .493
time, we also validated the contribution of con-
RoBERT aLarge w/ MH
trastive learning to obtaining more discriminative
CL pre-train .518
embedding representations.
CL fine-tune .522
Furthermore, we observed that, in handling NLU
Table 3: Performance of finetuned encoding models tasks with complex inputs, providing reasonable
with classifiers on the validation set. SH means single descriptions for complex input examples enhances
classification head, while MH means multiple heads. model performance compared to directly concate-
nating and inputting individual components.

4.2 Exp-II: Prompt Tuning 5.2 Performance of Prompt Tuning

The previously mentioned prompt templates are Finding versions of different models with the same
sequentially tested in this section in conjunction parameter size is a challenge, to ensure the validity
with models of different parameter sizes and struc- of the results, we attempted to compensate by the
tures. The training strategies are inspired by Liu trainable parameters during training.
et al., 2021 and implementations are based on From the table, it is evident that on models with
OpenPrompt (Ding et al., 2022). a size less than 500M, fine-tuning outperforms
prompt tuning, which is reasonable being given
To be specific, the CLS template is treated as a more trainable parameters. However, when the
hard prompt of which parameters are frozen and model size approaches 1B (two to three times that
fed to encoding models with trainable multiple clas- of the formers), prompt tuning shows comparable
sifiers. Its trainable parameters are consistent with performance to fine-tuning. Yet, during training,
the previous fine-tuning settings. we observed a larger variance in the performance
Templates in the forms of mask filling and ques- of prompt tuning with changes in the initialization
tion answering are also proposed. At this time, we methods of templates.
set PLMs fixed and allow prompts’ parameters to Does it mean fine-tuning is always a better
be updated. Extra knowledge in the OA template is choice in this range of parameter size (<1B)? Our
processed in advance, which is manually selected two additional experiments explain this in more
from feedback from ChatGPT. detail. The results of NLI and few-shot are listed
in Table 5.
We built a simple NLI task using the Premise
4.3 Exp-III: Works on LLMs Conclusion and Stance features in the dataset and
tuned some models using the same method as be-
The capabilities of the LLMs on this complex NLU fore. As can be seen, there is no significant gap
task are also what we plan to explore. Experiments between fine-tuning and prompt tuning in this sim-
are conducted with a CoT template which is de- pler task, even with fewer trainable parameters,
signed to stimulate the logical inference ability and indicating that with the increase in task complexity,
knowledge inside LLMs. prompt tuning requires a larger base model, mean-
Considering the computational budget, no gra- ing better language modeling abilities and more
dient updating process is included in this template. knowledge to achieve performance comparable to
Instead, the scores are obtained by taking 5% of the the encoder + classifier paradigm.
validation dataset for local inference (Llama7B) or When faced with few-shot datasets, prompt tun-
API calls (GPT series) and computing F1 scores on ing shows obvious advantages. In contrast, fine-
predicted results, which is shown in Table 4. tuning seems failed to solve the task.
Benevolence: dependability
Conformity: interpersonal

Universalism: objectivity
Universalism: tolerance
Self-direction: thought

Universalism: concern

Universalism: nature
Self-direction: action

Benevolence: caring
Power: dominance

Security: personal

Conformity: rules
Security: societal
Power: resources
Achievement
Stimulation

Hedonism

Tradition

Humility
Face
Method / F1 score All
Classification
RoBERTa w/ MH .51 - - - - - - - - - - - - - - - - - - - -
RoBERTa-CLS .54 - - - - - - - - - - - - - - - - - - - -
DeBERTa-CLS .52 - - - - - - - - - - - - - - - - - - - -
MLM
RoBERTa-MBC .47 .52 .63 .22 .31 .58 .33 .56 .30 .69 .62 .58 .55 .27 .16 .50 .32 .66 .75 .39 .43
DeBERTa-MBC - - - - - - - - - - - - - - - - - - - - -
GLM
T5† -BCA .45 .49 .59 .23 .30 .58 .37 .50 .25 .70 .61 .44 .46 .25 .21 .48 .28 .68 .73 .33 .50
T5† -OA .45 .49 .58 .25 .31 .58 .32 .49 .27 .73 .60 .49 .45 .25 .19 .47 .28 .66 .74 .35 .52
T5-BCA .52 .57 .70 .18 .44 .65 .41 .54 .33 .78 .68 .67 .59 .30 .13 .50 .35 .76 .85 .49 .56
T5-OA .50 .51 .69 .13 .27 .61 .44 .49 .30 .74 .62 .58 .55 .36 .17 .52 .41 .73 .80 .45 .55
GPT2† -BCA .42 .41 .57 .15 .21 .55 .26 .43 .23 .66 .53 .47 .49 .31 .20 .42 .28 .64 .72 .36 .53
GPT2† -OA .44 .45 .62 .11 .42 .47 .31 .41 .23 .71 .56 .44 .51 .33 .20 .45 .30 .67 .77 .39 .52
GPT2-BCA .46 .44 .63 .19 .23 .58 .38 .47 .25 .69 .56 .52 .54 .34 .23 .43 .28 .68 .79 .32 .56
GPT2-OA .48 .47 .65 .25 .29 .56 .37 .51 .31 .72 .60 .51 .55 .27 .19 .50 .33 .67 .81 .52 .54
LLM-CoT
Llama-7B .49 .55 .66 .20 .39 .64 .35 .48 .26 .77 .63 .54 .53 .35 .20 .53 .38 .71 .80 .34 .42
ChatGPT .53 .52 .71 .26 .33 .61 .48 .52 .38 .75 .65 .60 .56 .47 .16 .55 .42 .72 .83 .42 .57

Table 4: F1 score of various prompt tuning methods. T5 and GPT2 with † stand for the base and medium
versions respectively and the others are of their large versions except for LLMs. There is something wrong with
DeBERTaForMaskedLM in huggingface, so DeBERTa-MBC is of no reference value.

5.3 Prompting LLMs to Detect Human Values 6 Conclusion


6.1 Fine-tuning vs Prompt Tuning
In the aspect of LLMs, even without any fine- This is a problem that involves multiple variables
tuning, a well-designed questioning style (prompts) as we mentioned in the analysis section. Each
can enable them to perform well on the task. approach has its own strengths and weaknesses.
However, this improvement is more uncertain Briefly, prompt tuning can achieve and some-
than the previously mentioned methods (as can be times surpass the performance of direct fine-tuning
seen from the variation in scores across categories) when the model size is large enough. It exhibits
and relies on an experienced questioner. Failures in better generalization and enhanced transferability.
a few categories may be attributed to inconsisten- However, direct fine-tuning, as a more straightfor-
cies between external knowledge from ChatGPT ward approach, is worth considering when train-
and task definitions. ing samples are abundant, as it allows for stable
performance leveraging a smaller model, which is
particularly crucial in industrial applications.

Models / Task NLI 5-shot HVD 6.2 Human values detection


RoBERTa .872 - PLMs of medium size: From the F1 scores, it is ev-
T5 .866 .413 ident that the model requires further improvement.
GPT2 .849 .408 However, the accuracy metric in the experiments
reflects the model’s detection capability is viable.
Table 5: Results of NLI and few-shot Human Value
Detection. The model sizes are close and all below
The main direction for improvement lies in reduc-
500M. NLI is evaluated by accuracy while HVD still ing false positives.
uses the F1 score. Chat-LLMs: Despite evaluating only a small sub-
set of data, LLMs have demonstrated commendable
performance in understanding human values.
6.3 Limitations Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Hiroaki Hayashi, and Graham Neubig. 2021. Pre-
When comparing the performance of fine-tuning train, prompt, and predict: A systematic survey of
and prompt tuning, the inability to control model prompting methods in natural language processing.
parameter sizes and trainable parameters leads to a
Long Ma. 2023. PAI at SemEval-2023 Task 4: A Gen-
decrease in the persuasiveness of the results. eral Multi-label Classification System with Class-
However, such conclusions remain meaningful balanced Loss Function and Ensemble Module. In
for practical applications because, in the face of Proceedings of the 17th International Workshop on
various trade-off conditions, considerations across Semantic Evaluation (SemEval’23), pages 256–261,
Toronto, Canada. Association for Computational Lin-
all aspects are often needed rather than an itemized
guistics.
comparison of pros and cons.
Nailia Mirzakhmedova, Johannes Kiesel, Milad Al-
shomary, Maximilian Heinrich, Nicolas Handke, Xi-
References aoni Cai, Barriere Valentin, Doratossadat Dastgheib,
Omid Ghahroodi, Mohammad Ali Sadraei, Ehsaned-
Georgios Balikas. 2023. John-Arthur at SemEval-2023 din Asgari, Lea Kawaletz, Henning Wachsmuth, and
Task 4: Fine-Tuning Large Language Models for Ar- Benno Stein. 2023. The touché23-valueeval dataset
guments Classification. In Proceedings of the 17th for identifying human values behind arguments.
International Workshop on Semantic Evaluation (Se-
mEval’23), pages 1428–1432, Toronto, Canada. As- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
sociation for Computational Linguistics. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of transfer learning with a unified text-to-text trans-
Kristina Toutanova. 2019. BERT: Pre-training of former. J. Mach. Learn. Res., 21:140:1–140:67.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of Daniel Schroter, Daryna Dementieva, and Georg Groh.
the North American Chapter of the Association for 2023. Adam-Smith at SemEval-2023 Task 4: Discov-
Computational Linguistics: Human Language Tech- ering Human Values in Arguments with Ensembles
nologies, Volume 1 (Long and Short Papers), pages of Transformer-based Models. In Proceedings of the
4171–4186, Minneapolis, Minnesota. Association for 17th International Workshop on Semantic Evalua-
Computational Linguistics. tion (SemEval’23), pages 532–541, Toronto, Canada.
Association for Computational Linguistics.
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen,
Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. Shalom H Schwartz. 1994. Are there universal aspects
OpenPrompt: An open-source framework for prompt- in the structure and contents of human values? Jour-
learning. In Proceedings of the 60th Annual Meet- nal of social issues, 50(4):19–45.
ing of the Association for Computational Linguistics:
System Demonstrations, pages 105–113, Dublin, Ire- Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
land. Association for Computational Linguistics. 2019. How to fine-tune bert for text classification?
In Chinese Computational Linguistics: 18th China
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. National Conference, CCL 2019, Kunming, China,
SimCSE: Simple contrastive learning of sentence em- October 18–20, 2019, Proceedings 18. Springer.
beddings. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process- Masaya Tsunokake, Atsuki Yamaguchi, Yuta Koreeda,
ing, pages 6894–6910, Online and Punta Cana, Do- Hiroaki Ozaki, and Yasuhiro Sogawa. 2023. Hitachi
minican Republic. Association for Computational at SemEval-2023 Task 4: Exploring Various Task
Linguistics. Formulations Reveals the Importance of Description
Texts on Human Values. In Proceedings of the 17th
Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan International Workshop on Semantic Evaluation (Se-
Liu, Jingang Wang, Juanzi Li, Wei Wu, and Maosong mEval’23), pages 1723–1735, Toronto, Canada. As-
Sun. 2022. Knowledgeable prompt-tuning: Incor- sociation for Computational Linguistics.
porating knowledge into prompt verbalizer for text
classification. In Proceedings of the 60th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2225–2240,
Dublin, Ireland. Association for Computational Lin-
guistics.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-


taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
guage models are zero-shot reasoners. Advances in
neural information processing systems, 35:22199–
22213.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy