0% found this document useful (0 votes)

20 views7 pages

2403.09720v1

This document discusses the SemEval-2023 Task 4, focusing on the effectiveness of fine-tuning versus prompting in understanding human values through language models. It explores various methodologies, including the use of large language models (LLMs) and the Human Value Detection dataset, while posing key questions about model performance and alignment with human preferences. The findings indicate that while fine-tuning has shown some effectiveness, further research is needed to enhance understanding and classification of human values in natural language processing tasks.

Uploaded by

Shumaila Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views7 pages

2403.09720v1

Uploaded by

Shumaila Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

SemEval-2023 Task 4: Fine-tuning vs Prompting,

Can Language Models Understand Human Values?

Pingwei SUN
BDT program, HKUST
psunah@connect.ust.hk

Abstract model ensembling and expanding the dataset to im-

prove scores. Taking into account computational
Accurately handling the underlying support val-
resources and the rise of LLMs, our project will fo-
ues in sentences is crucial for understanding
the speaker’s tendencies, yet it poses a chal-
cus on the following questions based on the dataset:
arXiv:2403.09720v1 [cs.CL] 12 Mar 2024

lenging task in natural language understanding

• Q1: Does prompt-tuning work compared with
(NLU). In this article, we explore the poten-
tial of fine-tuning and prompt tuning in this fine-tuning on the complex downstream task?
downstream task, using the Human Value De-
• Q2: Can PLMs handle human values powered
tection 2023. Additionally, we attempt to vali-
date whether models can effectively solve the by knowledge from the pre-training stage?
problem based on the knowledge acquired dur-
ing the pre-training stage. Simultaneously, our
• Q3: How do LLMs perform on the task after
interest lies in the capabilities of large language aligning with human preference?
models (LLMs) aligned with RLHF in this task,
and some preliminary attempts are presented. 2 Related Work
Human Value Detection 2023. Among the sub-
1 Introduction
missions, most teams applied transformers-based
The persuasiveness of arguments is heavily influ- models as their backbone and treated it as a classi-
enced by individuals’ values, and variations in the fication task.
priority assigned to these values can lead to dis- Team Adam Smith (Schroter et al., 2023) em-
putes, especially between different cultures. To ploys an ensemble strategy, including DeBERTa
address this, computational linguistics leverages and RoBERTa. They trained these models for loss
human values to categorize and evaluate arguments. minimization or F1-score maximization on three
Schwartz’s value categorization(Schwartz, 1994) folds each RoBERTa model was pretrained on the
serves as a common framework. However, existing IBM-ArgQ-Rank30KArgs dataset. Ensembling in-
datasets have limitations, such as small sizes and volved averaging predictions with an optimized de-
cultural biases. To overcome these issues, the com- cision threshold. They experimented with a stacked
petition holder proposes an extension, Touché23- meta-classifier based on logistic regression.
ValueEval (Mirzakhmedova et al., 2023), contain- Team John Arthur (Balikas, 2023) fine-tuned a
ing 9324 diverse arguments from various sources DeBERTa model on the task’s data, using a con-
and cultures. This dataset aims to facilitate the de- catenated representation of stance, premise, and
velopment of Pre-trained Language Models (PLM) conclusion. They found that separate token sym-
for the automatic identification of human values bols for stance improved classification. The model
in persuasive communication, supporting a broad was trained to minimize binary cross-entropy loss,
array of applications. The task attracted many re- and the team observed performance benefits with
searchers to work on it and various methods were more training data.
proposed to solve the problem. Team PAI (Ma, 2023) combined models using
However, from the leaderboard, this task still various input datasets. They applied weight voting
requires more in-depth research, with the top per- based on F1 score for ensembling and explored
former only achieving an average F1 score of 0.56. different loss functions, ultimately favoring a class-
Most participating teams tried to solve it as a clas- balanced loss. Classification thresholds were tested,
sification task and employed techniques such as but no performance improvement was achieved.
Feature Value Type Example
Argument ID Str ”A01002”
Conclusion Str ”We should ban fast food”
Stance Str [”supporting”, ”against”]
Premise Str ”Fast food should be banned because it is really bad ...”
Value Labels Int [0, 1]
Value categories: ”Achievement”, ”Stimulation”, etc.
Value Description Str ”Self-direction: thought. It is good to have own ideas and interests.
Contained values and associated arguments (examples): ...”

Table 1: The dataset used in the experiment with the corresponding types and examples. Stance is mapped to single
words to simplify the prompting template. Square brackets represent single selections from among. The Value
Description are defined in Schwartz, 1994.

Fine-tuning. Since the release of BERT (De- 3.1 Data Utilization

vlin et al., 2019), the approach of large-scale pre- The data file contains various annotations and sup-
training followed by fine-tuning on downstream plementary datasets from platforms like Zhihu and
tasks has been widely explored and applied to vari- The New York Times. Experiments are conducted
ous NLP subtasks. on the main dataset, and due to the unavailability
Through the integration of a range of training of test labels, all results in this paper reflect the
techniques, this method has achieved impressive model’s performance on the validation set.
results. The bi-encoder structure employed during Before commencing the experiments, the data
pre-training inherently endows it with the ability files are reorganized, and certain features are pre-
to effectively extract semantic information from se- processed, as shown in Table 1.
quences. However, it performs poorly when facing
3.2 Fine-tuning as a Classification Task
few-shot and zero-shot scenarios.
Classifier. To solve the multi-label classification
Prompting. Recently, researchers have been
task, there are two methods to choose from. On
attempting to transform NLP tasks into a seq-to-
the one hand, it is sensible to use a single classi-
seq format using prompts. T5 (Raffel et al., 2020)
fier, apply the sigmoid function to each xi in the
marked the first successful endeavor in this direc-
output = {x1 , x2 , ..., xc }, and then calculate the
tion. Such an approach leverages the knowledge ac-
Binary Cross Entropy Loss as follows.
quired by language models during the pre-training
C
phase rather than treating them as feature extractors. X
Consequently, it exhibits excellent performance on yi · log σ(xi ) + (1 − yi ) · log(1 − σ(xi )) (1)
i=1
few-shot and zero-shot tasks, with the added benefit
of reduced fine-tuning costs. On the other hand, it also works by setting up
the same number of classifiers as the number of
GPT, with its decoder structure, demonstrates labels, each classifier to handle one category.
enhanced capabilities in language modeling. The Hidden States. Serving as an encoding module,
team Hitachi (Tsunokake et al., 2023) has used BERT can provide various levels of features of the
BART, T5, and GPT3, feeding the data to them in sequence. It is commonly agreed that the deeper
question-answering format. layers always output high-level semantic informa-
tion (Sun et al., 2019) and the first token [CLS] can
be used for sequence classification.
3 Methods Avoid Forgetting. Since the model has learned
solid knowledge during pre-training, we need to
To answer the questions proposed in § 1, we pro- find an appropriate way to arrange the learning
pose the following techniques, to our best knowl- rate for fine-tuning. The lower layers handle more
edge, to measure the potential of models on the general information so they should be updated with
complex task of detecting human values. Figure 1 relatively lower learning rates. We use the decay
illustrates our workflow in the project. factor to achieve layer-wise learning rates.
Advanced
PLM Type Task Mode Tuning Method Target Question
Technique

Multiple Classifiers
Classification Q1
Add classifiers after the last Finet-une vs Prompt
Encoder: transformer layer. Layer-wise LR
RoBERTa, DeBERTa Fine-tuning
(Unfix PLM) Contrastive Learning

MLM Q2
Seq2Seq: CLS Template
Predict the masked value Prompt + Fine-tuning Human Value with PLMs
T5 "Yes" or "No" of each label.
BC Template
Prompt tuning
Decoder: (Fix PLM) OA Template
GLM
GPT2, LLMs LLM
Q3
Generate judgement for each Performance of LLMs
label then map to labels. CoT Template

P-tuning

Figure 1: Illustration of the workflow of our experiments. Tracks marked by different colors stand for combinations
of models, tasks, and methods, details of which are told in § 4. The dashed lines mean methods are theoretically
available but we do not consider them in the project because of poor performance or being computationally
demanding. The colorful dots are experiment results and indicate the analysis for the questions proposed in § 1.

3.3 Contrastive Learning Model(MLM) by filling the blank or a Generative

This project is a complex Natural Language Un- Language Model(GLM) by generation.
derstanding task, so obtaining a good embedding Open answering template with knowledge-
representation of the sequence is crucial for the able verbalizer. When our templates become more
subsequent training of the classifier. complex and open, it often allows for better stimu-
Inspired by the SimCSE (Gao et al., 2021), Con- lation of internal model knowledge. At this point,
trastive Learning (CL) loss is applied to optimize we need a verbalizer to map the output results to
the embedding states of sequences. Since the target labels, and the choice of verbalizer significantly
is multi-label classification, the positive and neg- impacts the outcomes (Hu et al., 2022).
ative samples can not be defined directly inside a Notably, the dataset includes descriptions and
mini-batch. Consequently, we compute the CL loss examples of human values. However, they were
according to the following formulas: artificially defined by linguists two decades ago.
To enhance the dataset’s generalization, we intro-
yiT · yj duce LLM to rewrite the descriptions and provide
wij = PB
T · yk + ε synonyms, which will serve as a knowledgeable
k=1 yi
PB sim(xi ,xj ) (2) verbalizer to map the output results to labels of
j=1 wij · e
τ′
CL ℓi = − log human values.
sim(xi ,xj )
PB Chain-of-Thought template. Language mod-
j=1 e
τ′
els, particularly LLMs, have demonstrated remark-
where y is the label vector, x is the embedding rep- able reasoning capabilities, as evidenced by studies
resentation and B is the batch size. The τ ′ stands such as (Kojima et al., 2022). Despite the challenge
for the temperature hyper-parameter of CL, which of directly comprehending whether specific human
is a scaling factor of similarity scores. values are implied in triples, models can acquire rel-
evant knowledge through a systematic prompting
3.4 Prompt Tuning Techniques process. This step-by-step approach allows LLMs
When doing prompt tuning, the model should be to navigate and extract pertinent information, ul-
transferred to other modes to fit the template’s re- timately enabling them to effectively address the
quirements, shown in Figure 2. The details of tem- intricacies of classification problems.
plate construction are introduced below. In light of this, we have incorporated templates
Binary choice template. It is easy to come in the form of CoT. These templates serve as struc-
up with the idea to transfer the multi-label clas- tured prompts that guide the model in interpreting
sification task to several binary choice questions. and understanding the nuances of human values
It can be solved either with a Masked Language within extra knowledge.
Prompt Template Board

CLS Template MBC Template

Input: Input:
Human Values are indicated in the premise: [Premise] and the Given the premise: [Premise] and the conclusion: [Conclusion]
conclusion: [Conclusion] [Stance] it. [Stance] it. Is it based on the [Value Description]? [MASK]
Output: Output:
[Logits for classifiers] "Yes" or "No"

BCA Template OA Template

Input: Input:
Question: Given the premise: [Premise] and the conclusion: . I claim that [Conclusion] because I [Stance] the
[Conclusion] [Stance] it. Is it based on the [ValueDescription]? [Premise]. In the aspect of [Value Label], use some words to
The answer is describe what kind of person am I?
Output: Output:
"Yes" or "No" True if any in the answer, else False.

CoT Template
Solving all labels in one run
Input 1:
What kind of human values are indicated in the premise: Available in encoder with classifier mode
[Premise] and the conclusion: [Conclusion] [Stance] it?
Output 1: Available in MLM mode
#Reply from LLM#
Input 2: Available in GLM mode
Do you think [#Reply from LLM#] is aligned with ,
which can also be called ? Human value descriptions from ChatGPT
Output 2:
"Yes" or "No" Synonyms from ChatGPT

Figure 2: Prompt templates for different task processing modes, including classification(CLS), masked binary
choice(MBC), binary choice answering(BCA), open answering(OA), and Chain-of-Thought(CoT). The bolded
content in brackets represents the features in the dataset, shown in Table 1.

4 Experiments we treat it as either a pre-training task or an ex-

tra loss item at the fine-tuning stage. The second
The following experiments are conducted to an- strategy performs slightly better, and both of them
swer questions mentioned in § 1. Since we aim make progress on the macro F1 score.
to explore fine-tuning methods and have a limited
computational budget, tricks such as model ensem-
Parameters Value
bling to improve scores are not applied.
Trainer
4.1 Exp-I: Fine-tuning Batch size 8
Epochs 3.0
Based on the previous team’s papers, we choose Lr scheduler cosine
RoBERTa and DeBERTa followed by a classifier Warmup ratio 0.1
for fine-tuning experiments. The Premise, Con- Lr decay 0.97
clusion, and Stance features are directly fed to the Optimizer
model. Settings of hyperparameters for training are Optimzier Adamw
listed in Table 2. Learning rate 2e-5
The performance of different models and classi- Trainable param last 8 layers + classifier(s)
fiers is presented in Table 3. RoBERTa with mul-
tiple classifiers is selected after the preliminary Table 2: We try various combinations of the hyperpa-
fine-tuning experiment. rameters and compare validation scores after training.
Then Contrastive Learning is introduced for fur- The above is an optimal choice among them.
ther improvement of the performance. In this part,
Model Macro F1 score 5 Analysis
RoBERT aLarge
w/ SH .481 5.1 Effectiveness of Fine-tuning and Input
w/ MH .507 Description
DeBERT aLarge
We achieved a relatively competitive result com-
w/ SH .477
pared with last year’s leaderboard. At the same
w/ MH .493
time, we also validated the contribution of con-
RoBERT aLarge w/ MH
trastive learning to obtaining more discriminative
CL pre-train .518
embedding representations.
CL fine-tune .522
Furthermore, we observed that, in handling NLU
Table 3: Performance of finetuned encoding models tasks with complex inputs, providing reasonable
with classifiers on the validation set. SH means single descriptions for complex input examples enhances
classification head, while MH means multiple heads. model performance compared to directly concate-
nating and inputting individual components.

4.2 Exp-II: Prompt Tuning 5.2 Performance of Prompt Tuning

The previously mentioned prompt templates are Finding versions of different models with the same
sequentially tested in this section in conjunction parameter size is a challenge, to ensure the validity
with models of different parameter sizes and struc- of the results, we attempted to compensate by the
tures. The training strategies are inspired by Liu trainable parameters during training.
et al., 2021 and implementations are based on From the table, it is evident that on models with
OpenPrompt (Ding et al., 2022). a size less than 500M, fine-tuning outperforms
prompt tuning, which is reasonable being given
To be specific, the CLS template is treated as a more trainable parameters. However, when the
hard prompt of which parameters are frozen and model size approaches 1B (two to three times that
fed to encoding models with trainable multiple clas- of the formers), prompt tuning shows comparable
sifiers. Its trainable parameters are consistent with performance to fine-tuning. Yet, during training,
the previous fine-tuning settings. we observed a larger variance in the performance
Templates in the forms of mask filling and ques- of prompt tuning with changes in the initialization
tion answering are also proposed. At this time, we methods of templates.
set PLMs fixed and allow prompts’ parameters to Does it mean fine-tuning is always a better
be updated. Extra knowledge in the OA template is choice in this range of parameter size (<1B)? Our
processed in advance, which is manually selected two additional experiments explain this in more
from feedback from ChatGPT. detail. The results of NLI and few-shot are listed
in Table 5.
We built a simple NLI task using the Premise
4.3 Exp-III: Works on LLMs Conclusion and Stance features in the dataset and
tuned some models using the same method as be-
The capabilities of the LLMs on this complex NLU fore. As can be seen, there is no significant gap
task are also what we plan to explore. Experiments between fine-tuning and prompt tuning in this sim-
are conducted with a CoT template which is de- pler task, even with fewer trainable parameters,
signed to stimulate the logical inference ability and indicating that with the increase in task complexity,
knowledge inside LLMs. prompt tuning requires a larger base model, mean-
Considering the computational budget, no gra- ing better language modeling abilities and more
dient updating process is included in this template. knowledge to achieve performance comparable to
Instead, the scores are obtained by taking 5% of the the encoder + classifier paradigm.
validation dataset for local inference (Llama7B) or When faced with few-shot datasets, prompt tun-
API calls (GPT series) and computing F1 scores on ing shows obvious advantages. In contrast, fine-
predicted results, which is shown in Table 4. tuning seems failed to solve the task.
Benevolence: dependability
Conformity: interpersonal

Universalism: objectivity
Universalism: tolerance
Self-direction: thought

Universalism: concern

Universalism: nature
Self-direction: action

Benevolence: caring
Power: dominance

Security: personal

Conformity: rules
Security: societal
Power: resources
Achievement
Stimulation

Hedonism

Tradition

Humility
Face
Method / F1 score All
Classification
RoBERTa w/ MH .51 - - - - - - - - - - - - - - - - - - - -
RoBERTa-CLS .54 - - - - - - - - - - - - - - - - - - - -
DeBERTa-CLS .52 - - - - - - - - - - - - - - - - - - - -
MLM
RoBERTa-MBC .47 .52 .63 .22 .31 .58 .33 .56 .30 .69 .62 .58 .55 .27 .16 .50 .32 .66 .75 .39 .43
DeBERTa-MBC - - - - - - - - - - - - - - - - - - - - -
GLM
T5† -BCA .45 .49 .59 .23 .30 .58 .37 .50 .25 .70 .61 .44 .46 .25 .21 .48 .28 .68 .73 .33 .50
T5† -OA .45 .49 .58 .25 .31 .58 .32 .49 .27 .73 .60 .49 .45 .25 .19 .47 .28 .66 .74 .35 .52
T5-BCA .52 .57 .70 .18 .44 .65 .41 .54 .33 .78 .68 .67 .59 .30 .13 .50 .35 .76 .85 .49 .56
T5-OA .50 .51 .69 .13 .27 .61 .44 .49 .30 .74 .62 .58 .55 .36 .17 .52 .41 .73 .80 .45 .55
GPT2† -BCA .42 .41 .57 .15 .21 .55 .26 .43 .23 .66 .53 .47 .49 .31 .20 .42 .28 .64 .72 .36 .53
GPT2† -OA .44 .45 .62 .11 .42 .47 .31 .41 .23 .71 .56 .44 .51 .33 .20 .45 .30 .67 .77 .39 .52
GPT2-BCA .46 .44 .63 .19 .23 .58 .38 .47 .25 .69 .56 .52 .54 .34 .23 .43 .28 .68 .79 .32 .56
GPT2-OA .48 .47 .65 .25 .29 .56 .37 .51 .31 .72 .60 .51 .55 .27 .19 .50 .33 .67 .81 .52 .54
LLM-CoT
Llama-7B .49 .55 .66 .20 .39 .64 .35 .48 .26 .77 .63 .54 .53 .35 .20 .53 .38 .71 .80 .34 .42
ChatGPT .53 .52 .71 .26 .33 .61 .48 .52 .38 .75 .65 .60 .56 .47 .16 .55 .42 .72 .83 .42 .57

Table 4: F1 score of various prompt tuning methods. T5 and GPT2 with † stand for the base and medium
versions respectively and the others are of their large versions except for LLMs. There is something wrong with
DeBERTaForMaskedLM in huggingface, so DeBERTa-MBC is of no reference value.

5.3 Prompting LLMs to Detect Human Values 6 Conclusion

6.1 Fine-tuning vs Prompt Tuning
In the aspect of LLMs, even without any fine- This is a problem that involves multiple variables
tuning, a well-designed questioning style (prompts) as we mentioned in the analysis section. Each
can enable them to perform well on the task. approach has its own strengths and weaknesses.
However, this improvement is more uncertain Briefly, prompt tuning can achieve and some-
than the previously mentioned methods (as can be times surpass the performance of direct fine-tuning
seen from the variation in scores across categories) when the model size is large enough. It exhibits
and relies on an experienced questioner. Failures in better generalization and enhanced transferability.
a few categories may be attributed to inconsisten- However, direct fine-tuning, as a more straightfor-
cies between external knowledge from ChatGPT ward approach, is worth considering when train-
and task definitions. ing samples are abundant, as it allows for stable
performance leveraging a smaller model, which is
particularly crucial in industrial applications.

Models / Task NLI 5-shot HVD 6.2 Human values detection

RoBERTa .872 - PLMs of medium size: From the F1 scores, it is ev-
T5 .866 .413 ident that the model requires further improvement.
GPT2 .849 .408 However, the accuracy metric in the experiments
reflects the model’s detection capability is viable.
Table 5: Results of NLI and few-shot Human Value
Detection. The model sizes are close and all below
The main direction for improvement lies in reduc-
500M. NLI is evaluated by accuracy while HVD still ing false positives.
uses the F1 score. Chat-LLMs: Despite evaluating only a small sub-
set of data, LLMs have demonstrated commendable
performance in understanding human values.
6.3 Limitations Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Hiroaki Hayashi, and Graham Neubig. 2021. Pre-
When comparing the performance of fine-tuning train, prompt, and predict: A systematic survey of
and prompt tuning, the inability to control model prompting methods in natural language processing.
parameter sizes and trainable parameters leads to a
Long Ma. 2023. PAI at SemEval-2023 Task 4: A Gen-
decrease in the persuasiveness of the results. eral Multi-label Classification System with Class-
However, such conclusions remain meaningful balanced Loss Function and Ensemble Module. In
for practical applications because, in the face of Proceedings of the 17th International Workshop on
various trade-off conditions, considerations across Semantic Evaluation (SemEval’23), pages 256–261,
Toronto, Canada. Association for Computational Lin-
all aspects are often needed rather than an itemized
guistics.
comparison of pros and cons.
Nailia Mirzakhmedova, Johannes Kiesel, Milad Al-
shomary, Maximilian Heinrich, Nicolas Handke, Xi-
References aoni Cai, Barriere Valentin, Doratossadat Dastgheib,
Omid Ghahroodi, Mohammad Ali Sadraei, Ehsaned-
Georgios Balikas. 2023. John-Arthur at SemEval-2023 din Asgari, Lea Kawaletz, Henning Wachsmuth, and
Task 4: Fine-Tuning Large Language Models for Ar- Benno Stein. 2023. The touché23-valueeval dataset
guments Classification. In Proceedings of the 17th for identifying human values behind arguments.
International Workshop on Semantic Evaluation (Se-
mEval’23), pages 1428–1432, Toronto, Canada. As- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
sociation for Computational Linguistics. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of transfer learning with a unified text-to-text trans-
Kristina Toutanova. 2019. BERT: Pre-training of former. J. Mach. Learn. Res., 21:140:1–140:67.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of Daniel Schroter, Daryna Dementieva, and Georg Groh.
the North American Chapter of the Association for 2023. Adam-Smith at SemEval-2023 Task 4: Discov-
Computational Linguistics: Human Language Tech- ering Human Values in Arguments with Ensembles
nologies, Volume 1 (Long and Short Papers), pages of Transformer-based Models. In Proceedings of the
4171–4186, Minneapolis, Minnesota. Association for 17th International Workshop on Semantic Evalua-
Computational Linguistics. tion (SemEval’23), pages 532–541, Toronto, Canada.
Association for Computational Linguistics.
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen,
Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. Shalom H Schwartz. 1994. Are there universal aspects
OpenPrompt: An open-source framework for prompt- in the structure and contents of human values? Jour-
learning. In Proceedings of the 60th Annual Meet- nal of social issues, 50(4):19–45.
ing of the Association for Computational Linguistics:
System Demonstrations, pages 105–113, Dublin, Ire- Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
land. Association for Computational Linguistics. 2019. How to fine-tune bert for text classification?
In Chinese Computational Linguistics: 18th China
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. National Conference, CCL 2019, Kunming, China,
SimCSE: Simple contrastive learning of sentence em- October 18–20, 2019, Proceedings 18. Springer.
beddings. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process- Masaya Tsunokake, Atsuki Yamaguchi, Yuta Koreeda,
ing, pages 6894–6910, Online and Punta Cana, Do- Hiroaki Ozaki, and Yasuhiro Sogawa. 2023. Hitachi
minican Republic. Association for Computational at SemEval-2023 Task 4: Exploring Various Task
Linguistics. Formulations Reveals the Importance of Description
Texts on Human Values. In Proceedings of the 17th
Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan International Workshop on Semantic Evaluation (Se-
Liu, Jingang Wang, Juanzi Li, Wei Wu, and Maosong mEval’23), pages 1723–1735, Toronto, Canada. As-
Sun. 2022. Knowledgeable prompt-tuning: Incor- sociation for Computational Linguistics.
porating knowledge into prompt verbalizer for text
classification. In Proceedings of the 60th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2225–2240,
Dublin, Ireland. Association for Computational Lin-
guistics.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-

taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
guage models are zero-shot reasoners. Advances in
neural information processing systems, 35:22199–
22213.

Training Language Models To Follow Instructions With Human Feedback
No ratings yet
Training Language Models To Follow Instructions With Human Feedback
68 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
2306.05087v2
No ratings yet
2306.05087v2
21 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
2505.17063v1
No ratings yet
2505.17063v1
34 pages
Tmlr 2023 Raft
No ratings yet
Tmlr 2023 Raft
29 pages
Automatic Essay Grading
No ratings yet
Automatic Essay Grading
20 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
20250216_Kamal_Hina_92133558_DLBCSIAW01
No ratings yet
20250216_Kamal_Hina_92133558_DLBCSIAW01
15 pages
2310.01119v2
No ratings yet
2310.01119v2
12 pages
Language Models (Mostly) Know What They Know: Anthropic
No ratings yet
Language Models (Mostly) Know What They Know: Anthropic
43 pages
L S LL MAF: Abel Upervised A Inetuning
No ratings yet
L S LL MAF: Abel Upervised A Inetuning
12 pages
Lecture 7
No ratings yet
Lecture 7
66 pages
2411.05193v2
No ratings yet
2411.05193v2
17 pages
ELREA 多个lora适配器动态选取
No ratings yet
ELREA 多个lora适配器动态选取
29 pages
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
No ratings yet
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
11 pages
2506.17766v1
No ratings yet
2506.17766v1
7 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Continual-Alignment
No ratings yet
Continual-Alignment
51 pages
Summary_Foundations on LLMs
No ratings yet
Summary_Foundations on LLMs
6 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
2402.13669v2
No ratings yet
2402.13669v2
16 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Paper - Training Chain-Of-Thought via Latent-Variable Inference
No ratings yet
Paper - Training Chain-Of-Thought via Latent-Variable Inference
23 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Finetuning LLM for vulnerability detection
No ratings yet
Finetuning LLM for vulnerability detection
12 pages
2310.06422v2
No ratings yet
2310.06422v2
7 pages
Large Language Models for Text Classification Case Study and 2rl2h1dz4onu
No ratings yet
Large Language Models for Text Classification Case Study and 2rl2h1dz4onu
12 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
2023.semeval-1.75
No ratings yet
2023.semeval-1.75
7 pages
Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models
No ratings yet
Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models
19 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
Interpreting Language Models Through Knowledge Graph Extraction
No ratings yet
Interpreting Language Models Through Knowledge Graph Extraction
13 pages
3
No ratings yet
3
5 pages
2 4 Chapters
No ratings yet
2 4 Chapters
3 pages
PTR: Prompt Tuning With Rules For Text Classification: Petroni Et Al. 2019
No ratings yet
PTR: Prompt Tuning With Rules For Text Classification: Petroni Et Al. 2019
12 pages
2 Ruf
No ratings yet
2 Ruf
11 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Bert
No ratings yet
Bert
10 pages
Model
No ratings yet
Model
5 pages
split_1363534026993628405
No ratings yet
split_1363534026993628405
2 pages
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
9 pages
ST 04
No ratings yet
ST 04
4 pages
B1022 B1025 MFP Service Manual
100% (1)
B1022 B1025 MFP Service Manual
294 pages
Logical Reasoning
No ratings yet
Logical Reasoning
12 pages
Generative Pretraining From Pixels V2
No ratings yet
Generative Pretraining From Pixels V2
12 pages
2024 Findings-Eacl 141
No ratings yet
2024 Findings-Eacl 141
17 pages
Improving Language Model Behavior by Training On A Curated Dataset
No ratings yet
Improving Language Model Behavior by Training On A Curated Dataset
9 pages
2403.01081v3
No ratings yet
2403.01081v3
10 pages
Nemo-Aligner: Scalable Toolkit For Efficient Model Align-Ment
No ratings yet
Nemo-Aligner: Scalable Toolkit For Efficient Model Align-Ment
16 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
Applying Deep Learning To Answer Selection - A Study and An Open Task
No ratings yet
Applying Deep Learning To Answer Selection - A Study and An Open Task
8 pages
Gen AI Assignment
No ratings yet
Gen AI Assignment
5 pages
L L M C S - I: Arge Anguage Odels AN ELF Mprove
No ratings yet
L L M C S - I: Arge Anguage Odels AN ELF Mprove
19 pages
Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
No ratings yet
Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
16 pages
Startup2 Elementary TB
No ratings yet
Startup2 Elementary TB
148 pages
Setting Up A Simple NAS Server On Linux Ubuntu - Micro Fusion Insight
No ratings yet
Setting Up A Simple NAS Server On Linux Ubuntu - Micro Fusion Insight
34 pages
StutiSharma
No ratings yet
StutiSharma
22 pages
Paper 36
No ratings yet
Paper 36
13 pages
Java PPT - 3 by Adi
No ratings yet
Java PPT - 3 by Adi
35 pages
2023.ranlp-1.86
No ratings yet
2023.ranlp-1.86
11 pages
Six Sigma
100% (2)
Six Sigma
377 pages
Azure Interview Questions
No ratings yet
Azure Interview Questions
19 pages
UQX OpIns en
No ratings yet
UQX OpIns en
66 pages
Farah CV
No ratings yet
Farah CV
1 page
2023.semeval-1.313
No ratings yet
2023.semeval-1.313
17 pages
Etwork: Penetration Testing
No ratings yet
Etwork: Penetration Testing
19 pages
DDWRT WireGuard Client Setup Guide v37
No ratings yet
DDWRT WireGuard Client Setup Guide v37
23 pages
Advanced Data Structures Lab
No ratings yet
Advanced Data Structures Lab
14 pages
3350-1 DS 20210720-Datasheet
100% (1)
3350-1 DS 20210720-Datasheet
16 pages
Adf7030 1 PDF
No ratings yet
Adf7030 1 PDF
55 pages
Railway Tracking System Abstract
No ratings yet
Railway Tracking System Abstract
48 pages
Farah CV
No ratings yet
Farah CV
1 page
01 IT110-Intro1
No ratings yet
01 IT110-Intro1
26 pages
Running EventStatsticsUtil To Understand How Events Are Stored in The OBM Database
No ratings yet
Running EventStatsticsUtil To Understand How Events Are Stored in The OBM Database
5 pages
SB6190 microSD Flash Guide
No ratings yet
SB6190 microSD Flash Guide
2 pages
Blockchain Technology
No ratings yet
Blockchain Technology
22 pages
Modified 31-JUL-2011 Type HOWTO Status MODERATED
No ratings yet
Modified 31-JUL-2011 Type HOWTO Status MODERATED
5 pages
DigitalCommunityMappingTool FinalDraft LR
No ratings yet
DigitalCommunityMappingTool FinalDraft LR
16 pages
Dell Poweredge 12g Server Bios
No ratings yet
Dell Poweredge 12g Server Bios
16 pages
Common Git Commands
No ratings yet
Common Git Commands
9 pages
China Telecom Phone SW Versions
No ratings yet
China Telecom Phone SW Versions
7 pages
Use of HL7 To Integrate A HIS-subsystem
No ratings yet
Use of HL7 To Integrate A HIS-subsystem
5 pages
W0030 DWGSeriesGatewayUpgradingIntroduction
No ratings yet
W0030 DWGSeriesGatewayUpgradingIntroduction
6 pages
Get File - Think Rich Pinoy Book PDF
No ratings yet
Get File - Think Rich Pinoy Book PDF
4 pages
Technical Aptitude Test Sample
No ratings yet
Technical Aptitude Test Sample
2 pages
SBI Homepage Re-Design (12 Column Grid) : Header
No ratings yet
SBI Homepage Re-Design (12 Column Grid) : Header
2 pages
Licencia Cocodrile Tech
No ratings yet
Licencia Cocodrile Tech
2 pages
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2403.09720v1

Uploaded by

2403.09720v1

Uploaded by

SemEval-2023 Task 4: Fine-tuning vs Prompting,

Can Language Models Understand Human Values?

Abstract model ensembling and expanding the dataset to im-

lenging task in natural language understanding

Fine-tuning. Since the release of BERT (De- 3.1 Data Utilization

3.3 Contrastive Learning Model(MLM) by filling the blank or a Generative

CLS Template MBC Template

BCA Template OA Template

4 Experiments we treat it as either a pre-training task or an ex-

4.2 Exp-II: Prompt Tuning 5.2 Performance of Prompt Tuning

5.3 Prompting LLMs to Detect Human Values 6 Conclusion

Models / Task NLI 5-shot HVD 6.2 Human values detection

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.