2403.09720v1
2403.09720v1
Pingwei SUN
BDT program, HKUST
psunah@connect.ust.hk
Table 1: The dataset used in the experiment with the corresponding types and examples. Stance is mapped to single
words to simplify the prompting template. Square brackets represent single selections from among. The Value
Description are defined in Schwartz, 1994.
Multiple Classifiers
Classification Q1
Add classifiers after the last Finet-une vs Prompt
Encoder: transformer layer. Layer-wise LR
RoBERTa, DeBERTa Fine-tuning
(Unfix PLM) Contrastive Learning
MLM Q2
Seq2Seq: CLS Template
Predict the masked value Prompt + Fine-tuning Human Value with PLMs
T5 "Yes" or "No" of each label.
BC Template
Prompt tuning
Decoder: (Fix PLM) OA Template
GLM
GPT2, LLMs LLM
Q3
Generate judgement for each Performance of LLMs
label then map to labels. CoT Template
P-tuning
Figure 1: Illustration of the workflow of our experiments. Tracks marked by different colors stand for combinations
of models, tasks, and methods, details of which are told in § 4. The dashed lines mean methods are theoretically
available but we do not consider them in the project because of poor performance or being computationally
demanding. The colorful dots are experiment results and indicate the analysis for the questions proposed in § 1.
Input: Input:
Question: Given the premise: [Premise] and the conclusion: . I claim that [Conclusion] because I [Stance] the
[Conclusion] [Stance] it. Is it based on the [ValueDescription]? [Premise]. In the aspect of [Value Label], use some words to
The answer is describe what kind of person am I?
Output: Output:
"Yes" or "No" True if any in the answer, else False.
CoT Template
Solving all labels in one run
Input 1:
What kind of human values are indicated in the premise: Available in encoder with classifier mode
[Premise] and the conclusion: [Conclusion] [Stance] it?
Output 1: Available in MLM mode
#Reply from LLM#
Input 2: Available in GLM mode
Do you think [#Reply from LLM#] is aligned with ,
which can also be called ? Human value descriptions from ChatGPT
Output 2:
"Yes" or "No" Synonyms from ChatGPT
Figure 2: Prompt templates for different task processing modes, including classification(CLS), masked binary
choice(MBC), binary choice answering(BCA), open answering(OA), and Chain-of-Thought(CoT). The bolded
content in brackets represents the features in the dataset, shown in Table 1.
The previously mentioned prompt templates are Finding versions of different models with the same
sequentially tested in this section in conjunction parameter size is a challenge, to ensure the validity
with models of different parameter sizes and struc- of the results, we attempted to compensate by the
tures. The training strategies are inspired by Liu trainable parameters during training.
et al., 2021 and implementations are based on From the table, it is evident that on models with
OpenPrompt (Ding et al., 2022). a size less than 500M, fine-tuning outperforms
prompt tuning, which is reasonable being given
To be specific, the CLS template is treated as a more trainable parameters. However, when the
hard prompt of which parameters are frozen and model size approaches 1B (two to three times that
fed to encoding models with trainable multiple clas- of the formers), prompt tuning shows comparable
sifiers. Its trainable parameters are consistent with performance to fine-tuning. Yet, during training,
the previous fine-tuning settings. we observed a larger variance in the performance
Templates in the forms of mask filling and ques- of prompt tuning with changes in the initialization
tion answering are also proposed. At this time, we methods of templates.
set PLMs fixed and allow prompts’ parameters to Does it mean fine-tuning is always a better
be updated. Extra knowledge in the OA template is choice in this range of parameter size (<1B)? Our
processed in advance, which is manually selected two additional experiments explain this in more
from feedback from ChatGPT. detail. The results of NLI and few-shot are listed
in Table 5.
We built a simple NLI task using the Premise
4.3 Exp-III: Works on LLMs Conclusion and Stance features in the dataset and
tuned some models using the same method as be-
The capabilities of the LLMs on this complex NLU fore. As can be seen, there is no significant gap
task are also what we plan to explore. Experiments between fine-tuning and prompt tuning in this sim-
are conducted with a CoT template which is de- pler task, even with fewer trainable parameters,
signed to stimulate the logical inference ability and indicating that with the increase in task complexity,
knowledge inside LLMs. prompt tuning requires a larger base model, mean-
Considering the computational budget, no gra- ing better language modeling abilities and more
dient updating process is included in this template. knowledge to achieve performance comparable to
Instead, the scores are obtained by taking 5% of the the encoder + classifier paradigm.
validation dataset for local inference (Llama7B) or When faced with few-shot datasets, prompt tun-
API calls (GPT series) and computing F1 scores on ing shows obvious advantages. In contrast, fine-
predicted results, which is shown in Table 4. tuning seems failed to solve the task.
Benevolence: dependability
Conformity: interpersonal
Universalism: objectivity
Universalism: tolerance
Self-direction: thought
Universalism: concern
Universalism: nature
Self-direction: action
Benevolence: caring
Power: dominance
Security: personal
Conformity: rules
Security: societal
Power: resources
Achievement
Stimulation
Hedonism
Tradition
Humility
Face
Method / F1 score All
Classification
RoBERTa w/ MH .51 - - - - - - - - - - - - - - - - - - - -
RoBERTa-CLS .54 - - - - - - - - - - - - - - - - - - - -
DeBERTa-CLS .52 - - - - - - - - - - - - - - - - - - - -
MLM
RoBERTa-MBC .47 .52 .63 .22 .31 .58 .33 .56 .30 .69 .62 .58 .55 .27 .16 .50 .32 .66 .75 .39 .43
DeBERTa-MBC - - - - - - - - - - - - - - - - - - - - -
GLM
T5† -BCA .45 .49 .59 .23 .30 .58 .37 .50 .25 .70 .61 .44 .46 .25 .21 .48 .28 .68 .73 .33 .50
T5† -OA .45 .49 .58 .25 .31 .58 .32 .49 .27 .73 .60 .49 .45 .25 .19 .47 .28 .66 .74 .35 .52
T5-BCA .52 .57 .70 .18 .44 .65 .41 .54 .33 .78 .68 .67 .59 .30 .13 .50 .35 .76 .85 .49 .56
T5-OA .50 .51 .69 .13 .27 .61 .44 .49 .30 .74 .62 .58 .55 .36 .17 .52 .41 .73 .80 .45 .55
GPT2† -BCA .42 .41 .57 .15 .21 .55 .26 .43 .23 .66 .53 .47 .49 .31 .20 .42 .28 .64 .72 .36 .53
GPT2† -OA .44 .45 .62 .11 .42 .47 .31 .41 .23 .71 .56 .44 .51 .33 .20 .45 .30 .67 .77 .39 .52
GPT2-BCA .46 .44 .63 .19 .23 .58 .38 .47 .25 .69 .56 .52 .54 .34 .23 .43 .28 .68 .79 .32 .56
GPT2-OA .48 .47 .65 .25 .29 .56 .37 .51 .31 .72 .60 .51 .55 .27 .19 .50 .33 .67 .81 .52 .54
LLM-CoT
Llama-7B .49 .55 .66 .20 .39 .64 .35 .48 .26 .77 .63 .54 .53 .35 .20 .53 .38 .71 .80 .34 .42
ChatGPT .53 .52 .71 .26 .33 .61 .48 .52 .38 .75 .65 .60 .56 .47 .16 .55 .42 .72 .83 .42 .57
Table 4: F1 score of various prompt tuning methods. T5 and GPT2 with † stand for the base and medium
versions respectively and the others are of their large versions except for LLMs. There is something wrong with
DeBERTaForMaskedLM in huggingface, so DeBERTa-MBC is of no reference value.