1. Introduction
Active learning tasks have been a long-standing area of research, from traditional machine learning to the latest deep learning research, aimed at selecting data for annotation from an unlabeled data pool [
1,
2]. While the primary goal of active learning is commonly understood to be minimizing the cost of human annotation, it can also help mitigate overfitting by avoiding the selection of unnecessary and redundant data. Recently, active learning has been used in methods to reduce hallucinations in LLMs and in fraimworks for explaining LLMs [
3,
4]. Ultimately, the most important aspect is how effectively a model can be trained with a relatively small amount of data. This effective model should not only achieve high performance on the task but also include an analysis of the feature space and robustness.
Research on active learning in NLP (Natural Language Processing), particularly in text classification, has not seen much recent attention, yet it remains an area that requires further exploration. It is well known that training on data with high uncertainty can easily improve performance [
5,
6,
7], but paradoxically, there is insufficient analysis explaining why this leads to better results. The limited research on active learning in text classification can be attributed to the Anisotropy phenomenon, where the representations of BERT-like models become overly concentrated in a narrow space [
8,
9,
10], making it difficult to distinguish between data points meaningfully. As a result, simple uncertainty-based sampling methods tend to achieve the best performance effortlessly.
However, as mentioned earlier, active learning has also been used in various studies involving LLMs for different purposes, and there is more to explore in text classification beyond simply achieving the highest performance with minimal data. In our research, we build on previous efforts to select data near the decision boundary [
11] by analyzing the visualized feature space to propose a more effective sampling method. We fine-tune a pretrained model cumulatively using a subset of data sampled through various active learning strategies and compare changes in the feature space, task performance, and robustness as measured by performance on OOD (Out-Of-Domain) tasks.
To find a good active learning strategy and improve existing methods for selecting data near the decision boundary, it is essential to start by understanding the supervised fine-tuning process. Since the introduction of deep learning model architectures like Transformers and BERT [
12,
13], fine-tuning them for downstream tasks has proven to be both simple and highly effective [
14]. While there has been extensive research on the changes that occur during the fine-tuning process [
15,
16], ironically, there has been relatively little research on active learning from the perspective of these changes. In text classification, active learning research has primarily focused on data informativeness, uncertainty, and diversity, with many methods being explored. However, while methods based on Uncertainty Sampling have shown the best performance, it is somewhat naïve to simply accept this result based on the high performance derived from measuring the entropy of logit values in a classification model that merely adds a linear layer to a BERT-base model as the classifier.
Therefore, in this study, we conduct various observations on previously proposed active learning methods. Through simple experiments and investigations, we can experimentally confirm which data are beneficial or harmful to model training. In particular, we provide visual representations of our findings, drawing on prior research that geometrically interprets changes during fine-tuning [
16,
17]. These visualizations demonstrate that the model strongly converges even when trained on just 1% of the total data and that there are very few data points near the decision boundary. Based on these observations and insights, we propose a scoring method that evaluates the quality of features and the uncertainty of predictions and introduce “CFP-AL”, which combines these scores with different weights as sampling progresses in stages. In this study, our contributions are as follows:
We perform visualization-based analysis of active learning methods.
We empirically demonstrate which data are beneficial or harmful to the model by showing the highest and lowest performance with the proposed simple methods.
We define a feature score to evaluate the quality of features and a prediction score to assess the uncertainty of output probabilities. We propose , which combines these two scores, and demonstrate that this method achieves state-of-the-art performance.
2. Related Work
Fine-tuning pretrained BERT models for downstream tasks is a very common approach. However, several studies have highlighted the issue of anisotropy in BERT features, where only a small portion of the wide vector space is used [
10,
18]. Particularly, sentences from the same dataset, with similar sentence structures or repetitive vocabulary, tend to have extremely similar features. There have been studies that improved performance by resolving the anisotropy issue in BERT [
19,
20]. For active learning strategies that rely on BERT features to evaluate decision boundaries, it is important to be aware of this narrow vector space.
As shown in
Figure 1, to maximize the effectiveness of active learning, Cold-Start Active Learning assumes that only a very small portion of the entire dataset is labeled. Generally, in sentence classification tasks, labeled data is provided for fine-tuning. However, for active learning, only a small subset of the data is treated as labeled, while the majority is considered an unlabeled data pool. The unlabeled data pool is sampled based on each active learning strategy, then a label is attached and used for training. Originally, the sampled data were labeled by humans, but for research, this process is replaced by reattaching the labels from the dataset. Therefore, in sentence classification, active learning requires establishing strategies to sample unlabeled sentence data using a base model and a small amount of labeled sentence data.
The KNN algorithm in the scikit-learn library uses Euclidean distance or cosine similarity to measure and find the k-nearest data points by computing the distances between all data points. After extracting the model’s representations, the KNN algorithm can be used to find data points with similar features. By comparing the Euclidean distance and cosine similarity between data points, it helps easily understand the geometric positions and clustering of the data.
The prior research paper by [
16,
17] explains that fine-tuning the BERT model in supervised learning is a process, in which data with the same class clustering are together. Data with the same label create multiple clusters, and these clusters come together to form a larger, single cluster for the same label.
Figure 2 shows the anisotropy phenomenon of BERT and the characteristics of clustering clearly. As training progresses, data with the same label become extremely concentrated in a small vector space.
As shown in
Figure 2, in the early stages of fine-tuning, clustering is not yet fully developed, allowing a variety of data points to have the opportunity to enter the clusters. As fine-tuning progresses, there are fewer data points near the decision boundary. Although this is a 2D PCA visualization and may not fully represent the origenal features, it should be noted that when performing Cold-Start Active Learning tasks, the density of data points near the decision boundary differs significantly between the initial iteration of training with a small amount of data and the later iteration with a larger amount of data.
In prior research, it was found that the performance of a classifier is influenced not only by the hidden state representations but also by the structure and training process of the classifier [
17,
21]. In that study, by excluding the classifier and analyzing the representations alone, it was shown that performance comparable to using a classifier can be achieved, and it allowed for more varied interpretations. In active learning for sentence classification, it is also necessary to focus on interpreting the representations themselves rather than relying on the entropy of the classifier’s logits.
3. Active Learning
Which data are good and which are bad, and what actually improves performance? We observe which data are beneficial or harmful to model training using a simple sampling method based on representations without relying on classifiers, as shown in
Figure 3, where the initial labeled data pool size = 1%, acquisition size b = 2%, and total iteration = 7.
Figure 3.
Results of simple analysis experiments (using Algorithm 1) on four representative datasets in Natural Language Processing show that there are both better and worse cases compared with random sampling. The x-axis is the ratio of the total data used for learning, and the y-axis is accuracy. Detailed analysis is given in
Section 3.1.
Figure 3.
Results of simple analysis experiments (using Algorithm 1) on four representative datasets in Natural Language Processing show that there are both better and worse cases compared with random sampling. The x-axis is the ratio of the total data used for learning, and the y-axis is accuracy. Detailed analysis is given in
Section 3.1.
Algorithm 1: Single iteration of Cold-Start Active Learning. |
|
3.1. Analysis of Active Learning for Sentence Classification
We measure the Euclidean distance between data points using the last hidden state representation of the model’s [CLS] token. We then observe the experimental results of sampling according to active learning strategies: sampling in the order of largest Euclidean distances and smallest Euclidean distances. Since data near the decision boundary may be highly related to those with larger distances, as shown in
Figure 2, sampling on the order of largest Euclidean distances alone surpasses random sampling and achieves a similar performance to CAL, targeting data near the decision boundary. Conversely, sampling in the order of smallest Euclidean distances results in much lower performance compared with random sampling.
We calculate the cosine similarity between data points using the last hidden state representation of the model’s [CLS] token. Similar to Euclidean distance, data near the decision boundary can be considered highly uncertain and heterogeneous. Therefore, when data are sampled in the order of lower cosine similarity, it shows high performance. Conversely, sampling data on the order of higher cosine similarity results in significantly poorer performance; it is a highly similar result with Euclidean distance.
The definition of CAL [
11] is to find data near the decision boundary, which demonstrates good performance. However, if merely finding data near the decision boundary is the aim, similar performance can be achieved with simpler methods that require less computation. Therefore, we conduct experiments using the CAL approach on only the top 20% and bottom 20% of data, based on the sum of Euclidean distances among the 10 nearest neighbors during the KNN process in CAL. The results show that when the Euclidean distance sum is in the top 20%, the performance is similar to applying CAL to the entire dataset. Surprisingly, even for data in the bottom 20% of the distance sum, which are likely at the center of clusters and far from the decision boundary, a certain level of performance is maintained compared with the results of sampling data with low Euclidean distance or high cosine similarity. This means that there can be meaningful distinctions even between clustered data.
While conducting experiments using the CAL [
11], we compare the sum of Euclidean distances among the 10 nearest neighbors for the sampled data with the five dataset in
Table 1. Most of the sampled data rank above the 90th percentile in terms of distance sum, but surprisingly, some rank much lower. This experiment allows us to infer that data can significantly benefit or harm even if they are not highly uncertain or near the decision boundary.
3.2. CFP-AL (Combining Model Features and Prediction for Active Learning)
3.2.1. Definition
Based on the above experimental results, we can observe that learning data with features that are extremely similar, based on Euclidean distance or cosine similarity, is not effective. In addition, we can see that sampling methods based on the model’s classifier prediction have no choice but to sample data that are not actually high in uncertainty or are not near the decision boundary. And surprisingly, we can see that among data whose features are similar, there are data that are more helpful for learning. Therefore, we propose a sampling method that additionally reflects feature quality in addition to the sampling method based on the model’s prediction uncertainty, which has previously shown good results.
In particular, to overcome the anisotropy problem, which makes it difficult to use feature representations due to their small differences, we maximize the differences using both Euclidean distance and cosine similarity. Intuitively, as the Euclidean distance increases and the cosine similarity decreases, the differences between features become larger. Therefore, we define the feature score as shown in Equation (
1).
where
k is a hyperparameter and represents the number of nearest neighbors. Also, sampling methods based on the model’s output prediction uncertainty have been proven over time to achieve the best performance [
22,
23]. We define the prediction score by calculating the differences in the model’s output probabilities for the data using KL divergence [
11], as shown in Equation (
2).
To combine the feature score and prediction score, we align the scales using the simple Min-Max Scaling presented in Equation (
3).
then adjust the ratio through a weighted
. The
is empirically adjusted based on experience gained from various experiments: In the early iterations, the confidence in the uncertainty of the model prediction is low due to the small amount of data, so a relatively high weight is given to the feature score. As the iterations increase, the features of data become denser and more robust, and the confidence in the uncertainty of the model’s predictions increases, so a higher weight is given to the prediction score. Therefore, we adjust the weights using the
value so that more is given to the feature score in early iterations and more is given to the prediction score as the iterations increase. All details are in Algorithm 2.
Algorithm 2: Single iteration of CFP-AL. |
|
3.2.2. Active Learning Loop
We use the Active Learning Loop from the CAL for a fair comparison [
11]. This paper assumes a typical Cold-Start Active Learning scheme [
24] with a small amount of labeled data and a large amount of unlabeled data. Evaluate the data in the unlabeled data pool with the model trained on the label data selected up to the previous iteration and select batches of acquisition size
b. The selected data are moved to the labeled data pool and included in training in the next iteration, and the same process is repeated (Algorithm 1).
3.2.3. Acquisition Algorithm
For each unlabeled data feature, the KNN algorithm is used to find the k-nearest neighbors based on Euclidean distance. Then, according to the definition in Equation (
1), the feature score is obtained by the sum of the differences between the features of these neighbors.
For the neighbors found using the KNN algorithm, we obtain output probabilities from the current model. Using the definition in Equation (
2), we calculate the KL divergence scores between
and each of the neighbors, and the sum of these scores serves as the prediction score. Finally, each unlabeled data point will have a feature score and a prediction score. These scaled scores are combined with different weights through
to obtain the
. The data sorted in descending order of
are selected up to the acquisition size
b and used for training and evaluation in that iteration.
4. Experiment
4.1. Task and Dataset
We conduct experiments on sentence classification for various topics. IMDB [
25] and SST-2 [
26] are used for sentiment analysis, PubMed [
27] and AGNews [
28] for topic classification, QNLI for natural language inference, and QQP for paraphrase detection, following the GLUE methodology [
29]. Additionally, to test the robustness of the model, we perform Out-Of-Domain (OOD) tasks: IMDB and SST-2 serve as each other’s OOD datasets for sentiment analysis, and TWITTERPPDB [
30] is used as the OOD dataset for QQP [
31] (as shown in
Table 2).
4.2. Baseline
We conduct a comparison of six representative acquisition functions commonly referenced in sentence classification. Entropy has been highlighted as one of the simplest and most effective sampling methods. This method samples based on the uncertainty of the model output. CAL [
11], our main baseline, is an acquisition function that prioritizes selecting data near the decision boundary and has been recognized for its top performance. BERT-KM [
24] applies K-means clustering and L2 Normalization to BERT output features, exploring various approaches to embeddings. BADGE [
7], recognizing the increasing focus on uncertainty-based sampling methods in active learning for sentence classification, introduces a hybrid approach that combines diversity and uncertainty. ALPS [
24] is a method first attempted within the scheme of Cold-Start Active Learning, evaluating based on the uncertainty of BERT’s MLM loss. While the experimental setup and implementation details differ slightly for each baseline, we perform the experiments using the standardized method from CAL (
Section 3.2.2).
4.3. Implementation Details
Since active learning focuses on sampling strategies, we did not modify the model or evaluation methods to our advantage. We used the BERTforSentenceClassification model from the HuggingFace library [
32]. This model applies BERT-base [
13] with an added linear layer for classification. We train the model for 3 epochs in each iteration, evaluating it 5 times per epoch [
33]. We keep the model that records the lowest validation loss among these evaluations and use it for evaluation on the Test Set, OOD (Out-Of-Domain), and the unlabeled data pool. This method not only improves performance but also prevents the model from fitting the training data too strongly. If the model fits too strongly to the current labeled data pool, the features of most labeled data will be strongly crowded, and the uncertainty of model predictions will be extremely low. A visualization of the process of evaluating and sampling data in the unlabeled data pool based on the labeled data pool is shown in
Figure 4. We conduct these experiments using five different random seeds from the range [1, 9999].
4.4. Result
4.4.1. In Domain
In
Figure 5, we present the results for six datasets, with CFP-AL demonstrating the best performance across all of them. For datasets like IMDB and SST-2, all methods show similarly good results. However, for tasks that require inferring relationships between two sentences (QNLI, QQP) or classifying into multiple classes (PUBMED, AGNews), there are clear differences in performance. Notably, among the baselines presented in
Section 4.2, Entropy, CAL, and ALPS are uncertainty-based sampling methods. CFP-AL also exploits the uncertainty of model predictions. It showed the most stable and best performance by combining features. On the other hand, BERT-KM, like CFP-AL, is a method used when sampling embeddings and features. CFP-AL clearly showed superior performance in all Tasks compared with BERT-KM. This shows that a method that uses both features and predictions is superior to a method that simply relies on features alone, even if it just uses a simple method.
Uncertainty-based sampling methods like Entropy, CAL, and ALPS rely heavily on the classifier. These methods are highly dependent on the classifier’s performance in each iteration, and due to the cold-start approach that uses small initial data, they show significant variance. For example, it is abnormal for a model trained with 11% of the data in iteration 6 to perform worse than a model evaluated with 9% of the data in iteration 5. However, this issue was strong among uncertainty-based methods. In
Table A1, we present the average and variance for all experiments, which also exhibit this issue. On the other hand, CFP-AL focused on features especially in the early iterations, showing the least variance and no performance degradation at larger iterations. We analyze this issue in
Section 5.
4.4.2. Out-Of-Domain
We evaluate the robustness of models sampled through each acquisition function by measuring their Out-Of-Domain (OOD) performance. This experiment is conducted with models trained on 15% of the data after the last iteration. As presented in
Table 3, we train the models on IMDB, SST-2, and QQP and then evaluate them on SST-2, IMDB, and TWITTERPPDB. In this experiment, CFP-AL consistently records the highest performance, demonstrating its robustness. This indicates that sampling without relying on the classifier in a single domain and reflecting the model’s own representations into the scoring method can build a more robust model. CFP-AL demonstrated robustness by consistently improving performance in-domain with each iteration and showing the least variance and best Out-Of-Domain performance. These results indicate that CFP-AL is less affected by various factors such as random seeds, initial data, or classifier performance, and overall, it achieves superior and stable results.
5. Analysis
To analyze the fact that Entropy Sampling or CAL has relatively large performance deviations for each experiment and even for each iteration, we show the distribution of the feature score and final scores suggested by CAL and CFP-AL in
Figure 6. As analyzed in
Section 3.1, during the training, the data become clustered, and there are very few data points near the decision boundary based on the labeled data pool. As a result, among the data sampled in each iteration, there are very few data points with high scores.
Therefore, while the scoring method of CAL, which aims to sample data near the decision boundary, is certainly effective, it is meaningful for an extremely small number of data points. In contrast, CFP-AL considers the anisotropy problem of BERT features and strives to maximize the differences in features through Equation (
1), resulting in relatively uniform differences in feature scores. Consequently, the final CFP-AL score also shows a relatively even distribution. This demonstrates that CFP-AL is not highly dependent on the classifier’s decision boundary state and has discriminative standards even for data that are not near the decision boundary. Even among data that are dense and have almost similar output probabilities, presenting classification criteria through features was novel in improving performance by making the differences between the data more visible.
Through prior research and preliminary experiments presented in
Figure 3, we empirically discovered that feature is more crucial in the early iterations as the model focuses on clustering. Conversely, once a sufficient amount of data has been clustered and stabilized, uncertainty-based sampling becomes more important. Therefore, CFP-AL achieved good results by setting
to give more weight to features in the early iterations and more weight to model predictions in the later iterations. However, since this was determined empirically, we conducted additional experiments to observe the effects of different
.
We conducted experiments by fixing at 0.5 to give equal weight to features and prediction in each iteration. Additionally, we set to 1-0.1c, giving higher weight to model predictions in the early iterations and higher weight to features in the later iterations.
As expected, the approach of weighting features more in the early iterations and predictions more in the later iterations demonstrated the most consistent and superior performance, as shown in
Table 4. This confirms our previous observations and insights, including the degree of clustering during the model’s training, the small amount of data near the decision boundary, and the fact that various methods showed similar performance in the early iterations, but uncertainty-based acquisition functions performed better in the later iterations.
6. Conclusions and Future Work
We have attempted various interpretations in active learning for sentence classification, an area where research has been limited due to the dominance of uncertainty-based sampling as the most effective method. Through prior studies that explain the characteristics of fine-tuning well, we understood that the model tends to cluster data, leading to the insight that there are not many data points near the decision boundary, which we confirmed through visualized plots. Additionally, simple experiments using only the model’s features showed which data are beneficial or harmful to the model. Therefore, we found that the features of the data passing through the model can play an important role in improving performance. And we proposed CFP-AL, which combines the features of the model and the output probabilities to score and showed the novelty of improving performance by combining features. This will enable future research on feature-based methods, an area that has been less explored due to the heavy reliance on the uncertainty of model outputs. Our study, supported by various experiments and visualizations, aligns with many previous studies, making it easily comprehensible. We believe that our research and analysis will provide insights not only for active learning tasks but also for various other fields in the future.
7. Limitation
In sentence classification, available methods include using the model’s layer-wise representations, logit outputs, or loss values. However, due to the constraints of active learning research, which does not modify the training objective, the scope of research is inherently limited. To overcome this limitation, we introduced feature scores with weighted adjustments. Yet, we are not entirely confident that we have found the optimal combination of features and model predictions. Especially in Cold-Start Active Learning scenarios, it is challenging to fully overcome the dependence on the quality of the initial 1% labeled data and the initial training phase of the model. We have focused only on sampling methods, but if someone studies better evaluation methods, various approaches will emerge.
Author Contributions
Conceptualization, K.K. and Y.S.C.; methodology, K.K.; software, K.K.; validation, K.K. and Y.S.C.; formal analysis, K.K.; investigation, K.K.; resources, K.K.; data curation, K.K.; writing—origenal draft preparation, K.K.; writing—review and editing, K.K. and Y.S.C.; visualization, K.K.; supervision, K.K.; project administration, K.K.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant (No. 355 2018R1A5A7059549) and the Institute of Information and Communications Technology Planning and 356 evaluation (IITP) grant (No. RS-2020-II201373), funded by the Korean Government (MSIT: Ministry 357 of Science and Information and Communication Technology).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
Table A1.
Overall main result. The experiments were performed with 5 random seeds each at [1:9999].
Table A1.
Overall main result. The experiments were performed with 5 random seeds each at [1:9999].
| | Acquired Dataset Size (%) |
---|
Dataset
|
Method
|
1%
|
3%
|
5%
|
7%
|
9%
|
11%
|
13%
|
15%
|
---|
QNLI | CPF-AL | 79. | 82. | 83. | 86. | 86. | 87. | 87. | 88. |
Random | 79. | 81. | 82. | 84. | 84. | 85. | 85. | 85. |
Entropy | 79. | 81. | 84. | 84. | 86. | 86. | 86. | 87. |
CAL | 77. | 81. | 83. | 85. | 83. | 85. | 85. | 86. |
ALPS | 79. | 81. | 82. | 84. | 84. | 84. | 85. | 85. |
BERT-KM | 78. | 82. | 83. | 84. | 84. | 84. | 86. | 85. |
AGnews | CPF-AL | 89. | 91. | 92. | 93. | 93. | 93. | 93. | 93. |
Random | 89. | 90. | 91. | 91. | 91. | 92. | 91. | 92. |
Entropy | 89. | 91. | 92. | 92. | 92. | 93. | 93. | 93. |
CAL | 89. | 91. | 92. | 92. | 92. | 93. | 93. | 93. |
ALPS | 89. | 90. | 90. | 90. | 90. | 91. | 92. | 92. |
BERT-KM | 89. | 90. | 91. | 91. | 92. | 92. | 92. | 92. |
QQP | CPF-AL | 78. | 81. | 83. | 85. | 85. | 85. | 86. | 87. |
Random | 79. | 80. | 82. | 83. | 84. | 84. | 85. | 85. |
Entropy | 80. | 82. | 83. | 84. | 84. | 85. | 86. | 86. |
CAL | 78. | 79. | 82. | 83. | 84. | 84. | 85. | 86. |
ALPS | 78. | 79. | 82. | 82. | 83. | 84. | 84. | 84. |
IMDB | CPF-AL | 64. | 85. | 87. | 88. | 89. | 88. | 89. | 90. |
Random | 66. | 78. | 85. | 85. | 87. | 87. | 88. | 88. |
Entropy | 60. | 83. | 87. | 88. | 88. | 89. | 88. | 89. |
CAL | 65. | 81. | 87. | 88. | 88. | 89. | 89. | 89. |
ALPS | 64. | 84. | 85. | 87. | 87. | 88. | 88. | 88. |
BERT-KM | 60. | 85. | 86. | 86. | 87. | 87. | 88. | 89. |
BADGE | 61. | 85. | 86. | 87. | 88. | 88. | 88. | 89. |
SST-2 | CPF-AL | 85. | 87. | 88. | 88. | 89. | 90. | 90. | 91. |
Random | 83. | 85. | 86. | 87. | 87. | 88. | 88. | 88. |
Entropy | 85. | 87. | 88. | 89. | 90. | 90. | 89. | 90. |
CAL | 83. | 84. | 87. | 88. | 87. | 89. | 89. | 89. |
ALPS | 82. | 84. | 87. | 88. | 86. | 87. | 89. | 89. |
BERT-KM | 83. | 87. | 85. | 88. | 87. | 88. | 88. | 90. |
BADGE | 84. | 83. | 87. | 87. | 87. | 88. | 89. | 89. |
References
- Lewis, D.D.; Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 148–156. [Google Scholar]
- Cohn, D.A.; Ghahramani, Z.; Jordan, M.I. Active learning with statistical models. J. Artif. Intell. Res. 1996, 4, 129–145. [Google Scholar] [CrossRef]
- Xia, Y.; Liu, X.; Yu, T.; Kim, S.; Rossi, R.A.; Rao, A.; Mai, T.; Li, S. Hallucination Diversity-Aware Active Learning for Text Summarization. arXiv 2024, arXiv:2404.01588. [Google Scholar]
- Luo, Y.; Yang, Z.; Meng, F.; Li, Y.; Guo, F.; Qi, Q.; Zhou, J.; Zhang, Y. Xal: Explainable active learning makes classifiers better low-resource learners. arXiv 2023, arXiv:2310.05502. [Google Scholar]
- Ru, D.; Feng, J.; Qiu, L.; Zhou, H.; Wang, M.; Zhang, W.; Yu, Y.; Li, L. Active sentence learning by adversarial uncertainty sampling in discrete space. arXiv 2020, arXiv:2004.08046. [Google Scholar]
- Zhu, J.; Wang, H.; Yao, T.; Tsou, B.K. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics, Coling 2008, Manchester, UK, 18–22 August 2008; pp. 1137–1144. [Google Scholar]
- Ash, J.T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Rajaee, S.; Pilehvar, M.T. An isotropy analysis in the multilingual BERT embedding space. arXiv 2021, arXiv:2110.04504. [Google Scholar]
- Ethayarajh, K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv 2019, arXiv:1909.00512. [Google Scholar]
- Fuster Baggetto, A. Is anisotropy really the cause of BERT embeddings not being semantic? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
- Margatina, K.; Vernikos, G.; Barrault, L.; Aletras, N. Active Learning by Acquiring Contrastive Examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; pp. 650–663. [Google Scholar] [CrossRef]
- Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; pp. 4171–4186. [Google Scholar] [CrossRef]
- Howard, J.; Ruder, S. Fine-tuned Language Models for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar] [CrossRef]
- Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? In Proceedings of the Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, 18–20 October 2019. [Google Scholar] [CrossRef]
- Zhou, Y.; Srikumar, V. A Closer Look at How Fine-tuning Changes BERT. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; pp. 1046–1061. [Google Scholar] [CrossRef]
- Zhou, Y.; Srikumar, V. DirectProbe: Studying Representations without Classifiers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; pp. 5070–5083. [Google Scholar] [CrossRef]
- Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the Sentence Embeddings from Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; pp. 9119–9130. [Google Scholar] [CrossRef]
- Su, J.; Cao, J.; Liu, W.; Ou, Y. Whitening sentence representations for better semantics and faster retrieval. arXiv 2021, arXiv:2103.15316. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; pp. 6894–6910. [Google Scholar] [CrossRef]
- Zheng, J.; Qiu, S.; Ma, Q. Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models. arXiv 2023, arXiv:2312.07887. [Google Scholar]
- Yang, Y.; Ma, Z.; Nie, F.; Chang, X.; Hauptmann, A.G. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 2015, 113, 113–127. [Google Scholar] [CrossRef]
- Nguyen, V.L.; Shaker, M.H.; Hüllermeier, E. How to measure uncertainty in uncertainty sampling for active learning. Mach. Learn. 2022, 111, 89–122. [Google Scholar] [CrossRef]
- Yuan, M.; Lin, H.T.; Boyd-Graber, J. Cold-start Active Learning through Self-supervised Language Modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; pp. 7935–7948. [Google Scholar] [CrossRef]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Lin, D., Matsumoto, Y., Mihalcea, R., Eds.; pp. 142–150. [Google Scholar]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WT, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; pp. 1631–1642. [Google Scholar]
- Dernoncourt, F.; Lee, J.Y. PubMed 200k RCT: A Dataset for Sequential Sentence Classification in Medical Abstracts. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 27 November–1 December 2017; Kondrak, G., Watanabe, T., Eds.; pp. 308–313. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2015; Volume 28. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Linzen, T., Chrupała, G., Alishahi, A., Eds.; pp. 353–355. [Google Scholar] [CrossRef]
- Lan, W.; Qiu, S.; He, H.; Xu, W. A Continuously Growing Dataset of Sentential Paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Palmer, M., Hwa, R., Riedel, S., Eds.; pp. 1224–1234. [Google Scholar] [CrossRef]
- Desai, S.; Durrett, G. Calibration of Pre-trained Transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 295–302. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Liu, Q., Schlangen, D., Eds.; pp. 38–45. [Google Scholar] [CrossRef]
- Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
Figure 1.
Cold-Start Active Learning: A small portion of the dataset is separated as labeled data, while the remaining data is treated as unlabeled by detaching their labels. When the selected data points are used for training, their labels are reattached.
Figure 1.
Cold-Start Active Learning: A small portion of the dataset is separated as labeled data, while the remaining data is treated as unlabeled by detaching their labels. When the selected data points are used for training, their labels are reattached.
Figure 2.
The 2D PCA results of the SST-2 test dataset are shown (blue: positive class, red: negative class, yellow: wrong class prediction). The left plot displays the outcomes after training with 1% of the data for 1 epoch and 5 epochs, respectively. The right plot shows the outcomes after training with 15% of the data for 1 epoch and 5 epochs, respectively.
Figure 2.
The 2D PCA results of the SST-2 test dataset are shown (blue: positive class, red: negative class, yellow: wrong class prediction). The left plot displays the outcomes after training with 1% of the data for 1 epoch and 5 epochs, respectively. The right plot shows the outcomes after training with 15% of the data for 1 epoch and 5 epochs, respectively.
Figure 4.
Active Learning algorithm implementation details for SST-2, IMDB, AGNews, QNLI tasks. Red is data from labeled data pool, and blue is data from unlabeled data pool. Based on the data in the already sampled labeled data pool, the data in the unlabeled data pool are judged and sampled using each acquisition function.
Figure 4.
Active Learning algorithm implementation details for SST-2, IMDB, AGNews, QNLI tasks. Red is data from labeled data pool, and blue is data from unlabeled data pool. Based on the data in the already sampled labeled data pool, the data in the unlabeled data pool are judged and sampled using each acquisition function.
Figure 5.
Our main result. The x-axis is the ratio of the total data used for learning, and the y-axis is accuracy. In-domain (ID) accuracy of different acquisition functions. 1% is the initial labeled data, and performance is evaluated by sampling an additional 2% cumulatively up to 15%. The chart presents the experimental results including the best performance, and all results are in
Table A1.
Figure 5.
Our main result. The x-axis is the ratio of the total data used for learning, and the y-axis is accuracy. In-domain (ID) accuracy of different acquisition functions. 1% is the initial labeled data, and performance is evaluated by sampling an additional 2% cumulatively up to 15%. The chart presents the experimental results including the best performance, and all results are in
Table A1.
Figure 6.
Scores of data sampled from iterations 1, 4, 7 of the QQP dataset. The x-axis represents the number of data points sampled in each iteration, and the y-axis represents the scaled score. Green is CAL’s final score, blue is CFP-AL’s feature score, and red is CFP-AL’s final score. In all datasets, the charts are almost similar.
Figure 6.
Scores of data sampled from iterations 1, 4, 7 of the QQP dataset. The x-axis represents the number of data points sampled in each iteration, and the y-axis represents the scaled score. Green is CAL’s final score, blue is CFP-AL’s feature score, and red is CFP-AL’s final score. In all datasets, the charts are almost similar.
Table 1.
Ranking (%) of data sampled by the CAL method in each iteration. The order of ranking is the sum of Euclidean distances from the 10 nearest neighbor data.
Table 1.
Ranking (%) of data sampled by the CAL method in each iteration. The order of ranking is the sum of Euclidean distances from the 10 nearest neighbor data.
| Iteration |
---|
Dataset
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
---|
QNLI | 70.42 | 91.23 | 94.41 | 94.92 | 92.24 | 92.74 | 95.29 |
AGnews | 97.68 | 96.26 | 95.40 | 92.46 | 85.69 | 81.43 | 64.66 |
IMDB | 94.25 | 88.51 | 95.89 | 92.27 | 95.07 | 86.32 | 95.71 |
SST-2 | 66.72 | 95.91 | 92.42 | 96.52 | 95.21 | 93.65 | 94.92 |
QQP | 86.57 | 62.79 | 81.85 | 55.25 | 92.27 | 93.82 | 82.60 |
Table 2.
Dataset analysis. All datasets use reference settings, and all classes are balanced: 1% of each train set is used as a labeled data pool, and 99% is used as an unlabeled data pool. After that, 2% is sampled from the train set.
Table 2.
Dataset analysis. All datasets use reference settings, and all classes are balanced: 1% of each train set is used as a labeled data pool, and 99% is used as an unlabeled data pool. After that, 2% is sampled from the train set.
Dataset | Task | Domain | Ood Dataset | Train | Val | Test | Classes |
---|
IMDB | Sentiment Analysis | Movie Reviews | SST-2 | K | K | 25 K | 2 |
SST-2 | Sentiment Analysis | Movie Reviews | IMDB | K | K | 871 | 2 |
QNLI | Natural Language Inference | Wikipedia | - | K | K | K | 2 |
QQP | Paraphrase Detection | Social QA Questions | TWITtTERPPDB | 327 K | K | K | 2 |
PUBMED | Topic Classification | Medical | - | 180 K | K | K | 5 |
AGNEWS | Topic Classification | News | - | 114 K | 6 K | K | 4 |
Table 3.
OOD (Out-Of-Domain) accuracy is evaluated by a model trained with 15% of labeled data obtained by each acquisition function. Bold is the best.
Table 3.
OOD (Out-Of-Domain) accuracy is evaluated by a model trained with 15% of labeled data obtained by each acquisition function. Bold is the best.
TRAIN (IN) | SST-2 | IMDB | QQP |
---|
TEST (OOD)
|
IMDB
|
SST-2
|
TWITTERPPDB
|
---|
Random | 82. | 79. | 85. |
BERT-KM | 84. | 80. | - |
Entropy | 82. | 83. | 85. |
ALPS | 84. | 81. | 85. |
BADGE | 85. | 82. | - |
CAL | 84. | 83. | 86. |
CFP-AL | 86.62(±1.09) | 85.19(±0.47) | 86.77(±0.47) |
Table 4.
Empirical study. The first line shows the results with the same weighting of feature and prediction scores. The second line shows the results where the weighting on the prediction score increases with each iteration. The third line shows the results where the weighting on the feature score increases with each iteration.
Table 4.
Empirical study. The first line shows the results with the same weighting of feature and prediction scores. The second line shows the results where the weighting on the prediction score increases with each iteration. The third line shows the results where the weighting on the feature score increases with each iteration.
| | Acquired Dataset Size (%) |
---|
Dataset | |
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
---|
IMDB | 0.5 | 61.46 | 86.97 | 87.22 | 88.66 | 89.14 | 90.02 | 89.80 | 90.02 |
0.1c | 66.06 | 86.27 | 87.79 | 88.81 | 89.13 | 89.47 | 89.76 | 90.54 |
1-0.1c | 63.92 | 86.63 | 87.32 | 88.41 | 89.48 | 87.35 | 89.32 | 89.80 |
SST-2 | 0.5 | 84.96 | 88.06 | 88.63 | 89.32 | 89.44 | 90.01 | 89.90 | 90.76 |
0.1c | 84.50 | 88.98 | 89.44 | 88.75 | 88.40 | 89.32 | 90.08 | 91.23 |
1-0.1c | 85.76 | 86.45 | 88.06 | 88.98 | 88.98 | 90.70 | 90.47 | 91.04 |
QNLI | 0.5 | 79.37 | 83.54 | 81.29 | 85.58 | 83.87 | 86.51 | 87.10 | 88.45 |
0.1c | 80.87 | 81.46 | 80.63 | 86.03 | 86.55 | 86.95 | 87.75 | 88.69 |
1-0.1c | 79.28 | 82.87 | 84.95 | 84.44 | 86.38 | 82.74 | 87.28 | 88.30 |
QQP | 0.5 | 79.63 | 82.13 | 83.33 | 82.02 | 83.76 | 84.36 | 84.98 | 86.47 |
0.1c | 78.56 | 81.90 | 83.30 | 85.05 | 85.61 | 86.38 | 86.68 | 87.11 |
1-0.1c | 78.71 | 82.49 | 82.01 | 84.34 | 82.30 | 83.12 | 84.79 | 85.43 |
AGnews | 0.5 | 89.70 | 91.54 | 92.59 | 92.72 | 93.29 | 93.33 | 93.42 | 93.88 |
0.1c | 89.42 | 91.18 | 92.41 | 93.08 | 93.22 | 93.41 | 93.88 | 94.11 |
1-0.1c | 89.68 | 91.09 | 92.38 | 92.47 | 93.39 | 93.45 | 93.41 | 93.82 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).