T2410230 (2)
T2410230 (2)
by
1. The thesis submitted is my/our own original work while completing degree at
Brac University.
3. The thesis does not contain material which has been accepted, or submitted,
for any other degree or diploma at a university or other institution.
Nafisa Mehreen
23241114
i
Approval
The thesis titled “Towards Personalized Education: Integrating AI in Learning En-
vironments through Bangla Language.” submitted by
Examining Committee:
Supervisor:
Program Coordinator:
(Member)
Head of Department:
(Chair)
ii
Abstract
The implication of this paper is that AI and NLP can be harnessed to build a system
which has the potential to change current learning methods by offering custom-made
learning experiences. The focus of this paper is on developing an intelligent system
that addresses individual student’s learning abilities and proposes individualized
features for betterment of their study efficiency. Despite Bangla being the seventh
most spoken language in the world, there have not been many works done for the
educational sector that leverages AI. So, the primary goal of this paper is to im-
plement such features in Bangla language. Moreover, to attain our objective, we
are using natural language processing algorithms that are curated specifically for
the Bangla language. The intelligence of the system will be based upon the in-
terpretation of the language. Furthermore, this paper dives into the analysis and
curation of suitable datasets, which include a broad range of language patterns and
instructional resources that are critical for training and evaluating questions and an-
swers. Using these algorithms and datasets, the paper aims to create an AI-driven
educational framework capable of solving context driven questions. In short, this
paper intends to pave the path for a more inclusive, effective and personalized edu-
cational experience for Bangla-speaking students by combining AI and Bangla NLP
technologies.
iii
Table of Contents
Declaration i
Approval ii
Abstract iii
Table of Contents iv
List of Figures vi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 5
2.1 Detailed Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 5
iv
4.2.3 Finetuning BanglaT5 . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Finetuning BanglaGPT . . . . . . . . . . . . . . . . . . . . . . 25
4.2.5 Finetuning XML-RoBERTa . . . . . . . . . . . . . . . . . . . 27
4.3 Result Analysis and Model Selection . . . . . . . . . . . . . . . . . . 29
6 Conclusion 48
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography 53
v
List of Figures
vi
Chapter 1
Introduction
1.1 Introduction
Artificial Intelligence (AI) is transforming education worldwide. It provides cus-
tomized, efficient, and responsive solutions to common learning obstacles. Natural
Language Processing (NLP) enables education apps to replicate human-like compre-
hension and communication with students. The apps can give customized feedback
and guidance. Hasan et al.[21] stated that these technologies are revolutionizing
the manner in which students interact with streams of knowledge, enabling systems
to ascertain understanding, recommend resources, and guide students through the
learning process. But according to Rahman et al.,[19] most of this advancement
has almost catered to English and only a limited cluster of high-resource languages,
while lagging behind languages like Bangla which, despite being the seventh most
spoken language in the world, is significantly under-represented in AI and NLP
advancements.
It has been shown through research that intelligent learning environments, when
linguistically and culturally localized, could have a significant impact on learning
engagement and comprehension stated by Patel. [12] A Bengali-medium curriculum
content-backed intelligent system powered by artificial intelligence can serve as a
virtual tutor. It can explain concepts, test knowledge, and respond to questions.
In low-resource settings where human tutors may not be available or affordable,
such tools are essential. Islam and Jahan [22] referenced that technology-driven
interventions, especially if tailored for the national language, can assist in bridging
1
Bangladesh’s urban-rural knowledge divide. Moreover, according to Roy et al [23],
if they are made with local feedback, such systems can accommodate students from
diverse dialectal and socio-economic backgrounds, thus making education inclusive.
Sarker et al. [32] believes that despite the evident need and potential, some hin-
drances are on the way to developing stable Bangla NLP systems. One of them
is the lack of large-scale, annotated datasets for training and testing AI models .
Normally, latest models like BERT or T5 have been trained on English or multilin-
gual datasets that do not reflect the grammatical and contextual richness of Bangla.
Hence, fine-tuning them on Bangla educational datasets is crucial for actual-world
performance. Additionally, language-specific phenomena like compound words, free
word order, and orthographic variation render Bangla NLP particularly challeng-
ing compared to structurally less complex languages like English which is stated by
Ahmed et al.[6]
The limitations don’t end at information, rather deployment is also limited by infras-
tructural factors. Even with improved smartphone availability, schools and homes in
rural areas in Bangladesh continue to have weak internet connectivity and a shortage
of digital equipment supplies. Rahim et al. [18] reported that these infrastructural
shortcomings constrain the potential scope of AI-driven learning platforms. Even
where there is connectivity, digital literacy is low, particularly among teachers and
parents which makes it more difficult to infuse AI tools into the general learning
environment.
This can pose issues that incline with algorithmic fairness. If AI systems are trained
on non-representative datasets which are not validated from an unbiased standpoint,
they are likely to generate biased results, which can lead to the making of an unfair
learning system. Mehrabi et al. [17] pointed out that biases in NLP models are
created because of unbalanced data or lack of dialectal variation which can then
skew results and damage user’s trust. In education, bias can lead to misleading
criticism or incorrect levels of difficulty, damaging the performance of students.
Therefore, fairness, transparency and representative training are not an option one
might wish to do but essential to the researcher’s moral obligation.
This work overall will help build the still-emerging field of Bangla NLP by creating
a helpful tool of education with real-world application. It addresses data scarcity
with judicious curation, addresses infrastructural constraints with focus on light
deployment and hopes to achieve inclusivity through the creation of a tool accessible
to Bangla-speaking students of various regions and financial backgrounds. Through
this initiative, we aim to set a precedent for more inclusive and productive AI-
2
based educational solutions in Bangladesh and lay the foundation for future work
on low-resource language technology.
• Chapter 1 - This chapter provides the background of this study, defines the
research problem and describes the objectives.
3
• Chapter 2 - This chapter explores relevant ideas, models and previous re-
searches to provide a foundation for our study.
• Chapter 3 - This is the core chapter of our study where we introduce a new
dataset based on NCTB Textbooks for Bangla Medium students. Dataset
analysis and validation is performed accordingly.
• Chapter 4 - This chapter discusses about finetuning the existing models that
were used to test our dataset. We selected BanglaT5 as our base model,
following the result analysis and evaluation of the finetunings.
• Chapter 6 - This chapter concludes our entire study, mentions the limitations
and provides recommendations for future work.
4
Chapter 2
Literature Review
This research paper by Aleedy et al.[3] describes how an interactive AI agent is built
which will automatically generate conversations between a human and a computer.
NLP and deep learning techniques consisting of Long Short-Term Memory, Con-
volution Neural Networks and Gated Recurrent Units have been used in order to
predict relevant and automatic responses to the inquiries of the customers. The sys-
tem has been tested and evaluated and the results show that out of all the models,
the best one is the Long Short-Term Memory model in terms of performance. The
goal of this paper is to increase the support of the customers by responding to the
customers’ inquiries as quickly and accurately as possible. The limited integration of
outside information sources into the communication model is the primary research
gap in this paper. This model focuses on past dialogues and overlooks other sources
such as databases, internet, APIs etc. and this can result in limiting the AI agent’s
capability to respond with information that is correct or reliable.
5
cover all possible questions. According to the article, the system has limitations
due to anaphora resolution. This means the system may struggle to identify the
antecedent of a pronoun. Also, the question generation was tested in the domain of
news articles which could produce grammatically correct answers. The document
also mentions that the system may not perform well on complex sentences or sen-
tences with words and phrases that require tags beyond the proposed set of rules.
They would like to add more types of questions, improve the UD tagging model,
and extend to other domains. Moreover, for further research, they want to test the
system using minor details, named entities, and common nouns.
Alwahaishi [25] states that this study proposes a new model that overcomes limita-
tions of previous dialogue systems by evaluating voice/video attributes and char-
acteristics. This chatbot also contains emotional representation which is relate
to biosignals and non-anthropomorphic expressive behavior as well. Additionally,
the proposed model looks forward to having more personalized interactive agents
through it. Reinforcement Learning algorithms have been proven helpful in opti-
mizing statistical dialog managers in this paper too. These strategies have as well
proved useful for dealing with problems of speech understanding. Furthermore, the
system can serve as a prototype for different applications with diverse input and out-
put emotional channels needed for them. Future work will involve further testing
and refinement of the model as outlined in the paper. Moreover, they showed their
interest in mental health or education where It could be applied someday if not now!
Also, there will be an extension of this research into other areas such as adding new
modalities to the model while considering ethics when using conversation interfaces.
In the studies, Khan et al. [16] presented an extensive comparative analysis of each
default and custom component impacts. The chatbot system was constructed in
Bangla to eases the communication between businesses and Bangali users. They
gathered data which they later discovered was imbalanced. Therefore, they also had
to address the imbalanced class problem. This Rasa framework is used in develop-
ing interactive agents for maintaining conversations in Bangla. Nevertheless, Rasa
Framework has been ineffective when applied to Bangla because it is a low-resource
language. This therefore necessitates the need of creating personalized components
that are specific to Bengali language. Additionally, they indicated their desire for ex-
panding and improving the overall quality of existing datasets in future work plans.
In order to achieve this, they will gather more information and perform intense
scrutiny and evaluation too; hence planning to have them included in their works as
well. Their intentions here also include some additional personal components to be
added into Rasa such as recent transformer models, multilingualism among others.
This paper by Gutiérrez [28] aims to investigate the role of Artificial Intelligence
and Chat-bot-based learning based on open conversational tools such as ChatGPT.
The study involves a handful of participants who engage with ChatGPT every day
to learn the English language and assess their proficiency after analyzing it with
questionnaires and interviews. The results of this research show how beneficial
ChatGPT can be to all learners when applied in various learning contexts, making
it a tool that complements and improves the field of research on technology-enhanced
language learning. Moreover, The AIIA system uses OpenAI’s GPT-3.5 model for
6
its NLP capabilities. The model is selected for its advanced text generation and
few-shot learning abilities. The system architecture includes a NodeJS backend
and uses text embeddings for efficient document retrieval and query response. The
AIIA provides features such as personalized learning pathways, quiz and flashcard
generation, and real-time feedback. However, this study has a limitation on the
scale of conversations at beginner level and does not reflect how a complex pattern
of questions could impact the performances of the AI chatbots. According to the
author, further research on adaptive and interactive learning in AI-based chatbots
is required to overcome these scenarios.
This chatbot by Dan et al. [26] implements LLM and converts the vast English Lan-
guage Model to Chinese to do intelligent assessments of essays, Socratic teaching,
and emotional support. It takes feedback based on human psychology and frontline
teachers to assess the accuracy of the bot and fine-tunes the data for a deeper un-
derstanding of the Retrieval-Augmented open question answering, psychology-based
dynamic emotional support, and stimulating critical thinking. EduChat addresses
challenges in the educational applications of LLMs by pre-training on a vast corpus
of educational books and diverse instructions to establish domain-specific knowledge.
It further fine-tunes the model with high-quality, customized instructions to enhance
educational functions. Additionally, EduChat uses retrieval-augmented techniques
to incorporate real-time data, ensuring accurate and up-to-date responses The case
studies for this claim to provide precise answers with relevant information and try
to guide the students like a teacher. However, this chatbot is limited to the Chinese
language only and it can assess only essays, and questions and provide intelligent
emotional support for now. The researchers state that further study is needed to
improve the LLM to integrate more educational features.
In this research, Jennifer et al. [29] propose a deep learning-based bilingual bot
named BilinBot which incorporates both English and Bangla Language. It imple-
ments AI and NLP for conversational purposes and uses Google BERT to process
the language. Also, LSTM and GRU are used to train the datasets efficiently. To
implement LSTM, they tokenized the data, built the network, and trained multi-
ple models with different optimizers. For the deep learning model, they added an
Embedding layer and a dense output layer to make the activation function work.
Moreover, to evaluate the predictability score, they used Python’s ROUGE score
model. On trained English datasets of ROUGE models, they got 73% - 96% accu-
racy and got 73% to 76% for the Bangla datasets. According to their analysis, the
results lie ahead of most state of art deep learning-based chatbots. The limitations
of this chatbot lie in the categories in which it can perform because of the limited
dataset in the Bangla Language. Also, it can only be used in two languages but
further research is needed to implement this in other languages as well to make it a
multilingual chatbot.
7
sations. Along with that, researchers have kept safety and accuracy as top priorities.
Also, the responses of LaMDA are designed to be respectful and show factual correct-
ness. LaMDA can be a great part of education and content recommendation-based
apps as it can execute specific tasks if finely tuned. Additionally, since LaMDA is
an AI that can hold conversations, it continuously tries to improve its quality and
accuracy of replies. However, the paper also talks about several research gaps in
LaMDA’s development. This conversational system needs continuous fine-tuning
for improvement which is expensive and time-consuming. Also, subjective human
judgments can cause inconsistent fine-tuning. LaMDA has learned to use exter-
nal knowledge better but still sometimes misrepresents facts. This mostly happens
where better evaluation metrics are needed. Bias mitigation is also hard as it needs
better detection techniques. Additionally, LaMDA’s high energy consumption needs
more sustainable practices.
The research paper by Hasnat, Chowdhury, and Khan [1]presents an OCR for Bangla
language, BanglaOCR. It is the first open-source Optical Character Recognition
(OCR) system for Bangla script. The system was developed by incorporating Tesser-
act’s optical character recognition engine with CRBLP’s Bangla script processing
tools. The authors of the paper included a thorough methodology that includes
training data preparation, image pre-processing, segmentation, and two-level post-
processing for better accuracy. Their system could achieve up to 93 percent accuracy
for clean printed documents. It also includes a GUI, spell checker, and support for
both Windows and Linux environments. .
The research paper by Lewis et al. [10] gives an account of BART, a denoising
autoencoder developed for pretraining sequence-to-sequence models. BART is de-
signed on the basis of a Transformer architecture with one bi-directional encoder
and an auto-regressive decoder. During pre-training, BART corrupts text in differ-
ent ways such as random sentence shuffling or a new kind of in-filling scheme and
then learns how to reconstruct that original text. It means that this approach com-
bines the bidirectional aspect of both GPT and BERT into one generalized model.
Consequently, the paper’s authors managed to provide indicators which showed its
performance better than previous methods on some framing benchmarks – GLUE,
SQuAD, XSum with R-1 and BLEU showing notable improvements respectively.
Its flexible noising and dual-graph interaction mechanisms make it highly effective
across a broad range of NLP tasks. Despite this high-performance level manifested
by BART there are still gaps in its research. To begin with, the methods used for
pretraining corruption should be more diverse and task specific. In addition, when
generating unsupported information it tends to hallucinate showing need for im-
proved factual accuracy again as mentioned above while also noting that it hasn’t
yet focused on what is relevantly related to each other within ELI5 task types.
Again we can generalize about this but then we can say that even if BART works
so well in most hands-free situations where you simply turn it on and speak at your
command; while at the same time being perfect during all sorts of laid-back con-
versations like these ones going back forth between two roommates; when subjected
loosely defined duties such as those belonging to ELI5 where any formulating any
request takes quite long sometimes resulting into getting irrelevant answers from AI
becomes quite tedious thus calling for much higher adaptability levels across task
8
clusters.
Islam et al. [5] address the planning and development of a Bangla virtual AI assis-
tant, ’Adheetee’ in this paper. This proposed model can follow and carry out a wide
set of commands given in Bangla to smart devices. The commands categorized into
basic and core commands use techniques like Named Entity Recognition, Keyword
Extraction, and Cosine Similarity. Also, the algorithms mentioned here are imple-
mented to help the virtual assistant use its own intelligence to respond to a variety of
commands. Also, the system can give responses independently or it can use external
APIs to find responses for unknown commands. However, the model is somewhat
dependent on other systems since it mostly needs external APIs for unknown data.
This limits the model’s independence and has an impact on its performance. For
further research, they want to work with more complex commands and decrease the
dependency on external APIs. Also, they are developing software for both iOS and
Android.
In this research, Rajpukur et al. [2] deep dived into the Stanford Question Answering
Dataset (SQuAD), which provides a large, high-quality dataset that helps with
the development of training machine reading comprehension models. This large
dataset is already preprocessed to convert the text into a format which will help
machine reading. It specifically tokenizes the texts, converts the words to numerical
representations and mostly does this thing where it creates features that can be
utilized by the model. But here the researchers also pointed out that there is a
significant gap between the performance of the machine models and humans. Also,
SQuAD gets the answers from the parsing of the sentences, so it avoids multiple
choice answers. But it adjusts the model’s parameters so that the error is minimized
between the model’s predicted answers and the correct answers.
9
corpus such as Wikipedia in this example. This enables the model to pick up related
documents at runtime and base its generation on them, creating wiser and more up-
dated outputs. The paper introduces two versions of RAG: RAG-Sequence, in which
the extracted documents are represented by the model and an answer is produced
using their common representation, and RAG-Token, in which dynamic token-wise
conditioning is allowed on diverse retrieved passages. The retriever is implemented
based on Dense Passage Retrieval (DPR), and the retriever and generator are trained
end-to-end within a differentiable framework. The model is evaluated on a series
of open-domain QA benchmarks including Natural Questions, WebQuestions, and
CuratedTREC. RAG outperforms existing pipelines that use independent retriever-
reader models and even achieves state-of-the-art results on some of the datasets
at the time of publication. It is a good thing that RAG has the capability to re-
turn evidence citations, which increases explainability and user trust. In addition,
in contrast to purely parametric models, the RAG updates are not retraining but
rather corpus or retriever index updates. However, the architecture is not without
some additional computational overhead from multi-passage encoding and real-time
retrieval. Despite this, RAG paved the way for an enormous diversity of retrieval-
augmented language systems, and this had a subsequent impact on later work in
open-domain QA, summarization, and even coding assistance. It was a milestone
for NLP from solely generative models to hybrid models that combined symbolic
and neural reasoning.
Asai et al. [34] leverage the deficiencies of standard RAG models with the cre-
ation of Self-RAG, a novel retrieval-augmented architecture that allows the model
to self-regulate its own retrieval and evaluation process. Unlike earlier practices that
retrieve a fixed amount of documents and produce afterward, Self-RAG trains the
model to decide when and how to retrieve and whether to re-evaluate or rewrite
using internal critique tokens. This approach reflects a paradigm change from static
augmentation towards agentic reasoning, where the language model recognizes its
gaps in knowledge and inaccuracies in facts. Self-RAG inserts target tokens during
training time: [RETRIEVE] tells the model to initiate a retrieval query, and [CRI-
TIQUE] tokens allow it to critique the adequacy and salience of retrieved responses.
Generation is recursive, and the model is allowed to revise its response based on
later critiques and retrievals. This structure more closely simulates human problem-
solving and avoids depending on earlier retrieval responses. Empirical evaluation on
QA, fact verification, and summarization tasks shows that Self-RAG significantly
surpasses static RAG baselines, ChatGPT, and LLaMA-2 systems, particularly in
terms of factual accuracy and reference trust. On open-domain QA, it reduces hal-
lucination rates and boosts BLEU scores by considerable margins. Furthermore,
qualitative assessment proves that the critique step is helpful for identifying and re-
vising wrong outputs, showing the ability of the system for self-correction. Self-RAG
has only one limitation: higher inference cost for several steps and rounds of decod-
ing incur latency. More sophisticated training data and supervision strategies are
also required in order to train retrieval timing and critique effectively. Nonetheless,
Self-RAG sets a new benchmark for adaptive retrieval-augmented models by offering
a scalable self-regulated pipeline for use in applications that necessitate reliable and
evidence-based responses.
10
Gao et al. [36] offer a timely systematic survey of Retrieval-Augmented Generation
(RAG) techniques adapted to Large Language Models (LLMs). With the hasty pro-
liferation of LLMs like GPT, PaLM, and LLaMA, and the increasing need to ground
their output on external knowledge, this paper categorizes the evolving architecture,
processes, and application of RAG in a unifying framework. The survey uses over
200 research papers and provides a comprehensive taxonomy of the components that
make up an average RAG system. The authors divide RAG’s development into three
major phases: Naı̈ve RAG, using fixed retriever + generator pipelines; Advanced
RAG, using re-ranking, query reformulation, and late fusion; and Modular RAG,
where retrieval and generation are used as flexible, learnable modules that can be
controlled by agents or policies. The retrieval module is addressed in terms of sparse
vs. dense paradigms (e.g., BM25 vs. DPR), while generation approaches include
Fusion-in-Decoder, retrieval-augmented prompting, and fine-tuned encoders. One
of the key contributions is the authors’ incorporation of evaluation criteria and stan-
dards. They note the difficulty in measuring groundedness, factuality, latency, and
memory efficiency for RAG systems and suggest using a standardized evaluation
matrix. The paper also identifies challenges such as index staleness, retrieval noise,
and security risks such as prompt injection via returned documents. In addition,
the survey explores diverse trends such as multimodal RAG (text with images or
audio), long-context RAG through memory compression, and tool-augmented RAG
where LLMs use retrieval as part of a broader reasoning pipeline. The authors end
by establishing a definition of open research directions such as life-long retrieval
learning, retrieval for instruction-tuned models, and privacy-aware RAG. The paper
is a seminal guide for researchers and practitioners who want to deploy or under-
stand retrieval-augmented systems at scale. Its extensive coverage and insightful
classification make it an essential guide to navigating the fast-evolving RAG space.
The study by Zulkarnain et al. [33] introduces bbOCR, a comprehensive and open-
source OCR pipeline developed to digitize printed Bengali documents across various
domains. The system aims to address the lack of integrated OCR solutions for
Bengali, which remains underserved despite its global ranking as the sixth most
spoken language. The pipeline incorporates several modules such as geometric and
illumination correction, document layout analysis, line and word detection, word
recognition and HTML reconstruction. These modules help to accurately convert
scanned Bengali documents into structured and editable formats. A notable contri-
bution of the paper is the introduction of a Bengali text recognition model named
APSIS-Net alongside two synthetic training datasets named Bengali SynthTIGER
and SynthINDIC and a large evaluation dataset named BCD3 which contains anno-
tated samples from 9 domains. The evaluation uses both traditional OCR metrics
and new system-level reconstruction metrics to demonstrate that bbOCR signifi-
cantly outperforms Tesseract in both accuracy and layout preservation. However,
while the pipeline is highly effective for printed texts, it currently does not support
handwritten documents or documents containing tables. It also shows limitations
when dealing with blurred images. Future plans include expanding to multilingual
capabilities and integrating table recognition. This paper highlights a critical ad-
vancement in Bengali document digitization but also signals the need for further
improvements to handle more complex document types and handwritten scripts.
11
The paper by Du et al. [9] introduces PP-OCR, an ultra-lightweight Optical Charac-
ter Recognition (OCR) system designed to either enhance the model ability or reduce
the model size. The system is divided into three main parts: text detection using
Differentiable Binarization (DB), detected box rectification, and CRNN-based text
recognition. Each module is optimized using a variety of enhancement and slimming
techniques such as lightweight backbone, cosine learning rate decay, FPGM pruning,
PACT quantization etc. These strategies collectively reduce the total model size to
as low as 3.5M for Chinese character recognition and 2.8M for alphanumeric sym-
bol recognition without compromising much on accuracy. The system is trained on
massive datasets and is validated across multiple languages such as French, Korean,
Japanese and German. One key strength of this paper is its extensive ablation stud-
ies that show the effectiveness of each optimization strategy. However, the system
lags behind large-scale models in terms of F-score. Nevertheless, the trade-off offers
significant practical advantages for deployment on mobile or embedded devices. Fu-
ture improvements could target the quantization of more complex components like
LSTMs and expanding training data diversity for better multilingual generalization.
The research paper by Lewis et al. [10] gives an account of BART, a denoising
autoencoder developed for pretraining sequence-to-sequence models. BART is de-
signed on the basis of a Transformer architecture with one bi-directional encoder
and an auto regressive decoder. During pre-training, BART corrupts text in differ-
ent ways such as random sentence shuffling or a new kind of in-filling scheme and
then learns how to reconstruct that original text. It means that this approach com-
bines the bidirec tional aspect of both GPT and BERT into one generalized model.
Consequently, the paper’s authors managed to provide indicators which showed its
performance better than previous methods on some framing benchmarks– GLUE,
SQuAD, XSum with R-1 and BLEU showing notable improvements respectively.
Its flexible noising and dual-graph interaction mechanisms make it highly effective
across a broad range of NLP tasks. Despite this high-performance level manifested
by BART there are still gaps in its research. To begin with, the methods used for
pretraining corruption should be more diverse and task specific. In addition, when
generating unsupported information it tends to hallucinate showing need for im-
proved factual accuracy again as mentioned above while also noting that it hasn’t
yet focused on what is relevantly related to each other within ELI5 task types.
Again we can generalize about this but then we can say that even if BART works so
well in most hands-free situations 6 where you simply turn it on and speak at your
command; while at the same time being perfect during all sorts of laid-back con-
versations like these ones going back forth between two roommates; when subjected
loosely defined duties such as those belonging to ELI5 where any formulating any
request takes quite long sometimes resulting into getting irrelevant answers from AI
becomes quite tedious thus calling for much higher adaptability levels across task
clusters.
12
action for better context and relevance. An answer-aware attention mechanism was
also used alongside a coarse-to-fine generation scenario. Therefore, it considerably
enhances question quality and relevance, thus establishing new standards for SQG
research based on a novel SQG-specific dataset. Current Sequential Question Gen
eration (SQG) research faces numerous challenges including error cascades, limited
context capture as well as low-quality datasets leading to lower quality and coher-
ence of generated questions. More studies are required though to enhance question
informativeness even if they introduce a semi-autoregressive model; develop more
efficient clustering and generation methods; improve coverage of images and knowl
edge graphs among other things.
13
Chapter 3
For extracting the data, the OCR method was used. By using this method, the texts
are broken down into meaningful contexts which have the constraint of less than 1000
characters. The purpose of this constraint was to make the context manageable for
generating question and answer extraction. The architecture that is followed for the
dataset is the SQuAD (Stanford Question Answering Dataset) format.
14
Figure 3.2: End of Dataset
15
This structured format ensured the integrity and accuracy of the data, making
it suitable for subsequent stages of the research such as analysis, evaluation, and
potential model training in natural language processing tasks involving the language
Bangla.
This provides us the idea for “Number of questions per Chapter” where x-axis
represents “Chapter”, while y-axis represents “Number of Questions” associated
with each chapter. The number of questions differs from chapter to chapter like
some have less than 100 questions but some have more than 300 questions. But
mostly the number of questions has a range of 200-300. The highest number of data
is found from chapter 11 which is from the Bangladesh and Global Studies book and
more than 400 questions derived from that single chapter.
16
Figure 3.4: Context vs Frequency Length in Dataset
This histogram indicates the distribution of context length which helps us see how
frequently different lengths of context occur in the dataset. The x-axis indicates
the ‘length of context text’ while the y-axis represents the frequency of the texts.
The length of the context text is likely in words, tokens or characters. The range of
x-axis is 0-1750. The shortest texts are near 0 and the longest is near 1750. We see
in the area of 250-500 where the frequency count is around 200.
This one represents the average context and answer length by chapter where the
x-axis represents chapters and y-axis represents average length (characters) as the
answer start is based on characters. The context length is in some cases 1200
characters long where the answer lengths never even crossed more than 20 characters.
The context length is not less than 200 characters on the other hand.
17
Figure 3.6: Distribution of Answer Start Position
The bar chart represents the distribution of answer start position where the x-
axis shows answer start position (character index wise) and the y-axis shows the
frequency. This is a right skewed chart where 0 has a frequency as high as 500.
There are very few answer start positions in a range between 800 and 1600.
After finishing the dataset, we found that it consists of 3,473 rows and 6 columns
which had the followings:
4. Answer: the answer span, which has to have relevance to the context
18
5. Chapter: An integer referring to a chapter
The first test revealed that there were no missing or null values in any of the columns.
But somehow the answer start part was saved as a string/object instead of being an
integer. That’s why it was marked for modification so that there wouldn’t be any
alignment validation issue.
1. There were no failing rows for title length not between 3 and 60.
2. There were 56 rows that had context strings that were too short to support
meaningful question answering.
4. 26 rows had answer fields that are either too short or very long. So answer
length is not between 1 and 100.
5. There were 7 rows that had non-integer answer values which broke the span
annotation logic.
Checking these rules created the core criteria for filtering during the cleaning process.
• Removed the rows where the context consists of fewer than 60 characters.
• If the answer length was not between 1 and 100, then those were removed.
19
• In the end, the final dataset came down to 3,385 rows.
By following all the systematic rules and regulations, a reliable resource for build-
ing Bangla questions and answering was made. Issues related to type mismatches.
Invalid spans and length outliers had been resolved.
20
Chapter 4
4.1 Models
4.1.1 Bangla T5
BanglaT5[20] model is made by following a transformer-based encoder-decoder ar-
chitecture. This architecture is based on the model of Text-to-Text Transfer Trans-
former. The processing of the encoder starts when a sentence enters as an input.
Then it processes through multiple levels of self attention, converts to latent rep-
resentation and proceeds to feed forward networks. On the other hand, a decoder
is used for generating the output sequence. It generates the output using the en-
coder’s latent representation and tokens. Also, to focus on different portions of the
sequence, the model uses a multi-head self-attention in encoder and decoder.
Since the words of a sentence follow a sequential order and BanglaT5 model doesn’t
have the capacity to understand the sequence itself. That’s why positional informa-
tion or encoding are added to the input encodings before each word of the sequence.
As a result, it carries the order of that particular word along with it. The BanglaT5
model generally works with 12 encoder and decoder layers. Each of the layers have
12 attention heads and also has a feed-forward network with 3072 units.
The model is trained using Bangla specific sentencepiece tokenizer which helps to
breakdown Bangla texts into smaller units or tokens. Bangla language has a com-
plex combination of characters and that’s why the tokenizer splits the text into
meaningful sub-parts. This helps the model to handle unseen texts.
So if the model hasn’t seen this word while training, it will still be able to handle
this text by detecting the tokens. Also, BanglaT5 has almost 223 million parameters
so it can easily handle complex Bengali text generation tasks.
T5 Tokenizer
For BanglaT5 model, tokenizers work similarly like other transformer-based models.
It uses a Sentence-Piece Model. It is a type of wubword tokenizer model, which splits
the words into smaller units called subowords. This way it becomes significantly
21
efficient in terms of handling rare and unseen words.This method is based on Byte-
Pair Encoding (BPE), which allows the tokenizers to evaluate the whole Bangla
words as single single tokens. It also makes sure that the single tokens break less
frequent words into parts. Special tokens like ¡pad¿, ¡unk¿, ¡s¿ and ¡/s¿ are used for
padding, unknown words, and sequence boundaries.
4.1.2 BanglaGPT
BanglaGPT[31] follows a transformer-based architecture, specially optimized for
Bangla-related services. It is built on a similar structure to GPT (Generative Pre-
trained Transformer), where the model is designed to predict the next word in
sequence in order to handle tasks such as text generation, completion and compre-
hension.
The model only works through the decoder architecture, that is, it focuses on pro-
cessing tokens in order from left to right. It uses multihead auto-attentive mecha-
nisms, which help consider different parts of the input at each decoding step. These
focused approaches enable the model to capture long-term dependencies and con-
textual meanings in Bangla.
Bangla GPT also uses a positional coding system to preserve word order, which is
important for contextual understanding in Bangla, where word order affects mean-
ing. Provided a Bangla-specific corpus and phrase fragment tokenizer Bangla scripts
tokenize into sub -word groups. This helps to deal with difficult and previously un-
seen words, allowing them to perform well in a wide range of Bangla language tasks.
The model’s ability to produce smooth and consistent Bangla text comes from its
extensive preliminary training on large Bangla datasets. BanglaGPT has hundreds
of thousands of parameters, enabling it to perform tasks such as text generation,
collecting questions and answering them in Bangla.
GPT Tokenizer
The BanglaGPT model uses subword tokenization. This is based on the BPE which
proves to be very efficient in handling Bangla text. It handles Bangla’s complex
script like compound characters and diacritic marks by evaluating them as individual
tokens and splitting them into subwords.
4.1.3 XML-RoBERTa
XLM-RoBERTa[4] is a multilingual model based on the RoBERTa framework, which
is itself an improved version of BERT (Bidirectional Encoder Representations from
Transformers). Previously trained on writing in several languages including Bangla
using masked language (MLM) objectives. In MLM, it is used to mask some words
in a sentence, and the model is trained to predict these masked words based on
context.
22
The model is a multi-layered transformer-based encoder that focuses on itself, where
it processes input sequences in both directions, taking into account dependencies
between words rather than sorting them. This bidirectional characteristic is needed
for understanding complex syntactic relations in Bengali sentences.
Exact Match
Exact Match is a common metric used in the field of NLP. It is especially used in the
question answering models or text generation models, by evaluating the accuracy of
the model predictions. It is calculated by the ratio of the numbers of the correct
predictions with the number of total predictions. It is very helpful in the cases
where precise answers are very crucial. It takes into account partial matches, which
is helpful in measuring strict ways of correctness, which makes it stand out from the
other metrics like F1 score.
23
4.2.2 Model Finetuning
Initially, we have selected 3 models which are very efficient for the Natural Language
Processing tasks that we are going to implement for the Bangla Language. These
models are BanglaT5, BanglaGPT and XLM-RoBERTa. These models are well
structured to do the tasks such as summarization, question-answering and text2text
generation via their tokenization tools. All of these models are pre-trained in Bangla
to do these tasks. Now we have to fine tune these models using our own dataset
that we have created from scratch from the ‘Bangladesh and Global Studies’ book
in SQAuD 2.0 format. For the evaluation of fine-tuning these models, we are going
to use F1 and Exact Match scores as metrics.
Table 4.1: Training and Validation Loss over Epochs for Finetuned BanglaT5
24
Figure 4.1: Finetuned BanglaT5
This shows that the training data loss and validation data loss for our dataset was
initially higher but it gradually decreased to a stable state over the course of the
fine tuning process. At the 10th epoch it was giving the minimal amount of data
loss indicating that the model was fitting properly with our dataset.
Then, we ran a test for question-answering with the fine tuned BanglaT5 model
alongside F1 and Exact Match score.
Here, we got 87.82% score in F1 metrics 78.23% score in Exact Match metrics using
the test split of our dataset.
25
the training function. The finetuning process ran for 10 epochs and it took a total
time of almost 3 hours to complete with our GPU. Here are the results we got from
our finetuning process with visualization.
Table 4.2: Training and Validation Loss over Epochs for Finetuned BanglaGPT
This data shows that the starting training and evaluation data loss while fine tuning
this model was quite high but it gradually decreased till the 10th epoch but the
training loss was still in a higher state. Also, the evaluation metrics score for this
model was 74.19% for F1 and 67.53% for Exact Match. Here is an example of a
question-answer function.
26
Figure 4.4: Example Question Answer Set in Finetuned BanglaGPT
Table 4.3: Training and Validation Loss over Epochs for Finetuned XLM-RoBERTa
27
Figure 4.5: Finetuned XLM-RoBERTa
From this we can get to know that, the training loss and validation loss was initially
much higher. Over the epochs, the both losses decreased almost like a straight slope.
But after the 4th epoch the training loss was still decreasing but with the lower slope.
But in the case of validation slope, we can see the loss was not stable. After the
4th epoch, the validation loss got slightly increased till the 5th epoch, then it was
slightly decreased till the 6th epoch. But after that, it was again slightly increased
in the 7th epoch, after that it gradually decreased but with a very little slope. After
the 10th epoch, the training loss and validation loss was in a lower state. But it was
still higher than the BanglaT5 model. The F1 score for this model is 81.03% and
Exact Match is 73.92%. We have also tasted with the question answering and here
is the result:
28
4.3 Result Analysis and Model Selection
Based on the evaluation metrics, here BanglaT5 stands the top for the Bangla
Chatbot we are trying to build, as it showed the best performance in average low
training loss (0.8742) and high F1 (87.82%). It also stands on higher ground in
the Exact Match (78.63%). Without finetune, BanglaT5 did not show any strong
position. But once it was fine tuned, BanglaT5 showed its strongest position; Exact
Match (78.23%) and F1 score (87.82%). On the other hand, BanglaGPT has a larger
parameter count (1.2B) and it showed
Models Params EM F1
mBERT 180M 67.12 72.64
XLM-R (base) 270M 68.09 74.27
XLM-R (large) 550M 73.15 79.06
sahajBERT 18M 65.48 70.69
BanglishBERT 110M 72.43 78.40
BanglaBERT 110M 72.63 79.34
BanglaT5 247M 68.50 74.80
BanglaGPT 1.2B 65.23 73.72
Finetuned BanglaT5 247M 78.23 87.82
Finetuned XLM-R (Large) 550M 73.92 81.03
Finetuned BanglaGPT 1.2B 67.53 74.19
higher average training loss (2.6683) and lower Exact Match (67.53) and F1 score
(74.19), which does not stand very strong in the line. So, it might not generalize as
the other models. But it is suitable for large-scale data processing and contextual
understanding. Then again, XLM-RoBERTa exhibited much better performance
than BanglaGPT with moderately less training loss and higher Exact Match (73.92)
and F1 score (81.03), which shows it falls behind the fine tuned BanglaT5. This
model can be a good option, if we require higher accuracy without a relatively
increased training loss.
29
Figure 4.7: Comparison of The Finetuned Models
To sum up, BanglaT5 has the superior metrics among the models and showed its
ideal position for selecting it for our work. Especially for task-specific objectives like
ours, fine tuned BanglaT5 stands at the top.
30
Chapter 5
Input Modalities
There are two major input options of the system: (a) a direct textual system to enter
user questions and (b) an image-based system which explicitly links the system to
the textbook in printed form or in prints. Such backup mode ability becomes very
important in terms of real life applicability of this technology in the education sector
of Bangladesh, where students commonly use hard copy text books, printed hand
outs and exam papers, which are not digitally present.
31
changing their meaning. This starts with Unicode normalization under NFC (Nor-
malization Form Canonical Composition), which allows consistent representation of
Bangla characters, which are especially complex because of conjunct forms, as well
as diacritics used in the script. The system then cleans by deleting redundant inter-
nal spacing, leading, and trailing whitespace. It also gets rid of extraneous or non-
Bangla characters that could make embedding model confused, which include special
characters or emojis. Syntactically and semantically significant marks are neverthe-
less kept in with full stops, commas or question marks to switch lines. Lastly, the
cleaned, normalized question is simply tokenized with a user defined Bangla tok-
enizer that is compatible with the downstream sentence embedding model.
32
be further subdivided or cut in a strategic fashion preferring the semantic bound-
aries instead of cutting in a fixed length and cutting in the middle of a meaningful
sentence.
33
Vector Indexing using FAISS
To enable efficient large-scale search the system is based on Facebook AI Similarity
Search (FAISS), a library optimized to perform efficient similarity search of dense
vectors. The index is created in the form of FAISS index constructed by IndexIVF-
Flat structure that is a combination of Inverted File System (IVF) and a flat inner
product quantizer. Such a hybrid solution divides the embedding space into several
clusters (k-means clustering is used to cluster the space during training), and new
vectors are added to the posting list of the nearest centroid. When the retrieval
is done, a collection (subsets) of these posting lists are searched and specifically
narrows the time consumed during the look-up. A representative sample of all the
embedded chunks is used in training the index; it is used in coming up with the ini-
tial centroids. The value of nlist (number of clusters) is usually selected empirically,
so as to trade off retrieval latency and recall: it is often set to the square root of the
number of chunks. Similarity between inner products is applied with aid of vectors
normalized using cosine to estimate semantic distance successfully.
rialised to disk and cached in memory throughout system boot up. This method
lowers the cold-start latency and allows deployment to be fast during research and
production. The index will be re-trained and re-indexed regularly in order to ac-
commodate new textbook content or embedded refinements, so each indexed entry
will always include newer content. Our RAG system, in summary, forms a semanti-
cally rich, efficiently searchable representation of raw educational content and this
is a process implemented at embedding and vector indexing. The system can re-
trieve fast, contextually relevant information with the help of a Bangla optimized
SBERT model, and a highly scalable FAISS ANN based index. Such an infrastruc-
ture preconditions the use of reliable real-time context selection and the subsequent
generative components of QA system.
34
there is semantics matching between the query vector and document chunks that
were embedded previously into the vector database.
35
talk of the same thing in the words “Poets and novelists of Bengal” but lexical gap
can be bridged by the cosine-based similarity search.
36
enhance the relevance, specificity and diversity of contextual information feeding into
the generative model. Initial Lexical Filtering using Jaccard Similarity The
initial nugget of this re-ranking pipeline is the removal of outlier passages achieved
via lexical level pruning with Jaccard similarity. Jaccard similarity is a calculation,
which determines how many token sets are the same between a question the user
types in and a given retrieved chunk. This measure is not precise, but an efficient
measure of lexical overlap, and when both the lexical overlap and the measure of
similarity are small, the retrieved chunk might contain no shared vocabulary and no
core concepts but might be strongly correlated with the same topic. As an example,
a question concerning “Bangla Independence Movement” can mistakenly recall a
text on “Modern Literature Movement” they would be irrelevant to each other, and
such dissimilarity would be reflected in lexical embeddings. Any chunks that have
Jaccard scores of less than a number that can be configured (0.2-03) are eliminated
to avoid misleading or noise context being introduced.
37
to produce less and less improvement in the quality of the generated answers and
makes them more and more inefficient in terms of generation time and noise. The
chosen passages are token-trimmed, when needed, to the maximum 512-768 tokens
of the context window of the BanglaT5 model (depending on the vocabulary of the
tokenizer). Syntactic and semantic completeness is maintained by, where possible,
truncating passages at sentence boundaries.
38
2. Question-answer sets composed by people, which translate the formal, factual,
pedagogical emphasis of textbook responses. They are extensive and touch all
sorts of topics such as Bangla literature, history, general science and social
studies.
3. Adversarial and null context examples that are intended to make the model
learn to determine that a question may not be answered in the input. These
examples make sure that the model learns to avoid answering certain questions
(i.e. learns to abstain) instead of hallucinating answers to questions that it is
not certain about.
This solution-efficient fine-tuning approach enables the model with resilience into a
wide range of educational circumstances, such as vague inquiries, partially matched
conditions, as well as question-context paucity.
Dracula, in that the encoder operates over the complete input sequence, whereas
the decoder focuses over the entire set of encoded tokens during generation. Such
an arrangement allows the model to combine the information of multiple passages
together, without fragmentation of information, unlike the retrieval models, where
processing only passages or context windows happen in isolation.
39
Figure 5.1: BanglaT5 EM and F1 Metrics with Test Dataset
We’ve observed that the discrepancy in EM (Exact Match) and F1 scores between
our base fine-tuned model and our RAG (Retrieval-Augmented Generation) pipeline
arises from key architectural and functional differences. Firstly, EM evaluates if the
predicted answer exactly matches the context after normalization. However, the F1
score calculates the overlap at the token level by combining precision and recall.
40
Our base fine-tuned mode - BERT, RoBERTa, or T5 was trained end-to-end on
a supervised QA dataset such as SQuAD or Natural Questions. As the context
passages are fixed and provided during training, the model learns to extract or
generate answers in a controlled format. This direct supervision helps to consistently
produce precise outputs. As a result, it leads to strong EM and F1 scores on
evaluation benchmarks.
However, the RAG pipeline adopts a totally different two-stage architecture. Firstly,
a retriever component dynamically searches a large corpus (using FAISS or similar
vector-based methods) to find top-k relevant passages. Then, a generator like BART
or T5 creates an answer conditioned on the retrieved documents. But RAG isn’t
trained in a fully end-to-end manner unlike our base model. This happens especially
if the retriever is frozen or independently tuned. This disjointed training often
introduces challenges.
Also it’s important to notice that RAG has clear strengths. It excels in open-domain
QA and real-world applications where information needs to be retrieved dynamically.
It’s important when the questions require knowledge not present in the training set.
In such scenarios, RAG can outperform closed-book models in terms of semantic
accuracy, even if the EM/F1 scores appear lower.
To illustrate the point: given the question, “Who developed the theory of relativ-
ity?”, our fine-tuned model might return “Albert Einstein,” yielding perfect EM and
F1. On the other hand, RAG will probably generate: “The theory of relativity was
developed by physicist Albert Einstein in the early 20th century.” Even though the
answer is factually accurate and even more informative, this answer fails the exact
match criterion and includes extra tokens that reduce the F1 score.
In summary, our fine-tuned model tends to achieve higher EM and F1 scores due
to its precise and concise outputs. On the other hand, our RAG pipeline pro-
vides retrieval imperfections and the generative output format. To improve RAG’s
41
performance,more flexible and potentially informative answers at the cost of lower
metric scores, mainly we are exploring strategies like fine-tuning the retriever on
in-domain data, applying reranking methods to filter more relevant documents, and
using post-processing techniques to align generated answers more closely with ex-
pected outputs.
42
Figure 5.3: OpenAI Prompt for Question Generation
43
5.4 User Interface
5.4.1 OCR Integration
The OCR system integrates with the user application to start a parallel and lightweight
OCR processing pipeline whenever a user uploads a new image or document, e.g.,
a page of a reference book, a worksheet printed off, hand-written class notes. This
works the same way as the default OCR module in offline corpus ingestion, going
through grayscale conversion, binarization, noise filtering, and skew correction be-
fore Parsing by means of Tesseract OCR v5.3, properly configured to support Bangla
script and equation types.
The text triggered (or selected) is, unconditionally, preprocessed (cleaned up) and
chunked (constrained to a fixed number of tokens so as to suit follow-on models),
with the maximum length of tokenization restricted to suit the downstream models.
The embedding of each chunk is then also done in real time with the sentence trans-
former that was used during the offline indexing of the corpus (l3cube-pune/bengali-
sentence-similarity-sbert), so that the vector embedding is comparable to the already
constructed index.
These newly embedded vectors are not written to disk-resident FAISS stores so as
not to jeopardize the integrity of the underlying master list that is persistent. They
are rather put in a volatile, in-memory FAISS IndexFlatIP instance that is dedicated
to the running session. This architecture provides maximum isolation of the user-
provided documents and avoids both contamination of the global data set and data
reproducibility and versioning. Here is an example of the ORC input:
44
Figure 5.6: OCR output
The first two screenshots show how the user can upload an image file in JPG,
PNG, or JPEG format containing a block of bangla text. This can be a scanned
page, a PDF screenshot, or a photo of printed material. The picture needs to be
clear. This image works as the input for our OCR (Optical Character Recognition)
function. The second screenshot shows the result after processing the image. In
the extracted text box, the embedded text gets extracted and converted into plain,
editable text. This feature helps users quickly retrieve bangla texts from a picture
and reuse content from printed or image-based documents without manual typing.
This can be especially useful for students, teachers, researchers, or anyone working
with scanned academic or reference materials. Also this extracted text can be used
to provide context for the application’s other function - question-answer generation
and MCQ generation.
This interface shows the Question Generation – Input and Output Interface. This
takes a Bangla context and the user provides the number of questions they want as
input. It then generates that number of meaningful and context-relevant questions
directly from the provided context. This helps users to practice exam-style questions.
It also makes the process efficient for educators and students to generate questions
and practice quickly.
45
Figure 5.8: Short Question/Answer
46
Figure 5.9: MCQ Generation Input
47
Chapter 6
Conclusion
6.1 Conclusions
To put it briefly, our study intends to create an intelligent question-answer system
that is specifically designed to meet the educational needs of learners who speak
Bangla. We primarily focused on question answer generation for learning efficiency.
We will accomplish this by utilizing the BanglaT5 model to create a system that will
process datasets following the SQuAD pattern alongside an efficient RAG pipeline.
This model fits well with our purpose and we fine tune it properly according to our
need with our dataset. Our goal is to develop a system that can generate quizzes and
give feedback to contextual based questions for Bengali-medium students. We firmly
believe our application will help Bengali-speaking students in their own educational
paths while also improving their proficiency through self-assessment. Further study
is required to increase dataset size into other books and finetune the model for
precision and effectiveness of Bangla AI-driven learning.
6.2 Limitations
No research is without limitations and this paper is not different from those either.
These are the limitations that were found out-
1. One of the main limitations of the system is the limited dataset for training
and fine-tuning the model. There are very few resources when it comes to
Bangla language compared to other languages.
2. If the given context is vague and irrelevant, the intelligent system fails to give a
feasible response. It is trained for more fact based answers hence it’s incapable
of critical thinking or data analysis.
48
6.3 Future Works
Our study primarily focuses on the BanglaT5 model and data collected from NCTB
curriculum textbook for class 9-10 for Bangladesh and Global studies and Bangla
First Paper. For future improvement of the system, more data can be collected from
other textbooks to enhance the knowledge base for the model. Not to mention, the
enhanced dataset can be used to train LLMs such as Llama or GPT4 for better gen-
erative performance and accuracy of this question-answering system. These models
can handle complex queries, offer better understanding of Bangla language and
smarter generation of context based answers which the BanglaT5 model is limited
in. Moreover, multimodal input and cloud or container deployment will make the
tool more flexible and scalable for both educational and real environments. All these
features will be towards developing a wiser, uniformed, and accessible Bangla QA
system for widespread use.
49
Bibliography
50
[10] M. Lewis, Y. Liu, N. Goyal, et al., “Bart: Denoising sequence-to-sequence pre-
training for natural language generation, translation, and comprehension,” in
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, 2020, pp. 7871–7880. doi: 10.18653/v1/2020.acl-main.703.
[11] P. Lewis, E. Perez, A. Piktus, et al., “Retrieval-augmented generation for
knowledge-intensive NLP tasks,” in Advances in Neural Information Process-
ing Systems (NeurIPS), 2020. doi: 10 . 48550 / arXiv . 2005 . 11401. [Online].
Available: https://arxiv.org/abs/2005.11401.
[12] A. Patel and A. Ghosh, “Localized learning tools: The missing link in educa-
tional ai,” AI for Inclusive Education, vol. 5, no. 1, pp. 41–56, 2020.
[13] M. Bommadi, S. Terupally, and R. Mamidi, “Automatic learning assistant in
telugu,” in Proceedings of the 1st Workshop on Document-Grounded Dialogue
and Conversational Question Answering (DialDoc), 2021, pp. 29–37. doi: 10.
18653/v1/2021.dialdoc-1.4.
[14] M. H. Chowdhury and R. Shahnaz, “Digital divide in bangladesh: Challenges
and policy implications,” Journal of Information Policy, vol. 11, pp. 202–223,
2021.
[15] M. M. Hasan, A. Roy, and M. T. Hasan, “Alapi: An automated voice chat
system in bangla language,” in 2021 International Conference on Electronics,
Communications and Information Technology (ICECIT), Khulna, Bangladesh,
2021, pp. 1–4. doi: 10.1109/ICECIT54077.2021.9641323.
[16] F. S. Khan, M. A. Mushabbir, M. S. Irbaz, and M. A. A. Nasim, “End-to-end
natural language understanding pipeline for bangla conversational agents,”
in 2021 20th IEEE International Conference on Machine Learning and Ap-
plications (ICMLA), Pasadena, CA, USA, 2021, pp. 205–210. doi: 10.1109/
ICMLA52953.2021.00039.
[17] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey
on bias and fairness in machine learning,” ACM Computing Surveys (CSUR),
vol. 54, no. 6, pp. 1–35, 2021.
[18] F. Rahim and S. Munna, “Technology access and digital literacy in rural
education,” Journal of ICT and Development, vol. 9, no. 2, pp. 12–26, 2021.
[19] O. Rahman and S. Talukder, “Underrepresentation of bangla in modern nlp
and its consequences,” Journal of South Asian AI Research, vol. 3, no. 2,
pp. 55–70, 2021.
[20] A. Bhattacharjee, T. Hasan, W. U. Ahmad, and R. Shahriyar, “Banglanlg:
Benchmarks and resources for evaluating low-resource natural language gen-
eration in bangla,” CoRR, vol. abs/2205.11081, 2022. arXiv: 2205.11081. [On-
line]. Available: https://arxiv.org/abs/2205.11081.
[21] T. Hasan, N. Alam, R. U. Ahmed, et al., “Banglaai: Building nlp tools for
low-resource bangla language,” in Proceedings of the Language Resources and
Evaluation Conference, 2022.
[22] N. Islam and S. Jahan, “Ai and education in rural bangladesh: Opportunities
and infrastructural gaps,” Bangladesh Journal of ICT, vol. 5, no. 2, pp. 33–45,
2022.
51
[23] S. Roy, T. Sultana, and J. Islam, “Banglanlp 2.0: Challenges and progress,”
Journal of Asian Computational Linguistics, vol. 4, no. 1, pp. 66–79, 2022.
[24] R. Thoppilan, D. D. Freitas, J. Hall, et al., “Lamda: Language models for
dialog applications,” arXiv, 2022. doi: https://doi.org/10.48550/arXiv.2201.
08239.
[25] S. Alwahaishi, “A smart interactive behavioral chatbot: A theoretical proto-
type,” in 2023 8th International Engineering Conference on Renewable En-
ergy Sustainability (ieCRES), Gaza, Palestine, State of, 2023, pp. 1–5. doi:
10.1109/ieCRES57315.2023.10209534.
[26] Y. Dan, Z. Lei, Y. Gu, et al., “Educhat: A large-scale language model-based
chatbot system for intelligent education,” arXiv, Jan. 2023. doi: 10.48550/
arxiv.2308.02773.
[27] S. Deode, J. Gadre, A. Kajale, A. Joshi, and R. Joshi, “L3cube-indicsbert:
A simple approach for learning cross-lingual sentence representations using
multilingual bert,” arXiv preprint arXiv:2304.11434, 2023.
[28] L. Gutiérrez, “Artificial intelligence in language education: Navigating the
potential and challenges of chatbots and nlp,” RSELTL, vol. 1, no. 3, pp. 180–
191, Sep. 2023. doi: https://doi.org/10.62583/rseltl.v1i3.44.
[29] S. S. Jennifer, S. A. Islam, S. S. Koly, R. A. Tuhin, M. S. H. Khan, and
M. M. Uddin, “Bilinbot: A bilingual chatbot using deep learning,” in 2023 5th
International Congress on Human-Computer Interaction, Optimization and
Robotic Applications (HORA), Istanbul, Turkiye, 2023, pp. 1–6. doi: 10.1109/
HORA58378.2023.10156681.
[30] R. Sajja, Y. Sermet, M. Cikmaz, D. Cwiertny, and I. Demir, “Artificial intelligence-
enabled intelligent assistant for personalized and adaptive learning in higher
education,” arXiv, 2023. doi: https://doi.org/10.48550/arXiv.2309.10892.
[31] M. S. Salim, H. Murad, D. Das, and F. Ahmed, “Banglagpt: A generative pre-
trained transformer-based model for bangla language,” in 2023 International
Conference on Information and Communication Technology for Sustainable
Development (ICICT4SD), 2023, pp. 56–59. doi: 10.1109/ICICT4SD59951.
2023.10303383.
[32] S. Sarker, A. Rahman, and M. Khandakar, “Building annotated datasets for
bangla nlp: Issues and directions,” Transactions in Asian Language Processing,
vol. 12, no. 1, pp. 1–15, 2023.
[33] I. M. Zulkarnain, S. B. Islam, M. Z. A. Z. Farabe, et al., Bbocr: An open-
source multi-domain ocr pipeline for bengali documents, 2023. arXiv: 2308 .
10647 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2308.10647.
[34] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to re-
trieve, generate, and critique through self-reflection,” in Proceedings of the
12th International Conference on Learning Representations (ICLR), 2024.
doi: 10 . 48550 / arXiv . 2310 . 11511. [Online]. Available: https : / / arxiv . org /
abs/2310.11511.
[35] M. Douze, A. Guzhva, C. Deng, et al., “The faiss library,” arXiv preprint
arXiv:2401.08281, 2024. arXiv: 2401.08281 [cs.LG].
52
[36] Y. Gao, Y. Xiong, X. Gao, et al., “Retrieval-augmented generation for large
language models: A survey,” arXiv preprint arXiv:2312.10997, 2024. doi: 10.
48550/arXiv.2312.10997. [Online]. Available: https://arxiv.org/abs/2312.
10997.
[37] N. Nawal, S. Basak, and R. Shahriyar, “Effective retrieval-augmented genera-
tion for open domain question answering in bengali,” Ph.D. dissertation, Jul.
2024. doi: 10.13140/RG.2.2.15361.06248.
[38] P. Narayan, R. T, T. Mg, K. Krishna, and P. V, “Retrieval-augmented gen-
eration for multiple-choice questions and answers generation,” vol. 259, Jan.
2025. doi: 10.1016/j.procs.2025.03.352.
53