Intell Bot

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

IntellBot: Retrieval Augmented LLM Chatbot for

Cyber Threat Knowledge Delivery

Dincy R. Arikkat1 , Abhinav M.1 , Navya Binu1 , Parvathi M.1 , Navya Biju1 , K.
S. Arunima1 , Vinod P.2,1 , Rafidha Rehiman K. A.1 , and Mauro Conti2
1
Department of Computer Applications, Cochin University of Science and
Technology, Kerala, India
arXiv:2411.05442v1 [cs.IR] 8 Nov 2024

2
Department of Mathematics, University of Padua, Padua, Italy

Abstract. In the rapidly evolving landscape of cyber security, intelli-


gent chatbots are gaining prominence. Artificial Intelligence, Machine
Learning, and Natural Language Processing empower these chatbots to
handle user inquiries and deliver threat intelligence. This helps cyber
security knowledge readily available to both professionals and the pub-
lic. Traditional rule-based chatbots often lack flexibility and struggle to
adapt to user interactions. In contrast, Large Language Model-based
chatbots offer contextually relevant information across multiple domains
and adapt to evolving conversational contexts. In this work, we develop
IntellBot, an advanced cyber security Chatbot built on top of cutting-
edge technologies like Large Language Models and Langchain alongside a
Retrieval-Augmented Generation model to deliver superior capabilities.
This chatbot gathers information from diverse data sources to create a
comprehensive knowledge base covering known vulnerabilities, recent cy-
ber attacks, and emerging threats. It delivers tailored responses, serving
as a primary hub for cyber security insights. By providing instant access
to relevant information and resources, this IntellBot enhances threat in-
telligence, incident response, and overall security posture, saving time
and empowering users with knowledge of cyber security best practices.
Moreover, we analyzed the performance of our copilot using a two-stage
evaluation strategy. We achieved BERT score above 0.8 by indirect ap-
proach and a cosine similarity score ranging from 0.8 to 1, which affirms
the accuracy of our copilot. Additionally, we utilized RAGAS to evaluate
the RAG model, and all evaluation metrics consistently produced scores
above 0.77, highlighting the efficacy of our system.

Keywords: Large Language Model, Security Chatbot, Threat Intelli-


gence, Cyber Security, Retrieval-Augmented Generation

1 Introduction
The rise of interconnected technologies creates new vulnerabilities, leading to
a surge in complex cyber threats and demands a faster response to security
incidents. So, having some intelligent-based tools like chatbots that can pro-
vide threat intelligence about the current landscape is crucial. The chatbots
2 Dincy et al.

empower cyber security professionals by efficiently answering their queries and


keeping them informed about the evolving threat landscape [1]. Chatbots are
computer programs that mimic human-user conversation through text or voice-
based communication [2]. Earlier researchers utilized techniques, such as rule-
based or knowledge-based, to implement the chatbot [3]. The rule-based chatbot
operates on a set of predefined rules and patterns. However, rule-based systems
are not very flexible, and they could have trouble answering complicated or un-
clear questions and, therefore, do not learn from user interactions [4]. To process
user input and provide appropriate responses, chatbots make use of technolo-
gies, including Artificial Intelligence (AI), Machine Learning (ML), and Natural
Language Processing (NLP).
The emerging landscape of human-computer interaction and NLP has en-
tailed the development of chatbots based on the Large Language Model (LLM).
Unlike traditional rule-based or other chatbot systems, LLM chatbots deliver
refined and contextually relevant information to the users. The proficiency of
this chatbot extends to tasks such as language comprehension, sentiment analy-
sis, and generating logical responses across various domains [5,6]. Additionally,
they offer scalability, effectively handle diverse user queries and adapt to evolving
conversational contexts [7]. Recently, security organizations have employed chat-
bots to explore cyber incidents happening in the real world [8]. A cyber security
chatbot is an AI-powered conversational agent designed to assist organizations
in addressing cyber security related inquiries, incidents, and concerns [9]. Se-
curity chatbots provide instant support to users, ensuring timely resolution of
issues irrespective of temporal constraints. The chatbot frees up valuable hu-
man resources, allowing cyber security professionals to focus on more strategic
actions and threat mitigation efforts. The chatbot’s round-the-clock availability
goes beyond the limitations of human support staff, guaranteeing uninterrupted
assistance to users worldwide. It helps to reduce the search time for cyber secu-
rity engineers by providing instant access to relevant information, solutions, and
resources, optimizing the troubleshooting and resolution of security incidents.
This paper looks into the architecture, implementation, and advantages of an
advanced cyber security chatbot named IntellBot, empowered by cutting-edge
technologies such as LLM and Langchain. This chatbot offers cyber security in-
sights and advice, benefiting not only security experts and organizations but also
the general public. By providing easily accessible information and resources, it
helps users to take proactive measures to protect themselves and their organiza-
tions from cyber threats. The efficiency demonstrated by the chatbot results in
cost savings by reduced manual intervention and time saved by delivering quick
and accurate responses to inquiries.
The major contributions of this paper include:

– We compile a security knowledge base from diverse sources, including APT


reports, security books, the National Vulnerability Database (NVD), security
blog articles, and reports from threat intelligence platforms. Our dataset en-
compasses a total of 2,447 PDF documents, consisting of 445 cyber security
books and 2,002 Advanced Persistent Threat (APT) reports. Additionally,
IntellBot 3

we gathered 7,989 details of malicious file hashes from Virustotal, storing


the responses in JSON format. Furthermore, we compile 2,959 URL details.
Moreover, we gathered 1,97,256 vulnerability details and scraped 21,825 se-
curity blog contents, all saved in CSV format.
– We develop an AI-based cyber security copilot by integrating the comprehen-
sive security knowledge base with cutting-edge Large Language Model (LLM)
capabilities.
– We evaluate the performance of IntellBot by a two-stage approach. Initially,
we analyzed the performance through an indirect approach, in which the
BERT score is calculated between the user query and five bot-generated
questions from the response to the query. In the second stage, we measure the
cosine similarity between the bot-generated responses and human responses
for the same query using Word2Vec, Glove, and BERT embeddings.
The rest of the article is organized as follows: Section 2 dives into related
research findings, Section 3 explains the background for the study, Section 4
outlines the methodology employed in this study, Section 5 showcases the exper-
imental results, including the interface and IntellBot output, Section 6 concludes
the work and proposing future research directions.

2 Related work
Large language models excel in various NLP tasks, including creative text gen-
eration, translation, and question answering [10]. These AI models, empowered
by massive amounts of text data, offer versatility due to their ability to han-
dle unseen data and require less task-specific training than traditional models.
Notably, LLMs are advantageous when data scarcity is a concern. However, fine-
tuned models may still be preferable for tasks with abundant data. While LLMs
hold significant promise, real-world applicability, interpretability, and potential
biases remain challenges. A few research works are available in the state of the
art, integrating NLP and LLM for chatbot creation.
In [11], the authors proposed an approach for using chatbots to improve
communication between citizens and government. The chatbots help to under-
stand citizens’ queries and provide them with information or complete trans-
actions. Pandya et al. [12] introduced Sahaay, an open-source system that uses
LLM to automate customer service. Sahaay directly retrieves information from
a company’s website and leverages Google’s Flan T5 LLM to answer customer
queries. The model is further trained with Hugging Face Instructor Embedding
to improve its understanding of meaning. Looking to address the limitations of
traditional, inflexible methods in entrepreneurship education, a new study [13]
proposed a groundbreaking solution: an LLM chatbot acting as a startup coach
simulator. This system tackles the iterative nature of startup development by
leveraging a conversational interface powered by LLMs like GPT-3. They created
the chatbot to provide real-time feedback on crucial aspects like their business
model, product-market fit, and financial projections. In [14], authors proposed
an LLM-based chatbot for the Fintech industry. They integrated technologies,
4 Dincy et al.

including Langchain, to enhance response accuracy and relevance. Initial find-


ings suggest a positive impact on user knowledge regarding the Digital Tenge
project.
Oliveira et al. [15] proposed a solution to improve student communication
and information access. Their virtual assistant chatbot offers quick and accu-
rate answers frequently asked questions about admissions, programs, registra-
tions, financial aid, and campus resources. This eliminates the need for students
to spend time searching for information manually. Building on LLM advance-
ments, Arora et al. [16] introduced JEEBENCH, a challenging dataset assessing
reasoning through complex physics, chemistry, and maths problems. They gath-
ered data from engineering exams and identified shortcomings in existing LLMs,
such as GPT-4, particularly in maths core concepts, algebra, and arithmetic.
This highlights the need for future research on integrating mathematical logic
into LLMs and evaluating their decision-making abilities.
A recent study by McIntosh et al. [17] introduces a novel chatbot assistant
specifically designed for cyber security Governance, Risk Management, and Com-
pliance (GRC). This virtual assistant leverages GPT-based models to focus on
policy development within the GRC domain, targeting researchers, practition-
ers, and organizations. Furthermore, they employ Game Theory to assess the
effectiveness of policies generated by GPTs. SecBot [9] is a conversational agent
built using ML and NLP techniques to assist non-experts with cyber security de-
cisions. They implemented Dual Intent and Entity Transformer (DIET), which
extracts entities from the text using Conditional Random Fields (CRFs). Their
knowledge databases store information on cyber security threats and solutions,
allowing SecBot to recommend actions to avoid or mitigate problems. In [18],
Bessani et al. investigated how LLMs can be used to improve threat awareness
and detection. The study proposes utilizing the chatbots for tasks like binary
classification and Named Entity Recognition(NER) to generate Indicators of
Compromises (IoCs) that can be used for proactive cyber threat mitigation. To
evaluate their effectiveness, the performance of these LLM chatbots is compared
against specialized models using real-world data collected from Twitter. The
findings of this research emphasize the value of LLMs in the realm of Cyber
Threat Intelligence (CTI) and showcase the project’s utilization of GPT-based
and open-source chatbot technologies.

3 Background

3.1 Langchain

Langchain3 is a Python library that has been developed to simplify the creation
of applications that utilize LLMs. It has emerged as a valuable resource for con-
structing intelligent chatbots. The core of Langchain is built on a modular and
extensible framework that abstracts the complexities of working with LLMs. It
provides a range of fundamental tools for loading and managing LLM models
3
https://www.langchain.com/
IntellBot 5

from different sources, tokenizing input text, batching requests, and managing
other common LLM operations. A notable feature of Langchain is its compat-
ibility with a diverse selection of LLM models from prominent providers like
OpenAI4 , Anthropic5 , Cohere6 , and AI217 , as well as open-source models such
as LlamaCP8 and GPT-J9 .
Furthermore, Langchain offers advanced tools for enhancing the capabilities
of LLMs, which are particularly beneficial in the context of a cyber security chat-
bot. These tools encompass memory components for preserving conversational
context, agents for breaking down complex tasks into manageable subtasks, and
mechanisms for integrating knowledge bases. In building our cyber security chat-
bot, we leveraged several key components from the Langchain framework. These
key components are as follows:
– Prompts: Prompts are crafted instructions given to a language model to
guide its responses in specific contexts. Prompts could include structured
queries or instructions to ensure that the model provides relevant and accu-
rate information.
– Document Loader: Document loaders are tools or scripts used to ingest and
preprocess documents. This helps to load documents from a file system,
database, or any other storage medium and preparing them for input into
the language model.
– Agents: An agent serves as the intermediary between humans and the LLM.
It interacts with the LLM to execute tasks or generate outputs according to
provided instructions. This can take the form of a script, program, or user-
friendly chat interface. The agent receives inputs, crafts a prompt to guide
the LLM, sends it to the LLM, and handles the LLM’s response to achieve
the intended result.
– Chains: The chain represents the sequence of operations or steps that the
agent follows to interact with the language model. This could include prepro-
cessing inputs, sending queries to the language model, receiving responses,
and post-processing or formatting the outputs as needed.
– LLM: LangChain’s core engine is the LLM, acting as the language model
itself. It is responsible for understanding inputs (prompts or queries) in nat-
ural language and generating appropriate responses or outputs based on its
training data and algorithms.

3.2 RAG Model


Understanding how Retrieval-Augmented Generation (RAG) models work with
LLM is essential for optimizing chatbots. LLMs lay the groundwork, empowering
4
https://openai.com/
5
https://www.anthropic.com/
6
https://cohere.com/
7
https://www.ai21.com/
8
https://github.com/ggerganov/llama.cpp
9
https://www.eleuther.ai/artifacts/gpt-j
6 Dincy et al.

chatbots to understand and generate human-like text. When integrated with


RAG models, which combine retrieval-based and generation-based approaches,
chatbots can autonomously derive precise responses from documents. The typical
RAG framework begins when a user submits a query or question, typically as
a text prompt seeking a detailed and accurate response. This query is initially
processed by the retriever model, which searches a large database or document
corpus to identify and retrieve the most relevant information. The model employs
techniques such as sparse or dense vector search, utilizing methods like TF-IDF,
BM25, or dense embeddings from transformer models like BERT. It selects a
set of top k relevant documents or passages most likely to contain the necessary
information.
Once the relevant documents are retrieved, they are prepared for the gener-
ative phase. This preparation involves extracting and formatting the pertinent
content to provide context for the generative model. The generative model, of-
ten a transformer-based language model like GPT, receives both the original
query and the retrieved documents as input. By integrating this information,
the model generates a coherent and contextually appropriate response, combin-
ing insights from the retrieved documents with its pre-trained knowledge. This
response is then delivered to the user, grounded in the retrieved information,
which enhances its reliability and reduces the likelihood of factual inaccuracies
or hallucinations. RAG ensures that responses generated by an LLM are not
solely dependent on static or outdated training data. Rather, the model utilizes
the dataset containing different documents to deliver accurate responses.

4 Methodology

The section outlines the systematic procedure employed in the development of a


cyber security chatbot using LLM technology. The comprehensive architecture of
IntellBot is depicted in Fig. 1. Our system encompasses three phases: (i) Creation
of the Security Knowledge Base, (ii) Generation of Query Response Interface,
and (iii) Evaluation of Chatbot Responses with those generated by humans (Sub-
ject Matter Experts). The creation of the security knowledge base involves data
collection, document loading, text segmentation, conversion of textual data into
numerical format, and storage of embedding vectors in the Facebook AI Sim-
ilarity Search (FAISS) vector store. Subsequently, a user-friendly interface was
developed using Streamlit, enabling users to pose security-related queries. The
interface and response generation were implemented utilizing the Large Lan-
guage Model. Finally, we evaluate the IntellBot’s responses using a two-stage
strategy involving indirect proof and cosine similarity. Detailed descriptions of
all procedures are provided in the subsequent section.

4.1 Data Collection

The study employed four different types of documents to create a comprehen-


sive dataset for analysis, which included PDFs, CSVs, JSONs, and URLs. The
IntellBot 7

Fig. 1: Proposed Architecture of Security chatbot created using the Large Lan-
guage Model

dataset consisted of a total of 2, 447 PDF documents comprising 445 cyber secu-
rity books refereed from different GitHub pages and 2, 002 Advanced Persistent
Threat (APT) reports sourced from vx-underground10 , providing detailed in-
formation on various malware attacks and procedures. Also, we collected data
from the threat intelligence platform known as VirusTotal11 . We searched 7, 989
malicious file hashes in Virustotal and saved the response in JSON format.
Moreover, custom crawlers were developed to collect information from websites
such as KrebsOnSecurity and Malwarebytes, focusing on URLs related to cy-
ber threats and security incidents. Additionally, vulnerability information was
gathered from the National Vulnerability Database (NVD)12 . Furthermore, we
10
https://vx-underground.org/APTs
11
https://www.virustotal.com/gui/home/upload
12
https://nvd.nist.gov/vuln/data-feeds
8 Dincy et al.

scraped security-related blogs from platforms like HackerNews13 , BleepingCom-


puter14 , CISCO15 , CSO16 , etc., and saved the blog content and title into the
CSV file. We collected 1, 97, 256 vulnerability details and 21, 825 blog content,
which are saved in CSV format. This method of gathering diverse data sources
enriched the cyber security dataset, making it more comprehensive and valuable
for analysis and research.
Furthermore, we performed some data preprocessing steps that aim to im-
prove data quality and usability. In the case of CSV and JSON data, preprocess-
ing typically includes identifying and removing duplicate entries to ensure data
integrity and accuracy. Dealing with HTML data obtained through web crawling
requires special attention, as it often contains newline and tab characters that
can interfere with the analysis. To tackle this issue, preprocessing techniques
are applied to eliminate or substitute these characters, thus maintaining the
consistency and cleanliness of the data.

4.2 Document Loading

Different document loaders were employed to load data in the formats such
as PDFs, CSVs, JSONs, and URLs. The PyPDF Loader is used to load PDF
data, load a total of 2,447 PDF documents, and extract metadata, page content,
and page numbers. The JSON Loader handles JSON files obtained from the
Virus Total website. JSON Loader effectively extracted data using the jq Python
package. The RecursiveUrlLoader utilized a tailored extractor function relying on
Beautifulsoup to traverse webpages starting from root URLs. It then extracted
text content from these pages to facilitate subsequent processing. Additionally,
data from 38 CSV files was loaded using the CSVLoader. The load function helps
to get structured data from these files, making them suitable for further analysis
in our research. Also, content scraped from 7,889 web pages of hackernews sites
was collected and saved as CSV files.

4.3 Text splitting

Text splitting converts the user’s input into smaller units, such as words or
tokens, allowing for easier data processing and extracting meaningful features.
This breakdown enables us to better understand the user’s queries or commands
and enhance comprehension. In our work, we utilized the Recursive Character
Text Splitter to carry out this task, incorporating parameters like chunk size
and chunk overlap. The chunk size parameter determines the maximum length
of individual segments in which a text document is divided. It specifies each
chunk’s number of characters, words, or tokens. We fixed the chunk size at 1000
to ensure consistent and manageable segments for analysis. Also, chunk overlap
13
https://thehackernews.com/
14
https://www.bleepingcomputer.com/news/security/
15
https://blog.talosintelligence.com/
16
https://www.csoonline.com/in/cybercrime/page/
IntellBot 9

indicates the level of shared content between consecutive text segments. A higher
overlap suggests more redundancy among chunks, while a lower overlap may lead
to more distinct segments. In our study, we choose a chunk overlap value of 50
to balance overlap and coherence, maintaining continuity across segmented text
data.

4.4 Conversion of text into Numerical format

Using embedding techniques is crucial in transforming textual data into a nu-


merical framework. This methodology involves representing chunks of text as
vectors, which allows for the effective capture of semantic relationships and sim-
ilarities. In our work, we employed the Sentence Transformer model from the
Hugging Face framework to generate these embeddings, specifically employing
the model “sentence-transformers/all-mpnet-base-v2"17 . Moreover, we leveraged
the use of embeddings twice within our chatbot system. Firstly, we utilized em-
beddings for encoding the text chunks retrieved from the text splitting phase.
Additionally, we employed embeddings to encode the questions posed by users.
To accomplish this, we utilized the OpenAI model (GPT-3.5-turbo), using its
capabilities to convert user queries into vector representations. By incorporating
embeddings in both stages of the chatbot, we improved the system’s capacity
to understand and respond to user queries efficiently. This resulted in seamless
interaction and enhanced information retrieval within the chatbot system.

4.5 Creation of Vector Store

After converting the data into numerical format, we stored it in a vector store,
a specialized database designed for efficient storage and retrieval of items. Each
piece of data within the vector store is represented as a high-dimensional vector in
a multi-dimensional space, allowing for comparisons and similarity calculations.
To enhance efficient storage and query performance, we utilized the Facebook
AI Similarity Search (FAISS)18 framework for constructing and managing the
vector store. FAISS is well-known for its capability to optimize similarity search
operations, ensuring quick and accurate retrieval of similar items from extensive
datasets. We created separate vector stores for each data type, including PDFs,
CSV files, URLs, and JSON data. To amalgamate the outputs from these various
vector stores and retrievers into a cohesive final response, we employed an en-
semble retriever. This ensemble method enhances response accuracy and depth
by leveraging the unique strengths of each individual retriever. By embracing
this approach, we balanced storage efficiency and query performance, enabling
swift and effective retrieval of similar items from the vector store.
17
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
18
https://python.langchain.com/docs/integrations/vectorstores/faiss/
10 Dincy et al.

Similarity Search Similarity search is a critical component of document re-


trieval systems, especially in the context of responding to user queries by iden-
tifying pertinent documents from a large corpus. When a query is submitted,
the similarity search algorithm calculates similarity scores between the query
and each document, facilitating the retrieval of the top_k (in our case 3) most
similar contexts. These retrieved contexts are then used as the foundation for
generating responses to the query. Furthermore, the file names of the retrieved
documents are stored in the ‘metadata’ parameter for reference and tracking
purposes. After retrieving the most similar documents, the system can create re-
sponses to the query by analyzing the content of these documents. By utilizing
the information found in the retrieved documents, the system can offer relevant
and informative responses that effectively address the user’s query.

4.6 Designing Prompt


This work uses a prompt-based approach to design a cyber security chatbot.
A prompt, in this context, is a structured instruction provided to the chatbot
that guides it in formulating a response to user queries. It acts as a bridge
between the user’s question and the chatbot’s internal knowledge base. This
chatbot is designed to mimic a human cyber security expert by utilizing various
information sources, such as context1, context2, and context3, from our vector
store and presenting the response. When the user encounters a cyber security
issue and poses a query, the chatbot analyzes this context alongside the user’s
specific query.

[System]
"You are a cyber security expert. Provide the responses for the question considering the
context below. Your responses should consider factors such as the relevance, accuracy,
depth, creativity, and level of detail of their responses."

[User Query]
{What was the name of the ransomware used by FIN8?}

[context 1]
{the ability to drop arbitrary files and exfiltrate file contents from the compromised
machine to an actor-controlled infrastructure. This is not the first time FIN8 has been
detected using Sardonic in connection with a ransomware attack...}

[context 2]
{CONTENT: An exhaustive analysis of FIN7 has unmasked the cybercrime syndicate’s
organizational hierarchy, alongside unraveling its role as an affiliate for mounting
ransomware attacks...}

[context 3]
{CONTENT: The notorious cybercrime group known as FIN7 has been observed de-
ploying Cl0p (aka Clop) ransomware, marking the threat actor’s first ransomware cam-
paign in late 2021...}
IntellBot 11

[Response]
FIN8 has been detected using the White Rabbit ransomware, which is based on Sar-
donic.

4.7 Streamlit User Interface

To optimize the interface development for our cyber security chatbot, we used
Streamlit19 , a framework renowned for its simplicity and rapid application de-
velopment capabilities. Streamlit proved highly advantageous in creating an in-
teractive interface with minimal coding requirements, seamlessly aligning with
our data processing framework, LangChain. We used Streamlit version 1.32.2,
where it provides a chat_input function to feed the user queries.

4.8 Response Evaluation

After developing the IntellBot for querying security related questions, we need to
evaluate the accuracy of its responses. To assess the performance of the bot, we
propose a two-stage evaluation approach. Initially, we assessed the correctness of
our IntellBot’s responses to a query by generating five bot questions based on the
retrieved response. We then validated the correctness of the response through
indirect proof. Subsequently, we strengthened the evaluation by providing man-
ual answers to the questions and comparing them with the bot’s response using
similarity scores.
Step 1: Response Evaluation using bot-generated questions
In this stage, we employed an indirect proof strategy. We determined the
bot answer A to the cyber security query Q0 and generated five subsequent bot
questions Q1 , Q2 , ..., Q5 based on A. The validation process involved comparing
these generated questions with the original query Q0 to assess the correctness of
A.
Then, we applied indirect proof or proof by contradiction, which asserts that
assuming a statement’s opposite leads to a logical inconsistency. It relies on
the following logic: If the bot answer A was incorrect, the questions {Qi }5i=1
derived from A would logically diverge from Q0 . However, if the similarity score,
sim(Q0 , Qi ), is high for all i (ranges from 1 to 5), it contradicts our initial
assumption of A’s incorrectness. This contradiction suggests that A correctly
addresses Q0 .
This approach proceeds as follows:

– Assumption: Initially we assume that bot generated answer A to the initial


query Q0 is incorrect.
– Implication: If A is incorrect, the subsequent questions {Qi }5i=1 generated
by the bot based on A should not similar to Q0 .
19
https://streamlit.io/
12 Dincy et al.

– Proof : The process begins with computing the BERT score between Q0 and
each {Qi }5i=1 to quantify their semantic alignment. BERT score computes
similarity scores between pairs of texts using contextual embedding from
Bidirectional Encoder Representations from Transformers (BERT). Higher
BERT score values indicate stronger semantic coherence between the texts,
suggesting a closer alignment in meaning and context. We calculate the av-
erage BERT score across the set of questions {Qi }5i=1 generated by A. If
this average BERT score yields high values when compared to Q0 , it implies
significant semantic similarity between Q0 and the bot-generated questions
{Qi }5i=1 .
The core of the proof lies in the contradiction analysis: assuming A is in-
correct leads to an expectation that the generated questions {Qi }5i=1 would
diverge in meaning from Q0 . However, if the observed BERT score values
between Q0 and {Qi }5i=1 are consistently high, it contradicts the initial as-
sumption of A’s incorrectness. This contradiction indicates that A likely
provides a correct answer to Q0 .
Step 2: Response Evaluation using Human Answer
In this step, we further confirm that the IntellBot’s response to a query is
correct by comparing it with a manual answer. To do this, we curated manual
answers to each question before querying IntellBot and collected the chatbot re-
sponses via the interface. Subsequently, we compared the bot-generated answer
with its corresponding human answer using cosine similarity, which measures
how close two sentences are in meaning. To measure cosine similarity, we deter-
mine the embeddings of the answers using GloVe [19], Word2Vec [20], and BERT.
Global Vectors for Word Representation (GloVe) creates dense vector representa-
tions of words based on global word co-occurrence statistics, effectively capturing
semantic information. Word2Vec generates embeddings by training neural net-
works on large text corpora, capturing semantic relationships and contextual
similarities.
BERT differs from GloVe and Word2Vec in that it is a transformer-based
model pre-trained on large-scale text data. BERT generates embeddings that
consider both left and right context in all layers, allowing it to capture deep
contextual meanings of words and sentences. Finally, we used the Equation 1 to
calculate the similarity between bot and human responses:
Ab · Ah
CS(Ab , Ah ) = (1)
∥Ab ∥ × ∥Ah ∥
where Ab and Ah are two vectors representing the bot answer and the human
answer, respectively. · represents the dot product operation (sum of products
of corresponding elements) ∥Ab ∥ and ∥Ah ∥ represent the magnitude (length) of
vectors Ab and Ah , calculated using the L2 norm (square root of the sum of
squares of elements).
If the similarity score is high, then it indicates that the semantic content
of the IntellBot’s response is very close to that of a manually curated correct
answer. This similarity reflects the accuracy and correctness of the IntellBot’s
response to a query.
IntellBot 13

5 Experimental Result
Experimental Setup: Our experimentation was conducted on a Windows 11
Pro system with an Intel Core i9 processor, 32 GB of RAM, and an NVIDIA
Quadro P2000 with 5 GB GDDR5X memory. Additionally, we utilized Google
Colab to create the vector store. Our setup utilized various tools and frame-
works, including HuggingFace for instructive embedding, RetrievalQA for creat-
ing chains, OpenAI for defining the LLM, PromptTemplate for prompt defini-
tion, and FAISS for vector storage.

5.1 Response Evaluation


In this section, we provide a detailed overview of the query responses received
from our chatbot. Also, we evaluate the response using a two-phase approach
with bot-generated answers and manual responses.
Initially, we developed the IntellBot using the process presented in Section
4. Then we tested various queries with IntellBot related to vulnerabilities, at-
tacks, malware, and campaigns. For instance, one inquiry focused on versions
of packages that are vulnerable to Regular Expression Denial of Service (Re-
DoS) via Email validation, to which the bot accurately responded with package
forms before 1.2.1, from 1.3.0, and before 1.3.2. Similarly, we investigated queries
related to Taidoor attacks, finding that attackers used a compromised command-
and-control server in China, along with other compromised machines, for their
infrastructure. They employed tools like MS-SQL attack tools, CNC programs,
and FTP servers for these attacks. Subsequently, we evaluated the responses
generated by IntellBot and assessed their performance using the two approaches
discussed in Section 4.8. To do this, we generated 100 questions from each type
of document and queried the IntellBot to obtain answers.

Response Evaluation Using Bot-generated Questions: In this step, we


generated five bot questions from each query responses. Then we calculated the
BERT score between the initial query (Q0 ) and each of these five generated
questions ({Qi }5i=1 ) as described in Section 4.8. Subsequently, we determined
the average similarity score.
Table 1 presents a sample of questions, their corresponding responses, the five
bot-generated questions for the response, their BERT scores, and the average
BERT score. From the table, we observed that the BERT score between the
initial query (Q0 ) and the bot-generated questions ({Qi }5i=1 ) is consistently very
high, nearly 0.91 on average. This indicates that the generated questions closely
match the content of the initial query.
We also analyzed all query responses by this method and our analysis reveals
that the average BERT score exceeds 0.80 for all Q0 and the bot-generated
questions {Qi }5i=1 . So, based on our indirect proof statement discussed in 4.8, the
bot-generated response is correct for the specific query. This finding suggests that
IntellBot consistently provides accurate and relevant responses to the questions
posed, reinforcing its effectiveness in generating appropriate answers.
14 Dincy et al.

Table 1: BERT Score (BS) comparison between the initial query (Q0 ) and five
bot-generated questions ({Qi }5i=1 ).
Query IntellBot Response Five Bot Questions BS Average
BS
The primary victims
What is the Mirage campaign 0.921
targeted in the Mirage
and who were the primary tar-
Who were the campaign were
gets?
primary victims organizations involved in
the contest for rights to 0.904
targeted in the How were the organizations in- 0.878
Mirage campaign? survey natural gas and oil volved in the contest for rights
in the South China Sea, to survey natural gas and oil in
including a high profile the South China Sea affected by
oil company in the the Mirage campaign?
Philippines, a military
organization in Taiwan, What were the potential motives 0.924
an energy company in behind the Mirage campaign?
Canada, and
organizations in Brazil, How were the victims identified 0.945
Israel, Egypt, and and targeted in the Mirage cam-
Nigeria. paign?
What steps can organizations 0.851
take to protect themselves from
similar cyber attacks in the fu-
ture?
The Taidoor attacks were
How were the Taidoor attacks 0.929
carried out using
carried out?
What infrastructure compromised systems,
was used to carry including a primary What tools and programs were 0.925
command-and-control 0.910
out the Taidoor used by the attackers in the
attacks? server located in China. Taidoor attacks?
The attackers also used
various tools and Where was the primary 0.907
programs, such as an command-and-control server
MS-SQL attack tool, located for the Taidoor attacks?
CNC programs, and a
portable HTTP file How did the attackers gain access 0.903
server, to carry out their to the compromised systems used
attacks. in the Taidoor attacks?
What types of attacks were car- 0.886
ried out using the MS-SQL at-
tack tool, CNC programs, and
portable HTTP file server in the
Taidoor attacks?

Response Evaluation Using Human Answer: After evaluating the response


using the question comparison approach, we further assessed the response and
performance of our IntellBot by comparing it with the manual response. For this,
we manually provided an answer for each query and generated a response from
IntellBot. To verify the accuracy of the response, we determined the cosine sim-
ilarity between the bot-generated response and the manual answer. This process
involved several text preprocessing steps, including converting text to lowercase,
removing stopwords, and splitting sentences into individual words to standard-
ize the input. We then converted each word into numerical vectors using GloVe
and Word2Vec embeddings, utilizing the GoogleNews-vectors-negative300.bin for
Word2Vec and glove.6B.300d.txt for GloVe to capture semantic meaning. In ad-
dition, we calculated cosine similarity using BERT embeddings, which provide a
IntellBot 15

Table 2: Cosine similarity score of 10 random questions with its bot answer
and human answer. The topics are represented as follows: Vul for vulnerability-
related questions, APT for APT attack-related questions, Blog for queries from
security blog articles, and VT for VirusTotal reports. Similarity scores were
calculated using three different embedding techniques: Word2Vec, GloVe, and
BERT, denoted as S1, S2, and S3, respectively.
Type Query Human Answer Bot Answer S1 S2 S3
Vul What versions of the The package forms be- The package forms before 1.2.1, 0.986 1 0.998
package are vulnerable fore 1.2.1, from 1.3.0, from 1.3.0 and before 1.3.2 are
to Regular Expression and before 1.3.2 are vul- vulnerable to Regular Expression
Denial of Service (Re- nerable to Regular Ex- Denial of Service (ReDoS) via
DoS) via email valida- pression Denial of Ser- email validation
tion? vice (ReDoS) via email
validation.
Vul What vulnerability was denial of service (com- Remote attackers could cause a 0.557 0.716 0.979
identified in SnapGear plete packet loss) via a denial of service by flooding the
appliances before the packet flood system with packets.
3.1.4u5 firmware?
APT What infrastructure was C&C servers and email The attackers used a com- 0.492 0.907 0.908
used to carry out the addresses. promised command-and-control
Taidoor attacks? server located in China, as well
as other compromised machines
that hosted their entire infras-
tructure. They also used vari-
ous tools such as MS-SQL attack
tool, CNC programs, and FTP
servers to carry out the Taidoor
attacks.
APT Who were the primary The disclosed victims in- The primary victims targeted in 0.895 0.959 0.964
victims targeted in the cluded a high-profile oil the Mirage campaign were or-
Mirage campaign? company in the Philip- ganizations involved in the con-
pines, a military or- test for rights to survey natural
ganization in Taiwan, gas and oil in the South China
an energy company in Sea, including a high profile oil
Canada, and other en- company in the Philippines, a
tities in Brazil, Israel, military organization in Taiwan,
Egypt, and Nigeria. an energy company in Canada,
and organizations in Brazil, Is-
rael, Egypt, and Nigeria.
APT What is the name of the The tool used by the The tool used by the Naikon 0.898 0.927 0.997
tool used by the Naikon Naikon APT group for APT group for remote adminis-
APT group for remote remote administration tration of compromised systems
administration of com- is called XSControl, is called “xsControl/naikon."
promised systems? which is described as
the group’s victim man-
agement software.
Blog What type of malware The prevalent and pow- The malware that infected the 0.833 0.864 0.996
infected the Turkish erful info-stealing trojan Turkish airline employee’s com-
airline employee’s com- called RedLine infected puter was an info-stealer, specif-
puter and facilitated the Turkish airline em- ically the RedLine trojan.
the theft of Airbus ployee’s computer.
credentials?
Blog What are some aliases of Polazert, SolarMarker, Polazert, SolarMarker, and Yel- 0.764 1 0.999
the Jupyter Infostealer and Yellow Cockatoo low Cockatoo.
malware?
Blog When did the attack on The attack on Kaseya The attack on Kaseya took place 0.946 0.943 1
Kaseya take place? occurred over the July 4 on the July 4 weekend in 2021.
weekend in 2021.
VT List the differ- 26690, 26977, and 26603 The different versions of 1 0.701 0.977
ent versions of Win32/Injector.EDSK are
Win32/Injector.EDSK? 26690, 26977, and 26603.
VT List the different 10.0.0.1040 and The different versions of Tro- 1 0.970 0.970
version of Tro- 11.0.0.1006. jan.Win32.PRIVATE LOAD
jan.Win32.PRIVATE ER.YXCLPZ are “10.0.0.1040"
LOADER.YXCLPZ? and “11.0.0.1006".
16 Dincy et al.

Fig. 2: Comparison of average cosine similarity scores generated by Word2Vec,


GloVe, and BERT embeddings related to various topics. For each topic, 100
queries were selected, and the cosine similarity between bot and human responses
was computed.

more contextual representation of the text. For this, we initialized the tokenizer
and model with bert-base-uncased. After computing the embeddings for both
the bot responses and the manual answers, we assessed the cosine similarity to
quantify the alignment between them.
Cosine similarity between the bot answers and human answers for 10 ran-
domly selected queries, illustrated in Table 2. The table reveals that the bot
responses closely align with the human responses, showcasing near-identical con-
tent. High similarity scores between the bot-generated and human responses af-
firm the accuracy of our IntellBot-generated answers. Although a few questions
exhibit similarity scores below 0.8, we manually verified such questions and found
that the bot-generated responses produced correct answers with additional in-
formation.
Furthermore, we computed the cosine similarity scores for the responses to
400 queries related to different topics of security. Subsequently, we determined
the average similarity score for each type of query. Figure 2 illustrates the aver-
age similarity scores between bot-generated and human responses using differ-
ent embedding. Across various topics of queries, BERT embedding consistently
demonstrates higher similarity compared to Word2Vec and GloVe embeddings.
IntellBot 17

As depicted in Figure 2, the similarity scores between bot and human answers
range from 0.8 to 1, indicating IntellBot’s capability to provide responses closely
resembling manual answers in terms of cosine similarity.

5.2 Evaluation of RAG model

In this section, we evaluate our RAG model using the Retrieval-Augmented Gen-
eration Assessment System (RAGAS) [21]. RAGAS provides a comprehensive
evaluation by assessing both the retrieval and generation aspects of the model
rather than just the final response. For evaluating retrieval systems, RAGAS
employs metrics such as context_recall and context_precision. For generation
assessment, it utilizes metrics like faithfulness to identify hallucinations and an-
swer_relevancy to measure how precisely the answers address the queries.

– Faithfulness: Measures the factual accuracy of the generated answer (A)


with respect to the provided context (C). This is performed in two steps:
Given a query (Q) and the generated answer (A), an LLM extracts a set
of statements S(A) = {s1 , s2 , . . . , sn } from A. Then, each statement si ∈
S(A) is verified against the context C to determine if it is supported. The
faithfulness score F is computed as:

|T |
F =
|S|

where |T | is the number of statements supported by the context C and |S|


is the total number of statements in S(A).
– Answer Relevancy: Evaluates how relevant and precise the generated an-
swer (A) is to the query (Q). RAGAS uses an LLM to generate m potential
questions {q1 , q2 , . . . , qm } that the answer A could address. For each ques-
tion qi , embeddings are obtained and compared with the embedding of the
original query Q. The answer relevancy score AR is computed as:
m
1 X
AR = sim(Q, qi )
m i=1

where sim(Q, qi ) denotes the cosine similarity between the embeddings of Q


and qi .
– Context Recall (CR): Assesses the retriever’s ability to retrieve all nec-
essary information for answering the query. RAGAS compares the retrieved
context (C) with the ground truth answer (G) by checking if each statement
in G is present in C. The context recall score CR is computed as:

|Gs inC|
CR =
|Gs |
where the Gs inC is the count of statements from G that are present in C,
and the Gs is the total number of statements in G.
18 Dincy et al.

– Context Precision (CP): Evaluates the accuracy of the retrieved or gen-


erated content by measuring the proportion of relevant statements among
all the retrieved statements. It is calculated as the ratio of the number of
relevant statements retrieved to the total number of statements retrieved.
|Gin C|
CP =
|Cs |
where |Gin C| is the count of statements from the ground truth answer G that
are present in the retrieved context C. |Cs | is the total number of statements
in the retrieved context C.
Table 3 presents the average scores for four metrics across different types of
cyber data sources: Vulnerability, APT Report, Security Blogs, and VirusTotal
Report. For Vulnerability-related queries, the system exhibits a high Context
Precision of 0.934 and Context Recall of 0.933, indicating accurate and com-
prehensive retrieval. The Faithfulness score of 0.908 and the Answer Relevancy
score of 0.855 further demonstrate that the generated answers are highly accu-
rate and relevant. The queries related to APT Reports show solid performance
with a CP of 0.881 and CR of 0.847, but slightly lower compared to vulnerabili-
ties, suggesting minor improvements in retrieval completeness. The Faithfulness
score of 0.878 and the Answer Relevancy score of 0.922 are strong, indicating
that generated answers are highly relevant and accurate. The queries about in-
formation posted in security blog articles have a lower CP of 0.805 but a higher
CR of 0.889, showing that while the system retrieves the most relevant infor-
mation, it also includes some irrelevant data. The Faithfulness score of 0.871
and AR score of 0.883 are good. Also we asked the queries related to VirusTotal
Reports of malicious hash and it provides scores near to 0.80 across all met-
rics reflecting challenges in both retrieval accuracy and answer relevance. Over-
all, these results indicate that the system performs well across various types
of cyber data. The implementation details are available on our GitHub page:
https://github.com/OPTIMA-CTI/IntellBot.

Table 3: Retrieval and Generation metric average score of each type of cyber
data.
Type Retrieval Generation
CP CR F AR
Vulnerability 0.934 0.933 0.908 0.855
APT Report 0.881 0.847 0.878 0.922
Security Blogs 0.805 0.889 0.871 0.883
VirusTotal Report 0.786 0.805 0.766 0.798

6 Conclusion
The security chatbot, IntellBot, developed in this work demonstrates a novel ap-
proach to leveraging Large Language Models and vector databases for answering
IntellBot 19

queries related to cyber security. The IntellBot utilizes the LangChain frame-
work, which enables the architecting of Retrieval-Augmented Generation systems
with numerous tools to transform, store, search, and retrieve information that
refines language model responses. Combining LLMs with vector databases high-
lights the potential for intelligent information retrieval systems tailored to cyber
security. We assessed the IntellBot performance using a two-stage approach. Ini-
tially, we analyzed performance through an indirect method by calculating the
BERT score between the user query and five bot-generated questions from the
response to that query. In the second stage, we measured the cosine similarity
between the bot-generated responses and human responses for the same query
using Word2Vec, GloVe, and BERT embedding. Our findings show that in the
first stage, we achieved a BERT score greater than 0.8 and consistently obtained
cosine similarity scores ranging from 0.8 to 1, indicating highly accurate query
responses. Also, RAGAS’s evaluation metrics, consistently above 0.77, highlight
the system’s effective performance across various types of cyber queries. In the
future, we plan to integrate more data sources to enhance the chatbot’s capabil-
ities and expand its relevance in cyber security contexts. Moreover, we intend to
incorporate more evaluation techniques to comprehensively assess the chatbot’s
responses. Additionally, we plan to leverage reinforcement learning with human
judgment to refine and optimize the bot’s responses.

Acknowledgment

This work was partly supported by the HORIZON Europe Framework Pro-
gramme through the project “OPTIMA-Organization sPecific Threat Intelligence
Mining and sharing" (101063107), funded by the European Union. Views and
opinions expressed are however those of the author(s) only and do not neces-
sarily reflect those of the European Union. Neither the European Union nor the
granting authority can be held responsible for them.

References
1. Ju Yoen Lee. Can an artificial intelligence chatbot be the author of a scholarly
article? Journal of educational evaluation for health professions, 20, 2023.
2. Eleni Adamopoulou and Lefteris Moussiades. An overview of chatbot technology.
In IFIP international conference on artificial intelligence applications and innova-
tions, pages 373–383. Springer, 2020.
3. Eric WT Ngai, Maggie CM Lee, Mei Luo, Patrick SL Chan, and Tenglu Liang. An
intelligent knowledge-based chatbot for customer service. Electronic Commerce
Research and Applications, 50:101098, 2021.
20 Dincy et al.

4. Sandeep A Thorat and Vishakha Jadhav. A review on implementation issues


of rule-based chatbot systems. In Proceedings of the international conference on
innovative computing & communications (ICICC), 2020.
5. Jin K Kim, Michael Chua, Mandy Rickard, and Armando Lorenzo. Chatgpt and
large language model (llm) chatbots: The current state of acceptability and a
proposal for guidelines on utilization in academic medicine. Journal of Pediatric
Urology, 2023.
6. Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D
Weisz. The programmer’s assistant: Conversational interaction with a large lan-
guage model for software development. In Proceedings of the 28th International
Conference on Intelligent User Interfaces, pages 491–514, 2023.
7. Alexei A Birkun and Adhish Gautam. Large language model-based chatbot as a
source of advice on first aid in heart attack. Current Problems in Cardiology, page
102048, 2023.
8. Martin Hasal, Jana Nowaková, Khalifa Ahmed Saghair, Hussam Abdulla, Václav
Snášel, and Lidia Ogiela. Chatbots: Security, privacy, data protection, and social
aspects. Concurrency and Computation: Practice and Experience, 33(19):e6426,
2021.
9. Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs,
Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. Secbot:
a business-driven conversational agent for cybersecurity planning and manage-
ment. In 2020 16th international conference on network and service management
(CNSM), pages 1–7, University of Zurich, 2020. IEEE.
10. Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming
Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in
practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge
Discovery from Data, pages 1–23, 2023.
11. Aggeliki Androutsopoulou, Nikos Karacapilidis, Euripidis Loukis, and Yannis
Charalabidis. Transforming the communication between citizens and government
through ai-guided chatbots. Government information quarterly, 36(2):358–367,
2019.
12. Keivalya Pandya and Mehfuza Holia. Automating customer service using
langchain: Building custom open-source gpt chatbot for organizations, 2023.
13. Joseph Benjamin Ilagan and Jose Ramon Ilagan. A prototype of a chatbot for eval-
uating and refining student startup ideas using a large language model. EdArXiv
Preprints, 05 2023.
14. Aigerim Mansurova, Aliya Nugumanova, and Zhansaya Makhambetova. Develop-
ment of a question answering chatbot for blockchain domain. Scientific Journal of
Astana IT University, pages 27–40, 2023.
15. Pedro Filipe Oliveira and Paulo Matos. Introducing a chatbot to the web por-
tal of a higher education institution to enhance student interaction. Engineering
Proceedings, 56(1):128, 2023.
16. Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced
enough? a challenging problem solving benchmark for large language models, 2023.
17. Timothy McIntosh, Tong Liu, Teo Susnjak, Hooman Alavizadeh, Alex Ng, Raza
Nowrozy, and Paul Watters. Harnessing gpt-4 for generation of cybersecurity
grc policies: A focus on ransomware attack mitigation. Computers & security,
134:103424, 2023.
18. Samaneh Shafee, Alysson Bessani, and Pedro M Ferreira. Evaluation of llm chat-
bots for osint-based cyberthreat awareness. arXiv preprint arXiv:2401.15127, 2024.
IntellBot 21

19. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532–1543, 2014.
20. Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162,
2017.
21. Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ra-
gas: Automated evaluation of retrieval augmented generation. arXiv preprint
arXiv:2309.15217, 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy