Intell Bot
Intell Bot
Intell Bot
Dincy R. Arikkat1 , Abhinav M.1 , Navya Binu1 , Parvathi M.1 , Navya Biju1 , K.
S. Arunima1 , Vinod P.2,1 , Rafidha Rehiman K. A.1 , and Mauro Conti2
1
Department of Computer Applications, Cochin University of Science and
Technology, Kerala, India
arXiv:2411.05442v1 [cs.IR] 8 Nov 2024
2
Department of Mathematics, University of Padua, Padua, Italy
1 Introduction
The rise of interconnected technologies creates new vulnerabilities, leading to
a surge in complex cyber threats and demands a faster response to security
incidents. So, having some intelligent-based tools like chatbots that can pro-
vide threat intelligence about the current landscape is crucial. The chatbots
2 Dincy et al.
2 Related work
Large language models excel in various NLP tasks, including creative text gen-
eration, translation, and question answering [10]. These AI models, empowered
by massive amounts of text data, offer versatility due to their ability to han-
dle unseen data and require less task-specific training than traditional models.
Notably, LLMs are advantageous when data scarcity is a concern. However, fine-
tuned models may still be preferable for tasks with abundant data. While LLMs
hold significant promise, real-world applicability, interpretability, and potential
biases remain challenges. A few research works are available in the state of the
art, integrating NLP and LLM for chatbot creation.
In [11], the authors proposed an approach for using chatbots to improve
communication between citizens and government. The chatbots help to under-
stand citizens’ queries and provide them with information or complete trans-
actions. Pandya et al. [12] introduced Sahaay, an open-source system that uses
LLM to automate customer service. Sahaay directly retrieves information from
a company’s website and leverages Google’s Flan T5 LLM to answer customer
queries. The model is further trained with Hugging Face Instructor Embedding
to improve its understanding of meaning. Looking to address the limitations of
traditional, inflexible methods in entrepreneurship education, a new study [13]
proposed a groundbreaking solution: an LLM chatbot acting as a startup coach
simulator. This system tackles the iterative nature of startup development by
leveraging a conversational interface powered by LLMs like GPT-3. They created
the chatbot to provide real-time feedback on crucial aspects like their business
model, product-market fit, and financial projections. In [14], authors proposed
an LLM-based chatbot for the Fintech industry. They integrated technologies,
4 Dincy et al.
3 Background
3.1 Langchain
Langchain3 is a Python library that has been developed to simplify the creation
of applications that utilize LLMs. It has emerged as a valuable resource for con-
structing intelligent chatbots. The core of Langchain is built on a modular and
extensible framework that abstracts the complexities of working with LLMs. It
provides a range of fundamental tools for loading and managing LLM models
3
https://www.langchain.com/
IntellBot 5
from different sources, tokenizing input text, batching requests, and managing
other common LLM operations. A notable feature of Langchain is its compat-
ibility with a diverse selection of LLM models from prominent providers like
OpenAI4 , Anthropic5 , Cohere6 , and AI217 , as well as open-source models such
as LlamaCP8 and GPT-J9 .
Furthermore, Langchain offers advanced tools for enhancing the capabilities
of LLMs, which are particularly beneficial in the context of a cyber security chat-
bot. These tools encompass memory components for preserving conversational
context, agents for breaking down complex tasks into manageable subtasks, and
mechanisms for integrating knowledge bases. In building our cyber security chat-
bot, we leveraged several key components from the Langchain framework. These
key components are as follows:
– Prompts: Prompts are crafted instructions given to a language model to
guide its responses in specific contexts. Prompts could include structured
queries or instructions to ensure that the model provides relevant and accu-
rate information.
– Document Loader: Document loaders are tools or scripts used to ingest and
preprocess documents. This helps to load documents from a file system,
database, or any other storage medium and preparing them for input into
the language model.
– Agents: An agent serves as the intermediary between humans and the LLM.
It interacts with the LLM to execute tasks or generate outputs according to
provided instructions. This can take the form of a script, program, or user-
friendly chat interface. The agent receives inputs, crafts a prompt to guide
the LLM, sends it to the LLM, and handles the LLM’s response to achieve
the intended result.
– Chains: The chain represents the sequence of operations or steps that the
agent follows to interact with the language model. This could include prepro-
cessing inputs, sending queries to the language model, receiving responses,
and post-processing or formatting the outputs as needed.
– LLM: LangChain’s core engine is the LLM, acting as the language model
itself. It is responsible for understanding inputs (prompts or queries) in nat-
ural language and generating appropriate responses or outputs based on its
training data and algorithms.
4 Methodology
Fig. 1: Proposed Architecture of Security chatbot created using the Large Lan-
guage Model
dataset consisted of a total of 2, 447 PDF documents comprising 445 cyber secu-
rity books refereed from different GitHub pages and 2, 002 Advanced Persistent
Threat (APT) reports sourced from vx-underground10 , providing detailed in-
formation on various malware attacks and procedures. Also, we collected data
from the threat intelligence platform known as VirusTotal11 . We searched 7, 989
malicious file hashes in Virustotal and saved the response in JSON format.
Moreover, custom crawlers were developed to collect information from websites
such as KrebsOnSecurity and Malwarebytes, focusing on URLs related to cy-
ber threats and security incidents. Additionally, vulnerability information was
gathered from the National Vulnerability Database (NVD)12 . Furthermore, we
10
https://vx-underground.org/APTs
11
https://www.virustotal.com/gui/home/upload
12
https://nvd.nist.gov/vuln/data-feeds
8 Dincy et al.
Different document loaders were employed to load data in the formats such
as PDFs, CSVs, JSONs, and URLs. The PyPDF Loader is used to load PDF
data, load a total of 2,447 PDF documents, and extract metadata, page content,
and page numbers. The JSON Loader handles JSON files obtained from the
Virus Total website. JSON Loader effectively extracted data using the jq Python
package. The RecursiveUrlLoader utilized a tailored extractor function relying on
Beautifulsoup to traverse webpages starting from root URLs. It then extracted
text content from these pages to facilitate subsequent processing. Additionally,
data from 38 CSV files was loaded using the CSVLoader. The load function helps
to get structured data from these files, making them suitable for further analysis
in our research. Also, content scraped from 7,889 web pages of hackernews sites
was collected and saved as CSV files.
Text splitting converts the user’s input into smaller units, such as words or
tokens, allowing for easier data processing and extracting meaningful features.
This breakdown enables us to better understand the user’s queries or commands
and enhance comprehension. In our work, we utilized the Recursive Character
Text Splitter to carry out this task, incorporating parameters like chunk size
and chunk overlap. The chunk size parameter determines the maximum length
of individual segments in which a text document is divided. It specifies each
chunk’s number of characters, words, or tokens. We fixed the chunk size at 1000
to ensure consistent and manageable segments for analysis. Also, chunk overlap
13
https://thehackernews.com/
14
https://www.bleepingcomputer.com/news/security/
15
https://blog.talosintelligence.com/
16
https://www.csoonline.com/in/cybercrime/page/
IntellBot 9
indicates the level of shared content between consecutive text segments. A higher
overlap suggests more redundancy among chunks, while a lower overlap may lead
to more distinct segments. In our study, we choose a chunk overlap value of 50
to balance overlap and coherence, maintaining continuity across segmented text
data.
After converting the data into numerical format, we stored it in a vector store,
a specialized database designed for efficient storage and retrieval of items. Each
piece of data within the vector store is represented as a high-dimensional vector in
a multi-dimensional space, allowing for comparisons and similarity calculations.
To enhance efficient storage and query performance, we utilized the Facebook
AI Similarity Search (FAISS)18 framework for constructing and managing the
vector store. FAISS is well-known for its capability to optimize similarity search
operations, ensuring quick and accurate retrieval of similar items from extensive
datasets. We created separate vector stores for each data type, including PDFs,
CSV files, URLs, and JSON data. To amalgamate the outputs from these various
vector stores and retrievers into a cohesive final response, we employed an en-
semble retriever. This ensemble method enhances response accuracy and depth
by leveraging the unique strengths of each individual retriever. By embracing
this approach, we balanced storage efficiency and query performance, enabling
swift and effective retrieval of similar items from the vector store.
17
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
18
https://python.langchain.com/docs/integrations/vectorstores/faiss/
10 Dincy et al.
[System]
"You are a cyber security expert. Provide the responses for the question considering the
context below. Your responses should consider factors such as the relevance, accuracy,
depth, creativity, and level of detail of their responses."
[User Query]
{What was the name of the ransomware used by FIN8?}
[context 1]
{the ability to drop arbitrary files and exfiltrate file contents from the compromised
machine to an actor-controlled infrastructure. This is not the first time FIN8 has been
detected using Sardonic in connection with a ransomware attack...}
[context 2]
{CONTENT: An exhaustive analysis of FIN7 has unmasked the cybercrime syndicate’s
organizational hierarchy, alongside unraveling its role as an affiliate for mounting
ransomware attacks...}
[context 3]
{CONTENT: The notorious cybercrime group known as FIN7 has been observed de-
ploying Cl0p (aka Clop) ransomware, marking the threat actor’s first ransomware cam-
paign in late 2021...}
IntellBot 11
[Response]
FIN8 has been detected using the White Rabbit ransomware, which is based on Sar-
donic.
To optimize the interface development for our cyber security chatbot, we used
Streamlit19 , a framework renowned for its simplicity and rapid application de-
velopment capabilities. Streamlit proved highly advantageous in creating an in-
teractive interface with minimal coding requirements, seamlessly aligning with
our data processing framework, LangChain. We used Streamlit version 1.32.2,
where it provides a chat_input function to feed the user queries.
After developing the IntellBot for querying security related questions, we need to
evaluate the accuracy of its responses. To assess the performance of the bot, we
propose a two-stage evaluation approach. Initially, we assessed the correctness of
our IntellBot’s responses to a query by generating five bot questions based on the
retrieved response. We then validated the correctness of the response through
indirect proof. Subsequently, we strengthened the evaluation by providing man-
ual answers to the questions and comparing them with the bot’s response using
similarity scores.
Step 1: Response Evaluation using bot-generated questions
In this stage, we employed an indirect proof strategy. We determined the
bot answer A to the cyber security query Q0 and generated five subsequent bot
questions Q1 , Q2 , ..., Q5 based on A. The validation process involved comparing
these generated questions with the original query Q0 to assess the correctness of
A.
Then, we applied indirect proof or proof by contradiction, which asserts that
assuming a statement’s opposite leads to a logical inconsistency. It relies on
the following logic: If the bot answer A was incorrect, the questions {Qi }5i=1
derived from A would logically diverge from Q0 . However, if the similarity score,
sim(Q0 , Qi ), is high for all i (ranges from 1 to 5), it contradicts our initial
assumption of A’s incorrectness. This contradiction suggests that A correctly
addresses Q0 .
This approach proceeds as follows:
– Proof : The process begins with computing the BERT score between Q0 and
each {Qi }5i=1 to quantify their semantic alignment. BERT score computes
similarity scores between pairs of texts using contextual embedding from
Bidirectional Encoder Representations from Transformers (BERT). Higher
BERT score values indicate stronger semantic coherence between the texts,
suggesting a closer alignment in meaning and context. We calculate the av-
erage BERT score across the set of questions {Qi }5i=1 generated by A. If
this average BERT score yields high values when compared to Q0 , it implies
significant semantic similarity between Q0 and the bot-generated questions
{Qi }5i=1 .
The core of the proof lies in the contradiction analysis: assuming A is in-
correct leads to an expectation that the generated questions {Qi }5i=1 would
diverge in meaning from Q0 . However, if the observed BERT score values
between Q0 and {Qi }5i=1 are consistently high, it contradicts the initial as-
sumption of A’s incorrectness. This contradiction indicates that A likely
provides a correct answer to Q0 .
Step 2: Response Evaluation using Human Answer
In this step, we further confirm that the IntellBot’s response to a query is
correct by comparing it with a manual answer. To do this, we curated manual
answers to each question before querying IntellBot and collected the chatbot re-
sponses via the interface. Subsequently, we compared the bot-generated answer
with its corresponding human answer using cosine similarity, which measures
how close two sentences are in meaning. To measure cosine similarity, we deter-
mine the embeddings of the answers using GloVe [19], Word2Vec [20], and BERT.
Global Vectors for Word Representation (GloVe) creates dense vector representa-
tions of words based on global word co-occurrence statistics, effectively capturing
semantic information. Word2Vec generates embeddings by training neural net-
works on large text corpora, capturing semantic relationships and contextual
similarities.
BERT differs from GloVe and Word2Vec in that it is a transformer-based
model pre-trained on large-scale text data. BERT generates embeddings that
consider both left and right context in all layers, allowing it to capture deep
contextual meanings of words and sentences. Finally, we used the Equation 1 to
calculate the similarity between bot and human responses:
Ab · Ah
CS(Ab , Ah ) = (1)
∥Ab ∥ × ∥Ah ∥
where Ab and Ah are two vectors representing the bot answer and the human
answer, respectively. · represents the dot product operation (sum of products
of corresponding elements) ∥Ab ∥ and ∥Ah ∥ represent the magnitude (length) of
vectors Ab and Ah , calculated using the L2 norm (square root of the sum of
squares of elements).
If the similarity score is high, then it indicates that the semantic content
of the IntellBot’s response is very close to that of a manually curated correct
answer. This similarity reflects the accuracy and correctness of the IntellBot’s
response to a query.
IntellBot 13
5 Experimental Result
Experimental Setup: Our experimentation was conducted on a Windows 11
Pro system with an Intel Core i9 processor, 32 GB of RAM, and an NVIDIA
Quadro P2000 with 5 GB GDDR5X memory. Additionally, we utilized Google
Colab to create the vector store. Our setup utilized various tools and frame-
works, including HuggingFace for instructive embedding, RetrievalQA for creat-
ing chains, OpenAI for defining the LLM, PromptTemplate for prompt defini-
tion, and FAISS for vector storage.
Table 1: BERT Score (BS) comparison between the initial query (Q0 ) and five
bot-generated questions ({Qi }5i=1 ).
Query IntellBot Response Five Bot Questions BS Average
BS
The primary victims
What is the Mirage campaign 0.921
targeted in the Mirage
and who were the primary tar-
Who were the campaign were
gets?
primary victims organizations involved in
the contest for rights to 0.904
targeted in the How were the organizations in- 0.878
Mirage campaign? survey natural gas and oil volved in the contest for rights
in the South China Sea, to survey natural gas and oil in
including a high profile the South China Sea affected by
oil company in the the Mirage campaign?
Philippines, a military
organization in Taiwan, What were the potential motives 0.924
an energy company in behind the Mirage campaign?
Canada, and
organizations in Brazil, How were the victims identified 0.945
Israel, Egypt, and and targeted in the Mirage cam-
Nigeria. paign?
What steps can organizations 0.851
take to protect themselves from
similar cyber attacks in the fu-
ture?
The Taidoor attacks were
How were the Taidoor attacks 0.929
carried out using
carried out?
What infrastructure compromised systems,
was used to carry including a primary What tools and programs were 0.925
command-and-control 0.910
out the Taidoor used by the attackers in the
attacks? server located in China. Taidoor attacks?
The attackers also used
various tools and Where was the primary 0.907
programs, such as an command-and-control server
MS-SQL attack tool, located for the Taidoor attacks?
CNC programs, and a
portable HTTP file How did the attackers gain access 0.903
server, to carry out their to the compromised systems used
attacks. in the Taidoor attacks?
What types of attacks were car- 0.886
ried out using the MS-SQL at-
tack tool, CNC programs, and
portable HTTP file server in the
Taidoor attacks?
Table 2: Cosine similarity score of 10 random questions with its bot answer
and human answer. The topics are represented as follows: Vul for vulnerability-
related questions, APT for APT attack-related questions, Blog for queries from
security blog articles, and VT for VirusTotal reports. Similarity scores were
calculated using three different embedding techniques: Word2Vec, GloVe, and
BERT, denoted as S1, S2, and S3, respectively.
Type Query Human Answer Bot Answer S1 S2 S3
Vul What versions of the The package forms be- The package forms before 1.2.1, 0.986 1 0.998
package are vulnerable fore 1.2.1, from 1.3.0, from 1.3.0 and before 1.3.2 are
to Regular Expression and before 1.3.2 are vul- vulnerable to Regular Expression
Denial of Service (Re- nerable to Regular Ex- Denial of Service (ReDoS) via
DoS) via email valida- pression Denial of Ser- email validation
tion? vice (ReDoS) via email
validation.
Vul What vulnerability was denial of service (com- Remote attackers could cause a 0.557 0.716 0.979
identified in SnapGear plete packet loss) via a denial of service by flooding the
appliances before the packet flood system with packets.
3.1.4u5 firmware?
APT What infrastructure was C&C servers and email The attackers used a com- 0.492 0.907 0.908
used to carry out the addresses. promised command-and-control
Taidoor attacks? server located in China, as well
as other compromised machines
that hosted their entire infras-
tructure. They also used vari-
ous tools such as MS-SQL attack
tool, CNC programs, and FTP
servers to carry out the Taidoor
attacks.
APT Who were the primary The disclosed victims in- The primary victims targeted in 0.895 0.959 0.964
victims targeted in the cluded a high-profile oil the Mirage campaign were or-
Mirage campaign? company in the Philip- ganizations involved in the con-
pines, a military or- test for rights to survey natural
ganization in Taiwan, gas and oil in the South China
an energy company in Sea, including a high profile oil
Canada, and other en- company in the Philippines, a
tities in Brazil, Israel, military organization in Taiwan,
Egypt, and Nigeria. an energy company in Canada,
and organizations in Brazil, Is-
rael, Egypt, and Nigeria.
APT What is the name of the The tool used by the The tool used by the Naikon 0.898 0.927 0.997
tool used by the Naikon Naikon APT group for APT group for remote adminis-
APT group for remote remote administration tration of compromised systems
administration of com- is called XSControl, is called “xsControl/naikon."
promised systems? which is described as
the group’s victim man-
agement software.
Blog What type of malware The prevalent and pow- The malware that infected the 0.833 0.864 0.996
infected the Turkish erful info-stealing trojan Turkish airline employee’s com-
airline employee’s com- called RedLine infected puter was an info-stealer, specif-
puter and facilitated the Turkish airline em- ically the RedLine trojan.
the theft of Airbus ployee’s computer.
credentials?
Blog What are some aliases of Polazert, SolarMarker, Polazert, SolarMarker, and Yel- 0.764 1 0.999
the Jupyter Infostealer and Yellow Cockatoo low Cockatoo.
malware?
Blog When did the attack on The attack on Kaseya The attack on Kaseya took place 0.946 0.943 1
Kaseya take place? occurred over the July 4 on the July 4 weekend in 2021.
weekend in 2021.
VT List the differ- 26690, 26977, and 26603 The different versions of 1 0.701 0.977
ent versions of Win32/Injector.EDSK are
Win32/Injector.EDSK? 26690, 26977, and 26603.
VT List the different 10.0.0.1040 and The different versions of Tro- 1 0.970 0.970
version of Tro- 11.0.0.1006. jan.Win32.PRIVATE LOAD
jan.Win32.PRIVATE ER.YXCLPZ are “10.0.0.1040"
LOADER.YXCLPZ? and “11.0.0.1006".
16 Dincy et al.
more contextual representation of the text. For this, we initialized the tokenizer
and model with bert-base-uncased. After computing the embeddings for both
the bot responses and the manual answers, we assessed the cosine similarity to
quantify the alignment between them.
Cosine similarity between the bot answers and human answers for 10 ran-
domly selected queries, illustrated in Table 2. The table reveals that the bot
responses closely align with the human responses, showcasing near-identical con-
tent. High similarity scores between the bot-generated and human responses af-
firm the accuracy of our IntellBot-generated answers. Although a few questions
exhibit similarity scores below 0.8, we manually verified such questions and found
that the bot-generated responses produced correct answers with additional in-
formation.
Furthermore, we computed the cosine similarity scores for the responses to
400 queries related to different topics of security. Subsequently, we determined
the average similarity score for each type of query. Figure 2 illustrates the aver-
age similarity scores between bot-generated and human responses using differ-
ent embedding. Across various topics of queries, BERT embedding consistently
demonstrates higher similarity compared to Word2Vec and GloVe embeddings.
IntellBot 17
As depicted in Figure 2, the similarity scores between bot and human answers
range from 0.8 to 1, indicating IntellBot’s capability to provide responses closely
resembling manual answers in terms of cosine similarity.
In this section, we evaluate our RAG model using the Retrieval-Augmented Gen-
eration Assessment System (RAGAS) [21]. RAGAS provides a comprehensive
evaluation by assessing both the retrieval and generation aspects of the model
rather than just the final response. For evaluating retrieval systems, RAGAS
employs metrics such as context_recall and context_precision. For generation
assessment, it utilizes metrics like faithfulness to identify hallucinations and an-
swer_relevancy to measure how precisely the answers address the queries.
|T |
F =
|S|
|Gs inC|
CR =
|Gs |
where the Gs inC is the count of statements from G that are present in C,
and the Gs is the total number of statements in G.
18 Dincy et al.
Table 3: Retrieval and Generation metric average score of each type of cyber
data.
Type Retrieval Generation
CP CR F AR
Vulnerability 0.934 0.933 0.908 0.855
APT Report 0.881 0.847 0.878 0.922
Security Blogs 0.805 0.889 0.871 0.883
VirusTotal Report 0.786 0.805 0.766 0.798
6 Conclusion
The security chatbot, IntellBot, developed in this work demonstrates a novel ap-
proach to leveraging Large Language Models and vector databases for answering
IntellBot 19
queries related to cyber security. The IntellBot utilizes the LangChain frame-
work, which enables the architecting of Retrieval-Augmented Generation systems
with numerous tools to transform, store, search, and retrieve information that
refines language model responses. Combining LLMs with vector databases high-
lights the potential for intelligent information retrieval systems tailored to cyber
security. We assessed the IntellBot performance using a two-stage approach. Ini-
tially, we analyzed performance through an indirect method by calculating the
BERT score between the user query and five bot-generated questions from the
response to that query. In the second stage, we measured the cosine similarity
between the bot-generated responses and human responses for the same query
using Word2Vec, GloVe, and BERT embedding. Our findings show that in the
first stage, we achieved a BERT score greater than 0.8 and consistently obtained
cosine similarity scores ranging from 0.8 to 1, indicating highly accurate query
responses. Also, RAGAS’s evaluation metrics, consistently above 0.77, highlight
the system’s effective performance across various types of cyber queries. In the
future, we plan to integrate more data sources to enhance the chatbot’s capabil-
ities and expand its relevance in cyber security contexts. Moreover, we intend to
incorporate more evaluation techniques to comprehensively assess the chatbot’s
responses. Additionally, we plan to leverage reinforcement learning with human
judgment to refine and optimize the bot’s responses.
Acknowledgment
This work was partly supported by the HORIZON Europe Framework Pro-
gramme through the project “OPTIMA-Organization sPecific Threat Intelligence
Mining and sharing" (101063107), funded by the European Union. Views and
opinions expressed are however those of the author(s) only and do not neces-
sarily reflect those of the European Union. Neither the European Union nor the
granting authority can be held responsible for them.
References
1. Ju Yoen Lee. Can an artificial intelligence chatbot be the author of a scholarly
article? Journal of educational evaluation for health professions, 20, 2023.
2. Eleni Adamopoulou and Lefteris Moussiades. An overview of chatbot technology.
In IFIP international conference on artificial intelligence applications and innova-
tions, pages 373–383. Springer, 2020.
3. Eric WT Ngai, Maggie CM Lee, Mei Luo, Patrick SL Chan, and Tenglu Liang. An
intelligent knowledge-based chatbot for customer service. Electronic Commerce
Research and Applications, 50:101098, 2021.
20 Dincy et al.
19. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532–1543, 2014.
20. Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162,
2017.
21. Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ra-
gas: Automated evaluation of retrieval augmented generation. arXiv preprint
arXiv:2309.15217, 2023.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: