Malware Analysis Using Machine Learning (Paper Presented)
Malware Analysis Using Machine Learning (Paper Presented)
Malware Analysis Using Machine Learning (Paper Presented)
MACHINE LEARNING
The Project report submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by
AKASH NAIR.V.S
(REG.NO: 20BFS007)
Mr .MIDHUN.S
Assistant Professor and Head, Department of DCFS
DECEMBER - 2022
SREE SARASWATHI THYAGARAJA COLLEGE POLLACHI
(Autonomous)
(Affiliated to Bharathiar University, Coimbatore)
CERTIFICATE
This is to certify that the project work entitled COMPARATIVE STUDY OF FILELESS
MALWARE DETECTION USING MACHINE LEARNING submitted to Sree
Saraswathi Thyagaraja College (Autonomous), Pollachi, affiliated to Bharathiar
University, Coimbatore in partial fulfillment of the requirements for the award of the
degree of BACHELOR OF DIGITAL AND CYBER FORENSICS SCIENCE is a
record of original work done by AKASH NAIR.V.S under my supervision and guidance
and the report has not previously formed the basis for the award of any Degree/ Diploma/
Associate ship/ Fellowship or other similar title to any candidate of any University.
14-12-2022 GUIDE
Counter Signed By
I, JAISON.V.R hereby declare that the project report entitled “COMPARATIVE STUDY
OF FILELESS MALWARE DETECTION USING MACHINE LEARNING”
submitted to Sree Saraswathi Thyagaraja College (Autonomous), Pollachi, affiliated to
Bharathiar University, Coimbatore in partial fulfillment of the requirements for the
award of the degree of BACHELOR OF DIGITAL AND CYBER FORENSICS
SCIENCE is a record of original work done by me under the guidance of Mr
.MIDHUN.S, Assistant Professor and Head, Department of DIGITAL AND CYBER
FORENSICS SCIENCE and it has not previously formed the basis for the award of any
Degree/Diploma /Associateship /Fellowship or other similar title to any candidate of any
University.
I take this opportunity to express our gratitude and sincere thanks to everyone who helped
me in my project.
I express my deep sense of gratitude and sincere thanks to my beloved staff members
MRS. VINEETHA V, MR. FEBIN PRAKASH & MRS. D.MANJULA allowed me to
carry out this project and gave me complete freedom to utilize the resources of the
department.
It's my prime duty to solemnly express my deep sense of gratitude and sincere thanks to the
guide Mr .MIDHUN.S, MCA, Assistant Professor and Head, UG Department of
Digital and Cyber Forensic Science for their valuable advice and excellent guidance to
complete the project successfully.
I also convey my heartfelt thanks to my parents, friends and all the staff members of the
Department of DIGITAL AND CYBER FORENSICS SCIENCE for their valuable support
which energized me to complete this project.
SYNOPSIS
SYNOPSIS
Malware attacks have been one of the most serious cyber risks faced by different
countries. The number of vulnerabilities reporting and malware is also increasing rapidly.
Researchers have received tremendous attention in the study of malware behaviors. There
are several factors that lead to the increase of malware attacks. The malware authors create
and deploy malware that can mutate and which has different forms such as ransomware
and fileless malware/virus. This is done in order to avoid detection of malware. It is
difficult to detect the malwares and cyber attacks using the traditional cyber security
procedures. Solutions for the new generation cyber attacks or for the problems in security
rely on various artificial intelligence techniques. Research shows that over the last decade,
malware has been growing exponentially, causing substantial financial losses to various
organizations. Different anti-malware companies have been proposing solutions to defend
against attacks from these malwares. The velocity, volume, and the complexity of malware
are posing new challenges to the anti-malware community. Current state-of-the-art
research shows that recently, researchers and anti-virus organizations started applying
machine learning and deep learning methods for malware analysis and detection. We have
used a memory dump method and applied unsupervised learning in addition to supervised
learning for malware classification.
In this project we are proposing a method for the identification of malware, even new
threads using memory dump and various artificial intelligence techniques like machine
learning and deep learning. The process in the system will be analyzed with various
machine learning models and find a more efficient and accurate model from these. With
the help of a deep learning model we can find existing malwares even now. We will be
doing a comparative study between 4 machine learning algorithms to determine which is
the best and accurate one. This study can help you to use the best algorithm to save your
time.
CONTENT
TABLE OF CONTENT
Chapter 1 : Introduction………………………………………………………….
4.1.3 Example…………………………………………………………….
4.2.2 Example…………………………………………………………
4.3.1 Example…………………………………………………………
4.4 XGBOOST……………………………………………………...
5.1 Prerequisite……….…………………………………………….
6.1 Overview…………………………………………………….
Chapter 9 : Conclusion……………………………………………………
References…………………………………………………………………
Chapter 1: Introduction
Idealistic hackers attacked computers in the early days because they were eager to
prove themselves. Cracking machines, however, is an industry in today's world. Despite
recent improvements in software and computer hardware security, both in frequency and
sophistication, attacks on computer systems have increased. Regrettably, there are major
drawbacks to current methods for detecting and analyzing unknown code samples. The
Internet is a critical part of our everyday lives today. On the internet, there are many
services and they are rising daily as well. Numerous reports indicate that malware's effect
is worsening at an alarming pace. Although malware diversity is growing, anti- virus
scanners are unable to fulfill security needs, resulting in attacks on millions of hosts.
Around 65,63,145 different hosts were targeted, according to Kaspersky Labs, and in 2015,
40,00,000 unique malware artifacts were found. Juniper Research (2016), in particular,
projected that by 2019 the cost of data breaches will rise to $2.1 trillion globally. Current
studies show that script-kiddies are generating more and more attacks which are automated.
To date, attacks on commercial and government organizations, such as ransomware and
malware, continue to pose a significant threat and challenge. Such attacks can come in
various ways and sizes. An enormous challenge is the ability of the global security
community to develop and provide expertise in cybersecurity. There is widespread
awareness of the global scarcity of cybersecurity and talent. Cybercrimes, such as financial
fraud, child exploitation online and payment fraud, are so common that they demand
international 24-hour response and collaboration between multinational law enforcement
agencies. For single users and organizations, malware defense of computer systems is
therefore one of the most critical cybersecurity activities, as even a single attack may result
in compromised data and sufficient losses. Malware attacks have been one of the most
serious cyber risks faced by different countries. The number of vulnerabilities reporting
and malware is also increasing rapidly. Researchers have received tremendous attention in
the study of malware behaviors. There are several factors that lead to the development of
malware attacks. The malware authors create and deploy malware that can mutate and
which has different forms such as ransomware and fileless malwares. This is done in order
to avoid the detection of malware. It is difficult to detect the malware and cyber attacks
using the traditional cyber security procedures. Solutions for the new generation cyber
attacks rely on various Machine learning techniques.
Brief:
Malware, short for malicious software, consists of programming (code, scripts, active
content, and other software) designed to disrupt or deny operation, gather information that
leads to loss of privacy or exploitation, gain unauthorized access to system resources, and
other abusive behavior. It is a general term used to define a variety of forms of hostile,
intrusive, or annoying software or program code. Software is considered to be malware
based on the perceived intent of the creator rather than any particular features. Malware
includes computer viruses, worms, Trojan horses, spyware, dishonest adware, crime-ware,
most rootkits, and other malicious and unwanted software or programs .
In 2008, Symantec published a report that "the release rate of malicious code and other
unwanted programs may be exceeding that of legitimate software applications.” According
to F-Secure, "As much malware was produced in 2007 as in the previous 20 years
altogether.”.
Since the rise of widespread Internet access, malicious software has been designed for a
profit, for example forced advertising. For instance, since 2003, the majority of widespread
viruses and worms have been designed to take control of users' computers for black-market
exploitation. Another category of malware, spyware, - programs designed to monitor users'
web browsing and steal private information. Spyware programs do not spread like viruses,
instead are installed by exploiting security holes or are packaged with user-installed
software, such as peer-to-peer applications.
Clearly, there is a very urgent need to find a suitable method to detect infected files, which
can even detect new viruses by studying the structure of system calls made by malware.
An increase in the use of exploit kits (programs used by cybercriminals to exploit system
vulnerabilities) led to an explosion of malware delivered online during the 2000s.
Automated SQL injection (a technique used to attack data-driven applications) and other
forms of mass website compromises increased distribution capabilities in 2007. Since then,
the number of malware attacks has grown exponentially, doubling or more each year.
At the start of the new millennium, internet and email worms made headlines across the
globe:
Throughout 2002 and 2003, internet users were plagued by out-of-control popups and other
Javascript bombs. Around this time, socially engineered worms and spam proxies began to
appear. Phishing and other credit card scams also took off during this period, along with
notable internet worms like Blaster and Slammer. Slammer, released in 2003, caused a
denial of service (DoS) on some internet hosts and slowed internet traffic. Below are some
other notable malware incidents from this time:
2004: An email worm war broke out between the authors of MyDoom, Bagle, and
Netsky. Ironically, this feud led to improved email scanning and higher adoption
rates of email filtering, which eventually nearly eliminated mass-spreading email
worms.
2005: The discovery and disclosure of the now-infamous Sony rootkit led to the
inclusion of rootkits in most modern-day malware.
2006: Various financial scams, Nigerian 419 scams, phishing, and lottery scams
were prevalent at this time. Though not directly malware-related, such scams
continued the profit-motivated criminal activity made possible by the internet.
2007: Website compromises escalated due in large part to the discovery and
disclosure of MPack, a crimeware kit used to deliver exploits online. Compromises
included the Miami Dolphins stadium site, Tom’s Hardware, The Sun, MySpace,
Bebo, Photobucket, and The India Times website. By the end of 2007, SQL
injection attacks had begun to ramp up; victims included the popular Cute Overload
and IKEA websites.
2008: By now, attackers were employing stolen FTP credentials and leveraging
weak configurations to inject IFrames on tens of thousands of smaller websites. In
June 2008, the Asprox botnet facilitated automated SQL injection attacks, claiming
Walmart as one of its victims.
2009: In early 2009, Gumblar emerged, infecting systems running older versions of
Windows. Its methodology was quickly adopted by other attackers, leading to
botnets that are harder to detect.
In the last decade or so, attacks have taken advantage of new technologies, including
cryptocurrency and the Internet of Things (IoT).
2010: Industrial computer systems were targets of the 2010 Stuxnet worm. This
malicious tool targeted machinery on factory assembly lines. It was so damaging
that it's thought to have caused the destruction of several hundred of Iran's
uranium-enrichment centrifuges.
2011: A Microsoft-specific Trojan horse called ZeroAccess downloaded malware
on computers via botnets. It was mostly hidden from the operating system using
rootkits and was propagated by Bitcoin mining tools.
2012: As part of a worrying trend, Shamoon targeted computers in the energy
sector. Cited by cybersecurity lab CrySyS as "the most complex malware ever
found," Flame was used for cyber espionage in the Middle East.
2013: An early instance of ransomware, CryptoLocker was a Trojan horse that
locked the files on a user's computer, prompting them to pay a ransom for the
decryption key. Gameover ZeuS used keystroke logging to steal users' login details
from financial transaction sites.
2014: The Trojan horse known as Regin was thought to have been developed in the
U.S. and U.K. for espionage and mass surveillance purposes.
2016: Locky infected several million computers in Europe, including over 5,000
computers per hour just in Germany. Mirai launched highly disruptive distributed
DoS (DDoS) attacks on several prominent websites and infected the IoT.
2017: The global WannaCry ransomware attack was halted when a cybersecurity
researcher found a "kill switch" within the ransomware code. Petya, another
instance of ransomware, was also released, using a similar exploit to the one used
by WannaCry.
2018: As cryptocurrency started to gain traction, Thanatos became the first
ransomware to accept payments in Bitcoin.
❖ Ransomware:
Ransomware is software that uses encryption to disable a target’s access to its data
until a ransom is paid. The victim organization is rendered partially or totally
unable to operate until it pays, but there is no guarantee that payment will result in
the necessary decryption key or that the decryption key provided will function
properly.
❖ Fileless Malware:
Fileless malware doesn’t install anything initially, instead, it makes changes to files
that are native to the operating system, such as PowerShell or WMI. Because the
operating system recognizes the edited files as legitimate, a fileless attack is not
caught by antivirus software — and because these attacks are stealthy, they are up
to ten times more successful than traditional malware attacks.
❖ Spyware:
Spyware collects information about users’ activities without their knowledge or
consent. This can include passwords, pins, payment information and unstructured
messages.The use of spyware is not limited to the desktop browser: it can also
operate in a critical app or on a mobile phone. Even if the data stolen is not critical,
the effects of spyware often ripple throughout the organization as performance is
degraded and productivity eroded.
❖ Adware:
Adware tracks a user’s surfing activity to determine which ads to serve them.
Although adware is similar to spyware, it does not install any software on a user’s
computer, nor does it capture keystrokes. The danger in adware is the erosion of a
user’s privacy — the data captured by adware is collated with data captured, overtly
or covertly, about the user’s activity elsewhere on the internet and used to create a
profile of that person which includes who their friends are, what they’ve purchased,
where they’ve traveled, and more. That information can be shared or sold to
advertisers without the user’s consent.
❖ Trojan:
A Trojan disguises itself as desirable code or software. Once downloaded by
unsuspecting users, the Trojan can take control of victims’ systems for malicious
purposes. Trojans may hide in games, apps, or even software patches, or they may
be embedded in attachments included in phishing emails.
❖ Worms:
Worms target vulnerabilities in operating systems to install themselves into
networks. They may gain access in several ways: through backdoors built into
software, through unintentional software vulnerabilities, or through flash drives.
Once in place, worms can be used by malicious actors to launch DDoS attacks,
steal sensitive data, or conduct ransomware attacks.
❖ Virus:
A virus is a piece of code that inserts itself into an application and executes when
the app is run. Once inside a network, a virus may be used to steal sensitive data,
launch DDoS attacks or conduct ransomware attacks.
❖ Rootkits:
A rootkit is software that gives malicious actors remote control of a victim’s
computer with full administrative privileges. Rootkits can be injected into
applications, kernels, hypervisors, or firmware. They spread through phishing,
malicious attachments, malicious downloads, and compromised shared drives.
Rootkits can also be used to conceal other malware, such as keyloggers.
❖ Keyloggers:
A keylogger is a type of spyware that monitors user activity. Keyloggers have
legitimate uses; businesses can use them to monitor employee activity and families
may use them to keep track of children’s online behaviors. However, when installed
for malicious purposes, keyloggers can be used to steal password data, banking
information and other sensitive information. Keyloggers can be inserted into a
system through phishing, social engineering or malicious downloads.
❖ Bot/Botnets:
A bot is a software application that performs automated tasks on command. They’re
used for legitimate purposes, such as indexing search engines, but when used for
malicious purposes, they take the form of self-propagating malware that can
connect back to a central server. Usually, bots are used in large numbers to create a
botnet, which is a network of bots used to launch broad remotely-controlled floods
of attacks, such as DDoS attacks. Botnets can become quite expensive. For
example, the Mirai IoT botnet ranged from 800,000 to 2.5M computers.
❖ Mobile Malware:
Attacks targeting mobile devices have risen 50 percent since last year. Mobile
malware threats are as various as those targeting desktops and include Trojans,
ransomware, advertising click fraud and more. They are distributed through
phishing and malicious downloads and are a particular problem for jailbroken
phones, which tend to lack the default protections that were part of those devices’
original operating systems.
❖ Wiper Malware:
A wiper is a type of malware with a single purpose: to erase user data and ensure it
can’t be recovered. Wipers are used to take down computer networks in public or
private companies across various sectors. Threat actors also use wipers to cover up
traces left after an intrusion, weakening their victim’s ability to respond.
In such a way, hackers present malware aimed at persuading people to install it. As it seems
legal, users also do not know what the programme is. Usually, we install it thinking that it
is secure, but on the contrary, it's a major threat. That's how the malware gets into your
system. When on the screen, it disperses and hides in numerous files, making it very
difficult to identify. In order to access and record personal or useful information, it may
connect directly to the operating system and start encrypting it Detection of malware is
defined as the search process for malware files and directories. There are several tools and
methods available to detect malware that make it efficient and reliable. Some of the general
strategies for malware detection are:
● Signature-based
● Heuristic Analysis
● Anti-malware Software
● Sandbox
Several classifiers have been implemented, such as linear classifiers (logistic regression,
naive Bayes classifier), support for vector machinery, neural networks, random forests, etc.
Through both static and dynamic analysis, malware can be identified by:
Antivirus software is used to prevent, detect, and remove malware, including but not
limited to computer viruses, computer worm, Trojan horses, spyware and adware. A variety
of strategies are typically employed by the antivirus engines. Signature-based detection
involves searching for known patterns of data within executable code. However, it is
possible for a computer to be infected with a new virus for which no signatures exist. To
counter such “zero-day” threats, heuristics can be used to identify new viruses or variants
of existing viruses by looking for known malicious code. Some antivirus can also make
predictions by executing files in a sandbox and analyzing results.
Often, antivirus software can impair a computer's performance. Any incorrect decision may
lead to a security breach, since it runs at the highly trusted kernel level of the operating
system. If the antivirus software employs heuristic detection, success depends on achieving
the right balance between false positives and false negatives. Today, malware may no
longer be executable files. Powerful macros in Microsoft Word could also present a
security risk. Traditionally, antivirus software heavily relied upon signatures to identify
malware. However, because of newer kinds of malware, signature-based approaches are no
longer effective.
Although standard antivirus can effectively contain virus outbreaks, for large enterprises,
any breach could be potentially fatal. Virus makers are employing "oligomorphic",
"polymorphic" and, "metamorphic" viruses, which encrypt parts of themselves or modify
themselves as a method of disguise, so as to not match virus signatures in the dictionary.
Studies in 2007 showed that the effectiveness of antivirus software had decreased
drastically, particularly against unknown or zero day attacks. Detection rates have dropped
from 40-50% in 2006 to 20-30% in 2007. The problem is magnified by the changing intent
of virus makers. Independent testing on all the major virus scanners consistently shows that
none provide 100% virus detection.
In recent years, the Machine Learning approaches are significantly becoming a high
demand in many industries and businesses for the purpose of obtaining meaningful data
insights and automation analysis. Machine Learning (ML) is one of the emerging domains
that is highly associated with the research field of Artificial Intelligence (AI). The concept
of Machine Learning in conjunction with AI, is referred to as a field of study that gives
computers the ability to learn without being explicitly programmed and according to the
definition of Tom M. Mitchell for ML refers as “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E .”
Clearly, Machine Learning can be defined as the ability of a computer program based on
computational algorithms that can automatically learn the underlying patterns from given
information and data to provide useful insights. Besides, Data Knowledge Discovery
processes which includes Data Mining is crucial in Machine Learning programs, as the
knowledge extraction of known and unknown data from the large-scale data source are
utilized as the basis of data insights and further exploration for key decisions from the
given data. Numerous applications are widely adapting Machine Learning techniques such
as stock prediction, credit scoring, smart medical checks, malware detection, and many
more as the applications are beneficial in delivering useful predictive analysis. The
following diagram shows as a general scheme for Machine Learning .
Supervised model refers to an algorithmic learning model that infers the underlying
patterns and insights relationship between the labeled data and target values of unlabelled
data that is subject to prediction outcome. Consider a malware detection example based on
the Machine Learning classification approach , Where a labeled training dataset of the files
is used with labels of benign and malicious for learning and training tasks of the model.
The labels are used to identify each data of the dataset. The model is trained and adapted to
the generalized pattern and feature knowledge from the given dataset’s data. The model
applies classification function on the test unseen data, which is unlabelled data, where it
classifies and predicts according to the supplied labels and trained dataset and produces
possible outcome prediction.
Fig. 2 Supervised learning overview
Memory analysis can overcome the limitations of static and dynamic analysis methods.
With memory analysis, the limitations of malware signatures created as a result of static
analysis can be overcome. Memory-based features can also overcome some dynamic
analysis limitations such as hidden behavior of malwares during analysis. Although
memory analysis is basically a static analysis, it is a known fact that new generation
malware does not exhibit some behaviors during static analysis. However, since such
hidden behaviors can be detected with memory analysis, it provides significant gains in
malware detection compared to traditional static analysis. Malware leaves some traces in
memory. With memory analysis, some information about the behavioral characteristics of
malware can be obtained using information such as terminated processes, DDL records,
registries, active network connections, and running services. Memory analysis work
consists of two stages, memory acquisition and memory analysis. Memory acquisition is
the stage of obtaining a full image of the machine memory. Memory analysis is the phase
of examining and analyzing the movements of malware, usually using a forensic memory
tool. In this way, it becomes possible to detect hidden malware with memory analysis.
Basically, each computational device is composed with two principle components that
performs computational processing and basic instructions of a system that are the physical
memory and the processor. These components are considered to hold forensic value, as the
processor includes program executions and the processes of the central processing unit
(CPU) of the whole computer system. Whereas the volatile physical memory, it consists of
temporarily stored data related to the processor and executed programs of the active
system. In terms of the modern computer system architecture, CPU is often stated as a
processor, that is indirectly accesses and requests the main memory (RAM) via Memory
Controller Hub for instructional commands to execute and process the data.
The need for memory acquisition has increased as more information is being stored on
computer’s memory that is involved in cybercrimes and network attacks. Memory
acquisition is highly prioritized for any identified live compromised computers as it
contains extremely volatile data in memory which is ranked as top in order of the data
volatility. The memory images and snapshots are only captured from a running system as
once the system is turned off completely or rebooted the memory’s content fades away.
Volatile and Non-volatile memory are the two types of memory available in the system.
Volatile memory stores data temporarily and non-volatile data is stored permanently in the
system. Memory stores current working of processes, registers, stack of processes, deleted
files, and encrypted data. Volatile memory or Random Access Memory (RAM) only
maintains its data while the computer or device is powered on. Non-volatile Memory, or
NVRAM, is for longer term storage. When a computer is powered off, evidence in RAM is
lost and normally cannot be recovered, however, the data in NVRAM often remains after
the system is powered off and can be analyzed after the fact.
The most widely used memory forensics platform for memory acquisition and analysis is
known to be Volatility Framework. This tool is beneficial to analyze captured and imaged
volatile memory for valuable information about the runtime state of the system, and
provides the ability to link artifacts from traditional forensic analysis. Also, the tool
provides a range of plugins to analyze the memory artifacts of main 6 areas as mentioned
earlier. In addition, this tool framework is python based and is also used as a python library.
At the initial analysis of a memory image, it is important to distinguish the system running
process. The following section will describe briefly about the system process as an artifact
along with process-related artifact logs that are used for the analysis using the Volatility
Framework plugins.
Chapter 4: Algorithms Applied
● DECISION TREE
● RANDOM FOREST
● SVM
● XGBOOST
Decision tree is a supervised learning technique that can be used for both classification
and regression problems, but are mostly suitable for solving classification problems. It is a
tree-structured classifier, with internal nodes representing characteristics of the data set,
branches representing decision rules, and each leaf node representing a result. The decision
tree has two nodes, a decision node and a leaf node. Decision nodes are used to make
decisions and have multiple branches, while leaf nodes are the result of those decisions and
contain no further branches. A decision or test is made based on the characteristics of the
specified dataset. A graphical representation to get all possible solutions to a
problem/decision based on certain conditions. It is called a decision tree because, like a
tree, it starts from a root node and expands to other branches to build a tree-like structure.
To build the tree, we use the CART algorithm, which stands for Classification and
Regression Tree Algorithm. A decision tree simply asks a question and subdivides the tree
into subtrees based on the answer (yes/no).
● Root node: The root node is the starting point of the decision tree. It
represents the entire data set, which is further divided into two or more
homogeneous data sets.
● Leaf node: A leaf node is the last output node and the tree cannot be further
split after a leaf node is obtained.
● Splitting: Splitting is the process of splitting a decision node/root node into
subnodes according to specified criteria.
● Branch/subtree: A tree formed by splitting a tree.
● Pruning: Pruning removes unwanted branches from trees.
● Parent/child nodes: The root node of the tree is called the parent node, and
the other nodes are called child nodes.
4.1.2 Working
In a decision tree, the algorithm for predicting classes for a given dataset starts at the root
node of the tree. The algorithm compares the value of the root attribute with the record
attribute (actual record) and branches based on that comparison to jump to the next node.
For the next node, the algorithm again compares the attribute values with other subnodes
and continues. Continue processing until a leaf node of the tree is reached. The complete
process can be better understood with the following algorithm:
Step 1: Start the tree with the root node containing the complete dataset.
Step 2: Find the best attributes in the dataset using the Attribute Selection Measure
(ASM).
Step-3: Divide S into subsets containing the possible values of the best attributes.
Step 4: Create a decision tree node containing the best attributes.
Step 5: Recursively build a new decision tree using the subset of the dataset
created in Step 3. Continue this process until you reach a stage where the nodes
cannot be classified any further and the last node can be called a leaf node.
4.1.3 Example
Suppose you have a job offer and you have a candidate who wants to decide whether to
accept the offer. So, to solve this problem, the decision tree starts from the root node
(Salary attribute in ASM). The root node is further split into the next decision node
(distance from office) and leaf nodes based on the corresponding label. The following
decision nodes are further divided into decision nodes (cab functions) and leaf nodes.
Finally, the decision node is split into her two leaf nodes (accepted offer and rejected
offer). Consider the following illustration.
A major problem when implementing decision trees is how to select the best attributes for
the root node and subnodes. Therefore, to solve such problems, there is a technique called
Attribute Selection Measure or ASM. This measure makes it easy to choose the best
attributes for the nodes of the tree. There are two popular techniques for ASM, which are:
● Information Gain
● Gini Index
1. Information Gain:
Where,
P(No) = Probability of No
2. Gini Index:
● It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
● For more class labels, the computational complexity of the decision tree may
increase.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Fig. 6 Working of Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
● There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
● The predictions from each tree must have very low correlations.
4.2.1 Working
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
The working of the algorithm can be better understood by the below example:
4.2.2 Example
Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a
new data point occurs, then based on the majority of results, the Random Forest classifier
predicts the final decision. Consider the below image:
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
● Although random forest can be used for both classification and regression tasks, it
is not more suitable for Regression tasks.
● A trained forest may require significant memory for storage, due to the need for
retaining the information from several hundred individual trees.
● A forest is less interpretable than a single decision tree. Single trees may be
visualized as a sequence of decisions.
● Although random forests can be an improvement on single decision trees, more
sophisticated techniques are available. Prediction accuracy on complex problems is
usually inferior to gradient-boosted trees.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called support vectors, and hence the algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
4.3.1 Example
SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, such a model can be created by using
the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as the support vector creates a decision boundary between these two
data (cat and dog) and chooses extreme cases (support vectors), it will see the extreme case
of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
Fig. 9 Working of SVM
● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then the hyperplane will be a straight
line. And if there are 3 features, then the hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vectors. These vectors
support the hyperplane, hence called a Support vector.
4.3.3 Working
Linear SVM :
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Fig. 10.2 Separation of dataset using multiple lines
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called the margin. And the goal of SVM is to maximize
this margin. The hyperplane with maximum margin is called the optimal hyperplane.
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
Fig. 10.5 Sample space of non linear SVM in 3D
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
4.4. XGBOOST
Ever since its introduction in 2014, XGBoost has been lauded as the holy grail of machine
learning hackathons and competitions. From predicting ad click-through rates to
classifying high energy physics events, XGBoost has proved its mettle in terms of
performance – and speed.
Tianqi Chen, one of the co-creators of XGBoost, announced (in 2016) that the innovative
system features and algorithmic optimizations in XGBoost have rendered it 10 times faster
than most sought after machine learning solutions. A truly amazing technique!
XGBoost is an ensemble learning method. Sometimes, it may not be sufficient to rely upon
the results of just one machine learning model. Ensemble learning offers a systematic
solution to combine the predictive power of multiple learners. The resultant is a single
model which gives the aggregated output from several models.
The models that form the ensemble, also known as base learners, could be either from the
same learning algorithm or different learning algorithms. Bagging and boosting are two
widely used ensemble learners. Though these two techniques can be used with several
statistical models, the most predominant usage has been with decision trees.
Let’s briefly discuss bagging before taking a more detailed look at the concept of boosting.
● Bagging : While decision trees are one of the most easily interpretable models,
they exhibit highly variable behavior. Consider a single training dataset that we
randomly split into two parts. Now, let’s use each part to train a decision tree in
order to obtain two models.
When we fit both these models, they would yield different results. Decision trees
are said to be associated with high variance due to this behavior. Bagging or
boosting aggregation helps to reduce the variance in any learner. Several decision
trees which are generated in parallel, form the base learners of bagging technique.
Data sampled with replacement is fed to these learners for training. The final
prediction is the averaged output from all the learners.
● Boosting : In boosting, the trees are built sequentially such that each subsequent
tree aims to reduce the errors of the previous tree. Each tree learns from its
predecessors and updates the residual errors. Hence, the tree that grows next in the
sequence will learn from an updated version of the residuals.
The base learners in boosting are weak learners in which the bias is high, and the
predictive power is just a tad better than random guessing. Each of these weak
learners contributes some vital information for prediction, enabling the boosting
technique to produce a strong learner by effectively combining these weak learners.
The final strong learner brings down both the bias and the variance.
In contrast to bagging techniques like Random Forest, in which trees are grown to
their maximum extent, boosting makes use of trees with fewer splits. Such small
trees, which are not very deep, are highly interpretable. Parameters like the number
of trees or iterations, the rate at which the gradient boosting learns, and the depth of
the tree, could be optimally selected through validation techniques like k-fold cross
validation. Having a large number of trees might lead to overfitting. So, it is
necessary to carefully choose the stopping criteria for boosting.
To improve the performance of F1, we could model after the residuals of F1 and
create a new model F2:
This can be done for ‘m’ iterations, until residuals have been minimized as much as
possible:
Here, the additive learners do not disturb the functions created in the previous
steps. Instead, they impart information of their own to bring down the errors.
4.4.3 Advantages of XGBOOST
XGBoost is a popular implementation of gradient boosting. Let’s discuss some features of
XGBoost that make it so interesting.
● Regularization: XGBoost has an option to penalize complex models through both
L1 and L2 regularization. Regularization helps in preventing overfitting
● Handling sparse data: Missing values or data processing steps like one-hot
encoding make data sparse. XGBoost incorporates a sparsity-aware split finding
algorithm to handle different types of sparsity patterns in the data
● Weighted quantile sketch: Most existing tree based algorithms can find the split
points when the data points are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data. XGBoost has a
distributed weighted quantile sketch algorithm to effectively handle weighted data
● Block structure for parallel learning: For faster computing, XGBoost can make use
of multiple cores on the CPU. This is possible because of a block structure in its
system design. Data is sorted and stored in in-memory units called blocks. Unlike
other algorithms, this enables the data layout to be reused by subsequent iterations,
instead of computing it again. This feature also serves useful for steps like split
finding and column sub-sampling
● Cache awareness: In XGBoost, non-contiguous memory access is required to get
the gradient statistics by row index. Hence, XGBoost has been designed to make
optimal use of hardware. This is done by allocating internal buffers in each thread,
where the gradient statistics can be stored
● Out-of-core computing: This feature optimizes the available disk space and
maximizes its usage when handling huge datasets that do not fit into memory
This chapter of the document details the prerequisite and environment setup analysis
techniques used to investigate and analyze a memory image for process-related artifacts
that are likely to be hidden in a memory with consideration of other memory artifacts. In
addition, the data collection approach for the Machine Learning models.
5.1 Prerequisite
2. Host Operating System (Version: Windows 11, 64-bit, Build 9600 ,6.3.9600)
Memory image dataset : a research dataset was utilized for this experimental study of
analyzing the process-related artifacts of memory image. The dataset consists of (4300
positive and 300 negatives) acquired realistic memory snapshots of uncompromised and
compromised Windows 11 virtual machines. Memory images/Dump were acquired using
in-built memory dump feature in windows and were stored as .dmp files and converted into
Comma Separated Value (.csv). Briefly, The memory images were compromised using
several malware based on obfuscation evasion techniques.
5.2 Analysis Approach
Step 3: Then use notmyfault tool to launch an unexpected error in the system.
This section of the document will detail the implementation of the tools including the main
implementation code of data collection and preprocessing, Machine Learning models and
Classifier with regards to used programming tools.
6.1 Overview
The memory dumps are analyzed using various machine learning algorithm whose
accuracy is given below:
By analyzing the above table, it is concluded that the most efficient algorithm for malware
analysis is Random Forest.
In this, a comparative analysis of different machine learning algorithms used for malware
analysis using their prediction accuracy and error. Here it is found that the accuracy of the
random forest machine learning algorithm is higher than that of any other algorithms used
for the malware analysis, also the error value for the same algorithm is lesser than that of
any other machine learning algorithm used.
Chapter 8: Future Work
This chapter defines some sustainable work and assumptions that might be addressed in
further research for the project. Some of the project's remaining sub-cores aims of
implementation, tests and experiments have been left for future work due to limitations and
time constraints of the project .
In addition, this project requires further analysis and extraction of other artifacts data from
memory images which is considerable and very time consuming . Some of the sub aims
and functional requirements that could developed as future work are listed below :
• Testing and improving the results of the Random forest classifier model with
better understanding and validating the feature attributes as prioritizing features of
pslist and psscan over the other features when classifying.
• Implementing the desired functional requirements of the classifier tool that could
show more results to the initial outcomes of the project.
• Considerably important for the future research to include other investigation and
analysis areas of memory forensics including dlls, handles and threads.
Chapter 9: Conclusion
As per the study, it is found that Random forest was able to classify the malware with
99.4% accuracy. We have proposed a malware detection module based on advanced data
mining and machine learning. While such a method may not be suitable for home users,
being very processor-heavy, this can be implemented at the enterprise gateway level to act
as a central antivirus engine to supplement antiviruses present on end-user computers. This
will not only easily detect known viruses, but act as a knowledge that will detect newer
forms of harmful files. While a costly model requires costly infrastructure, it can help in
protecting invaluable enterprise data from security threats, and prevent immense financial
damage
References