Malware Analysis Using Machine Learning (Paper Presented)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

COMPARATIVE STUDY OF FILELESS MALWARES USING

MACHINE LEARNING

The Project report submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted by
AKASH NAIR.V.S
(REG.NO: 20BFS007)

Under the Guidance of

Mr .MIDHUN.S
Assistant Professor and Head, Department of DCFS

DEPARTMENT OF DIGITAL AND CYBER FORENSICS SCIENCE

SREE SARASWATHI THYAGARAJA COLLEGE


(Autonomous)

An Autonomous, NAAC Re-Accredited with A Grade, ISO 21001:2018 Certified


Institution, Affiliated to Bharathiar University, Coimbatore Approved by AICTE
for MBA/MCA and by UGC for 2(f) & 12(B) status Pollachi-642 107

DECEMBER - 2022
SREE SARASWATHI THYAGARAJA COLLEGE POLLACHI
(Autonomous)
(Affiliated to Bharathiar University, Coimbatore)

CERTIFICATE

This is to certify that the project work entitled COMPARATIVE STUDY OF FILELESS
MALWARE DETECTION USING MACHINE LEARNING submitted to Sree
Saraswathi Thyagaraja College (Autonomous), Pollachi, affiliated to Bharathiar
University, Coimbatore in partial fulfillment of the requirements for the award of the
degree of BACHELOR OF DIGITAL AND CYBER FORENSICS SCIENCE is a
record of original work done by AKASH NAIR.V.S under my supervision and guidance
and the report has not previously formed the basis for the award of any Degree/ Diploma/
Associate ship/ Fellowship or other similar title to any candidate of any University.

14-12-2022 GUIDE

POLLACHI (Mr .MIDHUN.S )

Counter Signed By

HOD DIRECTOR-SoCS PRINCIPAL

(Mr .MIDHUN.S ) (Dr. A. Abdul Rasheed) (Dr.A.Somu)

Viva-Voce Examination held on _________________

INTERNAL EXAMINER EXTERNAL EXAMINER


DECLARATION

I, JAISON.V.R hereby declare that the project report entitled “COMPARATIVE STUDY
OF FILELESS MALWARE DETECTION USING MACHINE LEARNING”
submitted to Sree Saraswathi Thyagaraja College (Autonomous), Pollachi, affiliated to
Bharathiar University, Coimbatore in partial fulfillment of the requirements for the
award of the degree of BACHELOR OF DIGITAL AND CYBER FORENSICS
SCIENCE is a record of original work done by me under the guidance of Mr
.MIDHUN.S, Assistant Professor and Head, Department of DIGITAL AND CYBER
FORENSICS SCIENCE and it has not previously formed the basis for the award of any
Degree/Diploma /Associateship /Fellowship or other similar title to any candidate of any
University.

POLLACHI Signature of the Candidate


14-12-2022 (AKASH NAIR.V.S)
ACKNOWLEDGEMENT

I take this opportunity to express our gratitude and sincere thanks to everyone who helped
me in my project.

I wish to express my heartfelt thanks to the Management of Sree Saraswathi Thyagaraja


College for providing me with excellent infrastructure during the course of study and
project.

I wish to express my deep sense of gratitude to Dr. A. SOMU,


MBA.,M.Com.,M.Phil.,Ph.D, Principal, Sree Saraswathi Thyagaraja College for
providing me excellent facilities and encouragement during the course of study and
project.

I express my sincere and heartfelt thanks to Dr. A.Abdul Rasheed,


B.Sc.,MCA.,M.E.,Ph.D., Director-School of Computing Science, Sree Saraswathi
Thyagaraja College for the support he offered

I express my deep sense of gratitude and sincere thanks to my beloved staff members
MRS. VINEETHA V, MR. FEBIN PRAKASH & MRS. D.MANJULA allowed me to
carry out this project and gave me complete freedom to utilize the resources of the
department.

It's my prime duty to solemnly express my deep sense of gratitude and sincere thanks to the
guide Mr .MIDHUN.S, MCA, Assistant Professor and Head, UG Department of
Digital and Cyber Forensic Science for their valuable advice and excellent guidance to
complete the project successfully.

I also convey my heartfelt thanks to my parents, friends and all the staff members of the
Department of DIGITAL AND CYBER FORENSICS SCIENCE for their valuable support
which energized me to complete this project.
SYNOPSIS

SYNOPSIS
Malware attacks have been one of the most serious cyber risks faced by different
countries. The number of vulnerabilities reporting and malware is also increasing rapidly.
Researchers have received tremendous attention in the study of malware behaviors. There
are several factors that lead to the increase of malware attacks. The malware authors create
and deploy malware that can mutate and which has different forms such as ransomware
and fileless malware/virus. This is done in order to avoid detection of malware. It is
difficult to detect the malwares and cyber attacks using the traditional cyber security
procedures. Solutions for the new generation cyber attacks or for the problems in security
rely on various artificial intelligence techniques. Research shows that over the last decade,
malware has been growing exponentially, causing substantial financial losses to various
organizations. Different anti-malware companies have been proposing solutions to defend
against attacks from these malwares. The velocity, volume, and the complexity of malware
are posing new challenges to the anti-malware community. Current state-of-the-art
research shows that recently, researchers and anti-virus organizations started applying
machine learning and deep learning methods for malware analysis and detection. We have
used a memory dump method and applied unsupervised learning in addition to supervised
learning for malware classification.

In this project we are proposing a method for the identification of malware, even new
threads using memory dump and various artificial intelligence techniques like machine
learning and deep learning. The process in the system will be analyzed with various
machine learning models and find a more efficient and accurate model from these. With
the help of a deep learning model we can find existing malwares even now. We will be
doing a comparative study between 4 machine learning algorithms to determine which is
the best and accurate one. This study can help you to use the best algorithm to save your
time.
CONTENT
TABLE OF CONTENT

Chapter 1 : Introduction………………………………………………………….

1.1.0 Evolution of Malware………………………………………………

1.2.0 History of Malware…………………………………………………

1.3.0 Types of Malware…………………………………………………...

1.4.0 Malware detection Techniques…………………………………….

1.5.0 Need of Machine learning in Malware Detection………………...

Chapter 2 : Machine learning…………………………………………………….

2.1.0 Machine learning techniques………………………………………

2.1.1 Supervised learning………………………………………………...

2.1.2 Unsupervised learning……………………………………………..

Chapter 3 : Memory Analysis……………………………………………………

3.1.0 Memory and its architecture……………………………………...

3.2.0 Memory acquisition/ Memory dump………………………………

3.3.0 Volatility framework……………………………………………….

Chapter 4 : Algorithms applied………………………………………………….

4.1. Decision Tree………………………………………………………..

4.1.1 Decision Tree Terminologies……………………………………….

4.1.2 Working of Decision Tree.………………………………………….

4.1.3 Example…………………………………………………………….

4.1.4 Advantages of the decision tree……………………………………


4.1.5 Disadvantages of the Decision Tree…………………………….

4.2. Random Forest………………………………………………...

4.2.1 Working of Random Forest……………………………………

4.2.2 Example…………………………………………………………

4.2.3 Application of Random Forest………………………………...

4.2.4 Advantages of Random Forest………………………………...

4.2.5 Disadvantages of Random Forest……………………………..

4.3. SVM Algorithm………………………………………………...

4.3.1 Example…………………………………………………………

4.3.2 Types of SVM…………………………………………………...

4.3.3 Working of SVM………………………………………………..

4.4 XGBOOST……………………………………………………...

4.4.1 Why XGBOOST?........................................................................

4.4.2 Ensemble Learning…………………………………………….

4.4.3 Advantages of XGBOOST……………………………………..

4.4.4 Disadvantages of XGBOOST….………………………………

Chapter 5 : Investigation and Analysis……………………………………………….

5.1 Prerequisite……….…………………………………………….

5.2 Analysis Approach………………………………………...

5.2.1 Acquisition of a compromised memory image…………..


Chapter 6 : Implementation……………………………………………..

6.1 Overview…………………………………………………….

6.2 Source code………………………………………………….

Chapter 7 : Prediction Analysis………………………………………….

Chapter 8 : Future Work…………………………………………………

Chapter 9 : Conclusion……………………………………………………

References…………………………………………………………………
Chapter 1: Introduction

Idealistic hackers attacked computers in the early days because they were eager to
prove themselves. Cracking machines, however, is an industry in today's world. Despite
recent improvements in software and computer hardware security, both in frequency and
sophistication, attacks on computer systems have increased. Regrettably, there are major
drawbacks to current methods for detecting and analyzing unknown code samples. The
Internet is a critical part of our everyday lives today. On the internet, there are many
services and they are rising daily as well. Numerous reports indicate that malware's effect
is worsening at an alarming pace. Although malware diversity is growing, anti- virus
scanners are unable to fulfill security needs, resulting in attacks on millions of hosts.
Around 65,63,145 different hosts were targeted, according to Kaspersky Labs, and in 2015,
40,00,000 unique malware artifacts were found. Juniper Research (2016), in particular,
projected that by 2019 the cost of data breaches will rise to $2.1 trillion globally. Current
studies show that script-kiddies are generating more and more attacks which are automated.
To date, attacks on commercial and government organizations, such as ransomware and
malware, continue to pose a significant threat and challenge. Such attacks can come in
various ways and sizes. An enormous challenge is the ability of the global security
community to develop and provide expertise in cybersecurity. There is widespread
awareness of the global scarcity of cybersecurity and talent. Cybercrimes, such as financial
fraud, child exploitation online and payment fraud, are so common that they demand
international 24-hour response and collaboration between multinational law enforcement
agencies. For single users and organizations, malware defense of computer systems is
therefore one of the most critical cybersecurity activities, as even a single attack may result
in compromised data and sufficient losses. Malware attacks have been one of the most
serious cyber risks faced by different countries. The number of vulnerabilities reporting
and malware is also increasing rapidly. Researchers have received tremendous attention in
the study of malware behaviors. There are several factors that lead to the development of
malware attacks. The malware authors create and deploy malware that can mutate and
which has different forms such as ransomware and fileless malwares. This is done in order
to avoid the detection of malware. It is difficult to detect the malware and cyber attacks
using the traditional cyber security procedures. Solutions for the new generation cyber
attacks rely on various Machine learning techniques.

1.1 Evolution of Malware


In order to protect networks and computer systems from attacks, the diversity,
sophistication and availability of malicious software present enormous challenges.
Malware is continually changing and challenges security researchers and scientists to
strengthen their cyber defenses to keep pace. Owing to the use of polymorphic and
metamorphic methods used to avoid detection and conceal its true intent, the prevalence of
malware has increased. To mutate the code while keeping the original functionality intact,
polymorphic malware uses a polymorphic engine. The two most common ways to conceal
code are packaging and encryption . Through one or more layers of compression, packers
cover a program's real code. Then the unpacking routines restore the original code and
execute it in memory at runtime. To make it harder for researchers to analyze the software,
crypters encrypt and manipulate malware or part of its code. A crypter includes a stub that
is used for malicious code encryption and decryption. Whenever it's propagated,
metamorphic malware rewrites the code to an equivalent. Multiple transformation
techniques, including but not limited to, register renaming, code permutation, code
expansion, code shrinking and insertion of garbage code, can be used by malware authors.
The combination of the above techniques resulted in increasingly increasing quantities of
malware, making time-consuming, expensive and more complicated forensic investigations
of malware cases. There are some issues with conventional antivirus solutions that rely on
signature-based and heuristic/behavioral methods. A signature is a unique feature or
collection of features that like a fingerprint, uniquely differentiates an executable.
Signature-based approaches are unable to identify unknown types of malware, however.
Security researchers suggested behavior-based detection to overcome these problems,
which analyzes the features and behavior of the file to decide whether it is indeed malware,
although it may take some time to search and evaluate. Researchers have begun
implementing machine learning to supplement their solutions in order to solve the previous
drawbacks of conventional antivirus engines and keep pace with new attacks and variants,
as machine learning is well suited for processing large quantities of data.

Brief:
Malware, short for malicious software, consists of programming (code, scripts, active
content, and other software) designed to disrupt or deny operation, gather information that
leads to loss of privacy or exploitation, gain unauthorized access to system resources, and
other abusive behavior. It is a general term used to define a variety of forms of hostile,
intrusive, or annoying software or program code. Software is considered to be malware
based on the perceived intent of the creator rather than any particular features. Malware
includes computer viruses, worms, Trojan horses, spyware, dishonest adware, crime-ware,
most rootkits, and other malicious and unwanted software or programs .

In 2008, Symantec published a report that "the release rate of malicious code and other
unwanted programs may be exceeding that of legitimate software applications.” According
to F-Secure, "As much malware was produced in 2007 as in the previous 20 years
altogether.”.

Since the rise of widespread Internet access, malicious software has been designed for a
profit, for example forced advertising. For instance, since 2003, the majority of widespread
viruses and worms have been designed to take control of users' computers for black-market
exploitation. Another category of malware, spyware, - programs designed to monitor users'
web browsing and steal private information. Spyware programs do not spread like viruses,
instead are installed by exploiting security holes or are packaged with user-installed
software, such as peer-to-peer applications.

Clearly, there is a very urgent need to find a suitable method to detect infected files, which
can even detect new viruses by studying the structure of system calls made by malware.

1.2. HISTORY OF MALWARE


A malicious software (malware) program is an application whose developer or
sender has malicious intent. While most programs and files you install or download are
completely harmless, some are designed to further hidden agendas, such as destroying your
files, stealing your information, or extracting a payment.During the late 1980s, the most
malicious programs were simple boot sectors and file infectors spread via floppy disks. As
computer network adoption and expansion continued through the first half of the 1990s,
malware distribution became easier, so volume increased. As technologies standardized,
certain types of malware proliferated. Macro viruses (which enable malware to be spread
via email attachment) that exploited Microsoft Office products gained a distribution boost
by the increased adoption of email. By the mid-1990s, businesses became increasingly
affected, due in large part to macro viruses, meaning propagation had become
network-driven.

Malware in the 21st Century

An increase in the use of exploit kits (programs used by cybercriminals to exploit system
vulnerabilities) led to an explosion of malware delivered online during the 2000s.
Automated SQL injection (a technique used to attack data-driven applications) and other
forms of mass website compromises increased distribution capabilities in 2007. Since then,
the number of malware attacks has grown exponentially, doubling or more each year.

At the start of the new millennium, internet and email worms made headlines across the
globe:

● ILOVEYOU attacked tens of millions of Windows-based computers in


2000.
● The Anna Kournikova email worm, launched in 2001, caused problems
in email servers around the world.
● Sircam, which was active in 2001, spread itself through email on
Windows-based systems.
● The CodeRed worm spread in 2001 by taking advantage of a buffer
overflow vulnerability.
● Nimda, which also appeared in 2001, affected computers running various
versions of Windows.

A Timeline of Early 2000's Malware

Throughout 2002 and 2003, internet users were plagued by out-of-control popups and other
Javascript bombs. Around this time, socially engineered worms and spam proxies began to
appear. Phishing and other credit card scams also took off during this period, along with
notable internet worms like Blaster and Slammer. Slammer, released in 2003, caused a
denial of service (DoS) on some internet hosts and slowed internet traffic. Below are some
other notable malware incidents from this time:
2004: An email worm war broke out between the authors of MyDoom, Bagle, and
Netsky. Ironically, this feud led to improved email scanning and higher adoption
rates of email filtering, which eventually nearly eliminated mass-spreading email
worms.
2005: The discovery and disclosure of the now-infamous Sony rootkit led to the
inclusion of rootkits in most modern-day malware.
2006: Various financial scams, Nigerian 419 scams, phishing, and lottery scams
were prevalent at this time. Though not directly malware-related, such scams
continued the profit-motivated criminal activity made possible by the internet.
2007: Website compromises escalated due in large part to the discovery and
disclosure of MPack, a crimeware kit used to deliver exploits online. Compromises
included the Miami Dolphins stadium site, Tom’s Hardware, The Sun, MySpace,
Bebo, Photobucket, and The India Times website. By the end of 2007, SQL
injection attacks had begun to ramp up; victims included the popular Cute Overload
and IKEA websites.
2008: By now, attackers were employing stolen FTP credentials and leveraging
weak configurations to inject IFrames on tens of thousands of smaller websites. In
June 2008, the Asprox botnet facilitated automated SQL injection attacks, claiming
Walmart as one of its victims.
2009: In early 2009, Gumblar emerged, infecting systems running older versions of
Windows. Its methodology was quickly adopted by other attackers, leading to
botnets that are harder to detect.

Malware Since 2010

In the last decade or so, attacks have taken advantage of new technologies, including
cryptocurrency and the Internet of Things (IoT).

2010: Industrial computer systems were targets of the 2010 Stuxnet worm. This
malicious tool targeted machinery on factory assembly lines. It was so damaging
that it's thought to have caused the destruction of several hundred of Iran's
uranium-enrichment centrifuges.
2011: A Microsoft-specific Trojan horse called ZeroAccess downloaded malware
on computers via botnets. It was mostly hidden from the operating system using
rootkits and was propagated by Bitcoin mining tools.
2012: As part of a worrying trend, Shamoon targeted computers in the energy
sector. Cited by cybersecurity lab CrySyS as "the most complex malware ever
found," Flame was used for cyber espionage in the Middle East.
2013: An early instance of ransomware, CryptoLocker was a Trojan horse that
locked the files on a user's computer, prompting them to pay a ransom for the
decryption key. Gameover ZeuS used keystroke logging to steal users' login details
from financial transaction sites.
2014: The Trojan horse known as Regin was thought to have been developed in the
U.S. and U.K. for espionage and mass surveillance purposes.
2016: Locky infected several million computers in Europe, including over 5,000
computers per hour just in Germany. Mirai launched highly disruptive distributed
DoS (DDoS) attacks on several prominent websites and infected the IoT.
2017: The global WannaCry ransomware attack was halted when a cybersecurity
researcher found a "kill switch" within the ransomware code. Petya, another
instance of ransomware, was also released, using a similar exploit to the one used
by WannaCry.
2018: As cryptocurrency started to gain traction, Thanatos became the first
ransomware to accept payments in Bitcoin.

1.3. TYPES OF MALWARE


While there are many different variations of malware, you are most likely to encounter the
following malware types:

❖ Ransomware:
Ransomware is software that uses encryption to disable a target’s access to its data
until a ransom is paid. The victim organization is rendered partially or totally
unable to operate until it pays, but there is no guarantee that payment will result in
the necessary decryption key or that the decryption key provided will function
properly.
❖ Fileless Malware:
Fileless malware doesn’t install anything initially, instead, it makes changes to files
that are native to the operating system, such as PowerShell or WMI. Because the
operating system recognizes the edited files as legitimate, a fileless attack is not
caught by antivirus software — and because these attacks are stealthy, they are up
to ten times more successful than traditional malware attacks.
❖ Spyware:
Spyware collects information about users’ activities without their knowledge or
consent. This can include passwords, pins, payment information and unstructured
messages.The use of spyware is not limited to the desktop browser: it can also
operate in a critical app or on a mobile phone. Even if the data stolen is not critical,
the effects of spyware often ripple throughout the organization as performance is
degraded and productivity eroded.
❖ Adware:
Adware tracks a user’s surfing activity to determine which ads to serve them.
Although adware is similar to spyware, it does not install any software on a user’s
computer, nor does it capture keystrokes. The danger in adware is the erosion of a
user’s privacy — the data captured by adware is collated with data captured, overtly
or covertly, about the user’s activity elsewhere on the internet and used to create a
profile of that person which includes who their friends are, what they’ve purchased,
where they’ve traveled, and more. That information can be shared or sold to
advertisers without the user’s consent.
❖ Trojan:
A Trojan disguises itself as desirable code or software. Once downloaded by
unsuspecting users, the Trojan can take control of victims’ systems for malicious
purposes. Trojans may hide in games, apps, or even software patches, or they may
be embedded in attachments included in phishing emails.
❖ Worms:
Worms target vulnerabilities in operating systems to install themselves into
networks. They may gain access in several ways: through backdoors built into
software, through unintentional software vulnerabilities, or through flash drives.
Once in place, worms can be used by malicious actors to launch DDoS attacks,
steal sensitive data, or conduct ransomware attacks.
❖ Virus:
A virus is a piece of code that inserts itself into an application and executes when
the app is run. Once inside a network, a virus may be used to steal sensitive data,
launch DDoS attacks or conduct ransomware attacks.
❖ Rootkits:
A rootkit is software that gives malicious actors remote control of a victim’s
computer with full administrative privileges. Rootkits can be injected into
applications, kernels, hypervisors, or firmware. They spread through phishing,
malicious attachments, malicious downloads, and compromised shared drives.
Rootkits can also be used to conceal other malware, such as keyloggers.
❖ Keyloggers:
A keylogger is a type of spyware that monitors user activity. Keyloggers have
legitimate uses; businesses can use them to monitor employee activity and families
may use them to keep track of children’s online behaviors. However, when installed
for malicious purposes, keyloggers can be used to steal password data, banking
information and other sensitive information. Keyloggers can be inserted into a
system through phishing, social engineering or malicious downloads.
❖ Bot/Botnets:
A bot is a software application that performs automated tasks on command. They’re
used for legitimate purposes, such as indexing search engines, but when used for
malicious purposes, they take the form of self-propagating malware that can
connect back to a central server. Usually, bots are used in large numbers to create a
botnet, which is a network of bots used to launch broad remotely-controlled floods
of attacks, such as DDoS attacks. Botnets can become quite expensive. For
example, the Mirai IoT botnet ranged from 800,000 to 2.5M computers.

❖ Mobile Malware:
Attacks targeting mobile devices have risen 50 percent since last year. Mobile
malware threats are as various as those targeting desktops and include Trojans,
ransomware, advertising click fraud and more. They are distributed through
phishing and malicious downloads and are a particular problem for jailbroken
phones, which tend to lack the default protections that were part of those devices’
original operating systems.
❖ Wiper Malware:
A wiper is a type of malware with a single purpose: to erase user data and ensure it
can’t be recovered. Wipers are used to take down computer networks in public or
private companies across various sectors. Threat actors also use wipers to cover up
traces left after an intrusion, weakening their victim’s ability to respond.

1.4. MALWARE DETECTION TECHNIQUES

In such a way, hackers present malware aimed at persuading people to install it. As it seems
legal, users also do not know what the programme is. Usually, we install it thinking that it
is secure, but on the contrary, it's a major threat. That's how the malware gets into your
system. When on the screen, it disperses and hides in numerous files, making it very
difficult to identify. In order to access and record personal or useful information, it may
connect directly to the operating system and start encrypting it Detection of malware is
defined as the search process for malware files and directories. There are several tools and
methods available to detect malware that make it efficient and reliable. Some of the general
strategies for malware detection are:
● Signature-based
● Heuristic Analysis
● Anti-malware Software
● Sandbox

Several classifiers have been implemented, such as linear classifiers (logistic regression,
naive Bayes classifier), support for vector machinery, neural networks, random forests, etc.
Through both static and dynamic analysis, malware can be identified by:

Without Executing the code


Behavioral Analysis

Current Antivirus Software

Antivirus software is used to prevent, detect, and remove malware, including but not
limited to computer viruses, computer worm, Trojan horses, spyware and adware. A variety
of strategies are typically employed by the antivirus engines. Signature-based detection
involves searching for known patterns of data within executable code. However, it is
possible for a computer to be infected with a new virus for which no signatures exist. To
counter such “zero-day” threats, heuristics can be used to identify new viruses or variants
of existing viruses by looking for known malicious code. Some antivirus can also make
predictions by executing files in a sandbox and analyzing results.

Often, antivirus software can impair a computer's performance. Any incorrect decision may
lead to a security breach, since it runs at the highly trusted kernel level of the operating
system. If the antivirus software employs heuristic detection, success depends on achieving
the right balance between false positives and false negatives. Today, malware may no
longer be executable files. Powerful macros in Microsoft Word could also present a
security risk. Traditionally, antivirus software heavily relied upon signatures to identify
malware. However, because of newer kinds of malware, signature-based approaches are no
longer effective.

Although standard antivirus can effectively contain virus outbreaks, for large enterprises,
any breach could be potentially fatal. Virus makers are employing "oligomorphic",
"polymorphic" and, "metamorphic" viruses, which encrypt parts of themselves or modify
themselves as a method of disguise, so as to not match virus signatures in the dictionary.

Studies in 2007 showed that the effectiveness of antivirus software had decreased
drastically, particularly against unknown or zero day attacks. Detection rates have dropped
from 40-50% in 2006 to 20-30% in 2007. The problem is magnified by the changing intent
of virus makers. Independent testing on all the major virus scanners consistently shows that
none provide 100% virus detection.

1.5. NEED FOR MACHINE LEARNING IN MALWARE DETECTION

Machine learning has created a drastic change in many industries, including


cybersecurity, over the last decade. Among cybersecurity experts, there is a general belief
that AI-powered anti-malware tools can help detect modern malware attacks and boost
scanning engines. Proof of this belief is the number of studies on malware detection
strategies that exploit machine learning reported in the last few years. The number of
research papers released in 2018 is 7720, a 95 percent rise over 2015 and a 476 percent
increase over 2010, according to Google Scholar,1. This rise in the number of studies is the
product of several factors, including but not limited to the increase in publicly labeled
malware feeds, the increase in computing capacity at the same time as its price decrease,
and the evolution of the field of machine learning, which has achieved ground-breaking
success in a wide range of tasks such as computer vision and speech recognition.
Depending on the type of analysis, conventional machine learning methods can be
categorized into two main categories, static and dynamic approaches. The primary
difference between them is that static methods extract features from the static malware
analysis, while dynamic methods extract features from the dynamic analysis. A third
category may be considered, known as hybrid approaches. Hybrid methods incorporate
elements of both static and dynamic analysis. In addition, learning features from raw inputs
in diverse fields have outshone neural networks. The performance of neural networks in the
malware domain is mirrored by recent developments in machine learning for cybersecurity.
Chapter 2: Machine Learning

In recent years, the Machine Learning approaches are significantly becoming a high
demand in many industries and businesses for the purpose of obtaining meaningful data
insights and automation analysis. Machine Learning (ML) is one of the emerging domains
that is highly associated with the research field of Artificial Intelligence (AI). The concept
of Machine Learning in conjunction with AI, is referred to as a field of study that gives
computers the ability to learn without being explicitly programmed and according to the
definition of Tom M. Mitchell for ML refers as “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E .”

Clearly, Machine Learning can be defined as the ability of a computer program based on
computational algorithms that can automatically learn the underlying patterns from given
information and data to provide useful insights. Besides, Data Knowledge Discovery
processes which includes Data Mining is crucial in Machine Learning programs, as the
knowledge extraction of known and unknown data from the large-scale data source are
utilized as the basis of data insights and further exploration for key decisions from the
given data. Numerous applications are widely adapting Machine Learning techniques such
as stock prediction, credit scoring, smart medical checks, malware detection, and many
more as the applications are beneficial in delivering useful predictive analysis. The
following diagram shows as a general scheme for Machine Learning .

Fig. 1 Machine Learning Overview


2.1. Machine learning techniques

Machine Learning provides various approaches as classification, regression ,pattern


recognition and many more that are constructed based on mathematical and statistical
algorithms which processes the outcome knowledge from a given data of a sample
dataset.In general ,there are several types of Machine Learning models that ranges from
Supervised Learning, Unsupervised Learning, Semi-supervised Learning, Reinforcement
Learning and Deep Learning. These Machine Learning models are used for different
purposes and each type of learning model is used to perform either descriptive or predictive
analysis depending on the chosen algorithm, type of analysis required to solve the problem
with consideration of the type of dataset used for the analysis. But the most popular
Machine Learning models that are used commonly : Supervised learning model and
Unsupervised learning models. The following section will discuss the two main categories
of machine learning models with their uses.

2.1.1. Supervised machine learning

Supervised model refers to an algorithmic learning model that infers the underlying
patterns and insights relationship between the labeled data and target values of unlabelled
data that is subject to prediction outcome. Consider a malware detection example based on
the Machine Learning classification approach , Where a labeled training dataset of the files
is used with labels of benign and malicious for learning and training tasks of the model.
The labels are used to identify each data of the dataset. The model is trained and adapted to
the generalized pattern and feature knowledge from the given dataset’s data. The model
applies classification function on the test unseen data, which is unlabelled data, where it
classifies and predicts according to the supplied labels and trained dataset and produces
possible outcome prediction.
Fig. 2 Supervised learning overview

2.1.2. Unsupervised machine learning

In terms of the Unsupervised learning model, it is only requiring an unlabelled input


dataset. This learning model utilizes clustering and grouping algorithms that can
automatically find regularity from the unlabelled data without human interference and it
filters and groups the unlabelled data into small clusters of similar features and provides
each cluster with a suitable label based on the acknowledge similarity patterns from the
dataset. Clearly, The Unsupervised learning model is considered to be useful when labeling
large datasets. Some uses of the Unsupervised model are found in the areas of data
compression, outlier detection, classification, and human learning.

Fig. 3 Unsupervised learning overview


Chapter 3: Memory Analysis

Memory analysis can overcome the limitations of static and dynamic analysis methods.
With memory analysis, the limitations of malware signatures created as a result of static
analysis can be overcome. Memory-based features can also overcome some dynamic
analysis limitations such as hidden behavior of malwares during analysis. Although
memory analysis is basically a static analysis, it is a known fact that new generation
malware does not exhibit some behaviors during static analysis. However, since such
hidden behaviors can be detected with memory analysis, it provides significant gains in
malware detection compared to traditional static analysis. Malware leaves some traces in
memory. With memory analysis, some information about the behavioral characteristics of
malware can be obtained using information such as terminated processes, DDL records,
registries, active network connections, and running services. Memory analysis work
consists of two stages, memory acquisition and memory analysis. Memory acquisition is
the stage of obtaining a full image of the machine memory. Memory analysis is the phase
of examining and analyzing the movements of malware, usually using a forensic memory
tool. In this way, it becomes possible to detect hidden malware with memory analysis.

3.1. Memory and its architecture

Basically, each computational device is composed with two principle components that
performs computational processing and basic instructions of a system that are the physical
memory and the processor. These components are considered to hold forensic value, as the
processor includes program executions and the processes of the central processing unit
(CPU) of the whole computer system. Whereas the volatile physical memory, it consists of
temporarily stored data related to the processor and executed programs of the active
system. In terms of the modern computer system architecture, CPU is often stated as a
processor, that is indirectly accesses and requests the main memory (RAM) via Memory
Controller Hub for instructional commands to execute and process the data.

Moreover, Memory is commonly known as random-access memory (RAM), particularly


for its characteristic of random-access time in any order for the storage and location of the
data. Memory is also characterized as the most volatile data in a computer system, as its
data and the content are lost when the system state is off. In addition, memory has the
capability to collect, access and transfer data between the input/output (I/O) controllers and
processor via the connected units of Northbridge and Southbridge. This indicates that the
information regarding external connected devices and storage media reside in the main
memory and can be acknowledged as this information for forensic investigation and
analysis.

3.2. Memory acquisition/Memory dump

The need for memory acquisition has increased as more information is being stored on
computer’s memory that is involved in cybercrimes and network attacks. Memory
acquisition is highly prioritized for any identified live compromised computers as it
contains extremely volatile data in memory which is ranked as top in order of the data
volatility. The memory images and snapshots are only captured from a running system as
once the system is turned off completely or rebooted the memory’s content fades away.

Volatile and Non-volatile memory are the two types of memory available in the system.
Volatile memory stores data temporarily and non-volatile data is stored permanently in the
system. Memory stores current working of processes, registers, stack of processes, deleted
files, and encrypted data. Volatile memory or Random Access Memory (RAM) only
maintains its data while the computer or device is powered on. Non-volatile Memory, or
NVRAM, is for longer term storage. When a computer is powered off, evidence in RAM is
lost and normally cannot be recovered, however, the data in NVRAM often remains after
the system is powered off and can be analyzed after the fact.

Acquisition is done with two different approaches. 1) Live System/device 2) Dead


System/Device. When a system is live it uses a different technique to retrieve data from the
system than a dead system. A Faraday bag is used to collect devices and then forensic is
proceeded. Acquisition is a technique in which collection of evidence is carried out from
the seized device through which a crime is committed. A write blocker is attached with the
seized device to collect the data, so that there is no change in the evidence and hash value
can be calculated after which RAM and Registry is Dump with the use of RAM Dump
memory forensic tool which collects all the data from the RAM and generate the reg.mem
file which collects all the data from RAM and then this file is analyzed in Encase tools and
report is generated. If the retrieved data matches with the original one then the accused can
be convicted on the basis of this.

3.3. Volatility Framework

The most widely used memory forensics platform for memory acquisition and analysis is
known to be Volatility Framework. This tool is beneficial to analyze captured and imaged
volatile memory for valuable information about the runtime state of the system, and
provides the ability to link artifacts from traditional forensic analysis. Also, the tool
provides a range of plugins to analyze the memory artifacts of main 6 areas as mentioned
earlier. In addition, this tool framework is python based and is also used as a python library.
At the initial analysis of a memory image, it is important to distinguish the system running
process. The following section will describe briefly about the system process as an artifact
along with process-related artifact logs that are used for the analysis using the Volatility
Framework plugins.
Chapter 4: Algorithms Applied

● DECISION TREE
● RANDOM FOREST
● SVM
● XGBOOST

4.1 DECISION TREE:

Decision tree is a supervised learning technique that can be used for both classification
and regression problems, but are mostly suitable for solving classification problems. It is a
tree-structured classifier, with internal nodes representing characteristics of the data set,
branches representing decision rules, and each leaf node representing a result. The decision
tree has two nodes, a decision node and a leaf node. Decision nodes are used to make
decisions and have multiple branches, while leaf nodes are the result of those decisions and
contain no further branches. A decision or test is made based on the characteristics of the
specified dataset. A graphical representation to get all possible solutions to a
problem/decision based on certain conditions. It is called a decision tree because, like a
tree, it starts from a root node and expands to other branches to build a tree-like structure.
To build the tree, we use the CART algorithm, which stands for Classification and
Regression Tree Algorithm. A decision tree simply asks a question and subdivides the tree
into subtrees based on the answer (yes/no).

Fig. 4 Decision Tree structure


4.1.1 Decision Tree Terminologies

● Root node: The root node is the starting point of the decision tree. It
represents the entire data set, which is further divided into two or more
homogeneous data sets.
● Leaf node: A leaf node is the last output node and the tree cannot be further
split after a leaf node is obtained.
● Splitting: Splitting is the process of splitting a decision node/root node into
subnodes according to specified criteria.
● Branch/subtree: A tree formed by splitting a tree.
● Pruning: Pruning removes unwanted branches from trees.
● Parent/child nodes: The root node of the tree is called the parent node, and
the other nodes are called child nodes.

4.1.2 Working

In a decision tree, the algorithm for predicting classes for a given dataset starts at the root
node of the tree. The algorithm compares the value of the root attribute with the record
attribute (actual record) and branches based on that comparison to jump to the next node.

For the next node, the algorithm again compares the attribute values ​with other subnodes
and continues. Continue processing until a leaf node of the tree is reached. The complete
process can be better understood with the following algorithm:

Step 1: Start the tree with the root node containing the complete dataset.
Step 2: Find the best attributes in the dataset using the Attribute Selection Measure
(ASM).
Step-3: Divide S into subsets containing the possible values ​of the best attributes.
Step 4: Create a decision tree node containing the best attributes.
Step 5: Recursively build a new decision tree using the subset of the dataset
created in Step 3. Continue this process until you reach a stage where the nodes
cannot be classified any further and the last node can be called a leaf node.
4.1.3 Example
Suppose you have a job offer and you have a candidate who wants to decide whether to
accept the offer. So, to solve this problem, the decision tree starts from the root node
(Salary attribute in ASM). The root node is further split into the next decision node
(distance from office) and leaf nodes based on the corresponding label. The following
decision nodes are further divided into decision nodes (cab functions) and leaf nodes.
Finally, the decision node is split into her two leaf nodes (accepted offer and rejected
offer). Consider the following illustration.

Fig. 5 Decision Tree Working Model


Attribute Selection Measures :

A major problem when implementing decision trees is how to select the best attributes for
the root node and subnodes. Therefore, to solve such problems, there is a technique called
Attribute Selection Measure or ASM. This measure makes it easy to choose the best
attributes for the nodes of the tree. There are two popular techniques for ASM, which are:

● Information Gain
● Gini Index

1. Information Gain:

● Information Gain is a measure of the change in entropy after segmenting the


dataset based on attributes.
● Calculate how much information the features provide about the class.
● Split the nodes according to the value of the obtained information and build
a decision tree.
● The decision tree algorithm always tries to maximize the value of
information gain and the node/attribute with the highest information gain is
split first. It can be calculated using the following formula:

Information Gain = Entropy (S) - [(Weighted Average) * Entropy (for each


attribute)

Entropy: Entropy is used to measure the impurity of a particular attribute. Specifies


the randomness of the data. Entropy can be calculated as follows:

Entropy(s) = -P(yes)log2 P(yes)- P(no)log2 P(no)

Where,

S= total number of samples

P(yes) = probability of yes

P(No) = Probability of No

2. Gini Index:

● The Gini Index is a measure of impurity or purity used in building decision


trees in the CART (Classification Regression Tree) algorithm.
● Attributes with a low Gini index should take precedence over attributes with
a high Gini index.
● Only binary splits are created, the CART algorithm uses the Gini index to
create the binary splits.
● The Gini coefficient can be calculated using the following formula:

Gini Index = 1- ∑jPj2

4.1.4 Advantages of the Decision Tree

● It is simple to understand as it follows the same process which a human follows


while making any decision in real-life.
● It can be very useful for solving decision-related problems.
● It helps to think about all the possible outcomes for a problem.
● There is less requirement of data cleaning compared to other algorithms.

4.1.5 Disadvantages of the Decision Tree

● The decision tree contains lots of layers, which makes it complex.

● It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.

● For more class labels, the computational complexity of the decision tree may
increase.

4.2. RANDOM FOREST

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:
Fig. 6 Working of Random Forest

Assumptions for Random Forest :

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:

● There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
● The predictions from each tree must have very low correlations.

4.2.1 Working
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.


Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

4.2.2 Example

Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a
new data point occurs, then based on the majority of results, the Random Forest classifier
predicts the final decision. Consider the below image:

Fig. 7 Working example of Random Forest

4.2.3 Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.

3. Land Use: We can identify the areas of similar land use by this algorithm.

4. Marketing: Marketing trends can be identified using this algorithm.

4.2.4 Advantages of Random Forest

● Random Forest is capable of performing both Classification and Regression


tasks.
● It is capable of handling large datasets with high dimensionality.
● It enhances the accuracy of the model and prevents the overfitting issue.
● It takes less training time as compared to other algorithms.
● It predicts output with high accuracy, even for the large dataset it runs
efficiently.
● It can also maintain accuracy when a large proportion of data is missing.

4.2.5 Disadvantages of Random Forest

● Although random forest can be used for both classification and regression tasks, it
is not more suitable for Regression tasks.
● A trained forest may require significant memory for storage, due to the need for
retaining the information from several hundred individual trees.
● A forest is less interpretable than a single decision tree. Single trees may be
visualized as a sequence of decisions.
● Although random forests can be an improvement on single decision trees, more
sophisticated techniques are available. Prediction accuracy on complex problems is
usually inferior to gradient-boosted trees.

4.3. SVM ALGORITHM


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called support vectors, and hence the algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Fig. 8 Example of SVM graph

4.3.1 Example

SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, such a model can be created by using
the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as the support vector creates a decision boundary between these two
data (cat and dog) and chooses extreme cases (support vectors), it will see the extreme case
of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
Fig. 9 Working of SVM

4.3.2 Types of SVM

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in


n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then the hyperplane will be a straight
line. And if there are 3 features, then the hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vectors. These vectors
support the hyperplane, hence called a Support vector.

4.3.3 Working
Linear SVM :
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

Fig. 10.1 Classification of dataset in linear SVM

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Fig. 10.2 Separation of dataset using multiple lines

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called the margin. And the goal of SVM is to maximize
this margin. The hyperplane with maximum margin is called the optimal hyperplane.

Fig. 10.3 Illustration of Optimal hyperplane


Non-Linear SVM
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

Fig. 10.4 Classification of dataset in Non-linear SVM

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
Fig. 10.5 Sample space of non linear SVM in 3D

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Fig. 10.6 Division of dataset in Non linear SVM


Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Fig. 10.7 Illustration of best hyperplane in non linear SVM

Hence we get a circumference of radius 1 in case of non-linear data.

4.4. XGBOOST
Ever since its introduction in 2014, XGBoost has been lauded as the holy grail of machine
learning hackathons and competitions. From predicting ad click-through rates to
classifying high energy physics events, XGBoost has proved its mettle in terms of
performance – and speed.

Tianqi Chen, one of the co-creators of XGBoost, announced (in 2016) that the innovative
system features and algorithmic optimizations in XGBoost have rendered it 10 times faster
than most sought after machine learning solutions. A truly amazing technique!

4.4.1 Why XGBOOST ?


The beauty of this powerful algorithm lies in its scalability, which drives fast learning
through parallel and distributed computing and offers efficient memory usage.It’s no
wonder then that CERN recognized it as the best approach to classify signals from the
Large Hadron Collider. This particular challenge posed by CERN required a solution that
would be scalable to process data being generated at the rate of 3 petabytes per year and
effectively distinguish an extremely rare signal from background noises in a complex
physical process. XGBoost emerged as the most useful, straightforward and robust
solution.

4.4.2 Ensemble learning

XGBoost is an ensemble learning method. Sometimes, it may not be sufficient to rely upon
the results of just one machine learning model. Ensemble learning offers a systematic
solution to combine the predictive power of multiple learners. The resultant is a single
model which gives the aggregated output from several models.

The models that form the ensemble, also known as base learners, could be either from the
same learning algorithm or different learning algorithms. Bagging and boosting are two
widely used ensemble learners. Though these two techniques can be used with several
statistical models, the most predominant usage has been with decision trees.

Let’s briefly discuss bagging before taking a more detailed look at the concept of boosting.

● Bagging : While decision trees are one of the most easily interpretable models,
they exhibit highly variable behavior. Consider a single training dataset that we
randomly split into two parts. Now, let’s use each part to train a decision tree in
order to obtain two models.

When we fit both these models, they would yield different results. Decision trees
are said to be associated with high variance due to this behavior. Bagging or
boosting aggregation helps to reduce the variance in any learner. Several decision
trees which are generated in parallel, form the base learners of bagging technique.
Data sampled with replacement is fed to these learners for training. The final
prediction is the averaged output from all the learners.
● Boosting : In boosting, the trees are built sequentially such that each subsequent
tree aims to reduce the errors of the previous tree. Each tree learns from its
predecessors and updates the residual errors. Hence, the tree that grows next in the
sequence will learn from an updated version of the residuals.
The base learners in boosting are weak learners in which the bias is high, and the
predictive power is just a tad better than random guessing. Each of these weak
learners contributes some vital information for prediction, enabling the boosting
technique to produce a strong learner by effectively combining these weak learners.
The final strong learner brings down both the bias and the variance.

In contrast to bagging techniques like Random Forest, in which trees are grown to
their maximum extent, boosting makes use of trees with fewer splits. Such small
trees, which are not very deep, are highly interpretable. Parameters like the number
of trees or iterations, the rate at which the gradient boosting learns, and the depth of
the tree, could be optimally selected through validation techniques like k-fold cross
validation. Having a large number of trees might lead to overfitting. So, it is
necessary to carefully choose the stopping criteria for boosting.

The boosting ensemble technique consists of three simple steps:


● An initial model F0 is defined to predict the target variable y. This model
will be associated with a residual (y – F0)
● A new model h1 is fit to the residuals from the previous step
● Now, F0 and h1 are combined to give F1, the boosted version of F0. The
mean squared error from F1 will be lower than that from F0:

To improve the performance of F1, we could model after the residuals of F1 and
create a new model F2:

This can be done for ‘m’ iterations, until residuals have been minimized as much as
possible:

Here, the additive learners do not disturb the functions created in the previous
steps. Instead, they impart information of their own to bring down the errors.
4.4.3 Advantages of XGBOOST
XGBoost is a popular implementation of gradient boosting. Let’s discuss some features of
XGBoost that make it so interesting.
● Regularization: XGBoost has an option to penalize complex models through both
L1 and L2 regularization. Regularization helps in preventing overfitting
● Handling sparse data: Missing values or data processing steps like one-hot
encoding make data sparse. XGBoost incorporates a sparsity-aware split finding
algorithm to handle different types of sparsity patterns in the data
● Weighted quantile sketch: Most existing tree based algorithms can find the split
points when the data points are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data. XGBoost has a
distributed weighted quantile sketch algorithm to effectively handle weighted data
● Block structure for parallel learning: For faster computing, XGBoost can make use
of multiple cores on the CPU. This is possible because of a block structure in its
system design. Data is sorted and stored in in-memory units called blocks. Unlike
other algorithms, this enables the data layout to be reused by subsequent iterations,
instead of computing it again. This feature also serves useful for steps like split
finding and column sub-sampling
● Cache awareness: In XGBoost, non-contiguous memory access is required to get
the gradient statistics by row index. Hence, XGBoost has been designed to make
optimal use of hardware. This is done by allocating internal buffers in each thread,
where the gradient statistics can be stored
● Out-of-core computing: This feature optimizes the available disk space and
maximizes its usage when handling huge datasets that do not fit into memory

4.4.4 Disadvantages of XGBOOST

● XGBoost does not perform so well on sparse and unstructured data.


● A common thing often forgotten is that Gradient Boosting is very sensitive to
outliers since every classifier is forced to fix the errors in the predecessor learners.
● The overall method is hardly scalable. This is because the estimators base their
correctness on previous predictors, hence the procedure involves a lot of struggle to
streamline.
Chapter 5: Investigation and Analysis

This chapter of the document details the prerequisite and environment setup analysis
techniques used to investigate and analyze a memory image for process-related artifacts
that are likely to be hidden in a memory with consideration of other memory artifacts. In
addition, the data collection approach for the Machine Learning models.

5.1 Prerequisite

Equipment and Setup:

1. Rog Strix G15 (15 inches, 2020)

2. Host Operating System (Version: Windows 11, 64-bit, Build 9600 ,6.3.9600)

3. Hardware Processing CPU Intel i7 -10870H @ 2.20GHz, RAM 16 GB

4. VMware Workstation 15 Player (Version: 15.5.2 Build-15785246)

1. Installed latest version of LinuxMint Mate (19.3) Operating system as Virtual


Machine VM

2. VM Specifications : ▪ RAM Size = 2 GB ▪ Internal Hard Disk 64 GB of free


available space used

Memory image dataset : a research dataset was utilized for this experimental study of
analyzing the process-related artifacts of memory image. The dataset consists of (4300
positive and 300 negatives) acquired realistic memory snapshots of uncompromised and
compromised Windows 11 virtual machines. Memory images/Dump were acquired using
in-built memory dump feature in windows and were stored as .dmp files and converted into
Comma Separated Value (.csv). Briefly, The memory images were compromised using
several malware based on obfuscation evasion techniques.
5.2 Analysis Approach

Investigating a compromised memory image requires deep understanding of different


memory forensics investigation and analysis techniques in order to uncover intrusion
sources with all associated artifacts. Despite the limited timeframe of the project, some
areas of the investigation and analysis of a compromised memory will be limited and will
mainly focuses on the process related artifacts that could be found in a memory in order to
identify memory’s process activities and behaviors as benign or malicious.

5.2.1 Acquisition of a compromised memory image

After a successful installation of a windows machine in VMWare, we used an in-built


memory dump feature in the windows machine to extract the memory dump file. The steps
we followed for this process is explained below:

Step 1: Go to the system properties in about.

Fig 11.1 System properties


Step 2: Then go to startup and recovery menu — write debugging information —
Complete Memory Dump

Fig 11.2 Startup and recovery menu

Step 3: Then use notmyfault tool to launch an unexpected error in the system.

Fig 11.3 Notmyfault tool menu


Step 4: Finally we got the dump file in the desired location.

Fig 11.4 Result of data acquisition


Chapter 6: Implementation

This section of the document will detail the implementation of the tools including the main
implementation code of data collection and preprocessing, Machine Learning models and
Classifier with regards to used programming tools.

6.1 Overview

1. Data Collection and Preprocessing Graphical User Interface implementation,


mainly responsible for importing, loading, and preprocessing the raw data of
psxview logs and creating a validated preprocessed labeled dataset file as csv. In
addition, all the input raw logs are collected from the user input specified labeled
"data-input" .Also, the output dataset is saved and stored within a directory called
"data-output" that is held in the Data Collection and Preprocessing directory.

2. Different Machine Learning Classification Models were implemented and


developed as simple python scripts in order to train, test and evaluate for best
performing model for the classification of the given dataset. Classification models
that were developed :Random Forest , Decision trees, Neural Network , Naïve
Bayes, Support Vector Machines. Each of these models provides a report as a text
file which includes the evaluation and performance metrics and prediction
outcomes of the model for given dataset.

3. Classifier Graphical User Interface implementation, mainly responsible for


importing processed dataset, training the model, testing, and classifying unseen
data. In addition, this implementation includes the development of the best
performing Machine Learning classification model that is used as a memory
classifier.

6.2 Source Code


Chapter 7: Prediction Analysis

The memory dumps are analyzed using various machine learning algorithm whose
accuracy is given below:

Sl.No Algorithms Accuracy Error

1. Decision Tree 99.098 0.902

2. Random Forest 99.431 0.569

3. AdaBoost 98.449 1.551

4. LinearRegression 58.348 41.652


Table 1.0 Results of prediction

By analyzing the above table, it is concluded that the most efficient algorithm for malware
analysis is Random Forest.

In this, a comparative analysis of different machine learning algorithms used for malware
analysis using their prediction accuracy and error. Here it is found that the accuracy of the
random forest machine learning algorithm is higher than that of any other algorithms used
for the malware analysis, also the error value for the same algorithm is lesser than that of
any other machine learning algorithm used.
Chapter 8: Future Work

This chapter defines some sustainable work and assumptions that might be addressed in
further research for the project. Some of the project's remaining sub-cores aims of
implementation, tests and experiments have been left for future work due to limitations and
time constraints of the project .

In addition, this project requires further analysis and extraction of other artifacts data from
memory images which is considerable and very time consuming . Some of the sub aims
and functional requirements that could developed as future work are listed below :

• Testing and improving the results of the Random forest classifier model with
better understanding and validating the feature attributes as prioritizing features of
pslist and psscan over the other features when classifying.

• Implementing the desired functional requirements of the classifier tool that could
show more results to the initial outcomes of the project.

• It would be interesting for constructing an automate predictor using volatility


plugins for importing memory images directly and looking up for the vital artifact
that can retrieved easily and utilized to distinguish memory as benign or malicious

• Improving the data collection and preprocessing tool to include options of


preprocessing multiple process related artifacts and testing against the classifier.

• Considerably important for the future research to include other investigation and
analysis areas of memory forensics including dlls, handles and threads.
Chapter 9: Conclusion

As per the study, it is found that Random forest was able to classify the malware with
99.4% accuracy. We have proposed a malware detection module based on advanced data
mining and machine learning. While such a method may not be suitable for home users,
being very processor-heavy, this can be implemented at the enterprise gateway level to act
as a central antivirus engine to supplement antiviruses present on end-user computers. This
will not only easily detect known viruses, but act as a knowledge that will detect newer
forms of harmful files. While a costly model requires costly infrastructure, it can help in
protecting invaluable enterprise data from security threats, and prevent immense financial
damage
References

[1] Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures.


Information Processing & Management 39, 1 (2003), 45–65.
[2] A Ananya, A Aswathy, TR Amal, PG Swathy, P Vinod, and Shojafar Mohammad.
2020. SysDroid: a dynamic MLbased android malware analyzer using system call traces.
Cluster Computing 23, 4 (2020), 2789–2808.
[3] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and
CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in
your pocket.. In Ndss, Vol. 14. 23–26.
[4] Thomas Bläsing, Leonid Batyuk, Aubrey-Derrick Schmidt, Seyit Ahmet Camtepe, and
Sahin Albayrak. 2010. An android application sandbox system for suspicious software
detection. In 2010 5th International Conference on Malicious and Unwanted Software.
IEEE, 55–62.
[5] Sanya Chaba, Rahul Kumar, Rohan Pant, and Mayank Dave. 2017. Malware detection
approach for android systems using system call logs. arXiv preprint arXiv:1709.08805
(2017).
[6] Vitaly Chaykovsky. 1992. Strace - System calls tracer. www.strace.io. Accessed:
2021-08-12.
[7] Gideon Creech and Jiankun Hu. 2013. Generation of a new IDS test dataset: Time to
retire the KDD collection. In 2013 IEEE Wireless Communications and Networking
Conference (WCNC). IEEE, 4487–4492.
[8] Gideon Creech and Jiankun Hu. 2013. A semantic approach to host-based intrusion
detection systems using contiguousand discontiguous system call patterns. IEEE Trans.
Comput. 63, 4 (2013), 807–819.
[9] Marko Dimjašević, Simone Atzeni, Ivo Ugrina, and Zvonimir Rakamaric. 2016.
Evaluation of android malware detection based on system calls. In Proceedings of the 2016
ACM on International Workshop on Security And Privacy Analytics. 1–8.
[10] Arnaud Dupuis. 2012. Genymotion - Android Emulator. https://www.genymotion.
com/. Accessed: 2021- 08-12.
[11] Stephanie Forrest, Steven A Hofmeyr, Anil Somayaji, and Thomas A Longstaff. 1996.
A sense of self for unix processes. In Proceedings 1996 IEEE Symposium on Security and
Privacy. IEEE, 120–128.
[12] Shifu Hou, Aaron Saas, Lifei Chen, and Yanfang Ye. 2016. Deep4maldroid: A deep
learning framework for android malware detection based on linux kernel system call
graphs. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence
Workshops (WIW). IEEE, 104–111.
[13] Google Inc. 2011. Android Studio. https://developer.android.com/studio. Accessed:
2021-06- 23.
[14] Federica Laricchia. 2021. Mobile operating systems’ market share worldwide from

January 2012 to July 2020.


https://www.statista.com/statistics/272698/globalmarketshare-held-by-mobile-operating-sy
stems-since-2009/. Accessed: 2021- 10-24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy