An Intrusion Detection System

ABSTRACT
With the rapid growth of interconnected devices in the Internet of Things (IoT), network security
faces new challenges due to limited computational resources, memory constraints, and the unique protocols
used in these devices. This project introduces a lightweight Intrusion Detection System (IDS) optimized for
resource-constrained environments by leveraging Benford's Law and analyzing network flow size
differences. Our IDS systematically monitors network traffic and identifies deviations from normal flow
patterns, utilizing the first significant digit distribution predicted by Benford's Law. By applying linear
regression and error metrics on network traffic samples, we successfully detect anomalies indicative of cyber
attacks. Experimental evaluations using the NSL-KDD dataset demonstrate the effectiveness of this IDS
approach in distinguishing between normal and malicious TCP flows. The results show promise for scalable
deployment in IoT and similar resource-constrained networks, paving the way for improved, accessible
network security solutions.

TABLE OF CONTENTS
CHAPTTER 1: INTRODUCTION
1.1 INTRODUCTION 01
1.4 LITERATURE REVIEW 02
CHAPTER 2: PROJECT MECHANISM

2.1 WORKFLOW OF THE SYSTEM 05
2.2 WORKING OF THE SYSTEM 06
CHAPTER 3: TECHNOLOGY STACK

3.1 PYTHON 16
3.2 VS CODE 18
CHAPTER 4: RESULTS AND DISCUSSION 19
CHAPTER 5: ADVANTAGES AND CHALLENGES

5.1 ADVANTAGES 21
5.2 CHALLENGES 22
CHAPTER 6: CONCLUSION AND FUTURE SCOPE
6.1 CONCLUSION 23
6.2 FUTURE SCOPE 23
REFERENCES 24
APPENDIX 26
LIST OF FIGURES
Fig 2.1 Flow chart of the project 05
Fig 2.2 Plot of χ 2 vs window size in normal flow 09
Fig 4.2 Data flow that does not have intrusion 19
Fig 4.3 Data flow that have intrusion 20
LIST OF TABLES
Table 2.1 NSL-KDD dataset classes distribution 06

An Intrusion Detection System 1
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Nowadays, the number of internet-connected devices are increasing exponentially which makes the task
of enforcing security and availability of the network services to the users much more challenging. In the last
decade, various tools and techniques have been developed by organizations towards the protection of
network against different security threats such as access control mechanisms, user authentication, and
firewalls. Although these solutions prevent unauthorized access by outsiders, they are not resilient against
insider attacks. Thus intrusion detection system (IDS) [1] was developed to act as the second line of defense
to protect information loss to intruders.
IDS can be classified into two major types i.e., Networkbased intrusion detection system (NIDS) and
Host-based intrusion detection system (HIDS). NIDS is installed within the networks, which monitors the
network traffic and identify potential threats such as denial-of-service attacks, port scans, etc. Host-based
IDS is installed within the individual host computer and analyzes its local host activities for any intrusion or
threat in the system [2]. Based on the detection techniques, IDS can be further classified into three types, i.e.,
signature-based IDS, anomaly-based IDS and hybrid-based IDS. Signature-based IDS searches for a specific
predefined sequence of bytes or packet in the network traffic. It is very fast in detecting known attacks but
unable to detect unknown attacks or zero-day attacks. An anomalybased IDS identifies similar behavior in
normal traffic flow, and anything which deviates from this behaviour is considered as anomalies. This
method is useful in finding unknown attacks very well.
However, they often lack in detecting legitimate traffic which leads to high False Positive Rate (FPR).
On the other hand, hybrid-based IDS combines the properties of both signature-based and anomaly-based
IDS to get better accuracy. Network-based IDS is further classified into two types based on the source of
data analyzed ie., packet-based and flowbased intrusion detection system. Packet-based IDS has to inspect
each packet in the network traffic, including payload and headers. Flow-based analyses some specific
packets in the network traffic. It looks for aggregated information of similar packets in the network flow; due
to this the amount of data that is going to be analyzed is reduced. It provides information and patterns about
network flow. Packet-based is mostly based on signature-based, and flow-based provides anomaly-based
IDS.
It is found that flow size difference of normal TCP flow follows Benford’s law approximately and
malicious TCP flow deviates from it [4]. Flow size difference is applied on first digit of Benford’s law. Since
Benford’s law can work using only a few (even single) feature values, and it requires only simple
mathematical operation, it can be very much suitable for resource-constraint systems like IoT devices. In this
direction, we propose an efficient, lightweight IDS using Benford’s law
1.2 LITERATURE REVIEW

1. Types of Network Intrusion Detection Systems
Intrusion Detection Systems (IDS) serve as a defensive measure to detect and prevent malicious
activities within a network, acting as an additional layer of protection even with other security tools like
firewalls and antivirus software. According to Kazienko and Dorosz (2003), an IDS is essential for detecting
and thwarting hostile activities that could compromise network integrity if left unchecked. Amoroso (1999)
further defines intrusion detection as “a process of identifying and responding to malicious activity targeted
at computing and networking resources.” Unlike firewalls and antivirus programs, which protect against
unauthorized access from external threats, IDS monitors and flags unusual or malicious internal activities,
enhancing overall network security.
IDS technologies are generally categorized into four types:
1. Network-based IDS (NIDS): Used to monitor network protocols and detect suspicious activity,
NIDS is typically deployed on virtual private networks, remote access servers, and routers (Sturmer,
2013).
2. Wireless IDS (WIDS): Similar to NIDS but specifically designed for wireless networks, WIDS
identifies unauthorized access points and potential misuse within wireless communication channels
(Adams).
3. Host-based IDS (HIDS): Operating on individual hosts, HIDS monitors for changes in system files,
network traffic anomalies, and suspicious application processes (Sturmer).
4. Network Behavior Analysis (NBA): NBA focuses on detecting unusual network behaviors and
traffic patterns, identifying potential threats through deviations from standard network flow
(Seehorn).
Attacks detected by IDS may be internal or external. Internal attacks originate from within the
network, often executed by trusted insiders who exploit their access to sensitive data. These attacks are
especially damaging as insiders benefit from organizational trust and physical access, making detection more
complex. Conversely, external attacks generally come from the internet, involving actors outside the network
perimeter. With rising incidents of insider attacks and increasing regulatory demands, organizations face
significant challenges in securing data from internal threats while adhering to compliance requirements
(Magalhaes, 2003).
2. Network-based Intrusion Prevention Systems
Network-based Intrusion Prevention Systems (NIPS) operate at the network level to scrutinize
incoming and outgoing packets, allowing or blocking them based on predefined security policies, similar to a
firewall. According to Scarfone and Mell (2007), NIPS provides extensive logging capabilities, which help
in tracking detected incidents for further analysis and alert review. Information logged by NIPS includes:
 Timestamp and Packet ID for chronological tracking,
 Event or action type with priority and severity ratings,
 Source and destination IP addresses,
 Protocols used in transport and application layers,
 Data payloads, capturing the nature of application requests and responses,
 State-related information, such as authenticated usernames.
NIPS also facilitates information gathering on host systems within the network, offering insights into
operating systems, applications, and network characteristics, which supports proactive threat assessment and
enhances the network’s defensive capabilities.
3. Network Intrusion Detection System
Network Intrusion Detection Systems (NIDS) continuously monitor network traffic, identifying
patterns that may signify a potential attack. As noted by Rozenblum (2001), a NIDS serves multiple
purposes: monitoring user and system activities, analyzing vulnerabilities, assessing file integrity, detecting
attack patterns, and logging policy violations. Unlike firewalls, which primarily block unauthorized access,
NIDS can detect subtle indications of intrusions even after an initial entry. NIDS configurations involve the
strategic placement of intrusion detection sensors at network entry points to feed information to a central
management console. This centralized setup allows administrators to analyze logs, update attack signatures,
and manage sensor configurations, enhancing the network’s overall resilience.
4. Phishing
Phishing is a deceptive technique used by cyber attackers to acquire sensitive information by

impersonating legitimate entities. Parno, Kuo, and Perrig describe phishing as a social engineering attack that
often involves mock websites mimicking authentic sites to lure victims. This form of fraud became widely
recognized in 1996, initially targeting AOL accounts, and has since evolved into a sophisticated threat that
continues to exploit trust in familiar online platforms (Reid, 2009). Rouse (2007) describes phishing attacks
as akin to “fishing” for information, where attackers cast a wide net, hoping to capture unsuspecting victims
who interact with fraudulent links or websites. The growth of online interactions has exacerbated phishing
risks, making it a persistent threat in both personal and organizational contexts.
5. Detecting and Preventing Phishing
IDS can play a crucial role in detecting and preventing phishing by using known phishing signatures
to block suspicious IP addresses. Liniger and Vines (2005) highlight the importance of signature-based
detection for identifying characteristics of known phishing attacks. With phishing attacks becoming more
prevalent and sophisticated, organizations need robust Intrusion Detection and Prevention Systems that can
monitor beyond traditional email channels, especially as social media adoption grows (Kaspian, 2013).
Comprehensive IDS solutions are thus essential for protecting sensitive information, as phishing attacks
increasingly target organizations through a range of digital communication channels.
CHAPTER 2
PROJECT MECHANISM
2.1 WORK FLOW OF THE SYSTEM
End
Fig 2.1 Flow chart of the project
2.2 WORKING OF THE SYSTEM
A. Traffic data
We have used NSL-KDD [12] as a traffic dataset for our project as it is one of the most widely used
dataset for intrusion detection system. This dataset consists of basic features and derived features for fine-
grained analysis of data in time and sequence windows. The dataset consists of 41 features divided into three
different categories as basic, content, and traffic. The NSL-KDD dataset contains a trainnig and testing
dataset. We have used only the training dataset that contains 1,25,973 samples of network traffic flow as we
need only few window flow samples and the number of samples is sufficient for our experimentation. The
42nd attribute represents a label for each flow as normal or attack. The attack classes are divided into four
types, namely DOS (Denial Of Service), Probe, U2R (User to Root), and R2L (Root to Local). The
distribution of total samples and TCP samples for each attack category is in shown Table 2.1.
Dataset Total Normal Dos Probe U2R R2L

#Total Samples 125973 67343 45927 11656 52 995
#TCP Samples 102689 53600 42188 5857 49 995
Table 2.1 NSL-KDD dataset classes distribution
There are 67,343 normal traffic flow present in the NSLKDD dataset, and the rest of them are
malicious. As, we aim to apply Benford’s law in TCP flow only, so we extracted all TCP connection samples
from the NSL-KDD training dataset, which consists of 1,25,973 samples and created a dataset. This dataset
contains 1,02,689 samples. Further, this dataset is divided into 3 datasets: 1) Normal dataset - which contains
all normal TCP connections (53,600 flows). 2) Malicious dataset - which contains all malicious TCP
connections (49,089 flows). 3) Mixed dataset - which contains normal and malicious flow in different
proportions.
B. The Metrics
We have used the metric called as” flow size difference” which is the difference between two
consecutive values in a feature having numeric values. Next, we discuss the flow-size difference and the
reason to use it.
1) Flow size difference: It is defined as the numeric difference of two consecutive TCP flows’ sizes.
It seems to be a potential metric of interest because it inherits the feature of long-tailedness. Iorliam et al.
[11] discussed about the validity of the flow-size difference to follow Benford’s law. For flow size
difference, we ignore the sign bit of the metric value and consider only the absolute value of the flow size
difference.
C. Detection method
In the previous section, we introduce flow size difference, which is a core part of the detection
method. In this section, we will briefly explain how to use this metric to detect normal and attack flow based
on Benford’s law. First, we divide the dataset into two sets; the first set contains all the normal flows while
the second set contains all the attack flows. There is a three-step procedure that is used to obtain the first-
digit frequency distribution of any feature having numeric values.
These three steps are:

(1) Calculate the flow size difference of the feature values for normal TCP flow and store the
absolute value in some array dataset.
(2) Extract the first significant digit from the result of (1). However, it may be the case that the first
significant digit is zero (for example, if two consecutive values of a feature in a flow are equal, then
difference would come as zero and so the first significant value would also be zero). We need to handle zero-
valued case as Benford’s distribution is defined only for digits from 1 to 9. In the proposed method, we
delete all zeros from our collection of the first significant digit.
(3) Calculate and plot the percentage of the frequency distribution of the first significant digit from
result of (2).
These steps are followed for each feature having a numeric value, and the plots corresponding to
normal flows is obtained. Further, this process is repeated for all features in the second dataset, which
contains only attack flows. A feature is said to follow Benford’s law if its plot corresponding to normal flows
shows a high similarity while the plot corresponding to attack flows shows high deviation from the expected
curve of the Benford’s law. We found only seven features in NSLKDD dataset that follows Benford’s law in
NSL-KDD dataset. These features are: duration, src bytes, dst bytes, count, srv counts, dst host count and dst
host srv count.
Benford’s law requires the calculation of distribution over a window. In a real-life scenario, the size
of window taking into consideration plays a crucial role. We found that the size of window that will give
optimal result can be obtained based on χ 2 divergence method.
From Benford’s law we know that
Benford’s law provides a vector that contains percentage distribution for each first digit value. If we
substitute the distribution value in place of P(d) in the Inverse-Benford function for each first digit d, then we
obtain nine pair of values as (1,f(1)), (2,f(2)... (9,f(9)). Further, we plot these points with digits i,e 1, 2... 9 on
x-axis and corresponding functional value as f(1), f(2) ... f(9) on y-axis. We name this type of curve as
Inverse-Benford distribution.
Further, we have applied linear regression method to find out best-fit line against the Inverse-Benford
distribution. We then evaluated different error metrics such as Mean Absolute Error (MAE), Mean Square
Error (MSE), and Root Mean Square Error (RMSE) between the best-fit line and the Inverse-Benford
distribution. The error values are observed to significantly less for normal flows and high for attack flows
depending upon the deviation from the nature of straight-line curve. The idea is to set a threshold value on
these error metrics. If the calculated error values are higher than the threshold value, the window flow is
classified as attack flow otherwise it is termed as normal flow.
D. TCP flows
In this subsection, we will discuss TCP flows. It is a set of packets that have similar properties and
close in time, i.e., they have the same source and destination. On considering TCP flow, the amount of
analysis of data is being reduced. TCP flows provide TCP connection or TCP sessions, which is consist of a
set of IP packets, starting with a SYN packet and ending with a FIN or RST packet. These flows are
generated by a large population of users through internet, and they are mutually independent
E. Window size
Since Benford’s law is based on distribution, we need to collect a given number of samples by using a
window-based method [9]. For a given metric, we can construct observed distribution and then compare it
with the expected distribution for detecting any deviation. It is found that on increasing window size, the
calculation of first-digit frequency is getting more accurate. But after a certain window size value, it does not
increase very much. However, it does increase the processing and memory cost. So, choosing an appropriate
value for window size is an important issue so that the model can be applicable for limited resource devices
(like IoT systems). Window size should be selected in such a way that any small deviation from the target
distribution could be easily captured. Since the Benford’s law has only 9 values, it can be expected that
window size could not be very large. It is found that window size W = 2000 to 3000 with step size of S = 50
to 500 is giving accurate results. Step size or window sliding defines how much the flow window slides at
one time. According to window-based method step size varies from 1 to W, generally S = W/2 is better
choice.
F. Measure of goodness of fit

To check how our model is close to target model, we need some metric which can define goodness of
fit. One well known metric is known as χ 2 test which is commonly used in the application of Benford’s law.
Chi-square goodness of fit test is a non-parametric test that is used to determine how the observed value of
our statistical model is significantly different from the expected value. Mathematically it is defined as:
where d is the first digit (1,2,3...9), Pd is the actual observed frequency, and Pˆ d is the expected
frequency of the data set. Figure 2.2 illustrates the plot between χ 2 and window size for normal traffic flows
in the network. The plot shows that as the window size goes on increasing, the χ 2 value goes on decreasing.
It means that the actual distribution of the first digit is going to very similar to the expected distribution. The
concept is that small window size could not able to cover all data set in our first digit data set, so there would
be a case where some values will be missing. But, on increasing the window size, all the first digit data is
going to cover within that window size and distribution tends to similar like expected distribution. As we can
see in the plot, the window size of value approximately 2500 is giving less deviation. We can use these many
flows to process at one time to detect whether the flow is normal or contain attack in the real-life scenario.
Fig 2.2 Plot of χ 2 vs window size in normal flow
Fig 2.3 Validity of Benford’s law for normal TCP flow.
G. Code Explanation
This code reads a CSV file containing a list of numbers, cleans the data, calculates expected probabilities for
each digit's appearance in the first and second positions, then compares these expected probabilities with
observed probabilities in the data. If observed probabilities fall outside calculated confidence intervals,
they’re flagged as violations. Finally, it creates bar charts for expected vs. observed frequencies and saves
them, along with flagged data.
1. Importing Libraries
import numpy as np
import math
from scipy import stats
import pandas as pd
import matplotlib
matplotlib.use('Agg') # Use a non-GUI backend
from matplotlib import pyplot as plt
 numpy is used for numerical operations, like creating matrices.

 math provides mathematical functions.
 scipy.stats is used for statistical functions, like calculating confidence intervals.
 pandas is for data manipulation and reading CSV files.
 matplotlib is for plotting; using 'Agg' backend allows the code to run without displaying the plot
(useful for saving images without a display).
2. main() Function Definition

The main() function takes two arguments:
 csv_file: the path to the CSV file containing transaction data.
 alpha_level: the significance level for confidence intervals.
3. calc_expected_probability() Function Definition

This helper function calculates the expected probability of a specific digit d appearing at a certain position n
in a number.
def calc_expected_probability(d: int, n: int) -> float:
prob = 0
if (d == 0) and (n == 1):
prob = 0
elif n == 1:
prob = math.log10(1 + (1 / d))

else:
l_bound = 10 ** (n - 2)
u_bound = 10 ** (n - 1)
for k in range(l_bound, u_bound):
prob += math.log10(1 + (1 / (10 * k + d)))
return round(prob, 3)
 For n=1, it computes the probability using log10(1 + 1/d); this formula is derived from Benford's
Law.
 For n > 1, it iterates over all possible numbers with n-1 digits, summing probabilities of each possible
k to approximate the expected frequency of digit d at position n.
4. Expected Probability Matrix Calculation

expected_prob_matrix = np.zeros((10, 2))
for d_i in range(expected_prob_matrix.shape[0]):
for n_i in range(expected_prob_matrix.shape[1]):
expected_prob_matrix[d_i, n_i] = calc_expected_probability(d_i, n_i + 1)
 Initializes a 10x2 matrix for storing expected probabilities.
 Each cell represents the probability of a digit d_i at position n_i.
5. Loading and Cleaning Data

data = pd.read_csv(csv_file, header=None)
data.columns = ['original']
data['cleaned'] = data.original.str.replace("-", "", regex=False)
data['cleaned'] = data.cleaned.str.replace(",", "", regex=False)
data['first_digit'] = data.cleaned.str[0].astype(int)
data['second_digit'] = data.cleaned.str.replace(",", "", regex=False).str[1]
data['second_digit'] = data['second_digit'].apply(lambda x: int(x) if x.isdigit() else 'NaN')
 Loads CSV data into a DataFrame.
 Removes - and , characters from each value.
 Extracts the first and second digits of each number (where applicable) and stores them in separate
columns.
6. Observed Probability Matrix Calculation
observed_prob_matrix = np.zeros((10, 2))

data_cols_length = [data.first_digit.count(), data[data.second_digit != 'NaN'].second_digit.count()]
for d_i in range(observed_prob_matrix.shape[0]):
for n_i in range(observed_prob_matrix.shape[1]):
observed_prob_matrix[d_i, n_i] = data[data.iloc[:, n_i + 2] == d_i].iloc[:, n_i + 2].count() /
data_cols_length[n_i]
 Creates a 10x2 matrix for observed probabilities.
 data_cols_length stores the counts of values in first_digit and second_digit columns.
 For each d_i and n_i, calculates the probability as the count of occurrences divided by the total count.
7. Confidence Interval Calculation

u_bound_matrix = np.zeros((10, 2))
l_bound_matrix = np.zeros((10, 2))
h_length_matrix = np.zeros((10, 2))
def calc_confidence_interval(alpha):
alpha_stat = stats.norm.ppf(1 - (1 - alpha) / 2)
prop_exp = expected_prob_matrix[d_i, n_i]
half_length = alpha_stat * (prop_exp * (1 - prop_exp) / data_cols_length[n_i]) ** .5 + (1 / (2 *
data_cols_length[n_i]))
u_bound = prop_exp + half_length
l_bound = prop_exp - half_length
h_length_matrix[d_i, n_i] = half_length
u_bound_matrix[d_i, n_i] = u_bound
l_bound_matrix[d_i, n_i] = max(l_bound, 0)
 Calculates confidence intervals for each digit’s expected probability using a normal distribution.
 alpha_stat computes the z-score based on the alpha value.
 half_length is the margin of error, calculated for each cell.
 u_bound_matrix and l_bound_matrix store the confidence interval limits.
8. Identifying Violations
list_of_rules_violated = []

obs = observed_prob_matrix[d_i, n_i]
u_bound = u_bound_matrix[d_i, n_i]
l_bound = l_bound_matrix[d_i, n_i]
if (obs > u_bound) or (obs < l_bound):
print("Violation: Digit {} in {} place".format(d_i, n_i + 1))
list_of_rules_violated.append((d_i, n_i))
 Checks if observed probabilities are outside the confidence intervals.
 If they are, the digit-position pair is added to list_of_rules_violated.
9. Retrieving and Saving Violations Data

data_that_violated = []
for (d_i, n_i) in list_of_rules_violated:
rule_violated_data = data[data.iloc[:, n_i + 2] == d_i]['original'].index.to_list()
for number in rule_violated_data:
data_that_violated.append(number)
violations_df = data.iloc[data_that_violated, [0, 2, 3]].drop_duplicates('original')
print("Total violations: {}".format(violations_df['original'].count()))
 Collects data points that violate rules, then creates a violations_df DataFrame containing unique
violations.
10. Plotting Expected and Observed Frequencies

def plot_charts(digit_location):
x = np.arange(10)
width = .35
fig, ax = plt.subplots()
ax.bar(x - width / 2, expected_prob_matrix[:, digit_location], width, yerr=h_length_matrix[:,
digit_location], capsize=7, label='Expected', color='#d3d3d3', figure=fig)
ax.bar(x + width / 2, observed_prob_matrix[:, digit_location], width, label='Given Data', color='blue',
figure=fig)
ax.legend()
ax.set_xticks(x)
title = 'First Digit Frequencies\n (Confidence Interval: {}%)'.format(alpha_level * 100) if digit_location

== 0 else 'Second Digit Frequencies\n (Confidence Interval: {}%)'.format(alpha_level * 100)
plt.title(title, figure=fig)
plt.xlabel('Digit', figure=fig)
plt.ylabel('Frequency', figure=fig)
return fig
 Creates bar charts for expected and observed frequencies with error bars representing confidence
intervals.
11. Saving Plots and Exporting CSV

plt_0 = plot_charts(0)
plt_0.savefig('first_digit_plot.png')
plt_1.savefig('second_digit_plot.png')
violations_df.to_csv('transactions_to_investigate.csv')
 Saves the bar charts and exports the violations data to a CSV file.
12. Running the Script

python
Copy code
if __name__ == "__main__":
main('transactions_real.csv', .999)
 Calls the main function with a specified CSV file and alpha level of 0.999 for high confidence.
CHAPTER 3
TECHNOLOGY STACK
3.1 PYTHON
Python is a programming language that is widely used in web applications, software development,
data science, and machine learning (ML). Developers use Python because it is efficient and easy to learn and
can run on many different platforms. Python software is free to download, integrates well with all types of
systems, and increases development speed.
The Python language has several use cases in application development, including the following examples:
Server-side web development
Server-side web development includes the complex backend functions that websites perform to
display information to the user. For example, websites must interact with databases, talk to other websites,
and protect data when sending it over the network.
Python is useful for writing server-side code because it offers many libraries that consist of prewritten
code for complex backend functions. Developers also use a wide range of Python frameworks that provide
all the necessary tools to build web applications faster and more easily. For example, developers can create
the skeleton web application in seconds because they don’t need to write it from scratch. They can then test it
using the framework’s testing tools, without depending on external testing tools.
Automation with Python scripts
A scripting language is a programming language that automates tasks that humans normally perform.
Programmers widely use Python scripts to automate many day-to-day tasks such as the following:
 Renaming a large number of files at once
 Converting a file to another file type
 Removing duplicate words in a text file
 Performing basic mathematical operations
 Sending email messages
 Downloading content
 Performing basic log analysis
 Finding errors in multiple files
Data science and machine learning
Data science is extracting valuable knowledge from data, and machine learning (ML) teaches computers
to automatically learn from the data and make accurate predictions. Data scientists use Python for data
science tasks such as the following:
 Fixing and removing incorrect data, which is known as data cleaning
 Extracting and selecting various features of data
 Data labeling, which is adding meaningful names for the data
 Finding different statistics from data
 Visualizing data by using charts and graphs such as line charts, bar graphs, histograms, and pie charts
Software development
Software developers often use Python for different development tasks and software applications such as
the following:
 Keeping track of bugs in the software code
 Automatically building the software
 Handling software project management
 Developing software prototypes
 Developing desktop applications using Graphical User Interface (GUI) libraries
 Developing simple text-based games to more complex video games
Software test automation
Software testing is the process of checking whether the actual results from the software match the
expected results to ensure that the software is error-free.
 Developers use Python unit test frameworks, such as Unittest, Robot, and PyUnit, to test the
functions they write.
 Software testers use Python to write test cases for various test scenarios. For example, they use it to
test the user interface of a web application, multiple software components, and new features.
Developers can use several tools to automatically run test scripts. These tools are known as Continuous
Integration/Continuous Deployment (CI/CD) tools. Software testers and developers use CI/CD tools such as
Travis CI and Jenkins to automate tests. The CI/CD tool automatically runs the Python test scripts and
reports the test results whenever developers introduce new code changes.
3.2 VS CODE
Visual Studio Code is a free, lightweight but powerful source code editor that runs on your desktop
and on the web and is available for Windows, macOS, Linux, and Raspberry Pi OS. It comes with built-in
support for JavaScript, TypeScript, and Node.js and has a rich ecosystem of extensions for other
programming languages (such as C++, C#, Java, Python, PHP, and Go), runtimes (such as .NET and Unity),
environments (such as Docker and Kubernetes), and clouds (such as Amazon Web Services, Microsoft
Azure, and Google Cloud Platform).
Aside from the whole idea of being lightweight and starting quickly, Visual Studio Code has
IntelliSense code completion for variables, methods, and imported modules; graphical debugging; linting,
multi-cursor editing, parameter hints, and other powerful editing features; snazzy code navigation and
refactoring; and built-in source code control including Git support. Much of this was adapted from Visual
Studio technology.
Visual Studio Code proper is built using the Electron shell, Node.js, TypeScript, and the Language
Server Protocol, and is updated monthly. The many extensions are updated as often as needed. The richness
of support varies across the different programming languages and their extensions, ranging from simple
syntax highlighting and bracket matching to debugging and refactoring. You can add basic support for your
favorite language through TextMate colorizers if no language server is available.
The code in the Visual Studio Code repository is open source under the MIT License. The Visual
Studio Code product itself ships under a standard Microsoft product license, as it has a small percentage of
Microsoft-specific customizations. It’s free despite the commercial license.
CHAPTER 4
RESULTS AND DISCUSSIONS
For experimental analysis, we have used the NSL-KDD data set which categorized attack types into
four different attack types, i.e., Dos, Probe, R2L, and U2R. The distribution of total samples and TCP
samples for each attack category in the NSL-KDD dataset is shown in Table 2.I. According Detection
method, we obtain only seven feature of NSL-KDD dataset which follows the Benford’s law distribution.
Figure 2.3 shows the frequency distribution curve for these seven features along with a blue line curve which
reflects the actual curve for the distribution of Benford’s law. It can be observed from the Figure 2.3 that the
distribution curves corresponding to these seven features are closely following Benford’s law curve for
normal flows. Figure 4.1 shows the data that does not have any intrusion. Because there is no deviation
between expected and given data. Figure 4.2 has attack flows as there is deviation between expected and
given data.
Fig 4.2 Data Flow that does not have intrusion

Fig 4.2 Data flow that have intrusion

CHAPTER 5
ADVANTAGES AND CHALLENGES
5.1 ADVANTAGES
The project we developed has several key advantages, particularly for applications in resource-constrained
environments such as IoT networks:
1. Efficiency and Resource Optimization:
 By leveraging Benford's Law, which operates on statistical distributions rather than
exhaustive analysis of packet data, the system requires fewer computational resources,
making it well-suited for IoT and other low-power devices.
 The proposed IDS uses a flow-based approach rather than packet-based inspection, focusing
on the size differences in network flows. This approach reduces data processing requirements,
conserving memory and processing power.
2. Scalability for Large Networks:
 This lightweight system is designed to handle extensive IoT networks where traditional,
resource-intensive IDS models may not be feasible. The scalability is achieved by analyzing
fewer data points without compromising on detection effectiveness, allowing it to monitor and
protect large IoT networks with minimal performance impact.
3. Enhanced Detection of Anomalous Behavior:
 The system’s reliance on Benford’s Law and flow size differences allows for effective
anomaly detection. By focusing on the natural statistical patterns in network traffic, it can
identify deviations associated with malicious activities, even in encrypted or obfuscated traffic
flows.
 This approach is robust against various types of cyber threats, including denial-of-service
(DoS) attacks, port scans, and other network-based intrusions that may bypass signature-based
detection systems.
4. High Detection Accuracy and Low False Positives:
 The use of linear regression and error metrics (e.g., Mean Absolute Error, Mean Squared
Error) enables the system to differentiate between normal and malicious flows with high
accuracy. The model adjusts its threshold values based on observed errors, helping to keep
false positives low, which is crucial in minimizing unnecessary alerts and reducing the
workload on security teams.
5. Adaptability for Different Attack Ratios:
 The system demonstrates adaptability by maintaining detection accuracy even as the
proportion of malicious traffic in a network changes. This dynamic capability makes it highly
suitable for varying network conditions, which are common in real-world applications
.
6. Applicability to a Wide Range of Attack Types:

 By utilizing a dataset like NSL-KDD, which includes multiple types of attacks, the system is
versatile in detecting diverse threat vectors, from brute force attacks to complex, multi-stage
intrusions. This adaptability enhances its usability across different industries and threat
landscapes.
7. Potential for Real-World Implementation:
 The system’s lightweight nature, combined with its robust detection methods, makes it an
excellent candidate for real-world deployment, especially in sectors where IoT devices are
prevalent (e.g., smart homes, industrial IoT, healthcare). Its design prioritizes compatibility
with real-time monitoring, allowing for proactive threat management in live networks.
8. Foundational Work for Future Enhancements:
 The project establishes a framework for incorporating advanced machine learning techniques
for further threat categorization and type-specific detection. This foundation could enable
future expansions, such as integrating machine learning classifiers or additional statistical
models, to refine the IDS’s capabilities even further.
5.2 CHALLENGES
 False Positives: Legitimate but unusual traffic patterns may deviate from Benford's Law,
potentially triggering false alarms.
 Dataset Limitations: Reliance on the NSL-KDD dataset might reduce accuracy in diverse, real-
world scenarios with varied traffic behaviors.
 Resource Constraints: Although designed for IoT devices, the IDS's statistical calculations
could still strain low-power devices
 Encrypted Traffic: Encrypted data poses a challenge, as the IDS depends on analyzing
unencrypted traffic patterns to detect anomalies.
 Threshold Tuning: The IDS requires fine-tuning of detection thresholds to fit different network
environments, which may complicate deployment.
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
6.1 CONCLUSION
The proposed intrusion detection system (IDS) leveraging Benford's Law and network flow size
differences provides an efficient solution for detecting malicious activity in resource-constrained
environments, such as IoT networks. By focusing on statistical patterns rather than exhaustive packet-level
analysis, this IDS offers a scalable and cost-effective alternative to traditional methods, allowing it to operate
with minimal computational overhead. Experimental results on the NSL-KDD dataset indicate strong
detection accuracy and low false-positive rates, affirming its potential for real-world applications in IoT and
large-scale networks. The project works and gives output as we expected from the beginning with slight
variations in between which are negligible.
6.2 FUTURE SCOPE
 Enhanced Attack Classification: Future work could integrate machine learning techniques for
classifying different attack types once detected, improving response strategies based on specific
threat categories.
 Adaptation for Encrypted Traffic: Enhancements for analyzing encrypted traffic patterns without
payload inspection could make the IDS applicable to privacy-focused networks.
 Real-World Testing and Dataset Expansion: Testing the IDS on more diverse and contemporary
datasets would ensure adaptability to various network environments and emerging threat vectors.
 Dynamic Threshold Adjustment: Implementing adaptive threshold mechanisms could help the IDS
automatically adjust to fluctuating network conditions, improving detection reliability.
 Integration with IoT Security Frameworks: Deploying this IDS as part of a comprehensive IoT
security framework would allow for seamless threat management across heterogeneous devices and
protocols, advancing its utility in practical, large-scale deployments.
REFERENCES
[1] Anna L. Buczak, and Erhan Guven, ”A Survey of Data Mining and Machine Learning Methods for Cyber
Security Intrusion Detection,” IEEE communication surveys & tutorial, vol. 18, no. 2, pp. 1153-1176,
(2016).
[2] Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey[J]. ACM computing surveys (CSUR),
2009, 41(3): 15
[3] Z. Jasak and L. Banjanovic-Mehmedovic, ”Detecting Anomalies by Benford’s Law,” 2008 IEEE
International Symposium on Signal Processing and Information Technology, Sarajevo, 2008.
[4] Iorliam A, Tirunagari S, Ho A T S, et al. ” Flow Size Difference” Can Make a Difference: Detecting
Malicious TCP Network Flows Based on Benford’s Law[J]. arXiv preprint arXiv:1609.04214, 2016.
[5] F. Benford, ”The law of anomalous numbers” Proc. Amer. Philos. Soc. , 78 (1938) pp. 551-572.
[6] Nigrini M J. The detection of income tax evasion through an analysis of digital frequencies[J]. Doctorat
en sciences de gestion, Cincinnati: universite de Cincinnati, 1992. ´
[7] Li X H, Zhao Y Q, Liao M, et al. Detection of tampered region for JPEG images by using mode-based
first digit features[J]. EURASIP Journal on advances in signal processing, 2012, 2012(1): 190.
[8] Sambridge M, Tkalciˇ c H, Jackson A. Benford’s law in the natural ´ sciences[J]. Geophysical research
letters, 2010, 37(22).
[9] Arshadi L, Jahangir A H. An empirical study on TCP flow interarrival time distribution for normal and
anomalous traffic[J]. International Journal of Communication Systems, 2014.
[10] Asadi A N. An approach for detecting anomalies by assessing the interarrival time of UDP packets and
flows using Benford’s law[C]//Knowledge-Based Engineering and Innovation (KBEI), 2015 2nd
International Conference on. IEEE, 2015: 257-262.
[11] Iorliam A, Tirunagari S, Ho A T S, et al. ” Flow Size Difference” Can Make a Difference: Detecting
Malicious TCP Network Flows Based on Benford’s Law[J]. arXiv preprint arXiv:1609.04214, 2016.
[12] G. Meena and R. R. Choudhary, ”A review paper on IDS classification using KDD 99 and NSL KDD
dataset in WEKA,” in International Conference on Computer, Communications and Electronics, Jaipur, pp.
553-558, 2017 on. IEEE, 2011: 958-963.s
[13] A. Kumar, M. Sung, J. J. Xu, and J. Wang, “Data streaming algorithms for efficient and accurate
estimation of flow size distribution,” in Proceedings of 2004 Joint International Conference on Measurement
and Modeling of Computer Systems (ACM SIGMETRICS ’04/IFIP Performance ’04). ACM, 2004, pp. 177–
188.
[14] P. Tune and D. Veitch, “OFSS: Skampling for the flow size distribution,” in Proceedings of 2014
Conference on Internet Measurement Conference (IMC 2014). ACM, 2014, pp. 235–240.
APPENDIX
Program Code
import numpy as np
import math
from scipy import stats
import pandas as pd
import matplotlib
matplotlib.use('Agg') # Use a non-GUI backend
from matplotlib import pyplot as plt
def main(csv_file, alpha_level):
def calc_expected_probability(d: int, n: int) -> float:
prob = 0
if (d == 0) and (n == 1):
prob = 0
elif n == 1:
prob = math.log10(1 + (1 / d))
else:
l_bound = 10 ** (n - 2)
u_bound = 10 ** (n - 1)
for k in range(l_bound, u_bound):
prob += math.log10(1 + (1 / (10 * k + d)))
return round(prob, 3)
expected_prob_matrix = np.zeros((10, 2))

expected_prob_matrix[d_i, n_i] = calc_expected_probability(d_i, n_i + 1)
data = pd.read_csv(csv_file, header=None)
data.columns = ['original']
data['cleaned'] = data.original.str.replace("-", "", regex=False)
data['cleaned'] = data.cleaned.str.replace(",", "", regex=False)
data['first_digit'] = data.cleaned.str[0].astype(int)
data['second_digit'] = data.cleaned.str.replace(",", "", regex=False).str[1]
data['second_digit'] = data['second_digit'].apply(lambda x: int(x) if x.isdigit() else 'NaN')
observed_prob_matrix = np.zeros((10, 2))
data_cols_length = [data.first_digit.count(), data[data.second_digit != 'NaN'].second_digit.count()]
observed_prob_matrix[d_i, n_i] = data[data.iloc[:, n_i + 2] == d_i].iloc[:, n_i + 2].count() / \
data_cols_length[n_i]
u_bound_matrix = np.zeros((10, 2))
l_bound_matrix = np.zeros((10, 2))
h_length_matrix = np.zeros((10, 2))
def calc_confidence_interval(alpha):
alpha_stat = stats.norm.ppf(1 - (1 - alpha) / 2)
prop_exp = expected_prob_matrix[d_i, n_i]
half_length = alpha_stat * (prop_exp * (1 - prop_exp) / data_cols_length[n_i]) ** .5 + (1 / (2 *
data_cols_length[n_i]))
u_bound = prop_exp + half_length

l_bound = prop_exp - half_length
h_length_matrix[d_i, n_i] = half_length
u_bound_matrix[d_i, n_i] = u_bound
l_bound_matrix[d_i, n_i] = max(l_bound, 0)
calc_confidence_interval(alpha_level)
list_of_rules_violated = []
obs = observed_prob_matrix[d_i, n_i]
u_bound = u_bound_matrix[d_i, n_i]
l_bound = l_bound_matrix[d_i, n_i]
if (obs > u_bound) or (obs < l_bound):
print("Violation: Digit {} in {} place".format(d_i, n_i + 1))
list_of_rules_violated.append((d_i, n_i))
data_that_violated = []
for (d_i, n_i) in list_of_rules_violated:
rule_violated_data = data[data.iloc[:, n_i + 2] == d_i]['original'].index.to_list()
for number in rule_violated_data:
data_that_violated.append(number)
violations_df = data.iloc[data_that_violated, [0, 2, 3]].drop_duplicates('original')
print("Total violations: {}".format(violations_df['original'].count()))
def plot_charts(digit_location):
x = np.arange(10)
width = .35
fig, ax = plt.subplots()
ax.bar(x - width / 2, expected_prob_matrix[:, digit_location], width,
yerr=h_length_matrix[:, digit_location],
capsize=7,
label='Expected',
color='#d3d3d3',
figure=fig)
ax.bar(x + width / 2, observed_prob_matrix[:, digit_location], width,
label='Given Data',
color='blue',
figure=fig)
ax.legend()
ax.set_xticks(x)
if digit_location == 0:
title = 'First Digit Frequencies\n (Confidence Interval: {}%)'.format(alpha_level * 100)
else:
title = 'Second Digit Frequencies\n (Confidence Interval: {}%)'.format(alpha_level * 100)
plt.title(title, figure=fig)
plt.xlabel('Digit', figure=fig)
plt.ylabel('Frequency', figure=fig)
return fig
# Save the plots instead of showing them

plt_0.savefig('first_digit_plot.png')
plt_1.savefig('second_digit_plot.png')
# Export CSV
violations_df.to_csv('transactions_to_investigate.csv')
if __name__ == "__main__":
main('transactions_real.csv', .999)

An Intrusion Detection System

Uploaded by

Copyright:

Available Formats

An Intrusion Detection System

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Intrusion Detection System

Uploaded by

Copyright:

Available Formats

ABSTRACT

network security solutions.

CHAPTER 2: PROJECT MECHANISM

CHAPTER 3: TECHNOLOGY STACK

CHAPTER 4: RESULTS AND DISCUSSION 19

CHAPTER 5: ADVANTAGES AND CHALLENGES

Fig 2.1 Flow chart of the project 05

Fig 2.2 Plot of χ 2 vs window size in normal flow 09

Fig 4.2 Data flow that does not have intrusion 19

Fig 4.3 Data flow that have intrusion 20

Table 2.1 NSL-KDD dataset classes distribution 06

1.2 LITERATURE REVIEW

IDS technologies are generally categorized into four types:

2. Network-based Intrusion Prevention Systems

 Timestamp and Packet ID for chronological tracking,

 Event or action type with priority and severity ratings,

 Source and destination IP addresses,

 Protocols used in transport and application layers,

 Data payloads, capturing the nature of application requests and responses,

 State-related information, such as authenticated usernames.

3. Network Intrusion Detection System

Phishing is a deceptive technique used by cyber attackers to acquire sensitive information by

5. Detecting and Preventing Phishing

2.1 WORK FLOW OF THE SYSTEM

Fig 2.1 Flow chart of the project

2.2 WORKING OF THE SYSTEM

Dataset Total Normal Dos Probe U2R R2L

Table 2.1 NSL-KDD dataset classes distribution

These three steps are:

From Benford’s law we know that

F. Measure of goodness of fit

Fig 2.2 Plot of χ 2 vs window size in normal flow

Fig 2.3 Validity of Benford’s law for normal TCP flow.

 numpy is used for numerical operations, like creating matrices.

2. main() Function Definition

3. calc_expected_probability() Function Definition

prob = math.log10(1 + (1 / d))

4. Expected Probability Matrix Calculation

5. Loading and Cleaning Data

observed_prob_matrix = np.zeros((10, 2))

7. Confidence Interval Calculation

for d_i in range(observed_prob_matrix.shape[0]):

9. Retrieving and Saving Violations Data

10. Plotting Expected and Observed Frequencies

title = 'First Digit Frequencies\n (Confidence Interval: {}%)'.format(alpha_level * 100) if digit_location

11. Saving Plots and Exporting CSV

12. Running the Script

Server-side web development

Automation with Python scripts

 Renaming a large number of files at once

 Converting a file to another file type

 Removing duplicate words in a text file

 Performing basic mathematical operations

 Sending email messages

 Performing basic log analysis

 Finding errors in multiple files

Data science and machine learning

 Fixing and removing incorrect data, which is known as data cleaning

 Extracting and selecting various features of data

 Data labeling, which is adding meaningful names for the data