Intrusion Detection and Online Booking Using Apache Kafka and Spark

Intrusion Detection and Online Booking using
Apache Kafka and Spark

A SEMINAR REPORT SUBMITTED TO
THE NATIONAL INSTITUTE OF ENGINEERING

(An Autonomous College)
In partial fulfillment for the award of degree of

Bachelor of Engineering
In
Computer Science & Engineering
Submitted By
BHARGAVI R KAMAT
(4NI16CS023)
Under The Guidance Of
Mrs. Hema Pandith S Mrs. Poornima.N

Assistant Professor Assistant Professor
Department of CS & E Department of CS & E
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(An Autonomous College)
Mysore-570 008
2019-20
(An Autonomous institution, affiliated to VTU)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Mysore -570008
CERTIFICATE
Certifies that the seminar work entitled “Intrusion Detection and Online Booking
using Apache Kafka and Spark” is a work carried out by Bhargavi R Kamat bearing
4NI16CS023 in partial fulfillment for the seminar prescribed by National Institute of
Engineering, Autonomous Institution under Vishvesvaraya Technological University,
Belgaum for the Eighth Semester B.E, Computer Science & Engineering. It is certified
that all correction/suggestions indicated for Internal Assessment have been incorporated.
The Seminar report has been approved as it satisfies the academic requirements in respect
of the seminar work prescribed for the Eight Semester.
Signature of the Guide Signature of the H.O.D
(Mrs. Hema Pandith S) (Dr.V K Annapurna)
Signature of the Co-Guide
(Mrs. Poornima N)
ACKNOWLEDGEMENT
I would like to express our sincere gratitude to all those who helped me in
completing the seminar successfully.
I express my profound thanks to Dr. Rohini Nagapadma, Principal, NIE,

Mysore for all the support and encouragement.
I am grateful to Dr. V K Annapurna, Prof. and Head of the Department of

Computer Science and Engineering, NIE, Mysore for her support and encouragement
facilitating the progress of this work
I sincerely extend my thanks to seminar guides Mrs. Hema Pandith S, Assistant

Professor and Mrs. Poornima N, Assistant Professor, Department of Computer Science
and Engineering, NIE, Mysore for their valuable guidance and support for this seminar.
Finally I thank my family and friends for being a constant source of inspiration and
advice.
BHARGAVI R KAMAT
ABSTRACT
In the information era, the size of network traffic is complex because of massive
Internet-based services and rapid amounts of data. The more network traffic has enhanced,
the more cyberattacks have dramatically increased. Therefore, cyber-security intrusion
detection has been a challenge in the current research area in recent years. The Intrusion
detection system requires high-level protection and detects modern and complex attacks
with more accuracy. Nowadays, big data analytics is the main key to solve marketing,
security and privacy in an extremely competitive financial market and government. If a
huge amount of stream data flows within a short period time, it is difficult to analyze real-
time decision making. Performance analysis is extremely important for administrators and
developers to avoid bottlenecks. The aim is to reduce time-consuming by using Apache
Kafka and Spark Streaming. The integration of Apache Kafka and Spark Streaming can
perform better in terms of processing time and fault-tolerance on the huge amount of data.
According to the results, the fault tolerance can be provided by the multiple brokers of
Kafka and parallel recovery of Spark Streaming. And then, the multiple partitions of
Apache Kafka increase the processing time in the integration of Apache Kafka and Spark
Streaming.
A stream processing system refers to combination & processing of data before the
data is store in storage medium. This system is built on multiple elements called as SPE
(Stream Processing Element), each SPE takes an input from data production perform
computing & generates output.
TABLE OF CONTENTS
Serial no. Description Page No.
1 INTRODUCTION 1
1.1 The Concept of Intrusion 1
Detection and online booking using
Apache Kafka and Spark
2 LITERATURE SURVEY 3
3 EXISTING AND PROPOSED 4
SYSTEMS
3.1 Existing Systems 4
3.2 Proposed System 5
4 SYSTEM ARCHITECTURE 6
4.1 Apache Kafka Architecture 6
4.2 Spark Streaming Architecture 6
5 SYSTEM IMPLEMENTATION 7
5.1 How does Apache Kafka work? 7
5.2 How does Spark Streaming work? 9
5.3 Apache Kafka and Spark 10
6 ADVANTAGES AND 11
DISADVANTAGES
6.1 ADVANTAGES 11
6.1.1 Apache Kafka 11
6.1.2 Spark Streaming 12
6.2 DISADVANTAGES 13
6.2.1 Apache Kafka 13
6.2.2 Spark Streaming 13
7 APPLICATIONS 15
CONCLUSION 16
REFERENCES 17
Intrusion Detection and Online Booking using Apache Kafka and Spark
CHAPTER 1
INTRODUCTION
1.1 The Concept of Intrusion Detection and online booking using Kafka
and Spark
Data is one of the new ingredient for Internet-based applications. In new trends for
internet applications, data used for real time analytics is become a part of production data.
Data is generated in a large volume through various activities for example, a social network
platform produce from clicks, in retail data produce through order, sales & shipment etc. This
types of data can be considered as stream data. Stream processing is now became a popular
paradigm which allow us to get result for real time & continuously for large volume of fresh
data.
Streaming of data is more popular model which enables real time streaming data for data
analytics . In current era Apache Kafka is most popular architecture used for processing the
stream data. Kafka is salable, distributed, and reliable result into high throughput. It also
provides an API similar to messaging system.
In the information era, the size of network traffic is complex because of massive Internet-
based services and rapid amounts of data. The more network traffic has enhanced, the more
cyberattacks have dramatically increased.
Therefore, Cybersecurity intrusion detection has been a challenge in the current research area
in recent years. The Intrusion detection system requires high-level protection and detects
modern and complex attacks with more accuracy.
Big data analytics is the main key to solve marketing, security and privacy in an extremely
competitive financial market and government. If a huge amount of stream data flows within a
short period time, it is difficult to analyze real-time decision making.
Performance analysis is extremely important for administrators and developers to avoid

bottlenecks. We can try to reduce time-consuming by using Apache Kafka and Spark
Streaming.
Dept. of Computer Science and Engineering NIE, Mysore Page 1

Messaging system is used for transferring of data from one application to another application,
applications can focus only on data not on how data is shared. There are many traditional
massing system but most of these dose not handle the big data in real time environment.
Distributed messaging system focus on reliable messaging queuing. There are two types of
message pattern, one is P to P (point to point)and second is public-subscribe. The public-
subscribe which is also called pub-sub is used in massing system.
In 2020, nearly 50 billion devices are going to connect the Internet as a result of the extensive
area of communication and advanced technologies. The protection of secure and sensitive
information has a significant impact on government and business sectors. Every software
system creates log files. Hence, the enormous amount of log data each day is exceeding
petabytes. The log data from network traffic has high volume, variety, and velocity.
Anomalies can be predicted and future attacks can be protected by these log data. The
behaviors of hackers can be captured and the possible attacks can be analyzed by log
transactions.
An intrusion is an attempt to get illegal access to network resources. Intrusion detection is

called monitoring the event signs of unwanted activities. It is a reliable system for network
traffic to decide whether there is an attack or not. It becomes the best way for the security
problem of the network system compared with other defense systems.
Big data the name implies huge volume of data. Now a days streaming of data is more
popular model which enables real time streaming data for data analytics . In current era
Apache Kafka is most popular architecture used for processing the stream data. Kafka is
scalable, distributed, and reliable result into high throughput. It also provides an API similar
to messaging system.

CHAPTER 2
LITERATURE SURVEY
In 2018, Sheeraz et al., in demonstrated big data streaming analysis of detecting

anomalies using the KMeans clustering algorithm. The experiment performed anomalies
detection by using Apache Kafka and Apache Spark on KDDCUP99 dataset. This study got
76% and 68% average accuracy on single and distributed environments and detection times
are up to 33 seconds. The limitations of the paper are to reduce accuracy and decrease the
processing time.
In Apache Spark and Spark MLlib (Machine learning Libraries), four machine learning
algorithms, namely Native Bayes, Decision Tree, Support Vector Machine, and Random
Forest were used to detect anomalies. Support Vector Machine got the highest accuracy rate
that is 92%. In this paper, incoming network traffic detection time did not meet in real-time.
This is an important issue for further research. Ichinose et al., in introduced the video analysis
framework from various cameras by using Apache Kafka and Spark Streaming on MINIST
image dataset. This paper considered Kafka parameter settings to be efficient in data transfer
performance. The experiment showed that images per second depend on the number of
brokers and the number of partitions in Apache Kafka. Masabo and Kaawase in [5] compared
malware detection by using machine learning SVM (Support Vector Machine) and deep
learning based on big data analytics. In this study, deep learning got 97% accuracy and
Support Vector Machine got 95% accuracy. The future work of this paper is going to increase
accuracy and processing speed. Jayanthi and Sumathi published a paper that compared the
difference between Hadoop MapReduce and Apache Spark in the weather forecast analysis
in. This system reported that Apache Spark succeeded in dealing with the drawback of
computational processing speed in Hadoop. In this paper, Apache Spark is 1.5x better
performance than the MapReduce. Sathyapriya et al., in reported the comprehensive review
of big data frameworks such as Hadoop, MapReduce, Spark and Flink in credit card fraud
detection. According to the results of this paper, it suggested that Apache Spark is more
efficient than others. In, Noac'H et al., attempts to show that the different parameter
configurations have an impact on the performance metrics of Apache Kafka. The major
objective of this study was to help stream processing developers to avoid the bottleneck.

CHAPTER 3
EXISTING AND PROPOSED SYSTEMS
3.1 EXISTING SYSTEMS

The protection of secure and sensitive information has a significant impact on
government and business sectors. Every software system creates log files. Hence, the
enormous amount of log data each day is exceeding petabytes. The log data from network
traffic has high volume, variety, and velocity. Anomalies can be predicted and future attacks
can be protected by these log data. The behaviors of hackers can be captured and the possible
attacks can be analyzed by log transactions. An intrusion is an attempt to get illegal access to
network resources. Intrusion detection is called monitoring the event signs of unwanted
activities. It is a reliable system for network traffic to decide whether there is an attack or not.
It becomes the best way for the security problem of the network system compared with other
defense systems. The intrusion detection system can be defined as an attack when a pattern is
different from previous normal patterns. Streaming transactions with huge sized input causes
a false alarm. Therefore, it produces thousands of alarms per day. However, 99% of these
alarms are false alarms. False-positive, wrongly considered an attack, is still the most
important problem. Deep learning-based auto-encoder using intrusion detection can decrease
false alarm. It can detect not only known intrusions but also works well with unknown
intrusions .
Big data analytics have been recently developed and widely used in various areas such as
banking, insurance, healthcare, education, manufacturing industries and risk and fraud
management, etc. Big data analytics aids organizations to profile customers based on
different features. Applying analytics to big data can improve performance and extract
valuable information. Big data analytics can detect suspicious activities and investigate the
historical data to predict future attacks. It can handle unstructured, semi-structured, structured
data that are being generated daily.

3.2 PROPOSED SYSTEM
In the proposed system, Apache Kafka and Spark Streaming are highlighted as a
stream processing architecture because the integration of them can support real-time big data
analytics. Streaming processing architecture consists of 2 layers. The first layer is Apache
Kafka, which ingests the streaming data and the second layer is Apache Spark Streaming
which processes generated data streams.
To design a system such that there are no discrepancies while multiple users are booking the
ticket, by using Apache Kafka for streaming/pipe-lining the user data as and when the
booking is being held and then processing the data using spark for updating the databases and
checking for any fraud activities and intrusion detection for high-level protection.
The main motivation of the intrusion detection system is to detect attacks from a huge
amount of data with high speed. The heterogeneous data and timeliness processing
transactions can be used by big data streaming analytics. Big data streaming analytics using
Apache Kafka and Spark Streaming can perform intrusion detection in real-time. When a
huge amount of streaming data flows into the system within a second, fault tolerance is an
issue. Multiple brokers of Apache Kafka and Resilient Distributed Data-set abstraction
technique of Apache Spark Streaming can recover loss after a failure occurs.
The major purpose is to investigate the impact of processing time on the number of stream
records and to improve the processing efficiency of Apache Kafka and Spark Streaming.
The integration of Apache Kafka and Spark Streaming can perform better in terms of
processing time and fault-tolerance on the huge amount of data the fault tolerance can be
provided by the multiple brokers of Kafka and parallel recovery of Spark Streaming. And
then, the multiple partitions of Apache Kafka increase the processing time in the integration
of Apache Kafka and Spark Streaming.

CHAPTER 4
SYSTEM ARCHITECTURE
4.1 APACHE KAFKA ARCHITECTURE
Fig4.1.1 Apache Kafka Framework
4.2 SPARK STREAMING ARCHITECTURE
Fig4.2.1 Spark Streaming Architecture

CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 How does Apache Kafka work?
Apache Kafka is public-subscribed messaging system which is designed to be

scalable, fast, reliable & durable. Kafka is the most popular distributed publish-subscribe
messaging system. It consists of topics, brokers, producers and consumers. Kafka organizes a
stream of messages called topics. Producer distributes streaming messages, and consumer
retrieves these messages.
Fig 5.1.1 Kafka Topic with 3 partitions
A Kafka comprises one or more servers called brokers that collect and provide reliable data
storage and then publish related topics. Apache Zookeeper is used to track the status of
cluster nodes in Kafka. Producers send messages to a broker. The broker keeps this message
and the consumer collects this data without loss. There is a broker in a single node cluster as
shown in fig.
Fig5.1.2 Producer send message to broker

Every broker has several partitions. A partition acts as a leader and others act as
followers. The leader takes to read and write operations and followers replicate the leader.
When the leader fails, followers automatically select the new leader. Therefore, multiple
partitions in the Kafka process in parallel and then get better processing time.
For knowing the Kafka framework we must have aware of some terminologies:
Topic: A topic is feeding system through which messages are stored & published, all Kafka
messages are organized into topics. If you wish to read a message you read it and if you wish
to send a message you send it to a specific topic. Producer applications write data to topics
and consumer applications read from topics. A Kafka Topic divided into multiple partitions .
Producers: Producers are the publisher of messages to one or more Kafka topics. Producers
send data to Kafka brokers. Every time a producer publishes a message to a broker. Producer
can also send messages to a partition of their choice.
Consumers: It read data from brokers. Consumers subscribes to one or more topics and
consume published messages by pulling data from the brokers
Connectors: It responsible for pulling stream data from Producers and delivering stream data
to Consumers or Stream Processors.
Broker: Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka
brokers are stateless, so they use Zookeeper for maintaining their cluster state.
Stream processor: Stream Processors are applications that transform data streams of topics
to other data streams of topics in Kafka Cluster.
Zookeeper: Zookeeper is used for managing and coordinating Kafka broker. Zookeeper
service is mainly used to notify producer and consumer about the presence of any new broker
in the Kafka system or failure of the broker in the Kafka system.
Fig5.1.3 Consumer

Fig5.1.4 Kafka Cluster
5.2 How does Spark Streaming work?
Apache Spark is an open-source, distributed, fast and cluster computing framework

for an enormous amount of data. Apache Spark is up to 100 times faster than Hadoop because
of an in-memory processing engine for large scale parallel processing. It supports various
programming languages such as Java, Python, R and Scala. It reduces the time-consuming of
reading and writing and processes data in parallel. The libraries of Apache Spark are Spark
Core for parallel and distributed processing, Spark Streaming for stream processing, Spark
MLlib for machine learning, Spark SQL and GraphX for graphic processing. Apache Spark
Streaming is one of the components of Apache Spark and used for micro-batch streaming
processing. Spark streaming uses in-memory computation and performs stream processing.
Spark streaming recovers in node failure without losses than other stream processing
frameworks such as Storm, Flink and MapReduce. It does not worry about duplication
because it offers exactly-one delivery. The main advantages of Spark Streaming are easy to
use, high throughput, load balancing, low latency, highly scalable and fault-tolerant. Spark
Streaming can easily process live data streams and provide results in near real-time. Data can
be captured from several sources such as Apache Kafka, Apache Flume, Amazon Kinesis,
etc. Finally, processed data streams can be stored in databases and displayed by live
dashboards.
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,
fault-tolerant stream processing of live data streams. Data can be ingested from many sources
like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window. Finally, processed
data can be pushed out to file systems, databases, and live dashboards.

In fact, you can apply Spark’s machine learning and graph processing algorithms on data
streams.
5.3 Apache Kafka and Spark
Fig5.3.1 Apache Kafka and Spark Together
Firstly, network traffic transactions are retrieved from CSV files or the datapipeline streamed
from the Apache Kafka. To get data streams, these transactions are transmitted via Apache
Kafka. Kafka is used ingest network traffic transactions in real-time. It captures streaming
data event and then distributes a huge amount of data streams without the loss due to
partitioning and replication. After that, Spark Streaming is used to process these stream data.
Spark Streaming divides data according to specified the duration of the batch interval. And
then the streaming output data for each batch is generated in Spark UI. Finally, the results of
performance in Spark UI are evaluated and analyzed. The streaming processing architecture
is as shown in Figure.
Fig5.3.2 Overview Diagram of Stream processing Architecture

CHAPTER 6
ADVANTAGES AND DISADVANTAGES
6.1 ADVANTAGES:
6.1.1 Apache Kafka
a.High-throughput: Without having not so large hardware, Kafka is capable of handling

high-velocity and high-volume data. Also, able to support message throughput of thousands
of messages per second.
b.Low Latency: It is capable of handling these messages with the very low latency of the
range of milliseconds, demanded by most of the new use cases.
c.Fault-Tolerant: One of the best advantages is Fault Tolerance. There is an inherent
capability in Kafka, to be resistant to node/machine failure within a cluster.
d.Durability: Here, durability refers to the persistence of data/messages on disk. Also,
messages replication is one of the reasons behind durability, hence messages are never lost.
e.Scalability:Without incurring any downtime on the fly by adding additional nodes, Kafka
can be scaled-out. Moreover, inside the Kafka cluster, the message handling is fully
transparent and these are seamless.
f.Distributed: The distributed architecture of Kafka makes it scalable using capabilities like
replication and partitioning.
g.Message Broker Capabilities:Kafka tends to work very well as a replacement for a more
traditional message broker. Here, a message broker refers to an intermediary program, which
translates messages from the formal messaging protocol of the publisher to the formal
messaging protocol of the receiver.
h.High Concurrency:Kafka is able to handle thousands of messages per second and that too
in low latency conditions with high throughput. In addition, it permits the reading and writing
of messages into it at high concurrency.
i.By Default Persistent:As we discussed above that the messages are persistent, that makes it
durable and reliable.
j.Consumer Friendly: It is possible to integrate with the variety of consumers using Kafka.
The best part of Kafka is, it can behave or act differently according to the consumer, that it
integrates with because each customer has a different ability to handle these messages,
coming out of Kafka. Moreover, Kafka can integrate well with a variety of consumers written
in a variety of languages.
k.Batch Handling Capable (ETL like functionality): Kafka could also be employed for
batch-like use cases and can also do the work of a traditional ETL, due to its capability of
persists messages.
l. Variety of Use Cases: It is able to manage the variety of use cases commonly required for
a Data Lake. For Example log aggregation, web activity tracking, and so on.
m.Real-Time Handling: Kafka can handle real-time data pipeline. Since we need to find a
technology piece to handle real-time messages from applications, it is one of the core reasons
for Kafka as our choice.
6.1.2 Spark Streaming
 Apache Spark is up to 100 times faster than Hadoop because of an in-memory

processing engine for large scale parallel processing.
 It supports various programming languages such as Java, Python, R and Scala. It

reduces the time-consuming of reading and writing and processes data in parallel.
 The libraries of Apache Spark are Spark Core for parallel and distributed processing.
 It recovers very fast from failures and stragglers.
 Resource usage and better load balancing in spark streaming.
 Streaming data can be combined with static datasets as well as interactive queries.
 We can also integrate it with advanced processing libraries, such as SQL, machine
learning, graph processing.

6.2 DISADVANTAGES:
6.2.1 Apache Kafka
a. No Complete Set of Monitoring Tools: It is seen that it lacks a full set of management
and monitoring tools. Hence, enterprise support staff felt anxious or fearful about choosing
Kafka and supporting it in the long run.
b. Issues with Message Tweaking: As we know, the broker uses certain system calls to
deliver messages to the consumer. However, Kafka’s performance reduces significantly if the
message needs some tweaking. So, it can perform quite well if the message is unchanged
because it uses the capabilities of the system.
c. Not support wildcard topic selection: There is an issue that Kafka only matches the exact
topic name, that means it does not support wildcard topic selection. Because that makes it
incapable of addressing certain use cases.
d. Lack of Pace
There can be a problem because of the lack of pace, while API’s which are needed by other
languages are maintained by different individuals and corporates.
e. Reduces Performance
In general, there are no issues with the individual message size. However, the brokers and
consumers start compressing these messages as the size increases. Due to this, when
decompressed, the node memory gets slowly used. Also, compress happens when the data
flow in the pipeline. It affects throughput and also performance.
f. Behaves Clumsy: Sometimes, it starts behaving a bit clumsy and slow, when the number
of queues in a Kafka cluster increases.
g. Lacks some Messaging Paradigms: Some of the messaging paradigms are missing in
Kafka like request/reply, point-to-point queues and so on. Not always but for certain use
cases, it sounds problematic.
6.2.2 Spark Streaming
a. No Support for Real-time Processing: In Spark Streaming, the arriving live stream of
data is divided into batches of the pre-defined interval, and each batch of data is treated like
Resilient Distributed Database(RDDs) .Then these RDDs are processed using the operations
like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not

real time processing but Spark is near real-time processing of live data. Micro-batch
processing takes place in Spark Streaming.
b. Problem with Small File: If we use Spark with Hadoop, we come across a problem of a
small file. HDFS provides a limited number of large files rather than a large number of small
files. Another place where Spark legs behind is we store the data gzipped in S3. This pattern
is very nice except when there are lots of small gzipped files. Now the work of the Spark is to
keep those files on network and uncompress them. The gzipped files can be uncompressed
only if the entire file is on one core. So a large span of time will be spent in burning their core
unzipping files in sequence.
c. No File Management System :Apache Spark does not have its own file management
system, thus it relies on some other platform like Hadoop or another cloud-based platform
which is one of the Spark known issues.
d. Expensive: Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark
is quite high.
e. Iterative Processing: In Spark, the data iterates in batches and each iteration is scheduled
and executed separately.
f. Latency: Apache Spark has higher latency as compared to Apache Flink.

CHAPTER 7
APPLICATIONS
Traditional message system exists from long time & play important role for data
processing IBM WebSphere MQ allows an application to insert message into multiple
queues automatically. . In JSM individual messages acknowledge after processing.
Recently Hedwig system is available for distributed pub-sub system which is
developed by Yahoo! It is scalable & offers strong durability guarantees. Apache
Kafka works in combination with Hbase, spark for real-time analytics & performing
streaming data. Now a days many MNC companies that are using Apache Kafka in
there use cases they are as follows.
 Twitter: Twitter uses Kafka as a stream-processing infrastructure.

 LinkedIn: Apache Kafka is used at LinkedIn for the streaming data. This data uses in
various product such as news feed & offline analytical system.
 Yahoo!: Kafka is used by Yahoo for their media analytic team in real time analytics.
 Netflix: Kafka used by Netflix as the gateway for data collection, this application
requiring billions of messages to be processed daily.

CONCLUSION
In this work we focus on how to deal with Kafka & how to tune with its deployment. Kafka
will help stream processing developer for effective use their big data processing architecture.
Kafka defines a pull based model that allows application can consume data whenever needed,
it achieves higher throughput than the traditional messaging system.
At present, big data analytics is the main key to the security and privacy challenges. The
stream processing architecture increases the processing speed of intrusion detection.
According to the outcomes, multiple partitions of Apache Kafka and the batch intervals
between 10 seconds and 50 seconds of Spark Streaming get better performance in the
integration of Kafka and Spark Streaming.

REFERENCES
[1] B. Yan and G. Han, “Effective feature extraction via stacked sparse autoencoder to
improve intrusion detection system”, IEEE Access, 2018
[2] E. Masabo and K.S. Kaawase, “Big data: deep learning for detecting malware”,
Proceedings of the 2018 International Conference on Software Engineering in Africa, 2018.
[3] F. Xiao and X. Li, “Using outlier detection to reduce false positives in intrusion
detection”, In 2008 IFIP International Conference on Network and Parallel Computing, IEEE,
2018
[4] Bhole Rahul Hiraman, ”A Study of Apache Kafka in Big Data Stream
Processing”, 2018 International Conference on Information, Communication, Engineering
and Technology (ICICET)
[5] May Thet Tun, Dim En Nyaung, Myat Pwint Phyu,“Performance Evaluation of Intrusion
Detection Streaming Transactions Using Apache Kafka and Spark Streaming”, 2019
International Conference on Advanced Information Technologies (ICAIT)

Intrusion Detection and Online Booking Using Apache Kafka and Spark

Uploaded by

Copyright:

Available Formats

Intrusion Detection and Online Booking Using Apache Kafka and Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intrusion Detection and Online Booking Using Apache Kafka and Spark

Uploaded by

Copyright:

Available Formats

Intrusion Detection and Online Booking using

Apache Kafka and Spark

THE NATIONAL INSTITUTE OF ENGINEERING

In partial fulfillment for the award of degree of

Under The Guidance Of

Mrs. Hema Pandith S Mrs. Poornima.N

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(An Autonomous institution, affiliated to VTU)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Signature of the Guide Signature of the H.O.D

(Mrs. Hema Pandith S) (Dr.V K Annapurna)

Signature of the Co-Guide

I express my profound thanks to Dr. Rohini Nagapadma, Principal, NIE,

I am grateful to Dr. V K Annapurna, Prof. and Head of the Department of

I sincerely extend my thanks to seminar guides Mrs. Hema Pandith S, Assistant

Performance analysis is extremely important for administrators and developers to avoid

Dept. of Computer Science and Engineering NIE, Mysore Page 1

An intrusion is an attempt to get illegal access to network resources. Intrusion detection is

Dept. of Computer Science and Engineering NIE, Mysore Page 2

In 2018, Sheeraz et al., in demonstrated big data streaming analysis of detecting

Dept. of Computer Science and Engineering NIE, Mysore Page 3

EXISTING AND PROPOSED SYSTEMS

3.1 EXISTING SYSTEMS

Dept. of Computer Science and Engineering NIE, Mysore Page 4

3.2 PROPOSED SYSTEM

Dept. of Computer Science and Engineering NIE, Mysore Page 5

4.1 APACHE KAFKA ARCHITECTURE

Fig4.1.1 Apache Kafka Framework

4.2 SPARK STREAMING ARCHITECTURE

Fig4.2.1 Spark Streaming Architecture

Dept. of Computer Science and Engineering NIE, Mysore Page 6

5.1 How does Apache Kafka work?

Apache Kafka is public-subscribed messaging system which is designed to be

Fig 5.1.1 Kafka Topic with 3 partitions

Fig5.1.2 Producer send message to broker

Dept. of Computer Science and Engineering NIE, Mysore Page 7

Dept. of Computer Science and Engineering NIE, Mysore Page 8

Fig5.1.4 Kafka Cluster

5.2 How does Spark Streaming work?

Apache Spark is an open-source, distributed, fast and cluster computing framework

Dept. of Computer Science and Engineering NIE, Mysore Page 9

5.3 Apache Kafka and Spark

Fig5.3.1 Apache Kafka and Spark Together

Fig5.3.2 Overview Diagram of Stream processing Architecture

Dept. of Computer Science and Engineering NIE, Mysore Page 10

ADVANTAGES AND DISADVANTAGES

6.1.1 Apache Kafka

a.High-throughput: Without having not so large hardware, Kafka is capable of handling

6.1.2 Spark Streaming

 Apache Spark is up to 100 times faster than Hadoop because of an in-memory

 It supports various programming languages such as Java, Python, R and Scala. It

 It recovers very fast from failures and stragglers.

 Resource usage and better load balancing in spark streaming.

Dept. of Computer Science and Engineering NIE, Mysore Page 12

6.2.1 Apache Kafka

6.2.2 Spark Streaming

Dept. of Computer Science and Engineering NIE, Mysore Page 13

f. Latency: Apache Spark has higher latency as compared to Apache Flink.

Dept. of Computer Science and Engineering NIE, Mysore Page 14