Intrusion Detection and Online Booking Using Apache Kafka and Spark
Intrusion Detection and Online Booking Using Apache Kafka and Spark
Intrusion Detection and Online Booking Using Apache Kafka and Spark
Submitted By
BHARGAVI R KAMAT
(4NI16CS023)
CERTIFICATE
Certifies that the seminar work entitled “Intrusion Detection and Online Booking
using Apache Kafka and Spark” is a work carried out by Bhargavi R Kamat bearing
4NI16CS023 in partial fulfillment for the seminar prescribed by National Institute of
Engineering, Autonomous Institution under Vishvesvaraya Technological University,
Belgaum for the Eighth Semester B.E, Computer Science & Engineering. It is certified
that all correction/suggestions indicated for Internal Assessment have been incorporated.
The Seminar report has been approved as it satisfies the academic requirements in respect
of the seminar work prescribed for the Eight Semester.
(Mrs. Poornima N)
ACKNOWLEDGEMENT
I would like to express our sincere gratitude to all those who helped me in
completing the seminar successfully.
Finally I thank my family and friends for being a constant source of inspiration and
advice.
BHARGAVI R KAMAT
ABSTRACT
In the information era, the size of network traffic is complex because of massive
Internet-based services and rapid amounts of data. The more network traffic has enhanced,
the more cyberattacks have dramatically increased. Therefore, cyber-security intrusion
detection has been a challenge in the current research area in recent years. The Intrusion
detection system requires high-level protection and detects modern and complex attacks
with more accuracy. Nowadays, big data analytics is the main key to solve marketing,
security and privacy in an extremely competitive financial market and government. If a
huge amount of stream data flows within a short period time, it is difficult to analyze real-
time decision making. Performance analysis is extremely important for administrators and
developers to avoid bottlenecks. The aim is to reduce time-consuming by using Apache
Kafka and Spark Streaming. The integration of Apache Kafka and Spark Streaming can
perform better in terms of processing time and fault-tolerance on the huge amount of data.
According to the results, the fault tolerance can be provided by the multiple brokers of
Kafka and parallel recovery of Spark Streaming. And then, the multiple partitions of
Apache Kafka increase the processing time in the integration of Apache Kafka and Spark
Streaming.
A stream processing system refers to combination & processing of data before the
data is store in storage medium. This system is built on multiple elements called as SPE
(Stream Processing Element), each SPE takes an input from data production perform
computing & generates output.
TABLE OF CONTENTS
Serial no. Description Page No.
1 INTRODUCTION 1
1.1 The Concept of Intrusion 1
Detection and online booking using
Apache Kafka and Spark
2 LITERATURE SURVEY 3
3 EXISTING AND PROPOSED 4
SYSTEMS
3.1 Existing Systems 4
3.2 Proposed System 5
4 SYSTEM ARCHITECTURE 6
4.1 Apache Kafka Architecture 6
4.2 Spark Streaming Architecture 6
5 SYSTEM IMPLEMENTATION 7
5.1 How does Apache Kafka work? 7
5.2 How does Spark Streaming work? 9
5.3 Apache Kafka and Spark 10
6 ADVANTAGES AND 11
DISADVANTAGES
6.1 ADVANTAGES 11
6.1.1 Apache Kafka 11
6.1.2 Spark Streaming 12
6.2 DISADVANTAGES 13
6.2.1 Apache Kafka 13
6.2.2 Spark Streaming 13
7 APPLICATIONS 15
CONCLUSION 16
REFERENCES 17
Intrusion Detection and Online Booking using Apache Kafka and Spark
CHAPTER 1
INTRODUCTION
1.1 The Concept of Intrusion Detection and online booking using Kafka
and Spark
Data is one of the new ingredient for Internet-based applications. In new trends for
internet applications, data used for real time analytics is become a part of production data.
Data is generated in a large volume through various activities for example, a social network
platform produce from clicks, in retail data produce through order, sales & shipment etc. This
types of data can be considered as stream data. Stream processing is now became a popular
paradigm which allow us to get result for real time & continuously for large volume of fresh
data.
Streaming of data is more popular model which enables real time streaming data for data
analytics . In current era Apache Kafka is most popular architecture used for processing the
stream data. Kafka is salable, distributed, and reliable result into high throughput. It also
provides an API similar to messaging system.
In the information era, the size of network traffic is complex because of massive Internet-
based services and rapid amounts of data. The more network traffic has enhanced, the more
cyberattacks have dramatically increased.
Therefore, Cybersecurity intrusion detection has been a challenge in the current research area
in recent years. The Intrusion detection system requires high-level protection and detects
modern and complex attacks with more accuracy.
Big data analytics is the main key to solve marketing, security and privacy in an extremely
competitive financial market and government. If a huge amount of stream data flows within a
short period time, it is difficult to analyze real-time decision making.
Messaging system is used for transferring of data from one application to another application,
applications can focus only on data not on how data is shared. There are many traditional
massing system but most of these dose not handle the big data in real time environment.
Distributed messaging system focus on reliable messaging queuing. There are two types of
message pattern, one is P to P (point to point)and second is public-subscribe. The public-
subscribe which is also called pub-sub is used in massing system.
In 2020, nearly 50 billion devices are going to connect the Internet as a result of the extensive
area of communication and advanced technologies. The protection of secure and sensitive
information has a significant impact on government and business sectors. Every software
system creates log files. Hence, the enormous amount of log data each day is exceeding
petabytes. The log data from network traffic has high volume, variety, and velocity.
Anomalies can be predicted and future attacks can be protected by these log data. The
behaviors of hackers can be captured and the possible attacks can be analyzed by log
transactions.
Big data the name implies huge volume of data. Now a days streaming of data is more
popular model which enables real time streaming data for data analytics . In current era
Apache Kafka is most popular architecture used for processing the stream data. Kafka is
scalable, distributed, and reliable result into high throughput. It also provides an API similar
to messaging system.
CHAPTER 2
LITERATURE SURVEY
CHAPTER 3
Big data analytics have been recently developed and widely used in various areas such as
banking, insurance, healthcare, education, manufacturing industries and risk and fraud
management, etc. Big data analytics aids organizations to profile customers based on
different features. Applying analytics to big data can improve performance and extract
valuable information. Big data analytics can detect suspicious activities and investigate the
historical data to predict future attacks. It can handle unstructured, semi-structured, structured
data that are being generated daily.
In the proposed system, Apache Kafka and Spark Streaming are highlighted as a
stream processing architecture because the integration of them can support real-time big data
analytics. Streaming processing architecture consists of 2 layers. The first layer is Apache
Kafka, which ingests the streaming data and the second layer is Apache Spark Streaming
which processes generated data streams.
To design a system such that there are no discrepancies while multiple users are booking the
ticket, by using Apache Kafka for streaming/pipe-lining the user data as and when the
booking is being held and then processing the data using spark for updating the databases and
checking for any fraud activities and intrusion detection for high-level protection.
The main motivation of the intrusion detection system is to detect attacks from a huge
amount of data with high speed. The heterogeneous data and timeliness processing
transactions can be used by big data streaming analytics. Big data streaming analytics using
Apache Kafka and Spark Streaming can perform intrusion detection in real-time. When a
huge amount of streaming data flows into the system within a second, fault tolerance is an
issue. Multiple brokers of Apache Kafka and Resilient Distributed Data-set abstraction
technique of Apache Spark Streaming can recover loss after a failure occurs.
The major purpose is to investigate the impact of processing time on the number of stream
records and to improve the processing efficiency of Apache Kafka and Spark Streaming.
The integration of Apache Kafka and Spark Streaming can perform better in terms of
processing time and fault-tolerance on the huge amount of data the fault tolerance can be
provided by the multiple brokers of Kafka and parallel recovery of Spark Streaming. And
then, the multiple partitions of Apache Kafka increase the processing time in the integration
of Apache Kafka and Spark Streaming.
CHAPTER 4
SYSTEM ARCHITECTURE
CHAPTER 5
SYSTEM IMPLEMENTATION
A Kafka comprises one or more servers called brokers that collect and provide reliable data
storage and then publish related topics. Apache Zookeeper is used to track the status of
cluster nodes in Kafka. Producers send messages to a broker. The broker keeps this message
and the consumer collects this data without loss. There is a broker in a single node cluster as
shown in fig.
Every broker has several partitions. A partition acts as a leader and others act as
followers. The leader takes to read and write operations and followers replicate the leader.
When the leader fails, followers automatically select the new leader. Therefore, multiple
partitions in the Kafka process in parallel and then get better processing time.
For knowing the Kafka framework we must have aware of some terminologies:
Topic: A topic is feeding system through which messages are stored & published, all Kafka
messages are organized into topics. If you wish to read a message you read it and if you wish
to send a message you send it to a specific topic. Producer applications write data to topics
and consumer applications read from topics. A Kafka Topic divided into multiple partitions .
Producers: Producers are the publisher of messages to one or more Kafka topics. Producers
send data to Kafka brokers. Every time a producer publishes a message to a broker. Producer
can also send messages to a partition of their choice.
Consumers: It read data from brokers. Consumers subscribes to one or more topics and
consume published messages by pulling data from the brokers
Connectors: It responsible for pulling stream data from Producers and delivering stream data
to Consumers or Stream Processors.
Broker: Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka
brokers are stateless, so they use Zookeeper for maintaining their cluster state.
Stream processor: Stream Processors are applications that transform data streams of topics
to other data streams of topics in Kafka Cluster.
Zookeeper: Zookeeper is used for managing and coordinating Kafka broker. Zookeeper
service is mainly used to notify producer and consumer about the presence of any new broker
in the Kafka system or failure of the broker in the Kafka system.
Fig5.1.3 Consumer
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,
fault-tolerant stream processing of live data streams. Data can be ingested from many sources
like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window. Finally, processed
data can be pushed out to file systems, databases, and live dashboards.
In fact, you can apply Spark’s machine learning and graph processing algorithms on data
streams.
Firstly, network traffic transactions are retrieved from CSV files or the datapipeline streamed
from the Apache Kafka. To get data streams, these transactions are transmitted via Apache
Kafka. Kafka is used ingest network traffic transactions in real-time. It captures streaming
data event and then distributes a huge amount of data streams without the loss due to
partitioning and replication. After that, Spark Streaming is used to process these stream data.
Spark Streaming divides data according to specified the duration of the batch interval. And
then the streaming output data for each batch is generated in Spark UI. Finally, the results of
performance in Spark UI are evaluated and analyzed. The streaming processing architecture
is as shown in Figure.
CHAPTER 6
6.1 ADVANTAGES:
e.Scalability:Without incurring any downtime on the fly by adding additional nodes, Kafka
can be scaled-out. Moreover, inside the Kafka cluster, the message handling is fully
transparent and these are seamless.
f.Distributed: The distributed architecture of Kafka makes it scalable using capabilities like
replication and partitioning.
g.Message Broker Capabilities:Kafka tends to work very well as a replacement for a more
traditional message broker. Here, a message broker refers to an intermediary program, which
translates messages from the formal messaging protocol of the publisher to the formal
messaging protocol of the receiver.
h.High Concurrency:Kafka is able to handle thousands of messages per second and that too
in low latency conditions with high throughput. In addition, it permits the reading and writing
of messages into it at high concurrency.
i.By Default Persistent:As we discussed above that the messages are persistent, that makes it
durable and reliable.
j.Consumer Friendly: It is possible to integrate with the variety of consumers using Kafka.
The best part of Kafka is, it can behave or act differently according to the consumer, that it
Dept. of Computer Science and Engineering NIE, Mysore Page 11
Intrusion Detection and Online Booking using Apache Kafka and Spark
integrates with because each customer has a different ability to handle these messages,
coming out of Kafka. Moreover, Kafka can integrate well with a variety of consumers written
in a variety of languages.
k.Batch Handling Capable (ETL like functionality): Kafka could also be employed for
batch-like use cases and can also do the work of a traditional ETL, due to its capability of
persists messages.
l. Variety of Use Cases: It is able to manage the variety of use cases commonly required for
a Data Lake. For Example log aggregation, web activity tracking, and so on.
m.Real-Time Handling: Kafka can handle real-time data pipeline. Since we need to find a
technology piece to handle real-time messages from applications, it is one of the core reasons
for Kafka as our choice.
The libraries of Apache Spark are Spark Core for parallel and distributed processing.
Streaming data can be combined with static datasets as well as interactive queries.
We can also integrate it with advanced processing libraries, such as SQL, machine
learning, graph processing.
6.2 DISADVANTAGES:
a. No Complete Set of Monitoring Tools: It is seen that it lacks a full set of management
and monitoring tools. Hence, enterprise support staff felt anxious or fearful about choosing
Kafka and supporting it in the long run.
b. Issues with Message Tweaking: As we know, the broker uses certain system calls to
deliver messages to the consumer. However, Kafka’s performance reduces significantly if the
message needs some tweaking. So, it can perform quite well if the message is unchanged
because it uses the capabilities of the system.
c. Not support wildcard topic selection: There is an issue that Kafka only matches the exact
topic name, that means it does not support wildcard topic selection. Because that makes it
incapable of addressing certain use cases.
d. Lack of Pace
There can be a problem because of the lack of pace, while API’s which are needed by other
languages are maintained by different individuals and corporates.
e. Reduces Performance
In general, there are no issues with the individual message size. However, the brokers and
consumers start compressing these messages as the size increases. Due to this, when
decompressed, the node memory gets slowly used. Also, compress happens when the data
flow in the pipeline. It affects throughput and also performance.
f. Behaves Clumsy: Sometimes, it starts behaving a bit clumsy and slow, when the number
of queues in a Kafka cluster increases.
g. Lacks some Messaging Paradigms: Some of the messaging paradigms are missing in
Kafka like request/reply, point-to-point queues and so on. Not always but for certain use
cases, it sounds problematic.
a. No Support for Real-time Processing: In Spark Streaming, the arriving live stream of
data is divided into batches of the pre-defined interval, and each batch of data is treated like
Resilient Distributed Database(RDDs) .Then these RDDs are processed using the operations
like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not
real time processing but Spark is near real-time processing of live data. Micro-batch
processing takes place in Spark Streaming.
b. Problem with Small File: If we use Spark with Hadoop, we come across a problem of a
small file. HDFS provides a limited number of large files rather than a large number of small
files. Another place where Spark legs behind is we store the data gzipped in S3. This pattern
is very nice except when there are lots of small gzipped files. Now the work of the Spark is to
keep those files on network and uncompress them. The gzipped files can be uncompressed
only if the entire file is on one core. So a large span of time will be spent in burning their core
unzipping files in sequence.
c. No File Management System :Apache Spark does not have its own file management
system, thus it relies on some other platform like Hadoop or another cloud-based platform
which is one of the Spark known issues.
d. Expensive: Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark
is quite high.
e. Iterative Processing: In Spark, the data iterates in batches and each iteration is scheduled
and executed separately.
CHAPTER 7
APPLICATIONS
Traditional message system exists from long time & play important role for data
processing IBM WebSphere MQ allows an application to insert message into multiple
queues automatically. . In JSM individual messages acknowledge after processing.
Recently Hedwig system is available for distributed pub-sub system which is
developed by Yahoo! It is scalable & offers strong durability guarantees. Apache
Kafka works in combination with Hbase, spark for real-time analytics & performing
streaming data. Now a days many MNC companies that are using Apache Kafka in
there use cases they are as follows.
CONCLUSION
In this work we focus on how to deal with Kafka & how to tune with its deployment. Kafka
will help stream processing developer for effective use their big data processing architecture.
Kafka defines a pull based model that allows application can consume data whenever needed,
it achieves higher throughput than the traditional messaging system.
At present, big data analytics is the main key to the security and privacy challenges. The
stream processing architecture increases the processing speed of intrusion detection.
According to the outcomes, multiple partitions of Apache Kafka and the batch intervals
between 10 seconds and 50 seconds of Spark Streaming get better performance in the
integration of Kafka and Spark Streaming.
REFERENCES
[1] B. Yan and G. Han, “Effective feature extraction via stacked sparse autoencoder to
improve intrusion detection system”, IEEE Access, 2018
[2] E. Masabo and K.S. Kaawase, “Big data: deep learning for detecting malware”,
Proceedings of the 2018 International Conference on Software Engineering in Africa, 2018.
[3] F. Xiao and X. Li, “Using outlier detection to reduce false positives in intrusion
detection”, In 2008 IFIP International Conference on Network and Parallel Computing, IEEE,
2018
[4] Bhole Rahul Hiraman, ”A Study of Apache Kafka in Big Data Stream
Processing”, 2018 International Conference on Information, Communication, Engineering
and Technology (ICICET)
[5] May Thet Tun, Dim En Nyaung, Myat Pwint Phyu,“Performance Evaluation of Intrusion
Detection Streaming Transactions Using Apache Kafka and Spark Streaming”, 2019
International Conference on Advanced Information Technologies (ICAIT)