BDA
BDA
BDA
Introduction:
Apache Spark Streaming is a scalable and fault-tolerant framework for real-time data
processing. It extends the capabilities of Spark, originally designed for batch processing, to
support stream data by processing it in micro-batches. This enables near real-time analytics
for use cases such as fraud detection, log monitoring, and social media trend analysis.
Key Features:
Architecture:
Example Code:
python
Copy code
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
sc = SparkContext(appName="RealTimeTransactionAnalysis")
ssc = StreamingContext(sc, 10) # Batch interval of 10 seconds
ssc.start()
ssc.awaitTermination()
Introduction:
Apache Kafka is a distributed, real-time event-streaming platform that enables high-
throughput, fault-tolerant communication between applications. It is widely used for building
data pipelines and event-driven architectures.
Key Components:
Workflow:
Example Workflow:
Introduction:
The streaming ecosystem refers to the collection of tools and frameworks used to manage,
process, and analyze real-time data streams. It includes components for ingestion, processing,
storage, and visualization.
Key Components:
• In a financial market, stock trading data is ingested by Kafka, processed in real time
by Spark Streaming, stored in Elasticsearch, and visualized on a Grafana dashboard to
identify trends.
Introduction:
A big data pipeline for real-time computing is a series of steps or stages that process
continuous streams of data. It ensures seamless ingestion, transformation, analysis, and
storage of data as it flows through the system.
Components:
1. Data Sources: IoT devices, social media feeds, transaction logs, or APIs generate the
raw data.
2. Ingestion Layer: Tools like Kafka and Flume handle the collection and
transportation of data to processing systems.
3. Processing Layer: Frameworks like Spark Streaming or Flink process the data in real
time to generate actionable insights.
4. Storage Layer: Processed data is stored in databases (e.g., Cassandra, Elasticsearch)
for persistence or future analysis.
5. Visualization Layer: Dashboards display real-time insights for monitoring and
decision-making.
• A logistics company monitors vehicle locations in real time. The pipeline collects
GPS data, processes it to calculate routes, stores it for tracking, and displays live
vehicle locations on a dashboard.
• Kafka is used by LinkedIn to track user activity in real time. Data from millions of
users is ingested, processed, and analyzed to provide personalized content
recommendations.
• Flink is used in ride-sharing applications to match drivers and riders in real time,
optimizing routes and reducing wait times.
Assignment No. 4 of Business Data Analytics
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building
algorithms capable of learning patterns from data and making decisions without explicit
programming. It involves training models on historical data to predict or classify future data
points based on learned patterns.
1. Supervised Learning: The model is trained using labeled data. The algorithm learns
the relationship between input features and target labels. Examples: Classification and
regression.
2. Unsupervised Learning: The algorithm works with unlabeled data to find hidden
patterns. Examples: Clustering, anomaly detection.
3. Reinforcement Learning: The model learns by interacting with an environment and
receiving feedback through rewards or penalties.
Bayes’ Theorem: Bayes’ Theorem provides a way to calculate the posterior probability
P(C∣X)P(C|X)P(C∣X) of a class CCC given a set of features XXX:
P(C∣X)=P(X∣C)P(C)P(X)P(C|X) = \frac{P(X|C)P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)P(C)
Where:
1. Calculate Priors: For each class, calculate the probability based on the frequency of
that class in the training data.
2. Calculate Likelihoods: For each feature in the data, calculate the probability that it
belongs to each class (using conditional probabilities).
3. Apply Bayes' Theorem: Multiply the prior probability by the likelihood for each
class, and then select the class with the highest posterior probability.
• Spam Email Classification: Given an email with certain words, Naive Bayes
calculates the probability of it being "spam" or "ham" (non-spam) based on prior
training data. The algorithm uses the occurrence of words like "free", "offer", and
"win" to classify emails.
python
Copy code
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
Deep Learning is a subset of machine learning that focuses on using artificial neural
networks to model complex patterns and representations in data. Unlike traditional machine
learning algorithms that require manual feature engineering, deep learning automatically
learns hierarchical features from raw data, making it effective for tasks like image
recognition, natural language processing, and speech recognition.
Key Concepts:
1. Neural Networks: Deep learning models are built on neural networks, which are
composed of layers of nodes (neurons). These networks have multiple hidden layers,
hence the term "deep".
2. Activation Functions: Non-linear functions like ReLU (Rectified Linear Unit) and
Sigmoid are used to introduce non-linearity into the model, allowing it to learn
complex patterns.
3. Backpropagation: A method used to optimize the neural network by calculating the
gradient of the loss function with respect to the model's parameters, and adjusting
them accordingly.
1. Feedforward Neural Networks (FNN): The simplest type of neural network, where
data moves in one direction from input to output.
2. Convolutional Neural Networks (CNN): Specialized for image processing tasks.
They use convolutional layers to automatically detect features like edges and textures.
3. Recurrent Neural Networks (RNN): Used for sequential data like time series or
text. RNNs maintain memory across timesteps, allowing them to model dependencies
in sequences.
4. Generative Adversarial Networks (GAN): Consist of two networks, a generator and
a discriminator, which compete with each other to generate realistic data (e.g.,
images).
python
Copy code
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
from keras.datasets import cifar10
from keras.utils import to_categorical
Machine Learning with Spark refers to the use of Apache Spark for running large-scale
machine learning algorithms on big data. Spark provides a distributed computing framework,
making it suitable for handling massive datasets that do not fit into memory on a single
machine. It provides a built-in library called MLlib for scalable machine learning.
1. Scalability: Spark can process terabytes of data across multiple nodes in a cluster,
providing efficient distributed computation.
2. High-Level APIs: Spark provides high-level APIs in languages like Python, Scala,
and Java for ease of use.
3. Distributed Algorithms: MLlib includes algorithms for classification, regression,
clustering, collaborative filtering, and more.
4. Pipeline Support: Spark MLlib supports pipeline APIs for assembling ML
workflows. A pipeline includes stages like data transformation, training, and
prediction.
MLlib Algorithms:
python
Copy code
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
# Sample data
data = [(0, 1.0, 2.0), (1, 1.5, 1.8), (2, 5.0, 8.0), (3, 8.0, 8.0), (4,
1.0, 0.6)]
df = spark.createDataFrame(data, ["ID", "Feature1", "Feature2"])
# Feature transformation
assembler = VectorAssembler(inputCols=["Feature1", "Feature2"],
outputCol="features")
df = assembler.transform(df)
predictions.show()
Apache Mahout is a machine learning library that focuses on scalable machine learning
algorithms for big data processing. It runs on top of Apache Hadoop and provides
implementations for classification, clustering, and recommendation algorithms. K-Means is a
clustering algorithm in Mahout that is widely used for grouping similar data points.
K-Means Algorithm:
1. Initialization: Randomly select kkk centroids (cluster centers) from the dataset.
2. Assignment Step: Assign each data point to the nearest centroid based on the
Euclidean distance.
3. Update Step: Recalculate the centroids by averaging the data points assigned to each
cluster.
4. Repeat: Iterate between the assignment and update steps until convergence (i.e.,
when centroids do not change significantly).
bash
Copy code
mahout kmeans -i input/data -c output/centroids -o output/clusters -k 3 -xm
hadoop
In this example, Mahout’s K-Means algorithm runs on Hadoop with 3 clusters (k=3) for data
stored in HDFS.
Assignment No. 5 of Business Data Analytics
1. Document-Oriented Storage:
o MongoDB stores data in BSON (Binary JSON) format. Each record is a
document, containing key-value pairs similar to JSON objects.
o Example:
json
Copy code
{
"name": "John Doe",
"age": 30,
"skills": ["Python", "MongoDB"]
}
2. Schema Flexibility:
o MongoDB allows dynamic schema design, enabling you to store documents of
varying structures in the same collection.
o Example: A collection may have some documents with "email" fields and
others without.
3. Horizontal Scalability:
o MongoDB supports sharding, a method of distributing data across multiple
servers to handle large-scale data storage and query loads.
4. Indexing:
o MongoDB supports a variety of indexes, such as single-field, compound, and
geospatial indexes, to optimize query performance.
5. High Availability:
o With replication, MongoDB ensures data redundancy and availability by
maintaining copies of data on multiple nodes.
6. Aggregation Framework:
o MongoDB provides powerful tools for performing data aggregation, such as
filtering, grouping, and transforming data.
7. GridFS:
o A feature to store and retrieve large files (e.g., images, videos) that exceed the
document size limit of 16MB.
8. Ad Hoc Queries:
o MongoDB supports a range of query operations, including field selection,
regular expressions, and range queries.
9. Integration with Programming Languages:
o MongoDB offers drivers and libraries for integration with popular
programming languages like Python, Java, Node.js, and more.
10. Security:
o MongoDB provides features like authentication, authorization, and SSL
encryption for secure data management.
11. Transactions:
o Supports multi-document ACID transactions, ensuring data consistency.
12. Change Streams:
o Real-time notifications when data in the database changes, useful for building
event-driven applications.
Schema design in MongoDB involves structuring your database collections and documents to
optimize performance and scalability. Since MongoDB is schema-less, the design depends on
the specific application requirements.
Key Principles
json
Copy code
{
"user_id": 1,
"name": "Alice",
"orders": [
{"order_id": 101, "amount": 250},
{"order_id": 102, "amount": 150}
]
}
Indexes in MongoDB improve the efficiency of query execution by reducing the amount of
data scanned. Without an index, MongoDB performs a collection scan, which is slower for
large datasets.
Types of Indexes:
javascript
Copy code
db.collection.createIndex({ name: 1 });
2. Compound Index:
o Index on multiple fields.
o Example: Indexing "name" and "age" fields.
o Command:
javascript
Copy code
db.collection.createIndex({ name: 1, age: -1 });
3. Text Index:
o For text search queries.
o Command:
javascript
Copy code
db.collection.createIndex({ description: "text" });
4. Unique Index:
o Ensures unique values for the indexed field.
o Command:
javascript
Copy code
db.collection.createIndex({ email: 1 }, { unique: true });
5. TTL Index:
o Automatically deletes documents after a certain time.
o Command:
javascript
Copy code
db.collection.createIndex({ createdAt: 1 }, {
expireAfterSeconds: 3600 });
6. Geospatial Index:
o For querying geographical data.
o Command:
javascript
Copy code
db.collection.createIndex({ location: "2dsphere" });
javascript
Copy code
db.users.find({ name: "Alice" });
javascript
Copy code
db.collection.createIndex({ name: 1 });
db.users.find({ name: "Alice" });
CRUD operations refer to the four basic operations to interact with a database: Create, Read,
Update, Delete.
1. Create Operation
javascript
Copy code
db.users.insertOne({
name: "Alice",
age: 30,
email: "alice@example.com"
});
db.users.insertMany([
{ name: "Bob", age: 25, email: "bob@example.com" },
{ name: "Carol", age: 35, email: "carol@example.com" }
]);
2. Read Operation
javascript
Copy code
db.users.find();
javascript
Copy code
db.users.find({ age: { $gt: 25 } });
3. Update Operation
javascript
Copy code
db.users.updateOne({ name: "Alice" }, { $set: { age: 31 } });
javascript
Copy code
db.users.updateMany({ age: { $lt: 30 } }, { $set: { status: "young" }
});
4. Delete Operation
javascript
Copy code
db.users.deleteOne({ name: "Bob" });
javascript
Copy code
db.users.deleteMany({ age: { $lt: 30 } });