CloudComputing Module2
CloudComputing Module2
Cloud Applications
Cloud applications are software programs that run on cloud infrastructure and are accessed over
the internet. They benefit from on-demand scalability, pay-per-use billing, and ubiquitous
accessibility.
1. Scientific Applications
2. Business Applications
Examples: Facebook, Instagram, YouTube, all relying on cloud infrastructure to store and
deliver content efficiently.
4. Mobile Applications
Cloud backends support mobile apps with storage, notifications, and synchronization.
5. Real-time Analytics
Conclusion
Cloud computing enables a wide range of real-world applications by leveraging distinct service
paradigms. These paradigms define the responsibilities of the cloud provider vs. the user and
facilitate scalable, cost-effective, and global computing solutions.
1. Performance Isolation
Issue: Shared infrastructure (multi-tenancy) can lead to performance fluctuations.
Impact: Virtual machines (VMs) may experience unpredictable latency or bandwidth due to
noisy neighbors.
Example: A VM running a high-priority task may slow down if another VM on the same host
consumes excessive resources.
Example: In Amazon’s 2012 outage, a power failure in one data center caused cascading
failures due to flawed recovery mechanisms.
Issue: Multi-tenancy increases risks like data breaches, insider threats, and unauthorized
access.
Impact: Sensitive data stored in public clouds may violate compliance regulations (e.g.,
GDPR, HIPAA).
Example: A malicious insider in a cloud provider could access confidential enterprise data.
Issue: Moving large datasets to/from the cloud is slow over standard networks.
Example: Transferring 1 TB over 1 Mbps takes ~10 days; faster networks (1 Gbps) reduce this
to ~2 hours.
5. Vendor Lock-In
Issue: Proprietary APIs and services make migration between providers difficult.
Example: Auto-scaling in AWS must balance cost and response time for variable workloads.
7. Software Licensing
Impact: Costs may spike if licenses don’t align with pay-per-use cloud models.
Example: Running licensed software on 100 VMs temporarily could incur prohibitive fees.
Issue: Cloud applications often rely on high-speed interconnects, but WANs introduce delays.
Example: EC2’s latency (145 μs) is 70× slower than dedicated supercomputers (e.g., Carver
at 2.1 μs).
Impact: Users may lose control over their data (e.g., stored in foreign jurisdictions).
Example: Government cloud initiatives must ensure data sovereignty and auditability.
Characteristics:
o Statelessness: Servers treat each request independently (no session data stored).
Description: Applications are built as loosely coupled services that communicate via
standardized protocols.
Key Technologies:
Description: A lightweight alternative to SOAP, using HTTP methods (GET, POST, PUT,
DELETE).
Advantages:
Phases:
Description: Systems react to events (e.g., messages, triggers) rather than requests.
Components:
Description: Decentralized networks where nodes (peers) share resources without a central
server.
Characteristics:
7. Microservices Architecture
Benefits:
o Fault Isolation: A failure in one service doesn’t crash the entire system.
8. Workflow-Based Systems
Patterns:
Use Case: Leader election, configuration management (e.g., Kafka for messaging).
Scalabilit
Style Fault Tolerance Use Case
y
Orchestrating tasks: Managing dependencies (e.g., Task B runs only after Task A completes).
Workflows are modeled using directed activity graphs or workflow description languages (WFDL),
resembling flowcharts with tasks as nodes and dependencies as edges.
Lifecycle of a Workflow
The lifecycle of a workflow consists of four primary phases, analogous to the lifecycle of a traditional
computer program (see Figure 4.1 in the document):
1. Creation
Activities:
2. Definition
Activities:
o Example: A batch processing workflow might use an AND split to process data
chunks in parallel.
3. Verification
Activities:
o Check for:
Deadlocks: Circular dependencies (e.g., Task A waits for Task B, which waits
for Task A).
o Example: In Figure 4.2(a), if Task D is chosen after B, Task F never runs, violating
liveness.
4. Enactment
Activities:
o Coordination Models:
Strong Coordination: Centralized control (e.g., a master process monitors
tasks).
o Monitoring: Track progress and handle exceptions (e.g., retry failed tasks).
o Example: The GrepTheWeb application (Section 4.7) uses queues and controllers to
manage tasks like launching EC2 instances and merging results.
• The sequence pattern occurs when several tasks have to be scheduled one after the completion of
the other [see Figure 4.3(a)].
• The AND split pattern requires several tasks to be executed concurrently. Both tasks B and C are
activated when task A terminates [see Figure 4.3(b)].
In case of an explicit AND split, the activity graph has a routing node and all activities connected to
the routing node are activated as soon as the flow of control reaches the routing node. In the case of
an implicit AND split, activities are connected directly and conditions can be associated with
branches linking an activity with the next ones. Only when the conditions associated with a branch
are true are the tasks activated.
• The synchronization pattern requires several concurrent activities to terminate before an activity
can start. In our example, task C can only start after both tasks A and B terminate [see Figure 4.3(c)].
• The XOR split requires a decision; after the completion of task A, either B or C can be activated[see
Figure (d)]
1. Core Concepts
ZooKeeper treats each distributed process as a deterministic finite state machine (FSM).
All replicas must execute the same sequence of commands to maintain consistency.
2. ZooKeeper Architecture
(A) Components
1. Servers (Ensemble)
2. Clients
o Read requests are served locally; writes are forwarded to the leader.
3. Znodes (Data Nodes)
o Similar to filesystem inodes but store state metadata (e.g., configuration, locks).
o Two types:
1. Distributed Locking
2. Configuration Management
3. Leader Election
4. Service Discovery
MapReduce is a distributed data processing model designed for large-scale parallel computation in
cloud environments.
It simplifies batch processing by breaking tasks into two phases:
Map (data splitting and processing) and
Reduce (aggregation of results). Below is a detailed explanation based on the document.
1. Core Principles
1. Divide-and-Conquer Approach
o Splits large datasets into smaller chunks processed in parallel.
o Inspired by functional programming (map and reduce operations in LISP).
2. Key-Value Pair Processing
o Input & output data are structured as <key, value> pairs.
3. Fault Tolerance
o Automatically handles node failures by re-executing tasks.
2. Phases of MapReduce
(A) Map Phase
Input: A set of <key, value> pairs (e.g., <document_name, text>).
Process:
o Each Map task processes a split of the input data.
o Applies a user-defined map() function to emit intermediate <k, v> pairs.
Example:
Python code
# Input: <doc1, "hello world hello">
def map(key, value):
for word in value.split():
emit(word, 1) # Output: [<"hello", 1>, <"world", 1>, <"hello", 1>]
3. Execution Workflow
1. Input Splitting:
o Data is divided into 16–64 MB chunks (e.g., in HDFS).
2. Master-Worker Model:
o Master Node: Assigns tasks to workers and tracks progress.
o Worker Nodes: Run Map/Reduce tasks.
3. Data Flow:
o Map Workers: Read input splits → emit intermediate data to local disk.
o Reduce Workers: Fetch intermediate data → aggregate → write final output.
4. Fault Tolerance
Task Retries: Failed tasks are re-executed on other nodes.
Checkpointing: Master periodically saves state to recover from failures.
5. Example Applications
1. Word Count (Classic Example)
o Counts word frequencies in documents.
2. Distributed Sort
o Sorts large datasets (e.g., Google’s web indexing).
3. Log Analysis
o Processes server logs to detect trends (e.g., AWS GrepTheWeb).
When a user program invokes the MapReduce function, the following sequence of actions take place
(see Figure 4.6):
1. The run-time library splits the input files into M splits of 16 to 64 MB each, identifies a number N
of systems to run, and starts multiple copies of the program, one of the system being a master and
the others workers. The master assigns to each idle system either a Map or a Reduce task. The
master makes O(M+R) scheduling decisions and keeps O(M×R)worker state vectors in memory. These
considerations limit the size of M and R; at the same time, efficiency considerations require that M,R
N.
2.A worker being assigned a Map task reads the corresponding input split, parses pairs, and passes
each pair to a user-defined Map function. The intermediate pairs produced by the Map function are
buffered in memory before being written to a local disk and partitioned into R regions by the
partitioning function.
3. The locations of these buffered pairs on the local disk are passed back to the master, who is
responsible for forwarding these locations to the Reduce workers. A Reduce worker uses remote
procedure calls to read the buffered data from the local disks of the Map workers; after reading all
the
intermediate data, it sorts it by the intermediate keys. For each unique intermediate key, the key and
the corresponding set of intermediate values are passed to a user-defined Reduce function. The
output of the Reduce function is appended to a final output file.
4. When all Map and Reduce tasks have been completed, the master wakes up the user
program
2. Architectural Approaches
(A) Workflow Automation
Tools: Apache Airflow, Nextflow.
Use Case:
o Cirrus (Section 4.10): A cloud platform for legacy biology apps, orchestrating tasks
like:
Data Preprocessing → Alignment → Analysis.
(B) Distributed Data Storage
Systems:
o Hadoop HDFS: Stores genomic data (e.g., FASTQ files).
o Amazon S3/Google Cloud Storage: Hosts public datasets (e.g., 1000 Genomes
Project).
(C) Hybrid Cloud for Sensitive Data
Challenge: Privacy laws (e.g., HIPAA) restrict genomic data movement.
Solution:
o Private Cloud: Stores raw patient data.
o Public Cloud: Runs anonymized analyses (e.g., AWS GovCloud).
Data
Distributed storage (S3, HDFS) + compression.
Volume (Petabytes)
5. Future Directions
AI/ML Integration: Train models on genomic data (e.g., Google DeepVariant).
Federated Learning: Analyze data across hospitals without centralizing it.
Quantum Computing: Accelerate protein-folding simulations (e.g., Google’s Quantum AI).