Distrsyslectureset1 Win20
Distrsyslectureset1 Win20
1
CS230
Distributed Computing Systems
Winter 2020
Lecture 1 - Introduction to Distributed Computing
Wed 5:00-8:50p.m., ALP 2200
Nalini Venkatasubramanian
nalini@uci.edu
2
CS230: Distributed Computing Systems
Course logistics and details
● TA for Course
● Nailah Alhassoun (nailah@uci.edu)
3
CS230: Distributed Computing Systems
Course logistics and details
● Homeworks
● Written homeworks
● Problem sets
● Includes paper summaries (1-2 papers on the
specific topic from the reading list)
● Course Examination (tentatively Week 8)
● Course Project
● In groups of 3
● Will require use of open source distributed computing
platforms
● Suggested projects will be available on webpage
4
CS230: Distributed Computing Systems
Prerequisite Knowledge
● Necessary – Operating Systems Concepts and
Principles, basic computer system architecture
● Highly Desirable – Understanding of Computer
Networks, Network Protocols
● Necessary – Basic programming skills in Java,
Python, C++,…
Distributed Systems 5
CompSci 230 Grading Policy
● Homeworks - 30% of final grade
• 4 homeworks - one for each segment of the course
– Problem sets, paper summaries (2 in each set)
• A homework due approximately every 2 weeks
• Make sure to follow instructions while writing and creating
summary sets.
• Extra Credit - Summary of 2 distributed computing related
distinguished talks this quarter
Project meeting
6 Feb 12 Group Communication, ALM Project update/initial
demo
7 Feb 19 Publish/Subscribe, Fault Tolerance Homework 3
Distributed Systems 10
Lecture Schedule
● Weeks 7,8: Messaging and Communication in
Distributed Systems
● Naming in Distributed Systems
● Gossip, Tree, Mesh Protocols
● Group Communication
● Weeks 9,10: Non-functional “ilities” in distributed
systems
● Reliability and Fault Tolerance
● Quality of Service and Real-time Needs
● Sample Distributed Systems (time permitting)
● P2P, Grid and Cloud Computing, Mobile/Pervasive
Distributed Systems 11
What is not covered
Distributed Systems 12
Distributed Systems
● Lamport’s Definition
● “ You know you have one when the crash of a computer you have
never heard of stops you from getting any work done.”
● Andrew Tanenbaum
A distributed system is a collection of independent computers that
appear to the users of the system as a single computer.
● “An interconnected collection of autonomous processes” - Wak
Fokknik (an algorithmic view)
● FOLDOC (Free on-line Dictionary) -??
A collection of (probably heterogeneous) automata whose distribution is transparent to the user
so that the system appears as one local machine. This is in contrast to a network, where the
user is aware that there are several machines, and their location, storage replication, load
balancing and functionality is not transparent. Distributed systems usually use some kind of
“client-server organization”
13
People-to-Computer Ratio Over
Time
15
What is a Distributed System?
16
What is a Distributed System?
Internet
More Examples: Banking systems, Communication (messaging, email), Distributed information systems (WWW,
federated DBs, Manufacturing and process control, Inventory systems, ecommerce, Cloud platforms, mobile
17
computing infrastructures, pervasive/IoT systems
Distributed Computing Systems
Globus Grid Computing Toolkit
Gnutella P2P Network
● Programming models:
• Tightly coupled vs. loosely coupled ,message-based vs. shared
variable
Principles of Operating Systems -
Lecture 1 19
Parallel Computing Systems
ILLIAC 2 (UIllinois)
Climate modeling,
earthquake
simulations, genome
analysis, protein
folding, nuclear fusion
research, ….. K-computer(Japan)
Tianhe-1(China)
P2P Communications
MSN, Skype, Social Networking Apps
Use the vast resources of machines at the edge of the Internet to build a network that
allows resource sharing without any central authority .
Distributed Systems 21
Real-time distributed systems
● Correct system function depends on timeliness
● Feedback/control loops
● Sensors and actuators
● Hard real-time systems -
● Failure if response time too long.
● Secondary storage is limited
● Soft real-time systems -
● Less accurate if response time is too long.
● Useful in applications such as multimedia, virtual reality.
Distributed Systems 24
Sample SmartSpaces Built - UCI
Responsphere - A Campus-wide
infrastructure to instrument, monitor,
SAFIRE – Situational
awareness for fire
OpsTalk– Speech based
awareness & alerting system
disaster drills & technology validation incident command for soldiers on the field
ACOUSTI ACOUSTI SA
APPLICATION
C C S
CAPTUR ANALYSI » Alerts
E Speech S
» Conversation
Monitoring
SCALE – A smart
community
awareness and
alerting testbed @
Montgomery County,
MD. A
NIST/Whitehouse
SmartAmerica Project
extended to Global
Cities Challenge.
25
Today’s Platforms Landscape - examples
System Goal
Distributed Systems 28
Why Distributed Computing?
● Inherent distribution
● Bridge customers, suppliers, and companies at
different sites.
● remote data access - e.g. web
● Support for interaction - email/messaging/social media
● Computation Speedup - improved performance
● Fault tolerance and Reliability
● Resource Sharing
● Exploitation of special hardware
● Scalability
● Flexibility
29
Why are Distributed Systems
Hard?
● Scale
● numeric, geographic, administrative
● Loss of control over parts of the system
● Unreliability of message passing
● unreliable communication, insecure communication,
costly communication
● Failure
● Parts of the system are down or inaccessible
● Independent failure is desirable
30
An entertaining talk: https://www.youtube.com/watch?v=JG2ESDGwHHY
31
Design goals of a distributed
system
● Sharing
● HW, SW, services, applications
● Openness(extensibility)
● use of standard interfaces, advertise services,
microkernels
● Concurrency
● compete vs. cooperate
● Scalability
● avoids centralization
● Fault tolerance/availability
● Transparency
● location, migration, replication, failure, concurrency
Key Questions
● What are the main entities in the system?
● How do they interact?
● How does the system operate?
● What are the characteristics that affect their
individual and collective behavior?
33
Classifying Distributed
Systems
● Based on Architectural Models
● Client-Server, Peer-to-peer, Proxy based,…
● Based on computation/communication - degree
of synchrony
● Synchronous, Asynchronous
• No single node
server as a
server
37
More Architectural Models
Mobile code
Multiple
servers
Proxy
38
Computation in distributed systems
● Synchronous system
● make assumptions about relative speeds of processes and delays
associated with communication channels
● constrains implementation of processes and communication
40
Parallel Computing Systems
● Special case of a distributed system
● often to run a special application
● Designed to run a single program faster
● Supercomputer - high-end parallel machine
41
Aurora: USA’s First ExaSCALE computer
Imagine …
- A computer so powerful that it
can predict future climate
patterns, saving millions of
people from drought, flood, and
devastation.
SIMD MIMD
Multiple (MD)
Processor
D D D D D D D
Instructions
Processor
D0 D0 D0 D0 D0 D0 D0
D1 D1 D1 D1 D1 D1 D1
D2 D2 D2 D2 D2 D2 D2
D3 D3 D3 D3 D3 D3 D3
D4 D4 D4 D4 D4 D4 D4
… … … … … … …
Dn Dn Dn Dn Dn Dn Dn
Instructions
A computer which exploits multiple data streams against a single instruction
stream to perform operations which may be naturally parallelized.
For example, an array processorFor example, an array processor or GPU.
MISD (Multiple Instruction Single Data)
Instructions
Instructions
Multiple instructions operate on a single data stream.
Uncommon architecture which is generally used for fault tolerance.
Heterogeneous systems operate on the same data stream and
aim to agree on the result.
Examples include the Space Shuttle flight control computer.
46
MIMD(Multiple Instruction Multiple Data)
Processor
D D D D D D D
Instructions
Processor
D D D D D D D
Instructions
Multiple autonomous processors simultaneously executing different instructions on
different data.
Distributed systems are generally recognized to be MIMD architectures;
either exploiting a single shared memory space or a distributed memory space.
Communication in Distributed
Systems
● Provide support for entities to communicate
among themselves
● Centralized (traditional) OS’s - local communication
support
● Distributed systems - communication across machine
boundaries (WAN, LAN).
● 2 paradigms
● Message Passing
● Processes communicate by sharing messages
● Distributed Shared Memory (DSM)
● Communication through a virtual shared memory.
48
Message Passing
State State
Message
● Basic primitives
● Send message, Receive message
49
Messaging issues
Synchronous ● Unreliable communication
● atomic action requiring the ● Best effort, No ACK’s or
participation of the sender and retransmissions
receiver.
● Application programmer designs
● Blocking send: blocks until
message is transmitted out of the own reliability mechanism
system send queue
● Blocking receive: blocks until
message arrives in receive queue ● Reliable communication
● Different degrees of reliability
Asynchronous ● Processes have some guarantee
● Non-blocking send:sending process that messages will be delivered.
continues after message is sent
● Reliability mechanisms - ACKs,
● Blocking or non-blocking receive:
Blocking receive implemented by NACKs.
timeout or threads. Non-blocking
receive proceeds while waiting for
message. Message is
queued(BUFFERED) upon arrival.
50
Synchronous vs. Asynchronous
53
Fault Models in Distributed
Systems
● Crash failures
● A processor experiences a crash failure when it ceases
to operate at some point without any warning. Failure
may not be detectable by other processors.
● Failstop - processor fails by halting; detectable by
other processors.
● Byzantine failures
● completely unconstrained failures
● conservative, worst-case assumption for behavior of
hardware and software
● covers the possibility of intelligent (human) intrusion.
54
Other Fault Models in
Distributed Systems
● Dealing with message loss
● Crash + Link
● Processor fails by halting. Link fails by losing
messages but does not delay, duplicate or corrupt
messages.
● Receive Omission
● processor receives only a subset of messages sent to
it.
● Send Omission
● processor fails by transmitting only a subset of the
messages it actually attempts to send.
● General Omission
● Receive and/or send omission
Failure Models
Omission and arbitrary failures
57
Other distributed system
issues
● Concurrency and Synchronization
● Distributed Deadlocks
● Time in distributed systems
● Naming
● Replication
● improve availability and performance
● Migration
● of processes and data
● Security
● eavesdropping, masquerading, message tampering,
replaying
Intro to Distributed Systems
Middleware 58
Middleware for distributed systems
● Middleware is the software between the application programs and
the Operating System/base networking.
● An Integration Fabric that knits together applications, devices, systems
software, data
● Distributed Middleware
● Provides a comprehensive set of higher-level distributed computing
capabilities and a set of interfaces to access the capabilities of the
system.
● Provides Higher-level programming abstraction for developing
distributed applications
● Higher than “lower” level abstractions, such as sockets, monitors
provided by the OS operating system
● Includes software technologies to help manage complexity and
heterogeneity inherent to the development of distributed
systems/applications/information systems. Enables modular
interconnection of distributed “services”.
Useful Management Services: Naming and Directory Service, State Capture Service. Event Service,
Transaction Service, Fault Detection Service, Discovery/trading Service, Replication Service, Migration
Services
59
cf: Arno Jacobsen lectures, Univ. of Toronto
Applications
Types of Middleware
Manageme
DCE DCE Distributed File Service
Securit
y DCE DCE
Other Basic
Service Distributed Directory
nt
Services
Time Service Service
● Integrated Sets of Services DCE Remote Procedure Calls
● DCE from OSF - provides key distributed DCE Threads Services
technologies, including RPC, a distributed
naming service, time synchronization service, Operating System Transport Services
60
Distributed Computing Environment (DCE)
● DCE - from the Open Software Foundation (OSF), offers an environment
that spans multiple architectures, protocols, and operating systems
(supported by major software vendors)
● It provides key distributed technologies, including RPC, a distributed naming service, time
synchronization service, a distributed file system, a network security service, and a threads
package.
Applications
Management
DCE Distributed File Service
DCE
Security DCE DCE
Service Other Basic
Distributed Directory
Services
Time Service Service
● Java Model
● Objects and threads are separate entities
● Threads are objects in themselves
● Can be joined together (complex object implements
java.lang.Runnable)
• BUT: Properties of connection between object and thread are not
well-defined or understood
64
Java and Concurrency
● Java has a passive object model
● Objects, threads separate entities
● Primitive control over interactions
● Synchronization capabilities also primitive
● “Synchronized keyword” guarantees safety but not
liveness
● Deadlock is easy to create
● Fair scheduling is not an option
65
Actors:
A Model of Distributed Objects
Interfac Thread
e Stat
e
Interface Procedur
e Actor system - collection of
Thread
Stat
Messag independent agents interacting via
e
es message passing
Interfac
Procedur e
Stat
e Thread e
Features
Procedur
• Acquaintances
e •initial, created, acquired
•History Sensitive
•Asynchronous
An actor can do one of three things: communication
1.Create a new actor and initialize its behavior
2.Send a message to an existing actor
3.Change its local state or behavior
Distributed Objects
● Techniques ● Issues with Distributed Objects
● Message Passing
● Abstraction
● Object knows about network;
● Network data is minimum ● Performance
● Argument/Return Passing ● Latency
● Like RPC. ● Partial failure
● Network data = args + return
result + names ● Synchronization
● Serializing and Sending Object ● Complexity
● Actual object code is sent. Might ● …..
require synchronization.
● Network data = object code +
object state + sync info
● Shared Memory
● based on DSM implementation
● Network Data = Data touched +
synchronization info
● An example: Netflix
● Offers Online streaming video service (17,000+ titles in
2010)
● Netflix website with support for video search
● Recommendation engines
● Instant playback on 100s of devices including xbox,
game consoles, roku, mobile devices, etc.
● Transcoding service
●…
Netflix App: version 0 (how
it started)
● Plays movies on demand on a mobile device
Server
Netflix.com
Simple Design
• Web Services standards
• Netflix owns the data center
• Uses a fairly standard server
Challenges with Version 0
Netflix Movies:
Master
Home copies
Amazon.com
73
Features of new version
74
Akamai
75
Multi-tier View of Cloud
Computing
● Good to view cloud applications running
in a data center in a tiered way
● Outer tier near the edge of the cloud
hosts applications & web-sites
● Clients typically use web browsers or
web services interface to talk to the
outer tier
● focus is on vast numbers of clients & 1 1
rapid response. 1 1
● Inside the cloud (next tier) we find high 1
volume services that operate in a 1 2 2
2 2
pipelined manner, asynchronously 2
1
● Caching to support nimble outer tier 2
services 1 Shards
2
● Deep inside the cloud is a world of
1 Index
virtual computer clusters that are 2 DB
scheduled to share resources and on
which applications like MapReduce
(Hadoop) are very popular
76
In the outer tiers replication
is key
● We need to replicate
● Processing: each client has what seems to be a
private, dedicated server (for a little while)
● Data: as much as possible, that server has copies of
the data it needs to respond to client requests without
any delay at all
● Control information: the entire structure is managed
in an agreed-upon way by a decentralized cloud
management infrastructure
But, In a more general setting - with updates and
faults, consistency becomes hard to maintain
77
across the replicas (more later)
Tradeoffs in Distributed Systems
Some interesting experiences
HOPELESSNESS
AND CONFIDENCE
IN DISTRIBUTED
SYSTEMS DESIGN
https://youtu.be/TlU1opuCXB0 78
Tradeoffs: The CAP Conjecture
(Eric Brewer: PODC 2000 Keynote)
79