ECS765P - W9 - Large-Scale Graph Processing
ECS765P - W9 - Large-Scale Graph Processing
Data
Ingestion Storage Processing Output
Sources
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● Graphx
Graph Definition
Communities of college football network, using colors for conferences and spatial clustering for identified communities
https://www.ese.wustl.edu/~nehorai/research/network_science/Lu_Community_Detection_SR_2018.html
Bipartite graphs
Bipartite: when the graph is partitioned into two groups and nodes only can have edges to the other part
https://en.wikipedia.org/wiki/Bipartite_graph
Example: “Stable marriage/matching” problem: how to find a stable matching between two equally sized
sets of elements given an ordering of preferences for each element. A matching is a bijection from the
elements of one set to the elements of the other set https://en.wikipedia.org/wiki/Stable_marriage_problem
Not Stable: if there is an element A of the first matched set which prefers some given element B of the
second matched set over the element to which A is already matched with, and similarly B also
prefers A over the element to which B is already matched with.
Frequency
Degree
https://en.wikipedia.org/wiki/Scale-free_network
Big Data Processing: Week 9
Topic List:
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● GraphX
Graph Management / Storage
Database that uses graph structures with nodes, edges and properties to store data
Similar to SQL: it is ACID – Atomic, Consistent, Isolated and Durable for logical units of work for database
transactions – (https://en.wikipedia.org/wiki/ACID)
Property
Entity
Why SQL is not suitable for dealing with a graph-based data?
Get non-immediate
friends of
Person001 who are
up to 3 hops away
Cypher query language
Open the Movies project in the desktop and then use command :play movies
Then follow the instructions for creating movies database and queries
Movies Database neo4j
Find Movies released in the 1990s
Find actors up to 4 hops away from Kevin Bacon
Find actors shortest path between two actors
Find co-co actors of Tom Hanks
Big Data Processing: Week 9
Topic List:
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● GraphX
Can access neo4j database with python using the neo4j library
Neo4j Python Obtaining Json Graph
Obtain json of the graph defined in the movies database showing movie titles and their actors/cast
Neo4j Python Search Functionality
Search for movies in the database that has sub-text defined by the variable q in movie titles
https://neo4j.com/docs/cypher-manual/current/clauses/where/#query-where-regex
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
Big Data Processing: Week 9
Topic List:
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● GraphX
Graph Traversal in MapReduce
Approach: Parallel processing of each vertex
● Each Map/Reduce function has access to limited info
One node and its links
Iterative executions of a MapReduce job
● Map: compute something on each node. Potentially send information to that node or other nodes that
is aggregated by the Reducers.
● Reducers: compute something on each unique node
● The output of the reducers in iteration #n becomes the input of the mappers in iteration #n+1
Finding the Shortest Path: Intuition
Inefficient
[1] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010, June). ”Pregel: a system for large-scale graph
processing.”, In Proceedings of the ACM SIGMOD
Pregel: Think like a vertex
https://people.cs.rutgers.edu/~pxk/417/notes/pregel.html
Pregel’s node/vertex-centric processing model
Pregel-style graph processing systems
Computation is iterative but in the form of supersteps
● Every iteration, a function that is executed at each vertex
Vertices can send messages to its neighbours
Messages arrive in the next superstep
Computation is executed in parallel
● Each vertex is independent from the rest in the same step
● Messages are the synchronization mechanism
https://people.cs.rutgers.edu/~pxk/417/notes/pregel.html
Google’s PageRank
PageRank is a link analysis algorithm
The rank value indicates the importance of a particular web page
A hyperlink to a page counts as a vote of support
A page that is linked to by many pages with high PageRank receives a high rank itself
Example: A PageRank of 0.5 means there is a 50% chance that a person clicking on a random link
will be directed to the document with a PageRank of 0.5
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: bringing order to the web., WWW
PageRank Example
Rank of the neighbor
Initial value = 1 / N
(number of pages)
Outdegree of the neighbor
r1(P2) = r(P3) / d(P3) + r(P1) / d(P1) = (1/6)/3 + (1/6)/2 = 1/18 + 1/12 = 30 / 216 = 5 / 36
r2(P2) = r1(p3)/d(p3) + r1(P1)/d(P1) = (1/12)/3 + (1/18)/2 = 1/36+1/36 = 1/18
https://en.wikipedia.org/wiki/PageRank
Big Data Processing: Week 9
Topic List:
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● GraphX
Spark GraphX
Spark’s library for graph processing
Provides specialized RDDs for representing graph structure, as well as its information (property graphs)
Provides methods for creating graph, transforming them, implementing multiple common graph metrics
and algorithms
GraphX is written in Scala à Graphframes is the Python library for using Spark’s Graph Processing
Spark GraphX Property Graphs
Spark GraphX RDD
Holds graph data and provides methods for manipulating them
VertexRDD[VertexId, VertexData]
Vertex IDs have to be Integer/Long
VertextData Holds vertex properties
EdgeRDD [EdgeData]
Edgedata holds source and destination IDs and edge properties
Technically a directed graph
Triplets
Join of source vertex, destination vertex, and edge
GraphX predefined methods
A Graph RDD has multiple convenience methods that provide access to its information and implement
relevant operations
https://spark.apache.org/docs/latest/graphx-programming-guide.html
Graph aggregate computation
Aggregate transformations send and process messages to every vertex through each edge
graph.aggregateMessages: This operator applies a user defined sendMsg function to each edge triplet in
the graph and then uses the mergeMsg function to aggregate those messages at their destination vertex.
https://spark.apache.org/docs/latest/graphx-programming-guide.html#aggregate-messages-aggregatemessages
Age of the oldest follower of each node
(Scala code)
http://webprojects.eecs.qmul.ac.uk/ag316/notesSite/BDP_slides/Week7%20%7C%20BigGraphs/ECS640-9-BigGraphs.pdf
Big Data Processing: Week 9
Topic List:
● Graph Applications
● Graph Databases
● Graph Databases with python
● Pregel
● GraphX