PSOSM Lectures
PSOSM Lectures
2.1
URL: https://www.youtube.com/watch?v=WYB6V0gTJps
Week: "2"
OSM API
Enables you to interact with social media to enable users to
collect data
Each has its own API and rate limits
Python
Has a lot of libraries and we will be using it
Data Formats
JSON
XML
MySQL
Relational Database to store data, stores data in rows and
columns
Viewed with PhpMyAdmin
MongoDB
Object is stored as data
Recommends to use RoboMongo but I would prefer Compose
Graph API
All content in meta is stored in graph format and every
interaction is an edge in the graph
All nodes have an unique numeric id
2.2
URL: https://www.youtube.com/watch?v=yf4jNe3X_mg
Week: "2"
Our Objective:
Methodology
1. Data Collection and Filtering
2. Data Characterization
3. Classification Module
1. Feature Generation
2. Obtaining Ground Truth
4. Evaluating Results
Data Collection
Data Filtering
Guardian manually annotated the data and publicly
distributed them
Fake Images were twice of Real Images
Analysis Helps: Who, When, Where, What, Why and How
Network Analysis:
In an hour the spread is exponential
The user who started it may not be the biggest spreader
of it
Classification
Results:
2.T
Week: "2"
Reddit API
praw is Python Reddit API Wrapper
Flair is an indication similar to hashtags
3.1
URL: https://www.youtube.com/watch?v=SkK6ejOS-XE
Week: "3"
Facebook
Features are quite different
Facebook is bidirectional
Connections are more personal
FBI
Just like TweetCred
Method:
3.2
URL: https://www.youtube.com/watch?v=4C4P_tzjthc
Week: "3"
API Keys
Do not share API Keys (Lol)
Privacy
Every context has different privacy expectations
Westin's 3 Categories
Fundamentalists - 25%
Pragmatists - 60%
Unconcerned - 15%
Dog Nigga
A black dog can be identified via internet
#piracyindia12
3.T.1
URL: https://www.youtube.com/watch?v=ERvCBJn-9tU
Week: "3"
Twitter API
Read/Write based tokens
Advertising Management is also provided
Tweepy - Python Wrapper
Search API
Based on query
Based on geocode
Based on Language
recent/popular/mixed
Based on count
Only works for last 7 days
Streaming API
4.1
URL: https://www.youtube.com/watch?v=oiqJZ3FIX_w
Week: "4"
Hard to Define
It can not be usefully addressed at all
Claim of Individual/Groups to determine for themselves when
how and what extent information about them is communicated
to others
Control of information
Forms of Privacy
Information
Internet
Communication
Telephone
Territorial
Living Space
Bodily
Self
Question
Can we use publicly available images like in FB and stuff and
use models that are off the shelves to reidentify anyone?
The goal is to
Latanya Sweeney
Combined medical data and Voter List to reidentify users
Independent data sets can be used to reidentify
Experiment 1
Data
Approach
Results
Experiment 2
offline to online comparison
Pictures from FB college network to identify students
Data
25k profiles
26k Images
114k Faces
Process
Results
Experiment 3
Combination of 1 and 2
Predicted SSN from public data
Faces / FB data + Public data --> SSN
27% of subjects got first 5 SSN right with 4 attempts
starting from their faces
4.T.1
URL: https://www.youtube.com/watch?v=pEyizxN3K84
Week: "4"
numpy
Arrays are faster
Consumes less memory
Mechanism to specify data types
# creatinfg zeroes
np.zeros(10, dtype=int)
# zero matrix
np.zeros((3,3))
# random elements
b = np.random.random((3,3))
# convert to a type
b.astype("int16")
# diagonal matrix
np.eye(5)
# generate random integer array
# ex: 20 Random Integer between 1 and 150
r = np.array([np.random.randint(1, 150) for i in range(20)])
# number of dimensions
a.ndim
# number of elements
a.size
# get i, j in mn matrix
# ex: 2,2
x[2,2]
# Slicing
# ex: selecting every second element
x[::2]
# Concatenate
np.concatenate(x, y)
# along an axis
np.concatenate(x, y, axis=0)
# along x axis
np.hstack(x, y)
# along y axis
np.vstack(x, y)
# Splitting Array
a.reshape(4, 4)
# Vertical Split
np.vsplit(grid, [2])
# Horizontal Split
np.hsplit(grid, [2])
# matrix multiplication
np.matmul(x, y)
4.T.2
URL: https://www.youtube.com/watch?v=V-PozDJ7c1A
Week: "4"
pandas
Data analysis library
Has a lot of inbuilt
# creatinfg zeroes
np.zeros(10, dtype=int)
# zero matrix
np.zeros((3,3))
# random elements
b = np.random.random((3,3))
# convert to a type
b.astype("int16")
# diagonal matrix
np.eye(5)
# number of dimensions
a.ndim
# number of elements
a.size
# get i, j in mn matrix
# ex: 2,2
x[2,2]
# Slicing
# ex: selecting every second element
x[::2]
# Concatenate
np.concatenate(x, y)
# along an axis
np.concatenate(x, y, axis=0)
# along x axis
np.hstack(x, y)
# along y axis
np.vstack(x, y)
# Splitting Array
a.reshape(4, 4)
# Vertical Split
np.vsplit(grid, [2])
# Horizontal Split
np.hsplit(grid, [2])
# matrix multiplication
np.matmul(x, y)
5.1
URL: https://www.youtube.com/watch?v=rPmjaAB8AAk
Week: "5"
Foursquare
Foursquare can be used to trace where people live
You check-in into a place
You can tip people
You become mayor if you go to a place 60 times
You get free parking spots if you are mayor
Policing
5.2
URL: https://www.youtube.com/watch?v=d9T9VVoUcKE
Week: "5"
Objective
Whether OSN can support Police to get actionable information
about crime and resident's opinions about policing
activities in urban cities of India
Methodology
Collect Data from BLR City police
Filter Post & Comments to relevance
1.6K comments and 250 Posts
Data Coding
1. Content Based
Missing
Query
Traffic
2. Style
Formal
Informal
3. Type
- Acknowledge to
- Reply to
- Follow-up by
- Ignored by
Lexical Analysis using word trees
Result
Engagement Type
Mostly Acknowledging
Replying
Follow up
Ignore (1/3)
Understanding Victimization
Accountability
Understanding Needs/Wants
5.3
URL: https://www.youtube.com/watch?v=Ao_ZuLPVlP8
Week: "5"
Research Questions
Methodology
1. Topics
1. N Gram Analysis
2. K-means Clustering
2. Engagement
1. No. of police citizen who comments in posts
2. Distinct citizens who comment in posts
3. Average no. of likes and comments
3. Emotional (LIWC and Anew Dictionary)
1. Valence - Positivity and Negativity
2. Arousal - Intensity
4. Social and Cognitive (LIWC)
1. Interpersonal Focus
2. Social Orientation
3. Cognition
Clusters of Topic
Engagement Characteristics
Emotional Characteristics
Why it matters?
Helps police and improving policing and community sensing
Enable emotional support to residents experiencing safety
concerns
Tech Implications
Help gauge changing emotions and behavior
Sense and record the reactions citizen and share the with
decision maker
6.1
URL: https://www.youtube.com/watch?v=z1IqDHJm6N0
Week: "6"
eCrimes in OSM
Phishing
Act of tricking someone into handling over her login creds
in order to exploit personal information
Spear Phishing: Target specific people
Whaling: Specific CEOs are targeted
Link in FB saying "There was some issue with Facebook login,
click here to solve it"
Example:
FB tech support DMs you
New Login system
Social Reputation
Folks respect you with the number of followers you have
Social status dictates reputation
A lot of them are manipulated:
Paid good reviews for a bad products
Fake followers
Click baiting
Getting you to click on links
#hijacking
Using a hashtag to selling products or do something, those
things will have nothing related to the hashtag
Compromised Account
Hacked accounts posting wrong information or other bad
activity
Impersonation
Pretending to be someone else
6.T.4
URL: https://www.youtube.com/watch?v=d6bi0QTaX5Y
Week: "6"
SNA Metrics
Centrality
Indegree: Most influential
Outdegree: Who disseminates information
Betweenness: Quickly Approachable, basically a node through
which good number of nodes reach other set of nodes
Closeness: How close a node is to other nodes
Community & Modularity
Adjacency Matrix
GraphML Format
7.1
URL: https://www.youtube.com/watch?v=oxXCzyRdTio
Week: "7"
Spammers
Top 100,000 spam followers account for 60% of all links
acquired by the spammers
Top spam-followers tend to reciprocate all links established
to them by spammers
Spammers try to increase their in-links
If I have less followers it is likely that I won't
reciprocate the follow by a spammer, responsiveness
increases with followers
Link Farmers
7.2
URL: https://www.youtube.com/watch?v=lz_IivUTQjk
Week: "7"
Methodology
Analysis Metrics
Timer Nudge
Sentiment Nudge
It was losing the context
Many cancelled because of this nudge
Post Frequency Dropped: 13 -> 7
Conclusion
Intervention helps user make better decision
More work is needed to understand which works when
7.3
URL: https://www.youtube.com/watch?v=AfTNyw3_TdE
Week: "7"
Semantic Attack
Urgency in Subject
Spelling Mistakes
Links take you to random websites
1. Phishing
2. Context-aware phishing / spear phishing
3. Whaling
4. Vishing: Over phone
5. Smsishing: Over SMS
6. Social Phishing
Social Phishing
Methodology
Flow
1. Public data is harvested
2. Data is correlated and stored in RDB
3. Heuristics are used to craft spoofed email message by Eve
4. Message is sent to bob
5. Bob follows the link contained and is sent to an unchecked
redirect
6. Bob is sent to attacker whuffo.com
7. Bob is asked for creds
8. Bob's creds are verified with university authenticator
9. Then
a. Bob is phished
b. Not phished, could try again
Victims
Control: 16%, which is higher than usual
Social: 72%, consisted with other experiments
Success rate
70% authentication in first 12hrs
Takedown has be successful and quick
Younger targets were more vulnerable
Science department had the maximum different between control
and social
Technology had the lowest victims #satisfying
Repeated Authentications
The users actually tried again because of the overload
message
Some even tried 80 times
Gender
Opposite gender was the highest
Male to Male was least
Female seems to be more vulnerable
Reactions
Anger
Researchers got fired
Psychological cost
Unethical and illegal
Denial
Nobody accepted they fell for it
Misunderstanding over spoofing emails
Underestimation of publicly available information4
Conclusions
7.T.5
URL: https://www.youtube.com/watch?v=Wqrea2rTV7I
Week: "6"
ntlk
NLP based library for string operations on data
# Replace punctuation
obj = str.maketrans("", "", string.punctuation)
tokens = [i.translate(obj) for i in tokens]
# Remove stopwords
stop_words = set(nltk.corpus.stopwords.words("english"))
tokens = [i for in tokens if i not in stop_words and len(i) > 2]
8-9-10-11
Week: 8, 9, 10, 11
8.1
De-duplicating audience
Profile Linking approach
Values change over time: people's username change
Profile pic and description change very often
Given a two user profiles and the respective username sets,
each composed of past and current usernames, find if
profiles refer to a single individual
Why only usernames?
8.2
Anonymous Networks
4chan
Whisper
Secret
Yik Yak
Wickr
Why do we need anonymous network
increasing awareness
Snowden Disclosure
PRISM surveillance program
Bal Thackeray Incident
What is Whisper?
GUI: Global Universal Identifier, removed in 2014
55% get no replies in Whisper
94% replies in one day
Unlikely to get attention later
80% post less than 10 total whisper
15% only replies
30% only post no reply
Average degree is very high
Very low clustering
No small world phenomenon
Assortavity is very less -> Random graph
18% content deleted compared to 4% in twitter
70% of deleted whispers are deleted within a week
2% stay after a month
90% of the two users co located in the same "State"
75% have their distance < 40 miles
Smaller user population in same nearby area, higher chance
of encounter
Active people have higher chances of meeting
New people -> New Posts does not work here
Users disengage
New users make 20% content
8.T.6
Gephi is basically a graph visualization tool
You can give nodes and edges as csv
Stats tab
Filters tab
Range for Followers, Out degree
Intersection to combine filters
Network Diameter
Density
Modularity
Page Rank
Connected Components
Layout allows you to visualize in different ways
Node Size according to data
Color according to Modularity
Direct Selection
Rectangle Selection
Drag Tool
Painter Tool
Node Pencil
Edge Pencil
Edit tool
Camera Button
Preview Tab
Select in Data Laboratory
Show in overview button
Tag cloud
Export filtered to new workspace
9.1
Location Based services
1. Foursquare
2. Yelp
3. Gowalla
4. Facebook
5. Twitter
Perceptions in OSM and Mobile Network
Mayorships is an incentive
pleaserobme.com looked at tweets and if user talks abouts a
location X, while belonging to another location, it means
they are travelling
People have designed cities based on data from foursqaure
badges and mayorship: gamifications of apps
Users can post tips, can serve as feedback
done or to-do
9.2
we are interested in done, to-do and mayorship
yahoo/geo/placefinder
few users have many mayorships and most have only one
few cities have many mayorships but most have very less
Some found correlation between number of mayorships and
number of tips and dones
New york is common in tips and dones
There are chances that there are mayorships but no tips
Dones are sparser
Lots of tips are generated 1hr apart
70% of the users have average distance of 150km
10% have 6000km
Frequency is 24hrs in most cases
Take all users, and classfiy
9.T.7
python-highcharts
chart = Highchart()
options = {
'chart': {
'type': "bar"
},
'title': {
'text': "Highchart bar"
},
'legend': {
'enabled': True
},
'xAxis': {
'categories': ['User 1', 'User 2']
},
'yAxis': {
'title': {
'text': "Number of Followers"
}
},
chart.save_file('/bar-chart.py')
10.1
Location based on other social networks
Twitter is the highest
0.5% users were geo tagged
reproducibility of the research