Detection of Internet Scam Using Logistic Regression: Mehrbod Sharifi Eugene Fink Jaime G. Cabonell
Detection of Internet Scam Using Logistic Regression: Mehrbod Sharifi Eugene Fink Jaime G. Cabonell
Detection of Internet Scam Using Logistic Regression: Mehrbod Sharifi Eugene Fink Jaime G. Cabonell
Mehrbod Sharifi
mehrbod@cs.cmu.edu
Eugene Fink
eugenefink@cmu.edu
Jaime G. Cabonell
jgc@cs.cmu.edu
I.
INTRODUCTION
Spam emails.
User-generated contents of generally benign websites,
such as blog comments and user reviews.
Online advertisement: Ad networks with banners or
textual ads; classified ads on Craigslist, Ebay, Amazon,
and other similar websites.
Outside Internet: Hard mail; TV and radio ads.
Approach
Source
which is a set
and
is the
Let
be the labels for the
website represented by
. In logistic regression, the
probability of being scam is:
Feature Name
Definition
Google (google.com)
search_result_count
Alexa (alexa.com)
reviews_count
Rating
traffic_rank
us_traffic_rank
sites_linking_in
IP Info (ipinfodb.com)
latitude, longitude
country_code
ip_count
Server oordinates.
Country of the server.
Number of IP addresses for
this domain name.
Whois (internic.net/whois.html)
country_code
created_days,
updated_days,
expires_days
Wikipedia (wikipedia.org)
years_in_business,
company_revenue,
employees
Companys capitalization.
Google Safebrowsing
(code.google.com/apis/safebrowsing)
Truste (truste.com)
safe
Compete (compete.com)
unique_monthly_visitors
monthly_visit
traffic_rank
member
site_is_good,
site_is_spam, malware,
pop_ups, scam,
bad_shopping_experience,
browser_exploits
total_comments
Experiments
1.0
We have tested the developed technique using ten-fold crossvalidation on each dataset, and we summarize the results in
Table 3. We have used the following four performance
measures.
0.9
F1
0.9749
0.9923
0.9809
0.9583
0.9795
0.9803
0.9534
AUC
0.9667
0.9990
0.9858
0.9582
AUC
.999
.767
.500
Random
Search
Traffic
10
100
1000
10000
100000
1.00
0.98
0.96
0.94
0.92
0.90
1
10
100
1000
10000
100000
1
0.98
AUC
.844
.940
0.5
.986
0.7
0.6
AUC
0.8
0.96
0.94
0.92
0.9
10
15
20
25
30
35
40
Weight
latitute
site_is_good
child_safety
vendor_reliability
total_positive_comments
trustworthiness
privacy
total_negative_comments
created_days
employees
-0.2265
-0.0494
-0.0314
-0.0147
-0.0144
-0.0087
-0.0079
-0.0011
-0.0005
-0.0002
Positive-Weight Features
Weight
traffic_rank
search_result_count
country_code
malware
updated_days
expires_days
ip_count
scam
reviews_count
site_is_spam
0.2190
0.2092
0.0494
0.0135
0.0021
0.0002
0.0001
0.0001
0.0001
0.0001
% with
value
7%
32%
58%
64%
35%
65%
64%
48%
36%
5%
% with
value
61%
70%
51%
26%
36%
36%
66%
28%
26%
29%