0% found this document useful (0 votes)
27 views6 pages

IP Geolocation Method Based On Neighbor IP Sequences

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

IP Geolocation Method Based On Neighbor IP Sequences

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 6th International Symposium on System and Software Reliability (ISSSR)

IP Geolocation Method Based on Neighbor IP Sequences

Yong Gan Helin Zhang


Zhengzhou Institute of Engineering and Technology School of Computer and Communication Engineering
Zhengzhou, China Zhengzhou University of Light Industry
e-mail: ganyong@zzuli.edu.cn Zhengzhou, China
e-mail: zhanghelin5460@foxmail.com

Yuanbo Liu Lei He


School of Computer and Communication Engineering School of Computer and Communication Engineering
Zhengzhou University of Light Industry Zhengzhou University of Light Industry
Zhengzhou, China Zhengzhou, China
e-mail: zzulilyb@163.com e-mail: helei@zzuli.edu.cn

Abstract—IP geolocation aims at determining the geographic Such methods can be further subdivided into methods based
location of an Internet host, which can improve the on space theory and methods based on probability estimation
performance and security of the Internet application, and [1], mainly including Geoping [2], TBG [3], Poist [4] and so
bring about novel services. Existing methods usually use on.
measurement-based technology or data analysis-based In IP address geolocation technology based on data
technology, and less consider the relationship between IP analysis, the data can be divided into structured data, semi-
addresses,this paper proposed an IP geolocation approach structured data, and unstructured data according to the
based on neighbor sequences. Considering the aggregation degree of structure of the data used. Registration and filing
characteristics of IP addresses, the IP address location library records are common structured data containing IP addresses
and mobile traffic data are taken as the original data. The (segments) and their corresponding geographic information
neighbor sequences of IP addresses are first calculated, then
[5], such as WHOIS database, DNSLOC records, DNS
converted into corresponding latitude and longitude sequences,
and then the model is built and solved. The experimental
names, etc. These data can be used to infer the geographic
results based on IP addresses in Henan Province, China show location of IP addresses.
that the geographic location of IP addresses can be determined Wang et al. [1] divided geolocation methods based on
by neighbor IP sequences with the mean error is between 20 registration records into three categories: 1) inferring
kilometers and 30 kilometers. This method can also be network structure and database information. It may appear
combined with other methods to obtain better results. extensive deviation method based on geographic positioning
registration records, because some large entities may have
Keywords-component; IP geolocation; neighbor sequences; multiple servers scattered in different places, but the domain
latitude and longitude name is registered to the same address [4]; 2) querying the
WHOIS database; 3) measuring hostnames and inferring
I. INTRODUCTION based on database information. In semi-structured data,
geographic information may exist as an attribute of an object
At user device does not provide a GPS location in the data. Dan et al. [5] extracted IP addresses and GPS
technology, using the IP address geolocation techniques is coordinates from mobile device search engine logs to
the preferred method for determining the location of a user. construct the most extensive set of ground truth values
Currently, for the increasing size of the Internet user base, a (Ground Truth) to date, with 8.4 million IP address
large number of Internet services are required to obtain geolocation records. Web pages are the most common
location information of the user. How to accurately unstructured data. You can use text mining [6], data mining,
determine a user's geographic location has become an and statistics [7] to extract geographic information from
Internet application in a significant issue. massive Web pages and associate them with specific IP
Existing IP address geolocation technologies can be addresses (segments) to achieve IP geolocation. Guo et al.
divided into two categories: 1) Based on data analysis [8] proposed the Structon method, which combines Web
methods, such methods use non-measurement technology, mining, inference and IP traceroute to reach a more accurate
through the processing and analysis of WHOIS database, IP address location. Backstrom et al [9] proposed a method
web pages and other related data, to obtain the geographic based on a social map, using the position of the user's friends
location corresponding to the IP address. 2) Measurement- to determine the user's location, and achieved a good
based methods, whose principle is to obtain network positioning effect.
topology, route hops, or delay detection of targets through
traceroute, to infer the geographic location of the IP address.

978-0-7381-0497-3/20/$31.00 ©2020 IEEE 46


DOI 10.1109/ISSSR51244.2020.00016
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
Guo et al. [8] found that network administrators tend to IP = b32 b31 … b1, where bi(1≤i≤32) It is a binary number 0
assign consecutive IP address segments to the same area or or 1, and mathematical operations can be performed.
similar locations, which means that consecutive IP addresses In order to measure the proximity of IP addresses, this
tend to cluster together in geographical distribution, that is, paper needs to define the distance between two IP addresses.
consecutive IP addresses tend to be neighbor geographically. The definition of IP distance should meet the following two
This paper refers to this assumption as the aggregation conditions:
characteristics of IP addresses. Based on this assumption, if 1) The distance between two identical IP addresses is 0;
the geographical location information of other IP addresses 2) Between different IP addresses, the longer the same
neighbor to an unknown location IP address can be obtained, prefix, the closer the distance. This property ensures that the
based on the geographic location of these neighbor IP IP address distance in the same network is always smaller
addresses, the physical spatial location range of the target IP than the IP address distance between different networks.
address can be inferred. Definition 1 (IP distance) IP distance is defined as:
Based on the above assumptions, this paper proposes an
IP address geolocation method based on neighbor sequences.  0 , p1 = p2
d ( p1 , p2 ) =  (1)
First calculate the neighbor sequences of the IP address and length( p1 ⊕ p2 ) , otherwise
convert it to the corresponding latitude and longitude, then
build the model and solve it. This paper uses the following Among them, Length (x) represents the number of bits of
two types of data to verify the method: 1) The open-source a binary number x without leading 0. In the result of the
IP positioning database [10] contains the IP address segment XOR operation, the same bit in the two IP addresses
and its corresponding administrative division address; 2) The becomes 0, and the different bit becomes 1, that is, the same
experimental data of a province in China sampled in the prefix of the two IP addresses is a string of 0 in the result,
education and scientific research network, this data is the and is located in the high order of the number. Therefore, the
network traffic of the mobile application. The traffic contains Length function can measure the length of the longest
the IP address, latitude and longitude, and forms a common prefix in the two IP addresses, and the longer the
corresponding relationship. The experimental results show longest common prefix, the smaller the function value.
that the geographical location of the IP address can be Therefore, Definition 1 satisfies the two conditions required
determined by the neighbor IP sequences, and the average for the definition of IP distance.
positioning error is 20-30 kilometers, which realizes the For example, for the three IP addresses listed in Table 1,
positioning at the district and county level. This method d(A,B)=4 and d(A,C)=20. This indicates that the distance
provides a new solution to the problem of IP address between AB is smaller than that of AC, which is consistent
geolocation. Simultaneously, it can also be combined with with the IP address decimal.
other methods based on measurement or data analysis to
TABLE I. EXAMPLES OF IP ADDRESS AND CORRESPONDING BINARY
obtain better results. The basic principle of the IP address FORMAL
geolocation method based on neighbor sequences proposed
in this paper is shown in Figure 1. Number IP address Binary formal

A 115.158.71.170 01110011100111100100011110101010
IP address data that contains geographic information
B 115.158.71.161 01110011100111100100011110100001

C 115.84.129.64 01110011100000010111111101000000

Neighbor IP address Latitude and longitude


Definition 2 (Proximity IP Set and Proximity IP Address)
For the specified IP address p , any IP address set with a
Neighbor IP address Latitude and longitude
...
distance n from p is denoted as Sp,n , which is called the n-
proximity IP set of p. The elements of Sp,n are called the n-
Train proximity IP address of p, and are denoted as Np,n . For a
given IP address p, its neighbor IP set Sp,n is determined.
According to the definition of IP distance, the binary number
Target IP address Model Latitude and longitude
of XOR of any element in Sp,n and p is n. This means that the
high-order bits of these elements are the same as p, the n-th
bn is different from p, and the value of the lowest n-1 bit bi(1
Figure 1. The basic principle of geolocation ≤ i ≤ n-1) can be any 0 or 1. Therefore, Sp,n can be
represented by the closed range [N1p,n, N2p,n] of the IP
II. RELATED DEFINITIONS address, where N1p,n = b32…bn0…0, N2p,n = b32…bn1…1.
Definition 3 (neighbor sequences) Given the minimum
The common representation of IP address [11, 12] is 4 distance ns and maximum distance ne of the IP address p, the
decimal numbers separated by English ".", which is
sequences of IP addresses Np,ns, Np,ns+1, … , Np,ne is called the
essentially a 32-bit binary integer, which can be expressed as
neighbor sequences of p.

47

Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
III. DESCRIPTION OF GEOLOCATION ALGORITHM A. Construct the Latitude and Longitude Corresponding to
The goal of IP address positioning is that given an IP the Neighbor Sequences
address, the location information of its physical space can be According to definition 2 and formula (1), for the target
quickly queried, including precise longitude and latitude and IP address p and the IP distance n, the calculation formula of
coarse-grained administrative regionalized addresses [13]. the neighbor set Sp,n is:
Based on the aggregation characteristics of IP addresses,
this paper proposes a geolocation method for IP addresses
 [ p, p ] ,n = 0
S p ,n =  (2)
based on neighbor sequences. This method infers the
location range of the target IP address based on the latitude   N1p ,n , N 2 p ,n  , otherwise
and longitude of multiple neighbor IPs. To this end, first For each IP address p in Sa, calculate ne-ns+1 neighbor
need to have a mapping relationship table from IP address to IP sets [N1p,n, N2p,n] and then traverse Sa and select in the
latitude and longitude as the benchmark input of the neighbor set IP address Np,n , if there is no IP address that
algorithm. satisfies the condition, then traverse the set Sb to find the
We have two different kinds of original data: 1) Open longest IP address segment that intersects with the neighbor
source IP address location database, each record is in the set. Finally, map these neighbor IP addresses (segments) to
form of six tuples (starting IP address, ending IP address, the corresponding latitude and longitude. Table 2 lists the
country, province, city, district), of which districts and related parameters of algorithm and their description. The
counties It may be empty; 2) The mobile application network algorithm description is as follows:
traffic of a province in China extracted from the China
Education and Research Network can be parsed from the TABLE II. DESCRIPTION OF PARAMETERS
data to obtain triples of the form (IP address, longitude,
latitude). Using Baidu’s geocoding API [14], the Parameter Explanation
administrative geographic address of the IP address location
ns, ne The minimum distance and maximum distance between the
database is encoded into latitude and longitude to form a neighbor IP and the target IP
quadruple (starting IP address, ending IP address, longitude,
latitude). Np,n IP address with distance n from IP address p
For mobile application network traffic, the IP addresses
of each record are aggregated, and finally each IP address Sa, Sb Respectively represent the processed mobile application
network traffic data collection and IPIPNET data collection
gets a set of latitude and longitude, and the maximum
distance between these latitude and longitude is calculated. If Strain, Stest IP address training set and test set
the maximum distance is more than 20 kilometers, this IP
address is considered to be dynamically allocated and the T The set of latitude and longitude sequences corresponding to
allocated addresses are scattered, which is not considered in neighbor sequences
this paper. After discarding the data with the maximum
distance above 20 kilometers, take the collection center of Calculate the latitude and longitude of neighbor
these latitude and longitude as the final latitude and sequences:
longitude, and record the number of aggregation. Figure 2
shows the flow chart of the algorithm. Input: Sa, Sb, ns, ne
Output : Set of latitude and longitude series T
corresponding to neighbor sequences
IPIPNET database Mobile traffic data

BEGIN
Convert address to
T= Ø
latitude and longitude
Aggregate by address FOR p∈ Sa DO
R= Ø
(Start IP, End IP, Longitude, Latitude) (IP, longitude, latitude) FOR n← ns TO ne DO
Calculate the neighbor IP set [N1p,n,N2p,n] with
distance p of n, select IP set M in Sa in the range of
[N1p,n, N2p,n]
Extract neighboring sequences Convert to latitude and longitude IF Size(M)=1 THEN
R [n] ← Latitude and longitude of the element of
M
Build and solve the model ELSE IF Size(M)≠0 THEN
R [n] ← The latitude and longitude of the element
with the largest aggregation number of M
Figure 2. Flowchart of algorithms ELSE
Select the IP segment L with the longest intersection
with [N1p,n, N2p,n] in Sb

48

Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
R [n] ← Latitude and longitude of the element of L lists the accuracy of the algorithm under different
END parameters.
END
T[p] ← R TABLE III. PERFORMANCE COMPARISON OF ALGORITHMS WITH
DIFFERENT PARAMETERS OF PARAMETERS
END
RETURN T average Minimum Maximum
median /
ns ne error / variance error / error /
END kilometer
kilometer
kilometer kilometer
B. Deduce the Latitude and Longitude of the Target 8 16 26.86 16.53 827.17 0.35 210.58
Address 8 20 26.47 14.45 931.25 0.37 245.30
Based on the above process, the IP address location 8 24 27.97 17.15 1195.51 0.40 266.90
problem can be transformed into using the latitude and
9 16 27.87 18.76 1048.57 0.07 245.22
longitude corresponding to the neighbor sequences to infer
the latitude and longitude of the target IP address. Taking the 9 20 26.51 17.99 1045.82 0.11 238.36
longitude and latitude sequences Xi as input, and the 9 24 28.44 18.74 1045.85 0.49 254.11
longitude and latitude Yi of the target IP address as output : 10 17 28.75 14.39 1182.65 0.22 257.48
Y i = f (Xi ) (3) 10 20 27.45 17.35 877.75 0.55 223.62
The actual task is to guess the function f. If Y and X are 10 24 30.17 17.46 1111.18 0.15 232.11
linear, then:
11 18 28.31 19.12 955.34 0.07 220.86
Y i = ωT X i (4)
11 20 31.06 23.08 969.17 0.18 226.94
Where Y is a 2-dimensional row vector, ω is an ne-
i
11 24 29.35 21.06 952.20 0.10 223.86
ns+1-dimensional column vector, and Xi is a matrix of (ne-
ns+1)*2. In order to learn this function, the loss function is 12 19 32.94 24.04 1059.94 0.65 256.11
defined as: 12 20 32.70 23.92 1079.66 0.55 225.44
m
12 24 32.20 21.72 1139.81 0.12 236.60
Loss ( X ) = ∑ (Y i − Yi )(Y i − Yi ) (5)
i =1
It can be seen from Figure. 3 and Figure. 4 that under
Where Yi is the true value of the i-th set of data Xi. And different parameters, the distribution of errors shows the law
the requested model is as follows that the error distance increases and the number of errors
ω = arg min Loss ( X ) (6) rapidly decreases. In addition, under different parameters,
ω although the error distance increases and the number of
This paper chooses the method of mini-batch gradient errors decreases at different rates, the accuracy of the
descent [15] to solve the model. Compared with batch algorithm does not differ by orders of magnitude. This
gradient descent and stochastic gradient descent, the speed indicates that within the selected ns and ne ranges, the
is faster than the batch gradient descent, and the accuracy is physical space position of the neighbor sequences has a
better than the random gradient descent. greater correlation with the physical space position of the
target IP address.
The main parameters of the algorithm are the minimum
However, it can still be seen that when the parameters ns
distance ns and the maximum distance ne of neighbor
=8 and ne =20, compared to when ns =8 and ne =24, the
sequences. When the IP distance reaches a large value, there number of errors under the same distance when the error
is no significant correlation between the geographical distance is smaller, and the number of errors under the same
locations of the two. At the same time, when the IP distance distance when the error distance is larger. These two points
is small enough, it can be assumed that both are in the same make the average error and median error of the algorithm
small LAN. However, when determining the geographic when ns =8 and ne =20 smaller, thus making the algorithm
location of the target IP address, you cannot only consider more accurate.
neighbor sequences with a sufficiently small distance. IP
address pairs with very small distances from each other will
not appear in the data in large numbers. Relying on such
occasional data will lead to overfitting of the model.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
Based on the existing data, this paper selects different ns
and ne for comparison. Considering that shorter neighbor
sequences are more susceptible to noise data, the length of
the neighbor sequences selected in this paper is 8-16. Table 3

49

Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
Qibin
42.239.250.215 Hebi Hebi
District, Hebi

Xiping Xiping
182.127.208.2 County, County, Zhengzhou
Zhumadian Zhumadian
Luolong
61.158.128.9 District, Kaifeng Luoyang
Luoyang
Hualong Hualong
175.106.255.25 District, District, Puyang
Puyang Puyang

Qibin Qibin
221.176.240.15 Hebi
Figure 3. Error distribution of algorithm when ns=8 and ne=20 District, Hebi District, Hebi

V. CONCLUSION
Regarding the geographic location of IP addresses,
existing data analysis-based methods pay more attention to
the extension attributes of IP addresses, such as
administrative division addresses and domain name
registration records, and pay less attention to the relationship
between IP addresses. This paper focuses on the relationship
between IP addresses. According to the aggregation
characteristics of IP addresses, a geolocation method for IP
addresses based on neighbor sequences is proposed. First,
select the neighbor sequences of the IP address and convert it
to the corresponding latitude and longitude sequences, then
Figure 4. Error distribution of algorithm when ns=8 and ne=24 use a linear model to fit the latitude and longitude sequences,
and select the parameter that minimizes the loss function as
It can be observed from Table 3 and Figures 3 and 4 that the model. Experiments show that by selecting appropriate
ns and ne affect the accuracy of the algorithm. In actual use, parameters, this method can give a more accurate physical
due to different data, the specific distribution of data is also location of the IP address. Limited by the data size of the
necessarily different, so experiments should be conducted on experiment, the positioning error of this experiment is 20-30
the basis of fully studying the distribution of data to kilometers on average. If the amount of data reaches a
determine the specific values of ns and ne to obtain the million level, this method is expected to limit the average
optimal positioning performance. positioning error to within 10 kilometers.
The average error of the algorithm is 20-30 kilometers, Starting from the relationship between IP addresses, this
and the median error is about 20 kilometers, which basically method considers the aggregation characteristics of IP
realizes the positioning at the district and county level. Table addresses, provides a new idea for the IP address geolocation
4 lists the positioning results of the algorithm and the results problem, and is a powerful complement to the existing IP
given by other IP address libraries. It can be seen that the address geolocation technology. Based on the data analysis
results given by the algorithm have a finer granularity, and method, if the relationship between IP addresses is
the accuracy is comparable to that of each database. introduced on the basis of the IP address extension attributes,
a large amount of information can be added and more
TABLE IV. GEOLOCATION RESULTS
features can be obtained, then the model will have more
parameters and the expression ability will also be enhanced,
Algorithm
IP address
results
Baidu ChunZhen you can expect to obtain better results. Accurate IP address
positioning results can effectively support related Internet
Zhongyuan Zhongyuan industries, such as network security and advertising. This
117.158.221.215 District, District, Zhengzhou work is based on IPv4 addresses. Although IPv6 addresses
Zhengzhou Zhengzhou are not widely used at present, they are growing rapidly. The
Luolong 37th China Internet Development Report [16] shows that
219.157.12.36 District, Hebi Luoyang from 2014 to 2015, the annual growth rate of the number of
Luoyang
IPv4 addresses in China reaching 9.6%, the next step will be
Longting dedicated to the geographic positioning of IPv6 addresses.
218.196.192.25 District, Kaifeng Kaifeng
Kaifeng
ACKNOWLEDGMENT

50

Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
This work is supported by the Nation Nature Science Features. International Conference on Multimedia Modeling. Springer
Foundation of China (NSFC), (NO. 61572445), NSFC Joint Berlin Heidelberg.
Fund Key Project (NO. U1804263). [8] Guo, C. , Liu, Y. , Shen, W. , Wang, H. J. , & Zhang, Y. . (2009).
Mining the Web and the Internet for Accurate IP Address
Geolocations. Infocom. IEEE.
REFERENCES
[9] Backstrom, L. , Sun, E. , & Marlow, C. . (2010). Find me if you can:
Improving geographical prediction with social and spatial proximity.
[1] Jinxia, W. , Xiaoyan, X. , Min, Y. , & Tianning, Z. . (2016). IP Proceedings of the 19th International Conference on World Wide
Geolocation Technology Research Based on Network Measurement. Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010.
Sixth International Conference on Instrumentation & Measurement. ACM.
IEEE. [10] Poese, I. , Uhlig, S. , Mohamed Ali Kâafar, Donnet, B. , & Gueye, B.
[2] Padmanabhan, V. N. , & Subramanian, L. . (2001). An investigation . (2011). Ip geolocation databases: unreliable?. Acm Sigcomm
of geographic mapping techniques for internet hosts. Acm Sigcomm Computer Communication Review, 41(2), 53-56.
Computer Communication Review. [11] Tetsuya, S. , & Hiroyuki, O. . (2018). Ip address assignment device
[3] Katz-Bassett, E. , John, J. P. , Krishnamurthy, A. , Wetherall, D. , & and ip address assignment method.
Chawathe, Y. . (2006). Towards IP geolocation using delay and [12] Almohri, H. M. J. , Watson, L. T. , & Evans, D. . (2019).
topology measurements. Acm Sigcomm Conference on Internet Predictability of ip address allocations for cloud computing platforms.
Measurement. ACM. IEEE Transactions on Information Forensics and Security, PP(99).
[4] Eriksson, B. , Barford, P. , Maggs, B. , & Nowak, R. . (2012). Posit: a [13] Zu, S. , Luo, X. , Liu, S. , Liu, Y. , & Liu, F. . (2018). City-level ip
lightweight approach for ip geolocation. ACM SIGMETRICS geolocation algorithm based on pop network topology. IEEE Access,
Performance Evaluation Review, 40(2), 2-11. PP, 1-1.
[5] Ovidiu Dan, Vaibhav Parikh, & Brian D. Davison. (2016). Improving [14] http://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-
ip geolocation using query logs. geocoding.
[6] Feitosa, R. M. , Labidi, S. , Santos, A. L. S. D. , & Santos, N. . [15] Khirirat, S. , Feyzmahdavian, H. R. , & Johansson, M. . (2017). Mini-
(2013). Social Recommendation in Location-Based Social Network batch gradient descent: Faster convergence under data sparsity. IEEE
Using Text Mining. International Conference on Intelligent Systems. Conference on Decision & Control. IEEE.
IEEE Computer Society.
[16] CNNIC. The 37th Statistical Report on Internet Development in
[7] Yuan, T. , Cheng, J. , Zhang, X. , Liu, Q. , & Lu, H. . (2013). A China.http://www.cnnic.net.cn.
Weighted One Class Collaborative Filtering with Content Topic

51

Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy