IP Geolocation Method Based On Neighbor IP Sequences
IP Geolocation Method Based On Neighbor IP Sequences
Abstract—IP geolocation aims at determining the geographic Such methods can be further subdivided into methods based
location of an Internet host, which can improve the on space theory and methods based on probability estimation
performance and security of the Internet application, and [1], mainly including Geoping [2], TBG [3], Poist [4] and so
bring about novel services. Existing methods usually use on.
measurement-based technology or data analysis-based In IP address geolocation technology based on data
technology, and less consider the relationship between IP analysis, the data can be divided into structured data, semi-
addresses,this paper proposed an IP geolocation approach structured data, and unstructured data according to the
based on neighbor sequences. Considering the aggregation degree of structure of the data used. Registration and filing
characteristics of IP addresses, the IP address location library records are common structured data containing IP addresses
and mobile traffic data are taken as the original data. The (segments) and their corresponding geographic information
neighbor sequences of IP addresses are first calculated, then
[5], such as WHOIS database, DNSLOC records, DNS
converted into corresponding latitude and longitude sequences,
and then the model is built and solved. The experimental
names, etc. These data can be used to infer the geographic
results based on IP addresses in Henan Province, China show location of IP addresses.
that the geographic location of IP addresses can be determined Wang et al. [1] divided geolocation methods based on
by neighbor IP sequences with the mean error is between 20 registration records into three categories: 1) inferring
kilometers and 30 kilometers. This method can also be network structure and database information. It may appear
combined with other methods to obtain better results. extensive deviation method based on geographic positioning
registration records, because some large entities may have
Keywords-component; IP geolocation; neighbor sequences; multiple servers scattered in different places, but the domain
latitude and longitude name is registered to the same address [4]; 2) querying the
WHOIS database; 3) measuring hostnames and inferring
I. INTRODUCTION based on database information. In semi-structured data,
geographic information may exist as an attribute of an object
At user device does not provide a GPS location in the data. Dan et al. [5] extracted IP addresses and GPS
technology, using the IP address geolocation techniques is coordinates from mobile device search engine logs to
the preferred method for determining the location of a user. construct the most extensive set of ground truth values
Currently, for the increasing size of the Internet user base, a (Ground Truth) to date, with 8.4 million IP address
large number of Internet services are required to obtain geolocation records. Web pages are the most common
location information of the user. How to accurately unstructured data. You can use text mining [6], data mining,
determine a user's geographic location has become an and statistics [7] to extract geographic information from
Internet application in a significant issue. massive Web pages and associate them with specific IP
Existing IP address geolocation technologies can be addresses (segments) to achieve IP geolocation. Guo et al.
divided into two categories: 1) Based on data analysis [8] proposed the Structon method, which combines Web
methods, such methods use non-measurement technology, mining, inference and IP traceroute to reach a more accurate
through the processing and analysis of WHOIS database, IP address location. Backstrom et al [9] proposed a method
web pages and other related data, to obtain the geographic based on a social map, using the position of the user's friends
location corresponding to the IP address. 2) Measurement- to determine the user's location, and achieved a good
based methods, whose principle is to obtain network positioning effect.
topology, route hops, or delay detection of targets through
traceroute, to infer the geographic location of the IP address.
A 115.158.71.170 01110011100111100100011110101010
IP address data that contains geographic information
B 115.158.71.161 01110011100111100100011110100001
C 115.84.129.64 01110011100000010111111101000000
47
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
III. DESCRIPTION OF GEOLOCATION ALGORITHM A. Construct the Latitude and Longitude Corresponding to
The goal of IP address positioning is that given an IP the Neighbor Sequences
address, the location information of its physical space can be According to definition 2 and formula (1), for the target
quickly queried, including precise longitude and latitude and IP address p and the IP distance n, the calculation formula of
coarse-grained administrative regionalized addresses [13]. the neighbor set Sp,n is:
Based on the aggregation characteristics of IP addresses,
this paper proposes a geolocation method for IP addresses
[ p, p ] ,n = 0
S p ,n = (2)
based on neighbor sequences. This method infers the
location range of the target IP address based on the latitude N1p ,n , N 2 p ,n , otherwise
and longitude of multiple neighbor IPs. To this end, first For each IP address p in Sa, calculate ne-ns+1 neighbor
need to have a mapping relationship table from IP address to IP sets [N1p,n, N2p,n] and then traverse Sa and select in the
latitude and longitude as the benchmark input of the neighbor set IP address Np,n , if there is no IP address that
algorithm. satisfies the condition, then traverse the set Sb to find the
We have two different kinds of original data: 1) Open longest IP address segment that intersects with the neighbor
source IP address location database, each record is in the set. Finally, map these neighbor IP addresses (segments) to
form of six tuples (starting IP address, ending IP address, the corresponding latitude and longitude. Table 2 lists the
country, province, city, district), of which districts and related parameters of algorithm and their description. The
counties It may be empty; 2) The mobile application network algorithm description is as follows:
traffic of a province in China extracted from the China
Education and Research Network can be parsed from the TABLE II. DESCRIPTION OF PARAMETERS
data to obtain triples of the form (IP address, longitude,
latitude). Using Baidu’s geocoding API [14], the Parameter Explanation
administrative geographic address of the IP address location
ns, ne The minimum distance and maximum distance between the
database is encoded into latitude and longitude to form a neighbor IP and the target IP
quadruple (starting IP address, ending IP address, longitude,
latitude). Np,n IP address with distance n from IP address p
For mobile application network traffic, the IP addresses
of each record are aggregated, and finally each IP address Sa, Sb Respectively represent the processed mobile application
network traffic data collection and IPIPNET data collection
gets a set of latitude and longitude, and the maximum
distance between these latitude and longitude is calculated. If Strain, Stest IP address training set and test set
the maximum distance is more than 20 kilometers, this IP
address is considered to be dynamically allocated and the T The set of latitude and longitude sequences corresponding to
allocated addresses are scattered, which is not considered in neighbor sequences
this paper. After discarding the data with the maximum
distance above 20 kilometers, take the collection center of Calculate the latitude and longitude of neighbor
these latitude and longitude as the final latitude and sequences:
longitude, and record the number of aggregation. Figure 2
shows the flow chart of the algorithm. Input: Sa, Sb, ns, ne
Output : Set of latitude and longitude series T
corresponding to neighbor sequences
IPIPNET database Mobile traffic data
BEGIN
Convert address to
T= Ø
latitude and longitude
Aggregate by address FOR p∈ Sa DO
R= Ø
(Start IP, End IP, Longitude, Latitude) (IP, longitude, latitude) FOR n← ns TO ne DO
Calculate the neighbor IP set [N1p,n,N2p,n] with
distance p of n, select IP set M in Sa in the range of
[N1p,n, N2p,n]
Extract neighboring sequences Convert to latitude and longitude IF Size(M)=1 THEN
R [n] ← Latitude and longitude of the element of
M
Build and solve the model ELSE IF Size(M)≠0 THEN
R [n] ← The latitude and longitude of the element
with the largest aggregation number of M
Figure 2. Flowchart of algorithms ELSE
Select the IP segment L with the longest intersection
with [N1p,n, N2p,n] in Sb
48
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
R [n] ← Latitude and longitude of the element of L lists the accuracy of the algorithm under different
END parameters.
END
T[p] ← R TABLE III. PERFORMANCE COMPARISON OF ALGORITHMS WITH
DIFFERENT PARAMETERS OF PARAMETERS
END
RETURN T average Minimum Maximum
median /
ns ne error / variance error / error /
END kilometer
kilometer
kilometer kilometer
B. Deduce the Latitude and Longitude of the Target 8 16 26.86 16.53 827.17 0.35 210.58
Address 8 20 26.47 14.45 931.25 0.37 245.30
Based on the above process, the IP address location 8 24 27.97 17.15 1195.51 0.40 266.90
problem can be transformed into using the latitude and
9 16 27.87 18.76 1048.57 0.07 245.22
longitude corresponding to the neighbor sequences to infer
the latitude and longitude of the target IP address. Taking the 9 20 26.51 17.99 1045.82 0.11 238.36
longitude and latitude sequences Xi as input, and the 9 24 28.44 18.74 1045.85 0.49 254.11
longitude and latitude Yi of the target IP address as output : 10 17 28.75 14.39 1182.65 0.22 257.48
Y i = f (Xi ) (3) 10 20 27.45 17.35 877.75 0.55 223.62
The actual task is to guess the function f. If Y and X are 10 24 30.17 17.46 1111.18 0.15 232.11
linear, then:
11 18 28.31 19.12 955.34 0.07 220.86
Y i = ωT X i (4)
11 20 31.06 23.08 969.17 0.18 226.94
Where Y is a 2-dimensional row vector, ω is an ne-
i
11 24 29.35 21.06 952.20 0.10 223.86
ns+1-dimensional column vector, and Xi is a matrix of (ne-
ns+1)*2. In order to learn this function, the loss function is 12 19 32.94 24.04 1059.94 0.65 256.11
defined as: 12 20 32.70 23.92 1079.66 0.55 225.44
m
12 24 32.20 21.72 1139.81 0.12 236.60
Loss ( X ) = ∑ (Y i − Yi )(Y i − Yi ) (5)
i =1
It can be seen from Figure. 3 and Figure. 4 that under
Where Yi is the true value of the i-th set of data Xi. And different parameters, the distribution of errors shows the law
the requested model is as follows that the error distance increases and the number of errors
ω = arg min Loss ( X ) (6) rapidly decreases. In addition, under different parameters,
ω although the error distance increases and the number of
This paper chooses the method of mini-batch gradient errors decreases at different rates, the accuracy of the
descent [15] to solve the model. Compared with batch algorithm does not differ by orders of magnitude. This
gradient descent and stochastic gradient descent, the speed indicates that within the selected ns and ne ranges, the
is faster than the batch gradient descent, and the accuracy is physical space position of the neighbor sequences has a
better than the random gradient descent. greater correlation with the physical space position of the
target IP address.
The main parameters of the algorithm are the minimum
However, it can still be seen that when the parameters ns
distance ns and the maximum distance ne of neighbor
=8 and ne =20, compared to when ns =8 and ne =24, the
sequences. When the IP distance reaches a large value, there number of errors under the same distance when the error
is no significant correlation between the geographical distance is smaller, and the number of errors under the same
locations of the two. At the same time, when the IP distance distance when the error distance is larger. These two points
is small enough, it can be assumed that both are in the same make the average error and median error of the algorithm
small LAN. However, when determining the geographic when ns =8 and ne =20 smaller, thus making the algorithm
location of the target IP address, you cannot only consider more accurate.
neighbor sequences with a sufficiently small distance. IP
address pairs with very small distances from each other will
not appear in the data in large numbers. Relying on such
occasional data will lead to overfitting of the model.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
Based on the existing data, this paper selects different ns
and ne for comparison. Considering that shorter neighbor
sequences are more susceptible to noise data, the length of
the neighbor sequences selected in this paper is 8-16. Table 3
49
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
Qibin
42.239.250.215 Hebi Hebi
District, Hebi
Xiping Xiping
182.127.208.2 County, County, Zhengzhou
Zhumadian Zhumadian
Luolong
61.158.128.9 District, Kaifeng Luoyang
Luoyang
Hualong Hualong
175.106.255.25 District, District, Puyang
Puyang Puyang
Qibin Qibin
221.176.240.15 Hebi
Figure 3. Error distribution of algorithm when ns=8 and ne=20 District, Hebi District, Hebi
V. CONCLUSION
Regarding the geographic location of IP addresses,
existing data analysis-based methods pay more attention to
the extension attributes of IP addresses, such as
administrative division addresses and domain name
registration records, and pay less attention to the relationship
between IP addresses. This paper focuses on the relationship
between IP addresses. According to the aggregation
characteristics of IP addresses, a geolocation method for IP
addresses based on neighbor sequences is proposed. First,
select the neighbor sequences of the IP address and convert it
to the corresponding latitude and longitude sequences, then
Figure 4. Error distribution of algorithm when ns=8 and ne=24 use a linear model to fit the latitude and longitude sequences,
and select the parameter that minimizes the loss function as
It can be observed from Table 3 and Figures 3 and 4 that the model. Experiments show that by selecting appropriate
ns and ne affect the accuracy of the algorithm. In actual use, parameters, this method can give a more accurate physical
due to different data, the specific distribution of data is also location of the IP address. Limited by the data size of the
necessarily different, so experiments should be conducted on experiment, the positioning error of this experiment is 20-30
the basis of fully studying the distribution of data to kilometers on average. If the amount of data reaches a
determine the specific values of ns and ne to obtain the million level, this method is expected to limit the average
optimal positioning performance. positioning error to within 10 kilometers.
The average error of the algorithm is 20-30 kilometers, Starting from the relationship between IP addresses, this
and the median error is about 20 kilometers, which basically method considers the aggregation characteristics of IP
realizes the positioning at the district and county level. Table addresses, provides a new idea for the IP address geolocation
4 lists the positioning results of the algorithm and the results problem, and is a powerful complement to the existing IP
given by other IP address libraries. It can be seen that the address geolocation technology. Based on the data analysis
results given by the algorithm have a finer granularity, and method, if the relationship between IP addresses is
the accuracy is comparable to that of each database. introduced on the basis of the IP address extension attributes,
a large amount of information can be added and more
TABLE IV. GEOLOCATION RESULTS
features can be obtained, then the model will have more
parameters and the expression ability will also be enhanced,
Algorithm
IP address
results
Baidu ChunZhen you can expect to obtain better results. Accurate IP address
positioning results can effectively support related Internet
Zhongyuan Zhongyuan industries, such as network security and advertising. This
117.158.221.215 District, District, Zhengzhou work is based on IPv4 addresses. Although IPv6 addresses
Zhengzhou Zhengzhou are not widely used at present, they are growing rapidly. The
Luolong 37th China Internet Development Report [16] shows that
219.157.12.36 District, Hebi Luoyang from 2014 to 2015, the annual growth rate of the number of
Luoyang
IPv4 addresses in China reaching 9.6%, the next step will be
Longting dedicated to the geographic positioning of IPv6 addresses.
218.196.192.25 District, Kaifeng Kaifeng
Kaifeng
ACKNOWLEDGMENT
50
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.
This work is supported by the Nation Nature Science Features. International Conference on Multimedia Modeling. Springer
Foundation of China (NSFC), (NO. 61572445), NSFC Joint Berlin Heidelberg.
Fund Key Project (NO. U1804263). [8] Guo, C. , Liu, Y. , Shen, W. , Wang, H. J. , & Zhang, Y. . (2009).
Mining the Web and the Internet for Accurate IP Address
Geolocations. Infocom. IEEE.
REFERENCES
[9] Backstrom, L. , Sun, E. , & Marlow, C. . (2010). Find me if you can:
Improving geographical prediction with social and spatial proximity.
[1] Jinxia, W. , Xiaoyan, X. , Min, Y. , & Tianning, Z. . (2016). IP Proceedings of the 19th International Conference on World Wide
Geolocation Technology Research Based on Network Measurement. Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010.
Sixth International Conference on Instrumentation & Measurement. ACM.
IEEE. [10] Poese, I. , Uhlig, S. , Mohamed Ali Kâafar, Donnet, B. , & Gueye, B.
[2] Padmanabhan, V. N. , & Subramanian, L. . (2001). An investigation . (2011). Ip geolocation databases: unreliable?. Acm Sigcomm
of geographic mapping techniques for internet hosts. Acm Sigcomm Computer Communication Review, 41(2), 53-56.
Computer Communication Review. [11] Tetsuya, S. , & Hiroyuki, O. . (2018). Ip address assignment device
[3] Katz-Bassett, E. , John, J. P. , Krishnamurthy, A. , Wetherall, D. , & and ip address assignment method.
Chawathe, Y. . (2006). Towards IP geolocation using delay and [12] Almohri, H. M. J. , Watson, L. T. , & Evans, D. . (2019).
topology measurements. Acm Sigcomm Conference on Internet Predictability of ip address allocations for cloud computing platforms.
Measurement. ACM. IEEE Transactions on Information Forensics and Security, PP(99).
[4] Eriksson, B. , Barford, P. , Maggs, B. , & Nowak, R. . (2012). Posit: a [13] Zu, S. , Luo, X. , Liu, S. , Liu, Y. , & Liu, F. . (2018). City-level ip
lightweight approach for ip geolocation. ACM SIGMETRICS geolocation algorithm based on pop network topology. IEEE Access,
Performance Evaluation Review, 40(2), 2-11. PP, 1-1.
[5] Ovidiu Dan, Vaibhav Parikh, & Brian D. Davison. (2016). Improving [14] http://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-
ip geolocation using query logs. geocoding.
[6] Feitosa, R. M. , Labidi, S. , Santos, A. L. S. D. , & Santos, N. . [15] Khirirat, S. , Feyzmahdavian, H. R. , & Johansson, M. . (2017). Mini-
(2013). Social Recommendation in Location-Based Social Network batch gradient descent: Faster convergence under data sparsity. IEEE
Using Text Mining. International Conference on Intelligent Systems. Conference on Decision & Control. IEEE.
IEEE Computer Society.
[16] CNNIC. The 37th Statistical Report on Internet Development in
[7] Yuan, T. , Cheng, J. , Zhang, X. , Liu, Q. , & Lu, H. . (2013). A China.http://www.cnnic.net.cn.
Weighted One Class Collaborative Filtering with Content Topic
51
Authorized licensed use limited to: HKBK College of Engineering. Downloaded on October 04,2024 at 04:37:32 UTC from IEEE Xplore. Restrictions apply.