Crawling Hidden Objects With KNN Queries
Crawling Hidden Objects With KNN Queries
Crawling Hidden Objects With KNN Queries
ABSTRACT
Many websites offering Location Based Services (LBS) provide a k NN search interface
that returns the top-k nearest-neighbor objects (e.g., nearest restaurants) for a given query
location. This paper addresses the problem of crawling all objects efficiently from an
LBS website, through the public k NN web search interface it provides. Specifically, we
develop crawling algorithm for 2D and higher-dimensional spaces, respectively, and
demonstrate through theoretical analysis that the overhead of our algorithms can be
bounded by a function of the number of dimensions and the number of crawled objects,
regardless of the underlying distributions of the objects. We also extend the algorithms to
leverage scenarios where certain auxiliary information about the underlying data
distribution, e.g., the population density of an area which is often positively correlated
with the density of LBS objects, is available. Extensive experiments on real-world
datasets demonstrate the superiority of our algorithms over the state-of-the-art
competitors in the literature.
CHAPTER 1
INTRODUCTION
5. Tools for specifying and developing Data and Knowledge Bases using tools
based on Linguistics or Human Machine Interface principles.
Specific areas to be covered are as follows: Fields and Areas of Knowledge and
Data Engineering: (a) Knowledge and data engineering aspects of knowledge based and
expert systems, (b) Artificial Intelligence techniques relating to knowledge and data
management, (c) Knowledge and data engineering tools and techniques, (d) Distributed
knowledge base and database processing, (e) Real-time knowledge bases and databases,
(f) Architectures for knowledge and data based systems, (g) Data management
methodologies, (h) Database design and modeling, (i) Query, design, and implementation
languages, (j) Integrity, security, and fault tolerance, (k) Distributed database control, (l)
Statistical databases, (m) System integration and modeling of these systems, (n)
Algorithms for these systems, (o) Performance evaluation of these algorithms, (p) Data
communications aspects of these systems, (q) Applications of these systems.
SYSTEM ANALYSIS
It is important to note that the key technical challenge for crawling through a kNN
interface is to minimize the number of queries issued to the LBS service. The
requirement is caused by limitations imposed by most LBS services on the number
of queries allowed from an IP address or a user account (in case of an API service such as
Google Maps) for a given time period (e.g., one day). For example, Twitter limits the
search rate at 180 queries per 15 minute. Of course, no algorithm can possibly
accomplish the task without issuing at least n/k queries, where n is output size (i.e., the
number of crawled objects), because each query returns at most k of the n objects. As
such, we are bound to have an output-sensitive algorithm, which nevertheless should
have a query cost as close to n/k as possible.
We start with addressing the kNN crawling problem in 1-D spaces, and propose a 1-D
crawling algorithm with upper bound of the query cost being O(n/k), where n is the
number of output objects, and k is the top-k restriction. We then use the 1D algorithm as a
building block for kNN crawling over 2-D spaces, and present theoretical analysis which
shows that the query cost of the algorithm depends only on the number of output objects
n but not the data distribution in the spatial space. We extend the kNN crawling problem
to the general case of m-D spaces - which is the first such solution in the literature. Our
contributions also include a comprehensive set of experiments on both synthetic and real-
world data sets. The results demonstrate the superiority of our algorithms over the
existing solutions.
For 2-D space, we take external knowledge into consideration to improve the
crawling performance.
The experimental results show the effectiveness of our proposed algorithms.