Lazy Classifiers Using P-Trees
Lazy Classifiers Using P-Trees
Lazy Classifiers Using P-Trees
1
ith attribute), into a P-tree. An example is given [3]. The NOT operation is a straightforward
in Figure 1. translation of each count to its quadrant-
complement. The AND and OR operations are
shown in Figure 2.
11 11 11 00 39
11 11 10 00 __________/ / \ \_______
11 11 11 00 / _____/ \ ___ \
16 ____8__ _15__ 0 P-tree-1: m P-tree-2: m
11 11 11 10 / / | \ / | \ \ ______/ / \ \______ ______/ / \ \_____
11 11 00 00 3 0 4 1 4 4 3 4 / / \ \ / / \ \
11 11 00 00 //|\ //|\ //|\ / / \ \ / / \ \
11 11 00 00 1 m m 1 1 0 m 0
1110 0010 1101
01 11 00 00 / / \ \ / / \ \ / / \ \
m 0 1 m 11 m 1 11 1 m
//|\ //|\ //|\ //|\
m
1110 0010 1101 0100
_____________/ / \ \____________
/ ____/ \ ____ \
1 ____m__ _m__ 0
/ / | \ / | \ \ AND-Result: m OR-Result: m
m 0 1 m 1 1 m 1 ________ / / \ \___ ________ / / \ \___
//|\ //|\ //|\ / ____ / \ \ / ____ / \ \
1110 0010 1101 / / \ \ / / \ \
1 0 m 0 1 m 1 1
/ | \ \ / / \ \
1 1 m m m 0 1 m
//|\ //|\ //|\ //|\
Figure 1. P-tree and PM-tree 1101 0100 1110 0010
2
metrics have been proposed. For two data 2) Find the k nearest neighbors using the
points, X = <x1, x2, x3, …, xn-1> and Y = <y1, y2, selected distance metric.
y3, …, yn-1>, the Euclidean similarity function is 3) Find the plurality class of the k-nearest
n −1 neighbors.
defined as d 2 ( X ,Y ) = ∑ (x
i =1
i − yi ) . It can
2
4) Assign that class to the sample to be
classified.
be generalized to the Minkowski similarity
Database scans are performed to find the
n −1
d q ( X ,Y ) = q ∑w xi − y i . nearest neighbors, which is the bottleneck in the
q
function, i If
method. In PKNN, by using P-trees, we can
i =1
quickly calculate the nearest neighbor for the
q = 2, this gives the Euclidean function. If q = 1, given sample. For example, given a tuple <v1,
it gives the Manhattan distance, which is v2, …, vn>, we can get the exact count of this
n −1
d1 ( X , Y ) = ∑ xi − yi . If q = ∝, it gives the
tuple by calculating the root count of
P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND …
i =1
AND Pn,vn
n −1
max function d ∞ ( X , Y ) = max xi − yi .
i =1 With HOBBit metric, we first calculate all
exact matches with the given data sample X, then
We proposed a metric using P-trees, called extend to data points with distance 1 (with at
HOBBit [3]. The HOBBit metric measures most one bit difference in each attribute) from X,
distance based on the most significant and so on, until we get k nearest neighbors.
consecutive bit positions starting from the left
(the highest order bit). The HOBBit similarity Instead of just getting exact k nearest
between two integers A and B is defined by neighbors, we take the closure of the k-NN set,
SH(A, B) = max{s | 0 ≤ i ≤ s ⇒ ai = bi}…(eq. 1 ) that is, we include all of the boundary neighbors
where ai and bi are the ith bits of A and B that have the same distance as the kth neighbor
respectively. to the sample point. Obviously closed-KNN is a
superset of KNN set. As we pointed out later in
The HOBBit distance between two tuples X the performance analysis, closed-KNN improves
and Y is defined by the accuracy obvisously.
3
functions which decreases with distance, can be operations instead of scanning the database. In
used (e.g., Gaussian, Kriging, etc). addition, with proposed closed-KNN, accuracy is
increased. In PINE method, podium function is
applied to different neighbors so accuracy is
4. PERFORMANCE ANALYSIS further improved.
We have performed experiments on the real Lazy classifiers are particularly useful for
image data sets, which consist of aerial TIFF classification on data streams. In data streams,
image (with Red, Green and Blue bands), new data keep arriving, so building a new
moisture, nitrate, and yield map of the Oaks area classifier each time can be very expensive. By
in North Dakota. In these datasets, yield is the using P-trees, we can build lazy classifiers
class label attribute. The data sets are available efficiently and effectively for stream data.
at [8]. We tested KNN with various metrics,
PKNN, and PINE. The accuracies of these
different implementations are given in Figure 3. 6. REFERENCES
50
Streams Using P-Trees”, PAKDD 2002,
40 Springer-Verlag, LNAI 2336, 2002, pp. 517-528.
30
[4] Dasarathy, B.V., “Nearest-Neighbor
Classification Techniques”, IEEE Computer
20
Society Press, Los Alomitos, CA, 1991.
10 [5] M. James, “Classification Algorithms”, New
0 York: John Wiley & Sons, 1985.
256 1024 4096 16384 65536 262144
[6] M. A. Neifeld and D. Psaltis, “Optical
Training Set Size (number of tuples) Implementations of Radial Basis Classifiers”,
Applied Optics, Vol. 32, No. 8, 1993, pp. 1370-
Figure 3. Accuracy comparison for KNN, PKNN 1379.
and PINE [7] TIFF image data sets. Available at
http://midas-10.cs.ndsu.nodak.edu/data/images/
We see that PKNN performs better than [8] Jiawei Han and Micheline Kamber, Data
KNN using various metrics, including Euclidean, Mining: Concepts and Techniques, Morgan
Manhattan, Max and HOBBit. PINE further Kaufmann Publishers, 2001.
improves the accuracy. Especially when the
training set size increases, the improvement is
more apparent. Both PKNN and PINE work
well compared to raw guessing, which is 12.5%
in this data set with 8 class values.
5. CONCLUSION