0% found this document useful (0 votes)
55 views

An Optimized K-Harmonic Mean Based Clustering User Navigation Patterns

base paer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

An Optimized K-Harmonic Mean Based Clustering User Navigation Patterns

base paer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

An Optimized k-Harmonic Mean Based Clustering User Navigation Patterns

R. Gobinath1, M. Hemalatha2

Department of Computer Science, Karpagam University, Coimbatore, India


Department of Computer Science, Karpagam University, Coimbatore, India
(iamgobinathmca1, csresearchhema2)@gmail.com

Abstract - The web mining is an application process


which plays an important role in analyzing the behavior of
the website users. The web usage mining is a sub category of
web mining has a major impact on web personalization. The
main concept of the paper deals with the extraction of
necessary information from web access log files and applying
clustering for easy analyzing of navigational patterns for
web personalization. The adapted k-harmonic algorithm is
used for clustering the obtained navigational patterns from
the various iterating process.
Keywords - Web mining, Web Usage Mining, KHarmonic Clustering

identification plays an important factor in the pattern


discovery process. The access log files stored on a web
server in sessions and user IP address are used for
identifying the particular user. The features necessary for
the personalization stage have to be extracted after
identifying the user and session. This is the most
important stage of the overall web personalization
process. The navigation patterns which are very important
for predicting the behavior of the user are identified. The
sequential arranging of extracting user navigation path as
to be done in analyzing process. This paper attempt to
explain the concept of analyzing the web log files with the
data mining techniques involved in it by giving the overall
methodology before pattern analyzing.

I. INTRODUCTION
The surfing path of a user not only carries the foot
mark of users navigational history, it also carries the
user's mentality while using the web site. The behavior of
the user can be clearly noted from the navigational path
and very useful in web site improvement [1]. The
construction of the web sites mainly focusses the interest
of the users, so the web sites should be constructed on
convenient way to satisfy the basic needs of the user. The
access log files gathered from the web server contains the
detailed history of the previous users which is very much
needed in web site designing [2] [3]. The navigation path
chosen by each and every users may vary from each
others depending on their interest and needs. The
strategies used for surfing the websites may also varys
depending on time spent on the websites, comfort level in
using the web sites for gathering information [4] [5] [6].
The web site designer plays a vital role in directing
the user for short navigation path by implementing ideas
collected from the previous user's navigation path
sequence [7] [8]. The web mining application can extract
the necessary information for better arrangement of
navigational paths from access log files by following
basic data mining techniques. The web access log files are
the single line statement stored on the web server which
contains necessary information for analyzing the users
behavioral navigation patterns. However some of
unwanted items for web site personalization also present
in it. The removal of unwanted items from the acquired
access log files should be done before proceeding to
actual web site personalization process. The time and user

978-1-4799-1597-2/13/$31.00 2013 IEEE

II. BACKGROUND
The proposed framework is based on the following
process.
1) Data collection
2) Pre-processing
3) Feature extraction
4) Pattern Discovery
5) Pattern analysis
The methodology involved in this paper is shown in the
following architecture.
Sequence
clustring
Access
Log
Files

Data PreProcessing

Feature
Extraction
User & Session
Identification

Fig.1. Architecture of sequence clustering process.

A. Data Collection
The web access log files are the information stored on
the web servers which has navigational history of the
users. Web access log files are the single lined statement
which contains full information about the user progress on
the web site. The single line access log files have
following information about the user who have visited.

2013 IEEE International Conference on Computational Intelligence and Computing Research

a) IP address of the visitor


b) User ID
c) The session (date and time)
d) The requested file
e) Used protocol
f) Status of the process (success/failure)
g) Bytes send
h) The site from the visitor came
i) The used browser
The fields mentioned is very important in
personalization of web sites. The sample access log files
are shown below
ip2283.unr - - [16/Nov/2005:00:01:03 -0500] "GET
/dmcourse/dm.css HTTP/1.1" 200 155
"http://www.kdnuggets.com/
dmcourse/data_mining_course/ assignments/
assignment-3.html " "Mozilla/4.0 (compatible;
The format of the web log files usually follows W3C
extended, NCSA and IIS log file format. The W3C format
allows the user to log for each request where as NCSA
and IIS has fixed format of logged each request.
B. Pre-processing
Data pre-processing is the most important process in
the data mining process. The identification of the single
line statement might be a difficult job in the cleaning
process. The single lined statement is divided into
different attributes by identifying the separators in the
access log files. The process of removing the unwanted or
irrelevant data for the personalization stage has to be
carried at the earliest. The collected web log files may
have adulterants for personalization process. Some of
such adulterants are listed bellow.
a) Duplicate files
b) Image files
c) Robotic files
d) Files with .css extension
e) Success status code
The cleaned records are taken in identifying the user
and session which is an important process in the feature
extraction.
C. User and Session identification
The user form the cleaned access log files are
identified by implementing the following method [12].
If an IP address not exist then
Consider the user as new user
End if
If IP address exists and the (( browser version or
Operating System ) is not exist) then
Consider the user entry as new user
Elseif
Next entry
The time spend by the user on the website is known
to be a session. The session time is categorized by every

30 minutes working with the user. The referrer based


method is used for identifying the new session in the web
log files.
for each entry in SLT
Sort the log data by IP address , agent and time
Next entry
Read each entry in SLT
If the IP address and agent not exist then
If (requested_timei - requested_timei -1 ) > Session Time
Out or Session time-out does not belong to Session
history then
Increment the value of session by 1
Consider the user session as new
Endif
Endif
Next entry
D. Feature extraction
Feature extraction is the final step of the
preprocessing stage which can be optional in many
preprocessing of web data in the process. Feature
extracted from the web data can use effectively for the
further pattern analysis stage of the personalization. The
feature extraction process done are based on few
categorizes [11].
First n and Last n pages
In this feature set the first n and last n pages visited
by the visitor in a transaction take into the consideration.
This feature set is used to obtain more information in the
data mining
First n and Last m pages
First n last m pages visited in this attribute selection
method each entry consist of first n and last m pages
accessed are taken into the consideration. This type of
grouping contains significant information to discriminate
cadre of visitors.
Top N-Most-Frequently visited pages
In this technique most frequently visited pages are
identified. The Top N most frequently visited pages were
chosen as attributes in this feature set. It was a notion that
it would be feasible for classifying visitors browsing
behaviors based on whether they visited a frequently
visited page or not.
Infrequently Visited Pages.
In this feature set the infrequently visited pages are
selected for analyzing the reason why the visitors are
avoiding viewing it. Whether those infrequent pages
should be taken into the consideration for a modification.
Top N-Most-Time Spend pages
This feature set has the same attributes as those in
Top N-Most-Frequent visited pages. In this feature set, an
attribute value is the amount of time spent in seconds by
the visitor on that particular frequently visited page. The
duration was calculated by taking the time difference
between two consecutive page requests. This feature set is
based on the conjecture that time spent on frequently
visited pages might be a very distinguishing factor to
categorize visitors.

2013 IEEE International Conference on Computational Intelligence and Computing Research

Path Navigation
In this approach of attribute selection method how the
users visited a particular web page is analyzed. The
association existed among the web page navigation is
considered in analyzing the user browsing patterns.
First and Last Pages Visited
The pages which are viewed first and last are
identified. Whether the visitors enter on the website by
the main page or they tried from a different page.
Inferring why visitors left following reading these pages
can be difficult it could be that these pages comfortable
their information requirements, or that they became
irritated by the time they arrived at these pages.
E. Sequences clustering
The navigational paths which are categorized can be
arranged in a certain order for easy analyzing. The
clustering concept implemented in the navigation path
grouping differs from some of the researchers. The
extended markup language has been used for sequential
web page representation and named as log markup
language [9] [10]. The log markup language arrangement
of web pages by considering the indices processed in
sessions. The simplification shown in usage of log
markup language instead of using plain statements made a
difference in calculating the distance between
navigational sequence. The usage of different technique
doesn't limit the researcher to use the data mining
algorithm in the clustering process. Although web mining
follows the techniques of data mining, the concept of
clustering the navigational path followed by researchers
uses raw clustering process.
The raw clustering concept of collecting the groups
can slow up the analyzing process. The clustering
technique known as k-harmonic explained for sequential
clustering by Bin Zhang el al., in 1999 [13] shown the
difference between k-mean and k-harmonic clustering
algorithms.
K-harmonic clustering
The k-harmonic means algorithm (KHM) is a method
similar to KM that arises from a different objective
function [13]. The KHM objective function uses the
harmonic mean of the distance from each navigational
path to all navigation paths which centers on different
group of cluster.
,

(1)

Here p is an input parameter, and typically p 2. The


harmonic mean gives a good (low) score for each
sequence navigational path when that sequential

navigational path is close to any one center of the group.


This is a property of the harmonic mean; it is similar to
the minimum function used by KM, but it is a smooth
differentiable function. The k-Harmonic mean minimizes
the square quantization errors within cluster variance. The
method of implementing the k-Harmonic mean in the
cluster navigational path process increases the efficiency
in forthcoming pattern recognition stage.
III. EXPERIMENTAL RESULTS
The experiment was conducted with the help of a web
access log tool. The kdlog access files are used for the
process and the result shows the top 10 most frequently
used path, forms different clusters. The sequential path of
the user navigation path forms the clustering by
implementing the k-harmonic mean clustering method.
TABLE I
SEQUENTIAL CLUSTERING

Top 10 Page path

Cluster
sequence

/ (default page)
860
/fdsearch/search.pl?(parameters)
174
/dmcourse/data_mining_course/course_notes.pdf
174
/software/ (default page)
166
/robots.txt
155
/jobs/index.html
114
/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf
109
/software/index.html
102
/phpBB/viewtopic.php?(parameters)
98
The navigation paths used frequently by the user are
shown in the TABLE II. The root page entry percentage is
more compared with the other sequential path.
TABLE II
USER ENTRY IN PERCENTAGE

Path Listing
Pages/Files

Visitors

Percentage
of visitors

/
/software/
/jobs/
/datasets/
/companies/consulting.html
/dmcourse/data_mining_course/
/software/suites.html
/software/visualization.html
/news/2005/n21/
/software/text.html

666
187
136
64
69
53
64
52
57
47

16.20%
4.55%
3.31%
1.56%
1.68%
1.29%
1.56%
1.26%
1.39%
1.14%

IV. CONCLUSION
The challenges faced by the website designer in
obtaining the clear information for easy web site
designing can be tackled by implementing web mining
techniques. The problem of retrieving necessary
navigational patterns for web personalization process can
be solved from this method and the navigational patterns
are taken for analyzing process. This paper focus in

2013 IEEE International Conference on Computational Intelligence and Computing Research

extracting information from access log files and extracting


necessary navigational patterns. This extracted
navigational patterns are arranged in clusters by applying
k-harmonic clustering technique. The implementing an
algorithm for recognition of necessary patterns for final
personalization process will be carried in future work.

REFERENCES
[1] J. Nielsen, Designing Web usability: the practice of
simplicity. Indianapolis IN: New Riders Press, 2000
[2] A. Cooper, The inmates are running the asylum.
Indianapolis, IN: SAMS, 1999.
[3] J. Preece, Y. Rogers and H. Sharp, Interaction design,
New York, NY: John Wiley and Sons, Inc, 2002.
[4] M. Graff, Individual differences in hypertext browsing
strategies, Behaviour and Information Technology, Vol.
24, no. 2, 2005.
[5] P.
Pirolli,
and
S.
Card,
Information
foraging. Psychological Review, Vol. 106, no. 4, 1999.
[6] L.D. Catledge, and J.E. Pitkow, Characterizing browsing
strategies in the World-Wide Web, Computer Networks
and ISDN Systems, Vol. 27, no.6, 1995.
[7] J. Holsanova, Tracking multimodal interaction with new
media, Paper presented at the workshop on The Citizen's
Use and Comprehension of Information on the Internet,
Uppsala, 2004..
[8] M.J. Bates, The design of browsing and berrypicking
techniques for the online search interface, Online
Review, Vol. 13, no. 5, 1989.
[9] T. Bray, C.M. Sperberg-McQueen, Extensible markup
language (XML) 1.0W3C recommendation, Technical
Report REC-xml-19980210, World Wide Web Consortium,
1998.
[10] J.R. Punin, M.S. Krishnamoorthy, M.J. Zaki, Web usage
mininglanguages and algorithms, Technical Report,
Rensselaer Polytechnic Institute, 2001.
[11] R. Gobinath and M. Hemalatha, Optimized Feature
Extraction for Identifying user Behavior in Web Mining ,
European Journal of Scientific Research, Vol. 105, no. 3,
2012.
[12] R. Gobinath and M. Hemalatha, Improved Preprocessing
Techniques for Analyzing Patterns in Web Personalization
Process, International Journal of Computer Application,
Vol. 58, no. 3, 2013.
[13] Bin Zhang, Meichun Hsu and Umeshwar Dayal, KHarmonic Means- A Data Clustering Algorithm. HewlettPackard Research Laboratory, 1999.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy