Algorithm For Tracing Visitors' On-Line Behaviors
Algorithm For Tracing Visitors' On-Line Behaviors
22
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
2.1 What is the need for tracing visitors’ 2.3 Web log
The aim of web log file is to create user profile by allowing
on-line behaviors in web usage mining their browsing similarities with previous users. Before the
It must to trace the visitors’ on-line behaviors for website
data mining process, it is required to clean, condense and
usage analysis. Actually it is an analysis to get knowledge
transform the raw data of weblog before performing data
about how visitors use website which could provide
mining. Weblog information can be integrated with web
guidelines to web site reorganization and helps to prevent
content and web structure mining to help webpage ranking
disorientation. It also helps to the designers in placing the
and web document classification. The interaction details of
important information where the visitors look for it. It has to
users with website are recorded automatically in web servers
be done for pre-fetching and caching web pages. Also it
as the form of weblogs [2]. Weblogs are kept as in form of
provides adaptive website (personalization). This is
line of text in web server, proxy server and browser
represented in the figure “Fig 3.” given below.
[8].Various forms of logs are server access logs, server
referrer logs, agent logs, client-side cookies, user profiles,
search engine logs and database logs. These are considered as
input for knowing the end user behavior in web usage mining.
Log files are those files that list the actions that have been
23
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
occurred [18].Log files hold many parameters which have the unwanted data and shapes the required data by filling in
employed in recognizing user browsing patterns. Some of the missing values, smoothing noisy data, identifying or removing
parameters are user name, visiting path traversed, timestamp, outliers and resolving inconsistencies. Always the dirty data
page last visited, success rate, user agent, URL and request can make confusion while processing it in the mining process.
type [17].
3.2 Data integration
2.4 Transfer / Access Log Many categories of databases, data cubes or files have been
The information on user’s request from their web browsers is collected and integrated together in this step.
stored in transfer/access log.
3.3 Data transformation
Table 1. Transfer/Access Log It actually pointed by the process of normalization and
aggregation.
Amount Status
Host File
Time Date of data of
name requested
Transferred report 3.4. Data reduction
It leads to the reduced representation based on the volume of
data collected and processed. But it gives the same or similar
analytical results.
2.5 Referrer Log
The recorded two fields of referrer log are URL and referrer 3.5. Data discretization
URL. It is a section in the data reduction step. This is done only for
Table 2. Referrer Log the case of numerical data not for all types of data.
4. PREPROCESSING OF WEB USAGE
URL Referrer URL DATA
Generally in the web usage mining, the preprocessing [9] is
considered as an essential task and treated as an idea to reach
the goal. As it was suggested in referred paper [1], the
intelligent system web usage preprocessor splits the human
2.6 Error Log and search engine accesses before using the preprocessing
The list of errors and requests which have failed are collected techniques. In the recent days, it is not possible to get good
in error log. Not only for the page which holds links to a file quality data. Also there is no better result for mining the
that does not exist, but also for the user who is not permitted quality data. But the quality decisions have been taken
to access a particular page, the user request may fail. It is depending upon the quality of data. The duplicate or
depicted in the following “Fig 4.” missing data may create incorrect or even misleading
statistics. Also the data warehouse requires consistent
integration of quality data. Moreover the data extraction,
cleaning, and transformation take the maximum of the
Request work in building a data warehouse. It is depicted in the
SERVER CLIENT following “Fig 5.”
24
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
5. AN OVERVIEW OF EXISTING Here the site topology is used to identify the user and for
METHODOLOGIES completing the missing path .To label the session, the time
This research paper [2] studies and presents several data duration is calculated between two nearby website visited by
preparation techniques of access stream even before the the particular user. It is calculated each and every time when a
mining process can be started. These are used to improve the user switches from one website to another and the amount of
performance of the data preprocessing, to identify the unique time spent in each website.
sessions and unique users. The methods proposed will help to
discover meaningful pattern and relationships from the access 6. PROPOSED METHODOLOGY
stream of the user. These are proved to be valid and useful by The proposed method tells user behavior and it creates user
various research tests. Yang Bin et al. in [19] used negative cluster and site cluster. Also it gives the information like what
association rules in discovery of web visitor’s patterns. sites are the most and least popular, which website is most
Negative association rules have been deployed to solve the commonly used by visitors and from what search engine are
deficiencies in which positive rules are referred to. It is known visitors coming frequently. In this method, if IP address is
that the data preprocessing is an essential process for effective unique then similar user cluster is created; If IP address is
mining process. In paper [9], a novel pre-processing technique same and user name is not unique, agent log, operating system
is proposed by removing local and global noise and web and browser are different then distinguish user cluster is
robots. Anonymous microsoft web dataset and MSNBC.com created.
anonymous web dataset are used for estimating this
preprocessing technique.
The paper [1] describes the effective and complete
preprocessing of access stream before actual mining process
25
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
A. Steps followed By the proposed algorithm, web visitors are classified on the
1. Create similar user cluster and distinguish user cluster basis of user click history and similarity measure. The
based on IP address. processed dataset is given below.
2. Create site clusters based on frequently accessed sites.
3. If number of sites in current site cluster is greater Table 3. Processed Dataset
than previous site cluster then assign that is the most
popular site. Label Processed Value
4. Return the most popular site Total No. of users 5080
5. Otherwise assume that is the least popular site Similar user cluster 2279
6. Return the least popular site and repeat until all user &
site clusters are processed. Distinguish user cluster 2801
VOB algorithm
Input The following “Fig 6.” shows the creation of user cluster.
Web log files
Output
User Cluster Creation
User cluster, site cluster, most popular site, least popular
site. 6000
Algorithm 5000
If (IP address is unique) then 4000
Create similar_user_ cluster; 3000 Distinguish
Return similar_ user_ cluster; 2000 User Cluster
If (IP address is same and user name is not unique, agent log, 1000 Similar User
operating system and browsers are different ) then 0
Create distinguish_ user_ cluster. Cluster
Return distinguish_ user _cluster. Total No.of
For i =sitecluster_1 to sitecluster_n do
users
If (no. of. sites in current site cluster > previous site
cluster) then
Most Popular = current_ site_ cluster
return “most popular site”
else Figure 6. User cluster creation
Least Popular = current_ site_ cluster
The VOB algorithm identifies the users based on the data
return “least popular site”
collected from cookies. This algorithm takes all the users in
repeat until all user & site clusters are processed. count and their request for processing. By the result, the
In the proposed method VOB, clustering plays a key role to proposed VOB algorithm outperforms to classify the similar
classify web visitors on the basis of user click history and user cluster and distinguish user cluster.
similarity measure. This algorithm considers four entities
namely IP address, user name, website name, and frequency The total number of sites visited by the user is calculated as
of accessed sites. Cookies based weblogs are taken as the 12682. Among these sites, maximum number of visits has
input which mainly classify the unique users and helps to done for the educational websites. Totally it is of count 4700.
create user clusters. And the users have given next preference to the social
networking sites.
Here, the website and webpage navigation behavior are
considered as the basic source for tracing the visitors’ online The number of visits made to social networking sites is 3269.
behavior and also to identify the interest of the user in Also 3031 users have referred the research sites. Only from
accessing the various web sites. Based on the number of sites the month of APRIL and MAY, the 1230 users have used the
in the site clusters, it is concluded that it is the most popular electronic commerce websites.
website or the least one. Also the frequency is calculated by The number of visits for the case of entertainment is 452
taking the time difference and the total number of clicks on a which explicitly shows the minimum desirability of that kind
particular website given in a log file. Hence the VOB of sites. The given “Fig 7.” tells that site clusters are created
algorithm effectively traces the behavior of online users which based on frequently accessed sites.
supports the website usage analysis.
From the following “Fig 8” it is known that the maximum
7. EXPERIMENTAL SETUP AND weighttage has given to educational sites than other sites like
PERFORMANCE ANALYSIS entertainment, social, electronic commerce and research.
The weblog files are collected from college web server and The most popular website is identified based on the condition
browser machine for the period of 6 months from January that if no. of. sites in current site cluster is greater than
2013 to June 2013. For implementation, Java (jdk 1.6) is used previous site. Otherwise it was assumed that is the least
in the system which posses Intel core i3 processor with 4GB popular site This same procedure is repeated until all user and
RAM. site clusters have processed.
7.1 Performance evaluation “Fig 9.” shows that, the proposed algorithm proves it’s
The performance evaluation is done by analyzing the dataset efficiency for classifying the preference of users to various
taken. In the period of 6 months, the total no. of users is 5080. categories of websites.
26
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
10. REFERENCES
[1] V.V.R. Maheswara Rao and Dr. V. Valli Kumari, “An
Figure 8. Overall Performance of VOB algorithm Enhanced Pre-Processing Research Framework For Web
Log Data Using A Learning Algorithm”, Netcom
In this algorithm, user cluster and site cluster creation is 2010,Cscp 01, Pp. 01–15, 2011.
mainly considered as an important work and it helps to do
website usage analysis based on their website surfing [2] Mr. Sanjay Bapu Thakare and Prof. Sangram. Z. Gawali,
behavior. “A Effective and Complete Preprocessing for Web Usage
Mining”, (IJCSE) International Journal on Computer
8. CONCLUSION AND SUMMARY Science and Engineering Vol. 02, No. 03, 2010, 848-851.
Web usage mining has emerged as the essential tool for
[3] Hussain.T, Asghar.S and Masood.N, “Web usage
realizing more personalized user-friendly and business
mining: A survey on preprocessing of web log
optimal web services. The key is to use the user-click stream
file”,Information and emerging technologies,2010,
data for many mining purposes. Traditionally, web usage
ISBN: 978-1-4244-8001-2
mining is used by e-commerce sites to organize their sites and
to increase profits. The newly proposed algorithm is Visitors’ [4] Jiawai Han and Micheline Kamber,”Data mining-
Online Behavior (VOB) which identifies user behavior and Concepts and echniques”, secondedition, Elsevier,
creates user cluster, site cluster, most popular web site and the Reprint 2010.
least popular web site. It must to trace the visitors’ on-line
behaviors for website usage analysis. Actually it is an analysis [5] http://www.galeas.de/webmining.html.
to get knowledge about how visitors use website which could [6] M. Malarvizhi S. A. Sahaaya Arul Mary, “Preprocessing
provide guidelines to web site reorganization and helps to of Educational Institution Web Log Data for Finding
prevent disorientation. Frequent Patterns using Weighted Association Rule
27
International Journal of Computer Applications (0975 – 8887)
Volume 87 – No.3, February 2014
Mining Technique”, European Journal of Scientific Conference on Tools with Artificial Intelligence,
Research ISSN 1450-216X Vol.74 No.4 ,617-633,2012. Newport Beach, CA., pp: 558-567. DOI:
10.1109/TAI.1997.632303
[7] Sheetal A. Raiyani and, Shailendra Jain, “Efficient
Preprocessing technique using Web log mining, [14] Srivatsava, J., R. Cooley, M. Deshpande and P.N. Tan,
International Journal of Advancements in Research & 2000. Web usage mining: discovery and applications of
Technology”, 1(6) ISSN 2278-7763, 2012. usage patterns from Web data. ACM SIGKDD Explorat.
Newsletter, 1: 12-23. DOI: 10.1145/846183.846188
[8] J.Srivatsava, R.Cooley, M.Deshpande, and P.N. Tan,
"Web Usage Mining: Discovery and Applications of [15] Agarwal, R. and R. Srikant, 1994. Fast algorithms for
Usage Patterns from Web Data." ACM SIGKDD mining association rules in large database. Proceeding of
Explorat. Newsletter, 2000. the 20th Conference on Very Large Data Bases, Sept. 12-
15, Morgan Kaufmann Publishers Inc., San Francisco,
[9] V.Chitraa, Dr.Antony Selvadoss Devamani, “A Novel CA. USA., pp: 487-499. DOI: 10.1234/12345678
Technique for Sessions Identification in Web Usage
Mining Preprocessing”, International Journal of [16] C.P. Sumathi et. al., Automatic Recommendation of
Computer Applications, Volume 34– No.9, 2012 Web Pages in Web Usage Mining, (IJCSE) International
Journal on Computer Science and Engineering, Vol. 02,
[10] Vijayashri Losarwar and Dr. Madhuri Joshi, Data No. 09, 2010, 3046-3052.
Preprocessing in Web Usage Mining, International
Conference on Artificial Intelligence and Embedded [17] Nanhay Singh1, Achin Jain1, Ram Shringar Raw,
Systems (ICAIES'2012) July 15-16, Singapore, 2012. Comparison Analysis Of Web Usage Mining Using
Pattern Recognition Techniques, International Journal of
[11] R. Suguna et.al,”User interest level based preprocessing Data Mining & Knowledge Management Process
algorithms using web usage mining”, International (IJDKP) Vol.3, No.4, July 2013
Journal on Computer Science and Engineering.
[18] L.K Joshila Grace, V. Maheswari, Dhinaharan
[12] Navin Kumar Tyagi1, A.K. Solanki2& Sanjay Tyagi3, Nagamalai, “Analysis of Weblogs and Web User in Web
An algorithmic approach to data preprocessing in web Mining,” International Journal of Network Security & Its
usage mining, International Journal of Information Applications (IJNSA), Vol. 3, No. 1, January 2011.
Technology and Knowledge Management July-
December 2010, Volume 2, No. 2, pp. 279-283 [19] Yang Bin, Dong Xianguin, Shi Fufu, “Research of Web
Usage Mining based on Negative Association Rules”
[13] Cooley, R., B. Mobasher and J. Srivatsava, 1997. Web International Forum on Computer Science-Technology
mining: Information and pattern discovery on the World and Applications, 2009.
Wide Web. Proceeding of the 9th IEEE International
IJCATM : www.ijcaonline.org 28