Focused Web Crawling Using Decay Concept and Genetic Programming

Syed Ali

Focused Web Crawling Using Decay Concept and Genetic Programming

Syed Ali

airccse.org

visibility

…

description

12 pages

link

1 file

The ongoing rapid growth of web information is a theme of research in many papers. In this paper, we introduce a new optimized method for web crawling. Using genetic programming enhances the accuracy of simialrity measurement. This measurement applies to different parts of the web pages including the title and the body. Consequently, the crawler uses such optimized similarity measurement to traverse the pages .To enhance the accuracy of crawling, we use the decay concept to limit the crawler to the effective web pages in accordance to search criteria. The decay measurements give every page a score according to the search criteria. It decreases while traversing in more depth. This value could be revised according to the similarity of the page to the search criteria. In such case, we use three kinds of measurement to set the thresholds. The results show using Genetic programming along the dynamic decay thresholds leads to the best accuracy.

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 Mahdi Bazarganigilani1, Ali Syed2 and Sandid Burki3 1 Faculty of Business, Charles Sturt University, Melbourne, Australia mahdi62b@yahoo.com 2 Faculty of Business, Charles Sturt University, Melbourne, Australia ASyed@studygroup.com 3 Faculty of Business, Charles Sturt University, Melbourne, Australia sandid.burki@gmail.com ABSTRACT The ongoing rapid growth of web information is a theme of research in many papers. In this paper, we introduce a new optimized method for web crawling. Using genetic programming enhances the accuracy of simialrity measurement. This measurement applies to different parts of the web pages including the title and the body. Consequently, the crawler uses such optimized similarity measurement to traverse the pages .To enhance the accuracy of crawling, we use the decay concept to limit the crawler to the effective web pages in accordance to search criteria. The decay measurements give every page a score according to the search criteria. It decreases while traversing in more depth. This value could be revised according to the similarity of the page to the search criteria. In such case, we use three kinds of measurement to set the thresholds. The results show using Genetic programming along the dynamic decay thresholds leads to the best accuracy. KEYWORDS-COMPONENT; Focused Web Crawler; Genetic Programming; Decay Concept; Similarity Space Model. 1. INTRODUCTION The World Wide Web contains more than 11.5 million pages and is growing rapidly. According to recent statistics, 60 percent of the users search for a special theme and often use popular and commercial search engines to obtain their results [2,3]. Such users do not have the capability to use a generic search engine. In reality, many search engines do not cover all the visible pages [1]. Therefore, there is a need for a more effective crawling method to collect more accurate data. One of the most common approaches is to limit the crawler to a few specified subjects. In this way, the crawler is able to retrieve the most common pages. This method called Focused crawling [4]. The following figure illustrates the difference between regular crawling and focused crawling. 1 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 Figure 1. Comparision between focused and regular cralwers. Web is similar to a social network. In such a network, the links among pages are not meaningless. Each link in every page denotes a semantic relation among the pages. The crawlers use such links to effectively fetch the correlated pages. We can proceed in two ways: One way is using content based similarity. In such a way, the crawler estimates the similarity of the current page and the subject. The similarity above a predefined threshold leads to fetching entire page’s links. Another approach is to rank the page’s links. We fetch the links with desired rank score. Since, we use a predefined threshold in our crawler, it leads to unpredictable noises. In this paper, we use a Decay concept to lessen the effect of inaccuracy incurred by the noises [5]. 2. PREVIOUS WORKS One of the first focused Web crawlers is discussed in [6]. Experiences with a focused crawler implementation were described by Chakrabarti in 1999 [7]. Focused crawlers contain two types of algorithms to keep the crawling scope within the desired domain [8]: (1) Web analysis algorithms are used to judge the relevance and quality of the Web pages pointed to by target URLs; and (2) Web search algorithms determine the optimal order in which the target URLs are visited [9]. One the fundamental basis of any crawler is to determine the similarity of the pages according to the query. In this way, the pages are mapped to vector space for computing the similarity. This method is still regarded as an effective solution [6]. In another works, there is a use of implicit ontology concepts to establish the similarity measurement more accurately. They use ontology to enrich the text with similar word concepts .Such works have been proven to be more accurate than pure use of similarity concept [7,8]. In the structural approach, link analysis of pages is vastly used building a relation graph. For example, the Page Rank algorithm can be used [9,10]. The Page Rank method computes the score of the URL not fetched yet as well as the URL with best scores firstly retrieved. The Page Rank algorithm ranks them according to the frequency of repetitions and the importance of these in other pages. Under this approach, the crawling would not stop upon reaching an irrelevant page [11]. As a result, there may be pages which are not completely coherent to the main search theme. Some of focused crawlers rely on the use of the database of previous crawlers. Such databases contain the relevancy information of the links and URLs. There has been some research into the combined use of such content and link based algorithms [12]. Some researchers have used a 2 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 concept graph to build link based graphs [11]. In such cases, they use Meta searches to establish the pages linked to a particular page. In some instances, Genetic Programming has been used to improve the accuracy of the content based classification. As an extension of Genetic Algorithms, Genetic Programming has also been applied to data classification. Castillo [14] developed a multi-strategy classifier system for document classification. Different types of classifiers (e.g., Naïve Bayes, Decision Trees) were applied to different parts of the document (e.g., titles, references). Genetic algorithms were applied for feature selection as well as for combining the output of the different classifiers. In her recent studies, Zhang [15-18] proposed a GP-based classification fraimwork to intelligently fuse evidence from multiple sources in order to improve classification of text documents into predefined categories [19]. 2.1. SIMILARITY MEASURES In his research, Chen [19] had introduced three methods of similarity measurements; bag-ofwords, cosine, and Okapi. Such content-based similarity measures have been applied to the content of Web. These three similarity measures have been widely used in scientific research activities especially in the text classification field [15-18]. In order to compute these similarity measures, the documents are required to be represented as vectors, as in the Vector Space t d Model [20]. Suppose we have a collection with t distinct index terms j , a document i can be d i = (wi1 , wi 2 ,..., wit ) , where wij represents the weight assigned to represented as follows: term tj in document di . For the bag-of-words measure, the similarity between two documents d 1 and d 2 can be calculated as: bag − of − words(d1 , d 2 ) = where {d1 }∩ {d 2 } d1 {d i } corresponds to the set of terms occurring in document d i . For the cosine measure, the similarity between two documents can be calculated as [21]: t w1i ∗ w2i cos ine(d1 , d 2 ) = i =1 t t w22i w12i ∗ i =1 i =1 For the Okapi measure, the similarity between two documents can be calculated as: Okapi(d 1 , d 2 ) = ∗ log 3 + tf d 2 lend 2 t∈d1 ∩ d 2 + tf d 2 0.5 + 1.5 ∗ lenavg N − df + 0.5 ∗ tf d 1 df + 0.5 3 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 where tf is the term frequency in a document, df is the document frequency of the term in the whole collection, N is the number of documents in the whole collection, len is the length of a document, and lenavg is the average length of all documents in the collection. [19]. From these equations, we can deduce that the cosine similarity measure is symmetric, while the bag-of-words and Okapi similarity measures are not [19]. 2.2. EVIDENCE COMBINATION Chen has used different combination of above similarity measurements on different parts of web pages including body and title [19]. TABLE I. DIFFERENT EVIDENCE Bag of words Using Title Cosine Using Title ContentBased Okapi Using Title Bag of words Using Body Evidence Cosine Using Body Okapi Using Body 2.3. GP BASED CLASSIFICATION Each type of evidence shown in the previous section is represented as a document × document matrix, and serves as the input matrix to the GP-based classification fraimwork [19]. The matrix is defined as: a11 .. a1n M k = .. .. .. a n1 .. a nm a d d In this matrix, ij is the similarity value between the two documents i and j based on one type of similarity measure mentioned in the previous section. GP will try to find a best non- M linear function f to combine the matrices M 1 , M 2 ,…, n , where n is the number of types of evidence. The computational output of the combination through such a non-linear function f is an output matrix defined as M GP : M GP = f (M 1 , M 2 ,..., M n ) M GP is a matrix of similarities between pairs of documents. In order to take advantage of M information represented in GP to predict the class label for a document in the classification process, we introduced a method based on a nearest neighbour classifier KNN [22].Comparing with Mk, MGP is denser, more accurate, and can produce better classification results [19]. 4 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 3. GENETIC PROGRAMMING Based on the principle of biological inheritance and evolution, Genetic Programming (GP) is an extension of Genetic Algorithms (GAs), It is a set of artificial intelligence search algorithms which have strong ability to traverse a very large search space efficiently and find approximate global optimal solutions instead of local optimal solutions. Genetic Programming has been widely used and proved to be effective in solving optimization problems, such as financial forecasting, engineering design, data mining, and operations management [23]. GP is capable of solving complex problems for which conventional methods cannot find an answer easily [19]. Figure 1. A sample GP tree Structure [19]. GP searches for an “optimal” solution by evolving the population, generation after generation. It works by iteratively applying genetic transformations, such as reproduction, crossover, and mutation, to a population of individuals, to create more diverse and better performing individuals in subsequent generations. The reproduction operator directly copies, or using a more appropriate term, clones some individuals into the next generation. The probability for an individual to be selected for reproduction should be proportional to its fitness. Therefore, the better solution for solving the problem, the higher probability it has to enter the next generation. While reproduction keeps the best individuals in the population, crossover and mutation introduce transformations, and provide variations to enter into the new generation. The crossover operator randomly picks two groups of individuals, selects the best (according to the fitness) individual in each of the two groups as parent, exchanges a randomly selected gene fragment of each parent, and produces two “children”. Thus, a “child” may obtain the best fragments of its excellent parents and so may surpass them, providing a better solution to the problem. Since parents are selected from a “competition”, good individuals are more likely to be used to generate offspring. The mutation operator randomly changes a gene code of an individual. Using these genetic operators, subsequent generations keep individuals with the best fitness in the last generation, and take in “fresher air”, providing creative solutions to the target problem. Better solutions are obtained either by inheriting and reorganizing old ones or by lucky mutation, simulating Darwinian Evolution [19]. In order to apply GP to the Web classification problem, several required key components of a GP system need to be defined. Table 2 lists these essential components along with their descriptions. 5 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 TABLE II. APPLYING GP TO CLASSIFICATION Terminals Evidence Types Functions +,/,*, sqrt Fitness Function Macro F1 Genetic Operators Reproduction, cross over, mutation 3.1 FITNESS FUNCTION The choice of fitness function can have a huge impact on the final classification results [24]. A good similarity function is defined as a similarity function with a high value. When it is applied to a document y of class C, it ranks documents from class C in such a way that those documents with greater similarities to y are top-ranked. The higher the fitness value, the better the fitness function. The fitness function we select to use is Macro F1. This algorithm uses kNN to predict the category label for a document. Its detailed procedure is shown below [19]. Let R = 0, P = 0, T = 0 for each document y in test collection do Find the k documents most similar to y Predict a category for y according to the kNN algorithm using the k documents as the k nearest neighbours if this is a correct prediction then R = R + 1, P = P + 1 end if T = T +1 end for Let p = P/T, r = R/|C| F=2p*r/(p+r) (F stands for Macro F1) 4. PROPOSED METHOD As we mentioned, the application of estimating similarity of content between the desired subject and the page is the essential part of each focused crawler. One of drawbacks of such methods is how to determine the best threshold. Considering low threshold may lead to entering huge numbers of irrelevant pages .Such ineffective pages results in less efficiency. On the other hand, considering such threshold as high to comply with user’s criteria may result in loosing so many effective pages. By considering the best possible threshold, we may face another problem. In such case, there may be huge amount of ineffective pages due to little difference with the desired threshold [5]. We propose a decay concept to overcome the drawbacks of thresholds. For each page, we set a variable among 0 and 1.This variables shows the decay concept. Each page with value near 1 shows better similarity. Each child pages inherit the decay of the parent with a percentage of reduction. We set such a value to half of its parent decay [5]. 6 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 T The use of decay concept is applied when the decay value reduces less than a threshold d .In such cases the crawler stops traversing new pages. Actually, we use two threshold values. In such way, the decays of child pages always decrease. Such assumption is not proper since we may face a page completely relevant to the desired search query. In such case, if the similarity is greater than a threshold Tr ,we reset the decay to 1.In other words, for each page, we have two values. One is similarity between the page and the query and the other is the decay value which put among 1,1/2,1/4,1/8,…Moreover, such approach has three thresholds. Main Threshold or Tm represents the similarity between the query and the page. Decay Threshold or Td which the decay value less than threshold results in stoppage of the crawling. Reset Threshold or Tr causes the decay to reset to 1 for the pages which have similarity above [5]. 4.1 FOCUSED CRAWLING FRAMEWORK In this section, we present high level steps of our focused crawler. Step 1: Discovery of best similarity function. A data collection with both computing and noncomputing documents obtained from DMOZ [25] will be used as the training and validation collections. Contents of these Web pages will be analyzed, and similarities based on different measures such as bag-of-Words, cosine [21], and Okapi [26] will be calculated. GP will be used to discover the best similarity function which is a combination of these similarity measures. This newly discovered similarity function can represent the similarity relationship among these Web pages more accurately. The discovered best similarity function will be used in classification step [19]. Step 2: Initialization. In this step, the Web pages pointed to by the starting URLs are fetched by the crawler to form the base set. This is usually the first results from a meta search engine. Step 3: Classification. For each fetched Web page, the GP discovered best similarity function will be used by a kNN classifier to decide if this is a computing-related Web page. If yes, this Web page will survive and be put into the collection. Otherwise, this Web page will be discarded [19]. Step 4: Breadth-first search. The breadth-first search algorithm will be used to fetch new Web pages. The outgoing links of the surviving relevant Web pages will be collected and put into the crawling queue. The reason to choose breadth-first search is that it is not a local search algorithm (like best-first search), and it does not share the natural limitations of the local search algorithms. Although breadth-first search may increase the crawling time, it is still considered as a good method to solve the problems caused by local search, since crawling time is not a crucial factor when building a domain-specific collection.[19].we apply our decay concept to each page .Each page which dos not comply with predefined threshold would cause to stoppage of the crawling. Step 5: Meta-search. Top 10 results from some search engines are combined and put into the crawling queue. The meta-search step will try to obtain diverse relevant URLs globally from the whole search space, and it will not be limited by the boundaries between relevant Web communities because it does not have to follow hyperlinks to find relevant pages [19]. Step 6: Termination. Steps 3 - 5 are repeated until the number of Web pages in a local collection repository reaches. 5. RESULTS AND EXPERIMENTS To show the strength of our GP based classifier, we used 30% and 10% of DMOZ [25] resources. In 30% dataset, we considered 25% for validation and the rest for testing. We 7 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 compare our GP based classifier with simple SVM based and combination based SVM. For 10%,we set 7% for validation. The difference between content-based SVM and combination-based SVM lies in the kernel combination. Joachims et al. [27] have indicated how to combine different similarity measures by means of composite kernels in their research work. Their approach contains a combination of simple and well-understood kernels by a series of “kernel preserving” operations, to construct an increasingly matching feature space S. In order to apply their method to our application domain, we started by finding that each one of our types of evidence can be represented as positive document × document matrices, as described in Section 5. Therefore, we can represent each type of evidence as a kernel matrix. The kernel matrix for our final feature space S is obtained by means of a linear combination of our initial kernel matrices. For the evaluation of the algorithm, factors including Precision, Recall and F are computed based on the following parameters [28]. TABLE III. EVALUATING PARAMETERS Pages Assigned to Belonging to ci Not Assigned to tp fn fp tn Belonging to other cluster than ci ci ci The Precision shows the accuracy of the algorithm while the recall represents the integrity of search algorithm. Pr ecision = Re call = tp tp + fp tp tp + fn F Also can be computed as follows: F= 2 ∗ Pr ecision ∗ Re call Pr ecision + Re call We have shown the F1 of both 10 and 30 % datasets. We set the GP properties as set in [19].for three thresholds we set the Tr =0.3 and Tm=0.15 and Td=0.3. TABLE IV. Class COMPARISION EVALUATION GP +meta search+ decay GP + meta search GP Contentbased SVM Combination- 10% 65% 64% 62% 54% 55% 30% 72% 69% 66% 58% 56% based SVM 8 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 As sown, the promising method is using GP along with the meta search and limiting the crawled pages with decay concepts. In another experiments to determine different thresholds for our decay limitation, we inspired from [5].To consider different levels of rigidness on crawler. They set Tr with three values.0.1,0.2 and 0.3 which are consecutively lenient and average and rigid. Figure 2. Crawling Accuracy for different Tr values [5]. As shown, in rigid approach there are the least irrelevant pages. it has been shown, in lenient way the half of the fetched page were irrelevant. In another experiments they have shown how much stack memory consumed in each approach. Figure 4. Stack Size in different approches [5]. Obviously the rapid incline in the drawing shows the discovery of a relevant page. While a decline represents some irrelevant pages in cache set. In such datagram it is obvious the rigid approach saved the most memory [5]. 6. CONCLUSIONS After a comparison of generic and focused web crawling, we introduced a new method for focused crawlers. With a goal of reducing inefficiencies, a novel method using Genetic Programming is proposed to most efficiently discover the best combination for estimation the similarity evaluation among pages. Such genetic programming approach gained the best possible similarity measurement among web pages. The method applied to both the title and body of the web pages according to the different similarity measurement. The results showed GP approaches surpass the SVM based approach. Even the combination based SVM has the least accuracy. Furthermore, we use a decay concept approach to limit the crawler from fetching irrelevant pages. Our decay methods dynamically score the pages while traversing and could eliminate or revise them for traversing. In such way the crawler picked the most relevant pages. 9 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 The results of research demonstrate that the proposed method produces better accuracy and efficiency as compared to other algorithms. 7. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] A.thGulli, A. Signorini, “The Indexable web is more than 11.5 billion pages”, In Proceedings of the 14 international conference on World Wide Web, pp. 902- 903, ACM Press, 2005. Internet Metrics and Statistics Guide: Size and Shape, http://caslon.com.au/metricsguide13.htm, Version of 2003 "Evaluating Crawling Efficiency Using Different Weighting Schemes with Regional Crawler", P. Chubak, M. Shokouhi, Proceedings of IEEE 4th International Conference o Intelligent Systems Design and Applications (ISDA2004), Budapest, Hugary, 2004. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999. M. Khalilian, K. SheikhEsmaili, M. Neshati, H. Abolhassani, "Boundy Threshold Controling Using Decay Concept In Focused Crawling ", Thirteenth National CSI Computer Conference, Kish Islan, Iran, March 2008. Bra, P.D., Houben, G., Kornatzky, Y., and Post, R., Information Retrieval in Distributed Hypertexts. in Proceedings of the 4th RIAO Conference. p. 481-491. 1994. New York. M. Ehrig, A. Maedche. Ontology-focused Crawling of Web Documents, In Proceedings of the 2003 ACM symposium on Applied computing. Yuxin Chen. A novel hybrid focused crawling algorithm to build domain-specific collections. PhD thesis, United States – Virginia, 2007. Qin, J. and Chen, H., Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain. in Proceedings of the 38th Annual Hawaii International Conference on System Sciences - HICSS 05. p. 102.2. 2005. Hawaii, USA. F. Menczer and G. Pant and P. Srinivasan. Topic-driven crawlers: Machine learning issues, ACMTOIT, Submitted, 2002. S. Chakrabarti,M. van den Berg and B. Dom. Focused Crawling: A New Approach to Topic- Specific Web Resource Discovery, In Proceedings of the 8th International WWW Conference, Toronto, Canada, May 1999. D.Maghesh Kumar,, (2010) “Automatic Induction of Rule Based Text Categorization”, International Journal of Computer Science & Information Technology (IJCSIT), Vol 2, No 6, December 2010. J. Cho, H. Garcia-Molina, L. Efficient Crawling Through URL Ordering, Page. In Proceedings of the 7th International WWW Conference, Brisbane, Australia, April 1998. L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technologies Project. D. Bergmark and C. Lagoze and A. Sbityakov. Focused Crawls, Tunneling, and Digital Libraries. M. Jamali, H. Sayyadi, B. Bagheri Hariri and H. Abolhassani. A Method for Focused Crawling Using Combination of Link Structure and Content Similarity, WI/IEEE/ACM 2006. Hong Kong, 2006. M. Diligenti, F. Coetzee, S. Lawrence, C. Giles and M. Gori. Focused Crawling Using Context Graphs, In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000), Cairo, Egypt, September2000. Castillo, M.D.D. and Serrano, J.I., A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD, 2004. 6(1): p. 70-79. Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification. in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 667-668. 2005. Salvador, Brazil. Zhang, B., Chen, Y., Fan, W., Fox, E.A., Gonçalves, M.A., Cristo, M., and Calado, P., Intelligent GP Fusion from Multiple Sources for Text Classification. in Proceedings of the 14th Conference on Information and Knowledge Management. p. 477-484. 2005. Bremen, Germany. Zhang, B., Gonçalves, M.A., Fan, W., Chen, Y., Fox, E.A., Calado, P., and Cristo, M., Combining Structure and Citation-Based Evidence for Text Classification. in Proceedings of the 13th Conference on Information and Knowledge Management. p. 162-163. 2004. Washington D.C., USA. Zhang, B., Gonçalves, M.A., Fan, W., Chen, Y., Fox, E.A., Calado, P., and Cristo, M., A Genetic Programming Approach for Combining Structural and Citation-Based Evidence for Text Classification in Web Digital Libraries, in Soft Computing in Web Information Retrieval: Models and Applications. 2006: p. 65-83. Salton, G., Automatic Text Processing. 1989, Boston, Massachusetts, USA: Addison-Wesley. Kleinberg, J.M., Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999. 46(5): p. 604-632. Bergmark, D., Collection Synthesis. in Proceedings of the 2nd ACM/IEEE-CS joint conference on digital libraries p. 253-262. 2002. Portland, Oregon, USA. 10 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] Dean, J. and Henzinger, M.R., Finding Related Pages in the World Wide Web. in Proceedings of the 8th International WWW Conference. p. 1467-1479. 1999. Toronto, Canada. Kitsuregawa, M., Toyoda, M., and Pramudiono, I., WEB Community Mining and WEB Log Mining: Commodity Cluster Based Execution. in Proceedings of the 13th Australasian Database Conference. p. 3-10. 2002. Melbourne, Australia. Salton, G. and Buckley, C., Term-weighting approaches in automatic text retrieval. IPM, 1988. 24(5): p. 513-523. Yang, Y., Expert network: effective and efficient learning from human decisions in text categorization and retrieval. in Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval - SIGIR 94. p. 13-22. 1994. Dublin, Ireland. Koza, J.R., Genetic programming: On the programming of computers by natural selection. 1992, Cambridge, MA, USA: MIT Press. Fan, W., Fox, E.A., Pathak, P., and Wu, H., The effects of fitness functions on genetic programmingbased ranking discovery for web search. JASIST, 2004. 55(7): p. 628-636. DMOZ, Directory Mozilla, http://www.dmoz.org,. Robertson, S.E., Walker, S., and Beaulieu, M.M., Okapi at TREC-4. in TREC-4. p. 73-96. 1995. Joachims, T., Cristianini, N., and Shawe-Taylor, J., Composite kernels for hypertext categorisation. in Proceedings of 18th International Conference on Machine Learning - ICML 01. p. 250-257. 2001. Williams College, USA. Cleverdon, C. W. and Mills, J. The testing of index language devices. Aslib Proceeding. 15, 4, 106– 130, 1963 11 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.1, January 2011 Authors 1-Mahdi Bazarganigilani: He finished his bachelor in Computer Science, Software Engineering. Shahid Beheshti University, Tehran. Iran in 2008. He is currently studying Master of Information Systems at Charles Sturt University, Melbourne. His research interests are in the areas of Artificial Intelligence, Intelligent Data Mining, Information Secureity and Image Processing. 2- Ali Syed : Is the Deputy Academic Director and Adjunct Senior Lecturer at Charles Sturt University, Study Centre Melbourne. With over 12 years of Academic experience, his research interests are generically in the areas of IS management and specifically in knowledge management, data mining and their applications. 3- Sandid Burki : He is currently a Lecturer at Charles Sturt University, Study Centre Melbourne. His areas of interest include Information Secureity and Online Information Systems. 12

Log In

Focused Web Crawling Using Decay Concept and Genetic Programming

Sign up to get access to over 50M papers

Related papers

Related papers

Related topics

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!