Chapter 02
Chapter 02
Chapter 2
Literature Survey
Data mining algorithms show best results for numerical data but with the emergence of statistics
and machine learning techniques, algorithms have been developed to mine non numerical data
and relational databases [34].
Earlier most of the DM algorithms employed only statistical techniques [35], but now a days, the
computing techniques like artificial intelligence, machine learning and pattern reorganization are
also an integral part of it [29], [34] ,where huge heterogeneous data stored in data warehouses
can be easily mined [36],[37].
DM applications are successfully implemented in various fields like health care, finance, retail,
telecommunication, fraud detection, risk analysis, education etc [38], [39], [40], [41]. Due to
increasing complexities in various fields and evolving technologies, there are new challenges to
DM which include different data formats, distributed databases, networking resources etc.
Data mining and knowledge discovery in databases are related to each other and to other related
fields such as machine learning, statistics, and databases. Knowledge discovery in databases is
the process of finding useful knowledge from large dataset. Data preparation, pattern search,
knowledge evaluation and refinement are the steps of KDD [42]. Data Mining is one of the steps
in the overall process of KDD and consists of collection and pre-processing of data, data mining,
interpretation, evaluation of discovered knowledge and finally post processing [43]. The basic
21 | P a g e
LITERATURE SURVEY
objective of KDD is to make data meaningful by developing methods and techniques for
effective mining but major problem faced by the KDD process is to map huge and heterogeneous
data into understandable, more abstract and useful form [44], [45].
The phrase knowledge discovery in databases emphasizes on the fact, that knowledge is the end
product of a data-driven discovery [12], [44], [46], [47], [48]. The data mining step of KDD
relies heavily on known techniques from machine learning, pattern recognition, and statistics to
find patterns from data.
Data warehousing is one of the fields of databases [44], [47], [49], [50], which helps in business
analytics and decision support. Data warehousing helps set the stage for KDD in two ways: (a)
data cleaning and (b) data access. The approach followed for analysis of data warehouses is
called online analytical processing (OLAP) [14], [51], [52], [53], [54].
Data mining step of KDD Process involves iterations for particular data mining methods in
application. There are two types of goals: (a) verification in which system is limited to verifying
b) discovery, in which system autonomously finds new patterns.
DM helps in determining patterns from observed data. Knowledge inference is produced from
fitted models. Two primary mathematical formalisms are used in model fitting are: (a) statistical
and (b) logical [44].
Primary goals of data mining in practice are prediction and description. In prediction some
variables and fields in the database are used to predict unknown values of other variables of
interest, and description helps in finding human-understandable patterns describing the data
[13],[15].
Classification is learning a function that maps (classifies) a data item into one of several
predefined classes [6]. The classification methods of data mining are used as part of knowledge
discovery applications which includes (a) classifying trends in financial markets, (b) education
and (c) identifying objects of interest from large dataset of images [7]. Regression is a predictive
22 | P a g e
LITERATURE SURVEY
technique that maps data item to a prediction variable. Clustering is a descriptive task which
helps in identifying a finite set of categories or clusters to describe the data e.g. Identifying those
students who are short of attendance and who have shown poor performance in sessionals [8],
[9], [10]. The examples of clustering applications in a knowledge discovery context include
discovering similar groups [11]. Summarization involves methods like calculating mean and
standard deviations. There are some methods which involve deriving of abstract rules,
visualization techniques, and the discovery of functional relationships between variables [44],
[45]. Summarization techniques are often applied to interactive exploratory data analysis and
automated report generation.
Decision Trees are useful for multiple variable analyses. They split a data set into branch-like
segments [56], [57].
These methods consist of techniques for prediction. Examples includes Feed Forward Neural
Networks, Adaptive Spline Methods, Projection Pursuit Regression, Multi-Layer Perceptrons,
Generalized Linear Models, Bayesian networks, Decision Trees, and Support Vector Machines
[58], [59].
In this, predictive analyses on new examples are derived from those examples in the model for
which predictions are known. The techniques include Nearest Neighbor Classification and
Regression Algorithms and Case-Based Reasoning Systems.
One can identify three primary components [35], [36], [44] in any DM algorithm:
2. Model evaluation: Model-evaluation criteria are statements which help in meeting the goals
of knowledge discovery process using particular pattern or model. Predictive models are
23 | P a g e
LITERATURE SURVEY
judged by the prediction accuracy on some dataset and descriptive models are evaluated
along the dimensions of predictive accuracy, novelty, utility, and understandability of the
model.
3. Search: A search method consists of two components: (a) Parameter search and (b) Model
search. Once the model representation and the model-evaluation criteria are fixed, then data
mining problem left with optimization of task on observational dataset.
Larger Databases: There are databases with hundreds of fields, tables, millions of records
and to derive some useful information from these is itself a challenge. Agrawal et al.
suggested methods for dealing with large data volumes using efficient algorithmic
approaches because with increasing dataset, there are chances of finding those patterns which
are invalid [60]. Solution to this problem is the use of prior knowledge to identify irrelevant
variables.
Pattern updation: There are some issues related to prompt change, deletion of data that can
make previously discovered patterns invalid [55], [61], [62]. The possible solutions are to
discover methods for updating the patterns.
Problem of missing and noisy data: This problem is related to business databases [16] and
mostly happens when KDD methods and tools easily incorporate prior knowledge
about a problem.
Deviation detection detects and explains why certain records cannot be put into specific
segments [1].
According to IBM report, three main steps in DM are preparing the data, reducing the data
and finally, looking for useful information [4].
24 | P a g e
LITERATURE SURVEY
Deciding appropriate sampling system, transformations, cleaning the data and to deal
with missing fields and records
Predictive modeling uses inductive reasoning techniques and algorithms like neural networks
[63].
Database segmentation use statistical clustering techniques to partition data into clusters [64].
2.6 DM Techniques
There are different data mining techniques which are used to extract information from a data set
and transform it into an understandable format for further use. Table 2.1 shows different data
mining techniques and their roles.
2.6.1 Statistics
Statistics is a vital component of data selection, sampling, data mining, and knowledge
evaluation. In data cleaning process, statistics offer the techniques to detect outliers to simplify
data when necessary, and to estimate noise, it deals with missing data using estimation
techniques [65], [66].
One of the most useful data mining techniques for e-learning is classification. Classification
maps data into the predefined group of classes. Classification is a supervised learning approach
performance with high accuracy is more beneficial for identifying the low academic performance
of the students at the beginning.
25 | P a g e
LITERATURE SURVEY
Techniques Roles
Classification Pre-Defined Examples
Clustering Identification of similar classes of objects.
Prediction Regression Technique.
Association Rules Find frequent item set findings among large data sets.
Derive meaning from complex or imprecise data and can be used
Neural Networks
to extract patterns and detect trends that are complex.
Represent set of decisions using CART (Classification and
Decision Trees Regression Trees) and CHAID (Chi Square Automatic Interaction
and Detection), C4.5, ID3.
Classify each record in a dataset Based on a combination of the
Nearest Neighbor method classes of the K-records which are most similar in historical
dataset.
Classification [67] is the processing of finding a set of models which describe and distinguish
data classes or concepts. The derived results may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, or neural networks. Models then can be used for
predicting the class label of data objects. In many applications, there is a need to predict some
missing data values rather than class labels. E.g. case when the predicted values are numerical
data and is often specifically referred to as prediction.
2.7 Clustering
Clustering groups the data, which is not predefined and it can identify dense and sparse regions
in object space. Unlike classification and prediction, which analyze class labeled data objects,
clustering analyses data objects without consulting a known class label. The class labels are not
present in the training data and clustering can be used to generate such labels. Clusters of objects
are formed so that objects within a cluster have high similarity in comparison to one another, but
are very dissimilar to objects in other clusters. Each cluster formed can be viewed as a class of
objects, from which rules can be derived [33]. Application of clustering in education can help in
26 | P a g e
LITERATURE SURVEY
2.8 Association
Association rule mining is to find the set of binary variables that occur in the transaction
database repeatedly. Apriori measures are the association rule mining algorithm [66], [68].
Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. The association rule A=>B shows those database
tuples that satisfy the conditions in A as well as in B.
AI techniques consist of pattern recognition, machine learning, and neural networks. Other
techniques in AI such as knowledge acquisition, knowledge representation, and search are
relevant to the various processes in DM.
Decision trees are non-linear data structures which start from the root node and end with a leaf
node. Decision trees represent sets of decisions. This approach can generate rules for the
classification of a data set. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) [69]. These techniques
are used for classification of a data set. They provide a set of rules that are applied to an
unclassified dataset to predict results. CART typically requires less data preparation than
CHAID.
2.9.3 Visualization
Visual DM techniques are helpful in exploratory data analysis, and mining the large database.
This approach requires integration of human in the DM process. There are examples of
visualization techniques that work on large data sets and produce interactive displays [70].
There are various techniques for visualizing multidimensional data like scatter plot matrices,
coplots, matrices, parallel coordinates, projection matrices, and other geometric projection
techniques such as icon-based techniques, hierarchical techniques, web-based techniques, graph-
based techniques, and dynamic techniques.
27 | P a g e
LITERATURE SURVEY
Web mining is the application of data mining to discover the patterns from the Web in the form
of data collected from online information databases, hyperlinks, and digital data. Data mining
technique used in web mining are Classification (supervised learning), Clustering (unsupervised
learning) [71], [72].
Increasing computational capacity and the emergence of the latest electronic devices lead to
ubiquitous or pervasive computing paradigm [73]. The Ubiquitous computing environments give
rise to Ubiquitous Data Mining (UDM).
The multimedia data includes images, video, audio, and animation. Data mining techniques
followed in multimedia data are rule-based decision tree classification algorithms like Artificial
Neural Networks, Instance-based learning algorithms, Support Vector Machines, Association rule
mining, clustering methods [74].
The spatial data includes astronomical and data related to space technology. It includes the use of
spatial warehouses, spatial data cubes, spatial OLAP, and clustering methods [75].
Other data mining areas include visualization, medical, pattern, wireless networks, association
rule based mining.
Applying data mining techniques to educational data for knowledge discovery is significant to
educational organizations as well as students. Knowledge-driven data supports educational
decision support system. Educational data mining enhance our understanding of learning by
28 | P a g e
LITERATURE SURVEY
finding educational trends which include improving student performance, course selection, in-
house training, and faculty development. Using linear regression analysis [29], some factors are
correlated to
income. Data min
improvement ratio, and increase the outcome. Thus, data mining techniques
are used to operate on large volumes of data to discover hidden patterns and relationship which
help in effective decision making [65].
According to Han and Kamber, data mining software should be developed in such a manner that
it allows the users to analyze data from different dimensions, enable to categorize it and
summarize the derived results [36]. Data mining can be applied to traditional as well as distance
education. There are many general data mining tools that provide mining algorithms, filtering,
and visualization techniques. Some examples of data mining tools are DBMiner, Clementine,
Intelligent Miner, RapidMiner and Weka etc [29]. DM combines machine learning, statistics, and
visualization techniques to discover and extract knowledge. Questionnaires and feedback forms
are often used to collect data related to approach towards educational patterns or trends,
interest towards technologies, teaching methodologies followed and data collected is to be
analyzed using techniques like a decision tree, neural networks etc.
There are different mining models like Decision Trees, Naive Bayes, Support Vector Machines,
Linear Regression, Minimum Description Length, K-Nearest Neighbors and K-Means. By using
these models, one can get student behavior patterns, course behavior patterns, predict student
retention, predict course suitability, and personalized intervention strategy [32].
Information visualization techniques can be used to graphically represent student data like his
maximum interest towards which technologies or interest which he has shown in solving
questionnaires etc are collected by web-based educational systems [76]. According to Tsantis and
Castellani, s the evaluation of an e-learning
system [77]. Visualization techniques involve conversations among online groups, social
networking websites etc. These techniques are also helpful for instructors which can manipulate
the graphical representations generated and get the understanding and interest of their learners.
29 | P a g e
LITERATURE SURVEY
Srivastava et al. have proposed that, Web mining is used to extract knowledge from web data
[78]. In web mining useful information is extracted from the contents of web documents and web
usage mining is another technique to discover meaningful patterns from data generated by client-
server transactions on one or more web localities.
Clustering and classification are both classification methods. Clustering is unsupervised and
classification is supervised. Classification and prediction are also related techniques.
Classification predicts class labels, whereas prediction predicts continuous-valued functions and
outlier is an observation that is unusually large or small relative to the other values in a dataset.
According to Liu, decision tree i.e. C5.0 algorithm and data cube technology are used for
managing classroom processes [79]. Induction analysis helps in identifying potential student
groups having similar characteristics. Talavera et al. proposes mining student data using
clustering to discover patterns reflecting user behaviors [80].
Tang et al. have given the concept of data clustering for web-based learning which helps in
solving learner based problems [81]. They find clusters of students with similar learning
characteristics based on the sequence and the contents of the pages they visited.
Association rule mining is popular mining method used between a set of items in large databases.
Here one or more attributes of a dataset are associated with each other using IF-THEN
statements.
Ha et al. [82] performs web page navigational structure analysis from web-based virtual
classrooms, e-learning portals and web pages navigated by learners.
30 | P a g e
LITERATURE SURVEY
The association fuzzy rules are implemented in a personalized e-learning material recommender
system. Fuzzy matching rules are used
and a list of learning materials [82]. Romero et al. [83] propose to use grammar-based genetic
programming with optimization techniques for providing a feedback to authors who designed
In text mining, mining is done on text data and is related to web content mining. It is an
interdisciplinary area involving machine learning and data mining, statistics, information
retrieval and natural language processing[76], [84]. Text mining can work with unstructured or
semi-structured datasets such as full-text documents, HTML files, emails, etc.
Data mining and text mining technologies are used in Web-based educational systems for shared
learning. Text mining is used for a discussion board for expanded correspondence analysis.
Learners select the relevant category which represents his/her comment and the system provides
Dringus et al. [85] and Abdous et al. [86] have proposed to use text mining as a strategy for
assessing conversations among irregular discussion forums. Text mining techniques also help in
evaluating the progress of a thread or user group discussions. Data can be retrieved from pdf
interactive multimedia productions for helping the evaluation of multimedia presentations for
statistics purpose and for extracting relevant data [82], [83]. Web-based educational systems
collect large amount of student data from weblog history which can be further analyzed for
deriving meaningful patterns [75].
Tang et al. have proposed to construct a personalized web-based application by which mining
can be done on both the framework and structure of the courseware. Keyword-driven text mining
algorithms are used to select articles for distance learning students [81].
31 | P a g e
LITERATURE SURVEY
2.19 Conclusion
In this chapter, the literature survey has been conducted on knowledge discovery perspective and
the role of data mining in an educational environment. Educational Data Mining is an upcoming
field related to several well-established areas of research including e-learning, web mining, text
mining etc. Data mining techniques have been used to analyze educational data and extract
useful information from a large amount of data.
The KDD field is related to the development of methods and techniques which make the data
relevant. In the educational sector, software and visualization techniques can be developed using
data mining t
helps us to cluster those students who need special attention in their studies. Knowledge
discovery in databases results in better decision-making related to the latest technologies used in
classroom teaching as well as faculty enhancement programs and in-house training etc. Using
data mining techniques, one can achieve refined data from distributed databases. Data Mining is
an efficient tool for improving institutional effectiveness and student learning. Knowledge
acquired by educational data mining not only help teachers to manage their classes, improves
their teaching skills, students learning processes but also provide feedback to institutions to
improve their infrastructures and quality.
Using techniques like decision trees, the class result of students are predicted based on the
attributes taken. Decision tree classifiers have been used on student's data to predict the student's
performance in the class result. These techniques help in identifying a) those students who are
short of attendance, b) shown poor performance in sessionals.
32 | P a g e