0% found this document useful (0 votes)
17 views

Chapter 02

sentiment analysis part2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Chapter 02

sentiment analysis part2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

LITERATURE SURVEY

Chapter 2
Literature Survey

2.1 Historical Trends in Data Mining

t is a combination of many disciplines like


database management systems (DBMS), Statistics, Artificial Intelligence (AI), and Machine
Learning (ML) [33]. Data Mining produces useful patterns when algorithmic methods are
applied on observational data.

2.1.1 Data Mining Trends

Data mining algorithms show best results for numerical data but with the emergence of statistics
and machine learning techniques, algorithms have been developed to mine non numerical data
and relational databases [34].

Earlier most of the DM algorithms employed only statistical techniques [35], but now a days, the
computing techniques like artificial intelligence, machine learning and pattern reorganization are
also an integral part of it [29], [34] ,where huge heterogeneous data stored in data warehouses
can be easily mined [36],[37].

DM applications are successfully implemented in various fields like health care, finance, retail,
telecommunication, fraud detection, risk analysis, education etc [38], [39], [40], [41]. Due to
increasing complexities in various fields and evolving technologies, there are new challenges to
DM which include different data formats, distributed databases, networking resources etc.

2.2 Knowledge Discovery in Databases

Data mining and knowledge discovery in databases are related to each other and to other related
fields such as machine learning, statistics, and databases. Knowledge discovery in databases is
the process of finding useful knowledge from large dataset. Data preparation, pattern search,
knowledge evaluation and refinement are the steps of KDD [42]. Data Mining is one of the steps
in the overall process of KDD and consists of collection and pre-processing of data, data mining,
interpretation, evaluation of discovered knowledge and finally post processing [43]. The basic

21 | P a g e
LITERATURE SURVEY

objective of KDD is to make data meaningful by developing methods and techniques for
effective mining but major problem faced by the KDD process is to map huge and heterogeneous
data into understandable, more abstract and useful form [44], [45].

The phrase knowledge discovery in databases emphasizes on the fact, that knowledge is the end
product of a data-driven discovery [12], [44], [46], [47], [48]. The data mining step of KDD
relies heavily on known techniques from machine learning, pattern recognition, and statistics to
find patterns from data.

Data warehousing is one of the fields of databases [44], [47], [49], [50], which helps in business
analytics and decision support. Data warehousing helps set the stage for KDD in two ways: (a)
data cleaning and (b) data access. The approach followed for analysis of data warehouses is
called online analytical processing (OLAP) [14], [51], [52], [53], [54].

2.2.1 The Data Mining Step of the KDD Process

Data mining step of KDD Process involves iterations for particular data mining methods in
application. There are two types of goals: (a) verification in which system is limited to verifying
b) discovery, in which system autonomously finds new patterns.

DM helps in determining patterns from observed data. Knowledge inference is produced from
fitted models. Two primary mathematical formalisms are used in model fitting are: (a) statistical
and (b) logical [44].

2.2.2 Data Mining Methods

Primary goals of data mining in practice are prediction and description. In prediction some
variables and fields in the database are used to predict unknown values of other variables of
interest, and description helps in finding human-understandable patterns describing the data
[13],[15].

Classification is learning a function that maps (classifies) a data item into one of several
predefined classes [6]. The classification methods of data mining are used as part of knowledge
discovery applications which includes (a) classifying trends in financial markets, (b) education
and (c) identifying objects of interest from large dataset of images [7]. Regression is a predictive

22 | P a g e
LITERATURE SURVEY

technique that maps data item to a prediction variable. Clustering is a descriptive task which
helps in identifying a finite set of categories or clusters to describe the data e.g. Identifying those
students who are short of attendance and who have shown poor performance in sessionals [8],
[9], [10]. The examples of clustering applications in a knowledge discovery context include
discovering similar groups [11]. Summarization involves methods like calculating mean and
standard deviations. There are some methods which involve deriving of abstract rules,
visualization techniques, and the discovery of functional relationships between variables [44],
[45]. Summarization techniques are often applied to interactive exploratory data analysis and
automated report generation.

2.2.2.1 Decision Trees and Rules

Decision Trees are useful for multiple variable analyses. They split a data set into branch-like
segments [56], [57].

2.2.2.2 Classification Methods

These methods consist of techniques for prediction. Examples includes Feed Forward Neural
Networks, Adaptive Spline Methods, Projection Pursuit Regression, Multi-Layer Perceptrons,
Generalized Linear Models, Bayesian networks, Decision Trees, and Support Vector Machines
[58], [59].

2.2.2.3 Example-Based Methods

In this, predictive analyses on new examples are derived from those examples in the model for
which predictions are known. The techniques include Nearest Neighbor Classification and
Regression Algorithms and Case-Based Reasoning Systems.

2.3 The Components of Data Mining Algorithms

One can identify three primary components [35], [36], [44] in any DM algorithm:

1. Model representation: A model representation is used to describe or extract patterns

2. Model evaluation: Model-evaluation criteria are statements which help in meeting the goals
of knowledge discovery process using particular pattern or model. Predictive models are

23 | P a g e
LITERATURE SURVEY

judged by the prediction accuracy on some dataset and descriptive models are evaluated
along the dimensions of predictive accuracy, novelty, utility, and understandability of the
model.

3. Search: A search method consists of two components: (a) Parameter search and (b) Model
search. Once the model representation and the model-evaluation criteria are fixed, then data
mining problem left with optimization of task on observational dataset.

2.4 Research and Application Challenges

Larger Databases: There are databases with hundreds of fields, tables, millions of records
and to derive some useful information from these is itself a challenge. Agrawal et al.
suggested methods for dealing with large data volumes using efficient algorithmic
approaches because with increasing dataset, there are chances of finding those patterns which
are invalid [60]. Solution to this problem is the use of prior knowledge to identify irrelevant
variables.

Pattern updation: There are some issues related to prompt change, deletion of data that can
make previously discovered patterns invalid [55], [61], [62]. The possible solutions are to
discover methods for updating the patterns.

Problem of missing and noisy data: This problem is related to business databases [16] and
mostly happens when KDD methods and tools easily incorporate prior knowledge
about a problem.

2.5 Steps in Data Mining

Link analysis identifies useful associations among datasets [1].

Deviation detection detects and explains why certain records cannot be put into specific
segments [1].

According to IBM report, three main steps in DM are preparing the data, reducing the data
and finally, looking for useful information [4].

24 | P a g e
LITERATURE SURVEY

Fayyad et al. proposed following steps of data mining [44]:

Retrieving the data from a large databases

Selecting the relevant subset to work with

Deciding appropriate sampling system, transformations, cleaning the data and to deal
with missing fields and records

Fitting models to the pre-processed data

Predictive modeling uses inductive reasoning techniques and algorithms like neural networks
[63].

Database segmentation use statistical clustering techniques to partition data into clusters [64].

2.6 DM Techniques

There are different data mining techniques which are used to extract information from a data set
and transform it into an understandable format for further use. Table 2.1 shows different data
mining techniques and their roles.

2.6.1 Statistics

Statistics is a vital component of data selection, sampling, data mining, and knowledge
evaluation. In data cleaning process, statistics offer the techniques to detect outliers to simplify
data when necessary, and to estimate noise, it deals with missing data using estimation
techniques [65], [66].

2.6.2 Classification and Prediction

One of the most useful data mining techniques for e-learning is classification. Classification
maps data into the predefined group of classes. Classification is a supervised learning approach

performance with high accuracy is more beneficial for identifying the low academic performance
of the students at the beginning.

25 | P a g e
LITERATURE SURVEY

Table 2.1 Data Mining Techniques and their Roles

Techniques Roles
Classification Pre-Defined Examples
Clustering Identification of similar classes of objects.
Prediction Regression Technique.
Association Rules Find frequent item set findings among large data sets.
Derive meaning from complex or imprecise data and can be used
Neural Networks
to extract patterns and detect trends that are complex.
Represent set of decisions using CART (Classification and
Decision Trees Regression Trees) and CHAID (Chi Square Automatic Interaction
and Detection), C4.5, ID3.
Classify each record in a dataset Based on a combination of the
Nearest Neighbor method classes of the K-records which are most similar in historical
dataset.

Classification [67] is the processing of finding a set of models which describe and distinguish
data classes or concepts. The derived results may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, or neural networks. Models then can be used for
predicting the class label of data objects. In many applications, there is a need to predict some
missing data values rather than class labels. E.g. case when the predicted values are numerical
data and is often specifically referred to as prediction.

2.7 Clustering

Clustering groups the data, which is not predefined and it can identify dense and sparse regions
in object space. Unlike classification and prediction, which analyze class labeled data objects,
clustering analyses data objects without consulting a known class label. The class labels are not
present in the training data and clustering can be used to generate such labels. Clusters of objects
are formed so that objects within a cluster have high similarity in comparison to one another, but
are very dissimilar to objects in other clusters. Each cluster formed can be viewed as a class of
objects, from which rules can be derived [33]. Application of clustering in education can help in

26 | P a g e
LITERATURE SURVEY

2.8 Association

Association rule mining is to find the set of binary variables that occur in the transaction
database repeatedly. Apriori measures are the association rule mining algorithm [66], [68].
Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. The association rule A=>B shows those database
tuples that satisfy the conditions in A as well as in B.

2.9 Techniques for Mining Transactional/Relational Database

2.9.1 Artificial Intelligence (AI) Techniques

AI techniques consist of pattern recognition, machine learning, and neural networks. Other
techniques in AI such as knowledge acquisition, knowledge representation, and search are
relevant to the various processes in DM.

2.9.2 Decision Tree Approach

Decision trees are non-linear data structures which start from the root node and end with a leaf
node. Decision trees represent sets of decisions. This approach can generate rules for the
classification of a data set. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) [69]. These techniques
are used for classification of a data set. They provide a set of rules that are applied to an
unclassified dataset to predict results. CART typically requires less data preparation than
CHAID.

2.9.3 Visualization

Visual DM techniques are helpful in exploratory data analysis, and mining the large database.
This approach requires integration of human in the DM process. There are examples of
visualization techniques that work on large data sets and produce interactive displays [70].

There are various techniques for visualizing multidimensional data like scatter plot matrices,
coplots, matrices, parallel coordinates, projection matrices, and other geometric projection
techniques such as icon-based techniques, hierarchical techniques, web-based techniques, graph-
based techniques, and dynamic techniques.

27 | P a g e
LITERATURE SURVEY

2.10 Various Data Mining Areas

2.10.1 Web Mining

Web mining is the application of data mining to discover the patterns from the Web in the form
of data collected from online information databases, hyperlinks, and digital data. Data mining
technique used in web mining are Classification (supervised learning), Clustering (unsupervised
learning) [71], [72].

2.10.2 Ubiquitous Data Mining

Increasing computational capacity and the emergence of the latest electronic devices lead to
ubiquitous or pervasive computing paradigm [73]. The Ubiquitous computing environments give
rise to Ubiquitous Data Mining (UDM).

2.11 Data Mining using Multimedia

The multimedia data includes images, video, audio, and animation. Data mining techniques
followed in multimedia data are rule-based decision tree classification algorithms like Artificial
Neural Networks, Instance-based learning algorithms, Support Vector Machines, Association rule
mining, clustering methods [74].

2.12 Spatial Data Mining

The spatial data includes astronomical and data related to space technology. It includes the use of
spatial warehouses, spatial data cubes, spatial OLAP, and clustering methods [75].

2.13 Emergence of Data Mining in Other Fields

Other data mining areas include visualization, medical, pattern, wireless networks, association
rule based mining.

2.14 Performance Improvement in Education Sector

2.14.1 Data Mining Techniques for Education Sector

Applying data mining techniques to educational data for knowledge discovery is significant to
educational organizations as well as students. Knowledge-driven data supports educational
decision support system. Educational data mining enhance our understanding of learning by

28 | P a g e
LITERATURE SURVEY

finding educational trends which include improving student performance, course selection, in-
house training, and faculty development. Using linear regression analysis [29], some factors are
correlated to
income. Data min
improvement ratio, and increase the outcome. Thus, data mining techniques
are used to operate on large volumes of data to discover hidden patterns and relationship which
help in effective decision making [65].

According to Han and Kamber, data mining software should be developed in such a manner that
it allows the users to analyze data from different dimensions, enable to categorize it and
summarize the derived results [36]. Data mining can be applied to traditional as well as distance
education. There are many general data mining tools that provide mining algorithms, filtering,
and visualization techniques. Some examples of data mining tools are DBMiner, Clementine,
Intelligent Miner, RapidMiner and Weka etc [29]. DM combines machine learning, statistics, and
visualization techniques to discover and extract knowledge. Questionnaires and feedback forms
are often used to collect data related to approach towards educational patterns or trends,
interest towards technologies, teaching methodologies followed and data collected is to be
analyzed using techniques like a decision tree, neural networks etc.

There are different mining models like Decision Trees, Naive Bayes, Support Vector Machines,
Linear Regression, Minimum Description Length, K-Nearest Neighbors and K-Means. By using
these models, one can get student behavior patterns, course behavior patterns, predict student
retention, predict course suitability, and personalized intervention strategy [32].

2.14.2 Statistics and Visualization

Information visualization techniques can be used to graphically represent student data like his
maximum interest towards which technologies or interest which he has shown in solving
questionnaires etc are collected by web-based educational systems [76]. According to Tsantis and
Castellani, s the evaluation of an e-learning
system [77]. Visualization techniques involve conversations among online groups, social
networking websites etc. These techniques are also helpful for instructors which can manipulate
the graphical representations generated and get the understanding and interest of their learners.

29 | P a g e
LITERATURE SURVEY

2.15 Web Mining

Srivastava et al. have proposed that, Web mining is used to extract knowledge from web data
[78]. In web mining useful information is extracted from the contents of web documents and web
usage mining is another technique to discover meaningful patterns from data generated by client-
server transactions on one or more web localities.

2.15.1 Clustering, Classification and Outlier Detection

Clustering and classification are both classification methods. Clustering is unsupervised and
classification is supervised. Classification and prediction are also related techniques.
Classification predicts class labels, whereas prediction predicts continuous-valued functions and
outlier is an observation that is unusually large or small relative to the other values in a dataset.

According to Liu, decision tree i.e. C5.0 algorithm and data cube technology are used for
managing classroom processes [79]. Induction analysis helps in identifying potential student
groups having similar characteristics. Talavera et al. proposes mining student data using
clustering to discover patterns reflecting user behaviors [80].

2.15.2 Adaptive and Intelligent Web-Based Educational Systems

Tang et al. have given the concept of data clustering for web-based learning which helps in
solving learner based problems [81]. They find clusters of students with similar learning
characteristics based on the sequence and the contents of the pages they visited.

2.16 Association Rule Mining

Association rule mining is popular mining method used between a set of items in large databases.
Here one or more attributes of a dataset are associated with each other using IF-THEN
statements.

2.16.1 Particular Web-Based Courses

Ha et al. [82] performs web page navigational structure analysis from web-based virtual
classrooms, e-learning portals and web pages navigated by learners.

30 | P a g e
LITERATURE SURVEY

The association fuzzy rules are implemented in a personalized e-learning material recommender
system. Fuzzy matching rules are used
and a list of learning materials [82]. Romero et al. [83] propose to use grammar-based genetic
programming with optimization techniques for providing a feedback to authors who designed

2.17 Text Mining

In text mining, mining is done on text data and is related to web content mining. It is an
interdisciplinary area involving machine learning and data mining, statistics, information
retrieval and natural language processing[76], [84]. Text mining can work with unstructured or
semi-structured datasets such as full-text documents, HTML files, emails, etc.

2.18 Web-Based Educational Systems

Data mining and text mining technologies are used in Web-based educational systems for shared
learning. Text mining is used for a discussion board for expanded correspondence analysis.
Learners select the relevant category which represents his/her comment and the system provides

2.18.1 Well-Known Learning Content Management Systems

Dringus et al. [85] and Abdous et al. [86] have proposed to use text mining as a strategy for
assessing conversations among irregular discussion forums. Text mining techniques also help in
evaluating the progress of a thread or user group discussions. Data can be retrieved from pdf
interactive multimedia productions for helping the evaluation of multimedia presentations for
statistics purpose and for extracting relevant data [82], [83]. Web-based educational systems
collect large amount of student data from weblog history which can be further analyzed for
deriving meaningful patterns [75].

2.18.2 Adaptive and Intelligent Web-Based Educational Systems

Tang et al. have proposed to construct a personalized web-based application by which mining
can be done on both the framework and structure of the courseware. Keyword-driven text mining
algorithms are used to select articles for distance learning students [81].

31 | P a g e
LITERATURE SURVEY

2.19 Conclusion

In this chapter, the literature survey has been conducted on knowledge discovery perspective and
the role of data mining in an educational environment. Educational Data Mining is an upcoming
field related to several well-established areas of research including e-learning, web mining, text
mining etc. Data mining techniques have been used to analyze educational data and extract
useful information from a large amount of data.

The KDD field is related to the development of methods and techniques which make the data
relevant. In the educational sector, software and visualization techniques can be developed using
data mining t
helps us to cluster those students who need special attention in their studies. Knowledge
discovery in databases results in better decision-making related to the latest technologies used in
classroom teaching as well as faculty enhancement programs and in-house training etc. Using
data mining techniques, one can achieve refined data from distributed databases. Data Mining is
an efficient tool for improving institutional effectiveness and student learning. Knowledge
acquired by educational data mining not only help teachers to manage their classes, improves
their teaching skills, students learning processes but also provide feedback to institutions to
improve their infrastructures and quality.

Using techniques like decision trees, the class result of students are predicted based on the
attributes taken. Decision tree classifiers have been used on student's data to predict the student's
performance in the class result. These techniques help in identifying a) those students who are
short of attendance, b) shown poor performance in sessionals.

The main finding of using these techniques is the gathering


academic performance. Other helpful techniques are clustering like K-Means, K-Nearest
Neighbors, Neural networks through which students are clustered based on some attributes like
a) class performance, b) sessional marks, c) attendance in class. The centroid values are
calculated from the educational dataset taking K-clusters. It enhances the decision-making
approach to monitor the performance of students. On increasing the value of K clusters, the
accuracy becomes better with a huge dataset and it can find the better grouping of the data. It
also helps us to clusters those students who need special attention. This review of data mining is
helpful to find useful patterns related to educational data sets.

32 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy