IV-cse DM Viva Questions
IV-cse DM Viva Questions
IV-cse DM Viva Questions
Data preprocessing analyse the data set and provides the characteristics of data set such as
relation name, attributes, instances i.e., number of data entries.
Missing values
Noisy data (Human/Machine Errors)
Inconsistent data
Data cleaning tasks
Handling missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Define metadata?
A : Metadata is simply defined as data about data. In other words, we can say that metadata is
the summarized data that leads us to the detailed data.
Explain data mart.
A : Data mart contains the subset of organization-wide data. This subset of data is valuable to
specific groups of an organization. In other words, we can say that a data mart contains data
specific to a particular group.
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.
Discretization will convert the type of the attributes from numeric to nominal to improve the
efficiency of the result.
Perform Classification of data using Weka
Classification is a process of finding a model that describes and distinguish data classes and
concept The described model can be represented in various forms such as classification
rules, decision trees, mathematical model. In this we choose the J48 tree classifier to
classify the data classes.
Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a
step known as candidate generation, and groups of candidates are tested against the data. The
algorithm terminates when no further successful extensions are found.
7. Briefly state different between data ware house & data mart?
o Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally. e.g. If there will be DHW of bank then
there can be one data mart for accounts, one for Loans etc. This is high level definitions.
Metadata is data about data. e.g. if in data mart we are receving any file. then metadata
will contain information like how many columns, file is fix width/elimted, ordering of
fileds, dataypes of field etc...
8. Differentiate between Data Mining and Data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data
using queries. These queries can be fired on the data warehouse. Explore the data in data
mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects
9. What is the difference between OLTP and OLAP?
o OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand, OLAP systems are deliberately denormalized for fast data
retrieval through SELECT operations.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
What is Dimension Table?
Dimension table is a table which contain attributes of measurements stored in fact tables. This
table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Average number of bricks produced by one person/machine – measure of the business process
The k-mean algorithm defines the centroid of a cluster as the mean value of the points within
the cluster. It proceeds as follows:
First it randomly selects k of the objects in D each of which initially represent cluster mean or
centre. For each of the remaining objects an object is assigned to the cluster to which it is the
most similar, based on the Euclidean distance between objects and the cluster mean. Then it
iteratively improves the within cluster variation and for each cluster it computes new mean using
objects assigned to cluster in previous iteration. All the objects are then reassigned using the
updated means as the new cluster centers. This is repeated until it becomes stable.
K-mediod algorithm is a Partitioning Around Mediods (PAM) algorithm used for partitioning
based on mediod or central objects. It proceeds as follows:
Arbitrarily choose k of the objects in D as the initial representative objects. Then assign each
remaining object to the cluster with the nearest representative object and randomly select a non
representative object. Compute the total cost of swapping representative object with the non
representative object. If the total cost is less than zero then swap it and form a set of new
representative objects.
7.Classification Algorithms
The closest neighbor rule distinguishes the classification of an unknown data point. That is on
the basis of its closest neighbor whose class is already known.
In this nearest neighbor is computed on the basis of estimation of k. That indicates how many
nearest neighbors are to consider characterizing. It makes use of the more than one closest
neighbor to determine the class. In which the given data point belongs to and so it is called as
KNN. These data samples are needed to be in the memory at the runtime. Hence they are
referred to as memory-based technique.
KNN which is focused on weights. The training points are assigned weights. According to their
distances from sample data point. But at the same, computational complexity and memory
requirements remain the primary concern.
To overcome memory limitation size of data set is reduced. For this, the repeated patterns. That
don’t include additional data are also eliminated from training data set. To further enhance the
information focuses which don’t influence the result. That are additionally eliminated from
training data set. The NN training data set can organize utilizing different systems. That is to
enhance over memory limit of KNN. The KNN implementation can do using ball tree, k-d tree,
and orthogonal search tree.
The tree-structured training data is further divided into nodes and techniques. Such as NFL and
tunable metric divide the training data set according to planes. Using these algorithms we can
expand the speed of basic KNN algorithm. Consider that an object is sampled with a set of
different attributes.
Assuming its group can determine from its attributes. Also, different algorithms can use to
automate the classification process. In pseudo code, k-nearest neighbor algorithm can express,
calculate the distance D(X,Y) between X and every object Y in the training set
8. Case study