Data Warehousing and Data Mining
Data Warehousing and Data Mining
Web Mining 66
NOTE:
MAKAUT course structure and syllabus of 6th semester has been changed from 2021. Previously
DATA WAREHOUSING AND DATA MINING was in 7th semester. This subject has been redesigned
and shifted in 6th semester as per present curriculum. Subject organization has been changed
slightly. Taking special care of this matter we are providing the relevant MAKAUT university
solutions and some model questions & answers for newly introduced topics, so that students can
get an idea about university questions patterns.
POPULAR PUBLICATIONS
c) It is a relational database
Answer: (b)
2. A data warehouse is said to contain a 'subject oriented' collection of data because [WBUT 2009,
2013]
d) It is a generalization of 'object-oriented'
Answer: (a)
3. A Data warehouse is said to be contain in time-varying collection of data because [WBUT 2010,
2013, 2015]
c) Every key structure of data warehouse contains either implicitly or explicitly an element of time
Answer: (c)
c) Database applications
b) OLTP applications
b) Data warehouse can be used for information processing (query, report) and analytical
processing
d) Data warehouse can be used for information processing (query, report), analytical processing
and data mining
Answer: (d)
Define the types of Data Marts. ] [WBUT 2009, 2010, 2011, 2018]
Answer:
1 Part:
A data mart is a group of subjects that are organized in a way that allows them to assist
departments in making specific decisions. For example, the advertising department will have its
own data mart, while the finance department will have a data mart that is separate from it. In
addition to this, each department will have full ownership of the software, hardware, and other
components that make up their data mart.
2nd Part:
Independent data marts sources from data captured form OLTP system, external providers or from
data generated locally within a particular department or geographic area.
2. Define data mining. What is the advantages data mining over traditional approaches? [WBUT
2009]
Answer:
1 Part:
Data mining, which is also known as knowledge discovery, is one of the most popular topics in
information technology. It concerns the process of automatically extracting useful information and
has the promise of discovering hidden relationships that exist in large databases. These
relationships represent valuable knowledge that is crucial for many applications. Data mining is
not confined to the analysis of data stored in data warehouses. It may analyze data existing at
more detailed granularities than the Sumanarized data provided in a data warehouse. It may also
analyze transactional, textual, spatial, and multimedia data which are difficult to model with
current multidimensional databotechnology.
2nd Part:
With the help of data mining, organizations are in a better position to predict the future regarding
the business trend, the possible amount of revenue that could be generated, the orders that could
be expected and the type of customers that could be approached. The traditional approaches will
not be able to generate such accurate results as they use simpler algorithms. One major advantage
of data mining over a traditional statistical approach is its ability to deal directly with
heterogeneous data fields.
The advaritages of data mining helps the businesses grow help the customers be happy, and help
in a lot of other areas like data management.
OR,
Explain support, confidence, frequent item set and give a formal definition of association rule.
[WBUT 2013]
OR,
What is an Association Rule? Define Support, Confidence, Item set and Frequent item set in
Association Rule Mining? [WBUT 2017]
Answer:
To illustrate the concepts, we use a small example from the supermarket domain. The set of items
is 1 = {milk, bread, butter, beer) and a small database containing the items is shown in Table
below.
Transaction Items
1 Milk, bread
2 Bread, butter
3 Beer
5 Bread, butter
The confidence of a rule is defined conf(X) Y) = supp(X [Y)/supp(X). For example, the rule (milk,
bread)) {butter) has a confidence of 0.2/0.4 = 0.5 in the database in the Table, which means that
for 50% of the transactions containing milk and bread the rule is correct. Confidence can be
interpreted as an estimate of the probability P(Y (X), the probability of finding the RHS of the rule
in transactions under the condition that these transactions also contain the LHS.
In many (but not all) situations, we only care about association rules or causalities involving sets of
items that appear frequently in baskets. For example, we cannot run a good marketing strategy
involving items that no one buys anyway. Thus, much data mining starts with the assumption that
we only care about sets of items with high support; i.e., they appear together in many baskets. We
then find association rules or causalities only involving a high-support set of items must appear in
at least a certain percent of the baskets, called the support threshold. We use the term frequent
itemset for "a set S that appears in at least fraction s of the baskets," where s is some chosen
constant, typically 0.01 or 1%.
Association rules are statements of the form (X1,X2, ...,X, Y, meaning that if we find all of
X1,X2,...X, in the market basket, then we have a good chance of nding Y. The probability of finding
Y for us to accept this rule is the condence of the rule. We normally would search only for rules
that had confidence above a certain threshold.