Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

7.Virtual Warehouse:A Virtual Warehouse is a logical data warehouse that 12.

Conceptual Modeling of Data Warehouse:Conceptual modeling of a data


1.Definition of a Data Warehouse:A data warehouse is a centralized repository that provides a unified view of data without physically consolidating it into a single warehouse involves creating a high-level blueprint that outlines the structure and
stores large volumes of structured and semi-structured data from various sources. It repository. It uses virtualization technologies to create a layer that allows users to organization of the data warehouse. It focuses on defining the main entities,
is designed for query and analysis rather than transaction processing. Data is query and analyze data from multiple sources as if it were stored in one place. relationships, and data flows without worrying about the technical implementation
consolidated, transformed, and stored to support decision-making processes. details.
8.Virtual Warehouse Key Characteristics:
2.Need for a Separate Data Warehouse Logical Integration: Data remains in its original source systems but is accessed 13.Key Techniques of Conceptual Modeling of Data Warehouse:
Data Integration: Combines data from disparate sources, providing a unified view. and integrated virtually. Star Schema: The most common data warehouse schema. It consists of a central
Historical Analysis: Stores historical data, enabling trend analysis over time. Flexibility: Easier to adapt to changes in the data environment. fact table surrounded by dimension tables. Each dimension table contains attributes
Performance: Optimizes query performance for complex analytical queries. Lower Cost: Reduced need for physical storage and data movement. related to the dimensions.
Data Quality: Ensures consistency, accuracy, and reliability through data cleansing. Performance: Can be affected by the performance of the underlying source systems Snowflake Schema: A more normalized form of the star schema where dimension
Decision Support: Facilitates business intelligence activities like reporting, data and the virtualization layer. tables are further normalized into multiple related tables.
mining, and OLAP (Online Analytical Processing).
9.Differences between Operational Database Systems and Data Warehouses 14.Concept Hierarchies:Concept hierarchies are a way to organize data into
3.Enterprise Data Warehouse (EDW): An Enterprise Data Warehouse is a multiple levels of granularity, allowing users to navigate through different layers of
centralised repository that stores integrated data from multiple business sources. It data abstraction. They define a sequence of mappings from a set of low-level
provides a comprehensive and consistent view of the enterprise's data, supporting concepts to higher-level, more general concepts.
decision-making at all levels of the organisation. Operational Database Data Warehouse
15.Measures: Their Categorization and Computation
i.Operational systems are designed to i.Data warehousing systems are
4.Enterprise Data Warehouse
support high-volume transaction typically designed to support Measures: Measures are the quantitative data points in a data warehouse,
Key Characteristics:
processing. high-volume analytical processing representing the metrics that users want to analyze.
Centralization: All data is stored in a single, central repository. ii.Operational systems are usually ii.Data warehousing systems are usually
Integration: Data from various sources is cleaned, transformed, and integrated to concerned with current data. concerned with historical data Categorization of Measures:
ensure consistency. iii.Data within operational systems are iii.Non-volatile, new data may be added Distributive Measures:Can be aggregated in any order and the result will be the
Scalability:Designed to handle large volumes of data and a wide variety of data types. mainly updated regularly according to regularly. same.Example: SUM, COUNT.
Complexity: Often complex to design and maintain due to the need for need. iv.It is designed for analysis of business Algebraic Measures:Can be computed using a finite number of distributive
comprehensive data integration and storage solutions. iv.It is designed for real-time business measures by subject area, categories, measures. Example: AVERAGE (computed as SUM/COUNT), RATIO.
Cost: Typically more expensive to implement and maintain compared to other dealing and processes. and attributes.
Holistic Measures:Require the entire dataset to compute the result.Example:
models due to the scale and complexity. v.It supports thousands of concurrent v.It supports a few concurrent clients
clients. relative to OLTP. MEDIAN, MODE.
vi.Less Number of data accessed. vi.Large Number of data accessed.
5.Data Mart: A Data Mart is a subset of a data warehouse, typically focused on a Computation of Measures:Measures are typically computed using aggregate
specific business line or team. It is designed to meet the needs of a particular group functions such as SUM, COUNT, AVERAGE, MIN, and MAX.
of users, such as the sales department, finance team, or marketing department. They can be derived or calculated using complex formulas involving
10.Data Cube: A Data Cube is a multi-dimensional array of values used to
basic measures
6.Data Mart Key Characteristics: represent data in a data warehouse. It allows data to be modelled and viewed in
Focused Scope: Limited to specific business areas or departments. multiple dimensions, which are often referred to as the attributes or features of the 18.OLEP (Online Evolutionary Processing):While OLAP focuses on the analytical
Simpler Design: Easier to implement and maintain compared to an EDW. data. processing of historical data, OLEP refers to a paradigm often associated with the
Faster Access: Designed to provide quicker access to relevant data for the targeted user processing of evolving or real-time data, supporting continuous and adaptive
11.Key Characteristics of Data Cube:
group. queries. This is not as commonly referenced as OLAP, but it generally deals with data
Dimensions: Represent the perspectives or entities with respect to which an
Cost-Effective: Less expensive to implement due to the reduced scope and complexity. that changes over time and requires immediate, adaptive responses.
organisation wants to keep records (e.g., time, geography, products).
Data Mart Types: Dependent Data Mart: Extracted from an existing EDW, Facts:Numeric data points of interest,usually aggregatable measures like sales, *what is metadata:Metadata is data that provides information about other data, such as
Hierarchies: Each dimension can have hierarchical levels details that describe the content, quality, structure, and management of a dataset. For
ensuring consistency with the central data repository.
Dice: A sub-cube created by selecting specific values for multiple dimensions. example, metadata for a digital photo might include the date it was taken, the camera
Independent Data Mart: Built directly from source systems, independent of an EDW.
settings, and the location.

16.OLAP Operations:OLAP operations enable users to analyze multi-dimensional 19.From Online Analytical Processing to Multidimensional Data Mining 22.Process of Knowledge Discovery in Databases (KDD)
data interactively, allowing for insights from different perspectives and granularities.
These operations are typically performed on a multi-dimensional data model or a Online Analytical Processing (OLAP): Purpose: Facilitates complex queries and The process of Knowledge Discovery in Databases (KDD) is a comprehensive
data cube. Here are the key OLAP operations: analysis of data in a multidimensional format. process of converting raw data into useful information and knowledge. It consists of
Operations: Roll-up, drill-down, slice, dice, and pivot. several steps:
Roll-Up: Aggregates data by climbing up a concept hierarchy or by reducing Tools: OLAP servers (ROLAP, MOLAP, HOLAP) and cube structures.
dimensions. Example: Rolling up sales data from the day level to the month level. 1. Data Cleaning:
Multidimensional Data Mining: Purpose: Discover patterns, correlations, and ○ Purpose: Remove noise and correct inconsistencies in the data.
Drill-Down: Breaks down data into finer levels of detail, the opposite of roll-up. anomalies in large datasets. ○ Activities: Handling missing values, correcting errors, and smoothing
Example: Drilling down from the year level to the quarter level in sales data. Techniques: Association rule mining, classification, clustering, regression, outlier noisy data.
detection. 2. Data Integration:
Slice: Selects a single dimension from the multi-dimensional data cube,resulting in a Integration with OLAP: Applies data mining techniques to multidimensional data ○ Purpose: Combine data from multiple sources into a coherent dataset.
sub-cube.Example: Slicing data to view sales figures for a specific year only. stored in OLAP cubes, enhancing analysis with trend analysis, forecasting, and ○ Activities: Merging databases, data warehouses, or different data
predictive modeling. formats.
Dice: Selects two or more dimensions to create a sub-cube, providing a more
3. Data Selection:
focused dataset. Example: Dicing data to view sales figures for specific products
20.Data Warehouse Implementation ○ Purpose: Select relevant data for analysis.
and regions.
○ Activities: Choosing a subset of attributes or records from the dataset.
Implementation Steps: 4. Data Transformation:
Pivot (Rotate): Reorients the data cube, allowing data to be viewed from different
perspectives.Example: Rotating the cube to swap rows and columns in a report to Planning and Analysis: Objective: Define project scope, objectives, timelines, and conduct ○ Purpose: Transform data into suitable formats for mining.
get a different view feasibility studies. Activities: Risk assessment and assembling a project team. ○ Activities: Normalization, aggregation, generalization, and feature
extraction.
17.Operations in the Multidimensional Data Model (OLAP):The multi-dimensional Design: Objective: Create detailed data models and ETL processes. 5. Data Mining:
data model is foundational for OLAP operations, enabling sophisticated data analysis Activities: Develop conceptual, logical, and physical schemas; plan for data quality, security, ○ Purpose: Apply algorithms to extract patterns from the data.
through various operations: and governance. ○ Activities: Using techniques such as classification, regression,
clustering, association, etc.
Development:Objective: Build the data warehouse infrastructure.
Aggregation:Summarizes data along one or more dimensions.Example: Summing
Activities: Set up servers and storage, implement ETL processes, develop the data
up sales figures for each product category. 23.Example of KDD Process
warehouse database and metadata repository.
Navigation:Involves moving through different levels of data detail, including Testing:Objective: Ensure the system works correctly. Consider an e-commerce company that wants to understand customer purchasing
drill-down and roll-up operations.Example: Navigating from yearly sales data to Activities: Perform unit testing, integration testing, validate data accuracy and consistency, behaviors to improve marketing strategies:
monthly sales data. conduct user acceptance testing (UAT).
1. Data Cleaning:Remove duplicate entries, correct data entry errors, handle
Selection:Filters data to focus on specific criteria.Example: Selecting sales data for Deployment:Objective: Make the data warehouse operational. missing values in transaction records.
a particular region or time period. Activities: Migrate data, set up access controls, train end-users, roll out the system. 2. Data Integration:Combine customer data from CRM systems, web analytics,
and transaction databases into a single dataset.
Computation:Performs calculations on data, such as averages, ratios, and Maintenance and Support:Objective: Keep the data warehouse running smoothly.
3. Data Selection:Select relevant attributes such as customer demographics,
percentages.Example: Calculating the average sales per customer. Activities: Monitor performance, update data, address issues, plan for scalability and
purchase history, and browsing behavior.
system upgrades.
18.Data Warehouse Design and Usage
21..What is Data Mining:It is the process of discovering patterns, trends, correlations,
Design:: Requirement gathering, data modeling (conceptual, logical, physical), ETL
and anomalies within large datasets using techniques from statistics, machine learning, and
process design, metadata management, and architecture planning (centralized, 28.Association Rule Learning: Association rule learning is a method used in data
database systems. The goal is to extract valuable information from raw data and transform it
federated, hybrid).Usage:: Supports complex querying and data analysis for mining to discover interesting relationships, patterns, or associations among a set of
into an understandable structure for further use, such as decision-making, prediction, and
items in large datasets. It aims to identify rules that predict the occurrence of an item
informed decision-making. Tools: OLAP operations (roll-up, drill-down, slice, die, knowledge discovery.
based on the occurrences of other items.
From Online Analytical Processing to Multidimensional Data Mining

24.Types of Repositories 26.Data Mining Trends 29.How Association Rule Learning Works:

1. Data Warehouses: 1. Big Data: 1. Identify Frequent Itemsets: Find all sets of items (itemsets) that have support
above a certain threshold.
● Description: Centralized repositories that store integrated data from multiple ● Description: Managing and analyzing large volumes of data that are beyond the 2. Generate Association Rules: From the frequent itemsets, generate rules that have
sources, designed for query and analysis. capability of traditional database systems. confidence above a certain threshold.
● Characteristics: Structured, subject-oriented, time-variant, and non-volatile. ● Trend: Leveraging technologies like Hadoop, Spark, and distributed computing for 3. Evaluate and Prune: Evaluate the generated rules using metrics like lift and prune
● Use Case: Business intelligence and reporting, historical data analysis. big data analytics. the ones that are not interesting.

2. Databases: 2. Cloud Computing: 30.Apriori Algorithm: The Apriori algorithm is a classic algorithm used to find frequent
itemsets and generate association rules. It uses a bottom-up approach where frequent
● Description: Structured collections of data, organized in tables and managed by ● Description: Utilizing cloud resources for scalable and flexible data mining subsets are extended one item at a time (known as candidate generation), and groups of
database management systems (DBMS). operations. candidates are tested against the data.
● Characteristics: Organized into schemas, supports transactions, ensures data ● Trend: Adoption of cloud platforms (e.g., AWS, Google Cloud, Azure) for data
integrity. storage and analytics. 31.Steps of the Apriori Algorithm:
● Use Case: Online transaction processing (OLTP), data storage for applications.
3. Real-Time Data Mining: 1. Generate Candidate Itemsets: Start with itemsets of length 1. Generate larger
3. Data Lakes: itemsets by combining the smaller itemsets.
● Description: Analyzing data as it is generated to provide immediate insights. 2. Calculate Support: For each candidate itemset, calculate its support.
● Description: Storage repositories that hold large amounts of raw data in its native ● Trend: Use of streaming data processing frameworks (e.g., Apache Kafka, Apache 3. Prune: Remove itemsets that do not meet the minimum support threshold.
format until it is needed. Flink). 4. Repeat: Repeat the process to generate itemsets of increasing length until no more
● Characteristics: Highly scalable, supports structured and unstructured data, frequent itemsets are found.
schema-on-read. 27.Data Mining Issues 5. Generate Rules: From the frequent itemsets, generate rules and calculate their
● Use Case: Big data analytics, storing unstructured data, data exploration. confidence.
1. Data Quality:
25.Data Mining Tasks Example:
● Description: Ensuring the accuracy, completeness, and consistency of data.
1. Descriptive Tasks: ● Issue: Handling noisy, incomplete, and inconsistent data that can affect the results of ● Dataset: {1, 2, 3}, {1, 2}, {2, 3}, {1, 3}, {2, 3}
data mining. ● Minimum Support Threshold: 0.4 (2 out of 5 transactions)
● Clustering: Grouping similar data objects into clusters based on their characteristics.
○ Example: Market segmentation to identify distinct customer groups. 2. Scalability: 32.FP-Growth Algorithm
● Association Rule Mining: Discovering interesting relationships between variables in
large datasets. ● Description: Efficiently processing and analyzing large datasets. FP-Growth Algorithm: The FP-Growth (Frequent Pattern Growth) algorithm is an
○ Example: Market basket analysis to find products frequently bought together. ● Issue: Developing algorithms that can scale with the increasing volume and alternative to the Apriori algorithm that eliminates the need for candidate generation. It uses
● Summarization: Providing a compact description of a dataset. complexity of data. a divide-and-conquer strategy by constructing a compact data structure called the FP-tree
○ Example: Generating a summary report of sales data. (Frequent Pattern Tree) and then extracting frequent itemsets directly from this tree.
3. Data Integration:
2. Predictive Tasks: Steps of the FP-Growth Algorithm:
● Description: Combining data from various heterogeneous sources into a unified
● Classification: Assigning items to predefined categories or classes. dataset. 1. Build the FP-Tree:
○ Example: Email spam detection. ● Issue: Addressing challenges related to data format, schema integration, and ○ Scan the database to determine the frequency of each item.
● Regression: Predicting a continuous-valued attribute based on input variables. semantic consistency. ○ Order the items by frequency and construct the FP-tree by inserting
○ Example: Predicting house prices based on features like size, location, and transactions.
4. Privacy Concerns: 2. Mine the FP-Tree:
age.
● Time Series Analysis: Analyzing time-ordered data to extract meaningful statistics ○ Starting from the root, extract conditional patterns and generate frequent
● Description: Protecting sensitive data from unauthorized access and misuse. itemsets.
and characteristics.
● Issue: Ensuring that data mining practices comply with data protection regulations
○ Example: Forecasting stock prices or weather conditions.
(e.g., GDPR). Example:
3. Sequential Pattern Mining: Discovering regular sequences of
events or patterns over time. . ● Dataset: {A, B, C}, {A, B}, {B, C}, {A, C}, {B, C}
33.Applications of Association Rule Learning: 4. Graph-Based Clustering: 39.Classification in Supervised Learning
1. Market Basket Analysis: Identifying products frequently bought together. ● Basic Idea: Model the dataset as a graph where each node represents a data point, Classification: Classification is a supervised learning task where the objective is to
2. Cross-Selling: Recommending additional products based on customer purchases. and edges represent the similarity between points. categorize input data into predefined classes or categories. The model is trained on a
3. Fraud Detection: Detecting unusual patterns that may indicate fraudulent activity. ● Algorithm Steps: dataset containing input-output pairs and learns to assign a class label to new instances
4. Healthcare: Discovering associations between symptoms and diseases. 1. Construct a similarity graph (e.g., k-nearest neighbor graph). based on the input features.
5. Web Usage Mining: Understanding user navigation patterns on websites. 2. Apply graph partitioning methods to find clusters (e.g., spectral clustering).
40.Issues Regarding Classification:
34.Unsupervised Learning: Unsupervised learning is a type of machine learning where the 36.Cluster Analysis Basics, Cluster Evaluation
algorithm is trained on unlabeled data, meaning there are no predefined labels or outcomes. 1. Overfitting: The model performs exceptionally well on the training data but poorly on
The goal is to infer the natural structure present within a set of data points. The most Cluster Analysis Basics: new, unseen data due to its complexity.
common tasks in unsupervised learning are clustering and association. 2. Underfitting: The model is too simple to capture the underlying patterns in the data,
● Objective: Organize a set of objects into clusters such that objects in the same resulting in poor performance on both training and test data.
35.Clustering Algorithms cluster are more similar to each other than to those in other clusters. 3. Imbalanced Data: When the classes in the dataset are not equally represented,
● Applications: Customer segmentation, image segmentation, document clustering, leading to a model biased towards the majority class.
1. K-Means Clustering: Partition the dataset into K clusters, where each data point belongs bioinformatics. 4. Feature Selection: Identifying the most relevant features that contribute to the
to the cluster with the nearest mean (centroid). prediction, which can improve model performance and reduce complexity.
37. Outlier Detection and Analysis 5. Noise: Presence of irrelevant or erroneous data points that can affect the model's
● Algorithm Steps: accuracy
1. Initialize K centroids randomly. Outlier Detection: Outlier detection is the process of identifying data points that significantly
2. Assign each data point to the nearest centroid. differ from the rest of the dataset. These points can indicate variability in the data or signal 41.Types of Classifiers:
3. Recalculate the centroids as the mean of the points in each cluster. an abnormal behavior.
4. Repeat steps 2 and 3 until convergence (centroids no longer change). 1. Binary Classification:
Methods of Outlier Detection: ○ Description: Classifies data into two distinct classes.
2.K-Medoids Clustering (PAM): Similar to K-Means but uses medoids (representative ○ Examples: Spam vs. non-spam emails, disease present vs. disease absent.
points) instead of means to define clusters. 1. Statistical Methods: Assume a distribution for the data (e.g., Gaussian) and identify ○ Common Algorithms: Logistic Regression, Support Vector Machines (SVM),
points that deviate significantly from this distribution. Decision Trees, Naïve Bayes.
● Algorithm Steps: ○ Example: Z-score, Grubbs' test. 2. Multiclass Classification:
1. Initialize K medoids randomly. 2. Distance-Based Methods: Identify points that are far from their neighbors. ○ Description: Classifies data into more than two classes.
2. Assign each data point to the nearest medoid. ○ Example: k-Nearest Neighbors (k-NN) based outlier detection. ○ Examples: Classifying types of fruits (e.g., apple, banana, orange),
3. For each medoid, replace it with a non-medoid point and compute the total 3. Density-Based Methods: Identify points in low-density regions as outliers. categorizing news articles into different topics.
cost of the configuration. ○ Example: Local Outlier Factor (LOF). ○ Common Algorithms: Decision Trees, Random Forest, Neural Networks,
4. If the total cost decreases, adopt the new configuration; otherwise, keep the 4. Clustering-Based Methods: Treat points not belonging to any cluster or in very k-Nearest Neighbors (k-NN), Naïve Bayes.
existing medoid. small clusters as outliers.
5. Repeat steps 2 and 3 until convergence. ○ Example: DBSCAN (Density-Based Spatial Clustering of Applications with 42.Classification Approaches
Noise).
3.Hierarchical Clustering: Create a tree of clusters (dendrogram) that illustrates the
1. Bayesian Classification - Naïve Bayes:
arrangement of the clusters produced. Outlier Analysis:
● Basic Idea: Applies Bayes' theorem with the assumption that features are
● Types: ● Applications: Fraud detection, network security, fault detection, data cleaning. independent given the class.
1. Agglomerative (bottom-up): Start with each data point as a single cluster and ● Challenges: High dimensionality, scalability, mixed-type data, defining what ● Bayes' Theorem: P(C/X)=P(X/C)⋅P(C)/P(X)
merge the closest pairs of clusters iteratively. constitutes an outlier.
2. Divisive (top-down): Start with all data points in one cluster and recursively Types:
split them into smaller clusters. 38.Supervised Learning: Supervised learning is a type of machine learning where
● Algorithm Steps: the model is trained on a labeled dataset. Each training example consists of an input ● Gaussian Naïve Bayes: Assumes features follow a Gaussian distribution.
1. Compute the distance matrix. and a corresponding desired output, also known as the label. The goal is to learn a ● Multinomial Naïve Bayes: Used for discrete data (e.g., word counts in text).
2. Find the closest pair of clusters and merge them. ● Bernoulli Naïve Bayes: Used for binary/Boolean features.
mapping from inputs to outputs that can be used to predict the labels of new, unseen
3. Update the distance matrix to reflect the new cluster.
data.
4. Repeat until all points are in a single cluster or a stopping criterion is met.

2. Association-Based Classification: 45.Mining the Web Page Layout Structure: 49.Automatic Classification of Web Documents:

● Basic Idea: Uses association rules discovered in data to build a classifier. ● Objective: Understand and extract meaningful information from the arrangement of ● Objective: Categorize web documents into predefined classes automatically.
● Steps: elements on a web page, such as headers, paragraphs, images, and links. ● Techniques:
1. Discover frequent itemsets using algorithms like Apriori. ● Techniques: ○ Text Classification Algorithms: Such as Naïve Bayes, Support Vector
2. Generate association rules from these itemsets. ○ DOM Tree Parsing: The Document Object Model (DOM) represents the
Machines (SVM), and Neural Networks.
3. Build a classifier by selecting rules with high confidence and support. structure of a web page. Mining involves parsing the DOM tree to extract
○ Feature Extraction: Techniques like TF-IDF (Term Frequency-Inverse
● Example: If a customer buys bread and butter, they are likely to buy milk. layout information.
● Pros: Can handle categorical data, interpretable rules. ○ XPath/CSS Selectors: Used to navigate and extract specific elements from
Document Frequency) and word embeddings (e.g., Word2Vec, BERT)
● Cons: Can be computationally expensive, especially with a large number of rules. the web page. to represent the text content.
○ Visual Segmentation: Techniques like VIPS (Vision-based Page ○ Clustering: Grouping similar documents together using algorithms like
3. Rule-Based Classifier: Segmentation) segment a web page into visually distinct blocks to understand k-means or hierarchical clustering for exploratory analysis.
the layout and hierarchy.
● Basic Idea: Uses a set of "if-then" rules for classification. 50.Web Usage Mining:
● Rule Format: If condition(s) -> then class. 46.Mining Web Link Structure:
● Rule Generation: ● Objective: Analyze user interaction data from web logs to understand user
○ Direct Method: Extract rules directly from the data using algorithms like ● Objective: Analyze the hyperlinks between web pages to discover relationships, behavior and improve web services.
RIPPER, CN2. patterns, and the overall structure of the web. ● Steps:
○ Indirect Method: Extract rules from other classifiers like decision trees (e.g., ● Key Concepts: ○ Data Collection: Gathering data from web server logs, browser logs,
C4.5). ○ PageRank: An algorithm used by Google Search to rank web pages in their
user profiles, and cookies.
● Example: If age < 30 and income = high, then class = “young professional”. search engine results. It measures the importance of web pages based on the
○ Preprocessing: Cleaning and transforming raw data into a usable
● Pros: Interpretable, flexible, easy to implement. number and quality of links to them.
● Cons: May not handle continuous features well, rule conflict resolution can be ○ HITS Algorithm: Hyperlink-Induced Topic Search identifies two types of web
format (e.g., session identification, user identification).
complex. pages: hubs (pages that link to many other pages) and authorities (pages that ○ Pattern Discovery: Using techniques such as association rule mining,
are linked by many hubs). clustering, and sequential pattern mining to find interesting patterns in
43.Example Classification Approaches ○ Graph Theory: Representing the web as a graph, with nodes as web pages web usage data.
and edges as hyperlinks, to apply graph-based algorithms for analysis. ○ Pattern Analysis: Interpreting the discovered patterns to make
1. Naïve Bayes: informed decisions about website design, content, and marketing
○ Application: Email spam filtering. 47.Mining Multimedia Data on the Web: strategies.
○ Description: Calculate probabilities of each email being spam or not based
on word frequencies. ● Objective: Extract and analyze multimedia content (images, videos, audio) from the
web to derive useful information. Types of Knowledge Discovery in Data Mining .
2. Association-Based Classification:
○ Application: Market basket analysis. ● Techniques:
1. Classification:
○ Description: Use frequent itemsets to predict future purchases based on ○ Image Mining: Using techniques like object recognition, image classification,
and clustering to analyze images. ○ Purpose: Assign items to predefined categories.
current cart contents.
○ Video Mining: Analyzing video content using methods such as scene ○ Example: Email spam detection.
3. Rule-Based Classifier:
detection, keyframe extraction, and activity recognition. 2. Clustering:
○ Application: Customer segmentation.
○ Description: Create rules based on customer attributes to classify them into ○ Audio Mining: Extracting information from audio content through techniques ○ Purpose: Group similar items together without predefined categories.
segments. like speech recognition, audio classification, and sentiment analysis. ○ Example: Customer segmentation based on buying behavior.
3. Association Rule Learning:
44.Web Mining: Web mining involves extracting useful information and knowledge from web 48.Distributed Data Mining (DDM): Distributed Data Mining (DDM) refers to the process ○ Purpose: Discover relationships between variables in large datasets.
data, which includes web content, web structure, and web usage data. It can be categorized of extracting knowledge and patterns from large datasets distributed across multiple ○ Example: Market basket analysis to find products often bought
into three main types: locations, heterogeneous environments, or decentralised systems. DDM is essential for together.
handling vast amounts of data generated in various fields such as finance,
i.Web Content Mining: Focuses on extracting useful information from the content of web telecommunications, healthcare, and e-commerce, where data is often stored in distributed
pages.ii.Web Structure Mining: Analyzes the structure of hyperlinks within the web to systems.
discover patterns and relationships.iii.Web Usage Mining: Analyzes user interaction data
(e.g., web logs) to understand user behavior and improve web services.

Advantage and Disadvantage of data mart Data Warehouse Three-Tier Architecture Difference between Data mining vs Data warehouse?

Advantages of a Data Mart: Ans:A data warehouse employs a three-tier architecture to efficiently manage data Data Mining Data Warehousing
processing, storage, and access. This architecture consists of the bottom tier, middle
1. Improved Performance: Data marts are smaller and more focused than data tier, and top tier. i.Data mining is the process of i.A data warehouse is a database
warehouses, allowing for faster query responses and better performance for determining data patterns. system designed for analytics
specific departmental needs. 1. Bottom Tier: Data Source Layer ii.Data mining is generally considered as ii.Data warehousing is the process of
2. Cost-Effective: Implementing a data mart is generally less expensive than a the process of extracting useful data combining all the relevant data..
Function: Extracts data from various source systems and prepares it for storage. from a large set of data. iii.Data warehousing is entirely carried
full-scale data warehouse. They require fewer resources and infrastructure,
iii.Business entrepreneurs carry data out by the engineers.
making them a cost-effective solution for smaller projects or departments. Components: mining with the help of engineers. iv.In data warehousing, data is stored
iv.In data mining, data is analyzed periodically.
Disadvantages of a Data Mart: ● Data Sources: Operational databases, ERP systems, flat files, and external repeatedly. v.Data warehousing is the process of
sources. v.Data mining uses pattern recognition extracting and storing data that allow
1. Data Silos: Implementing multiple data marts can lead to the creation of data ● ETL Processes: Tools that perform Extract, Transform, and Load operations techniques to identify patterns. easier reporting.
silos, where data is isolated and not easily shared or integrated across the to cleanse, integrate, and aggregate data before loading it into the data
organization. This can hinder overall data analysis and decision-making. warehouse.
2. Inconsistency: Different data marts might use different standards and Feature of good cluster
definitions, leading to inconsistencies in data interpretation and reporting 2. Middle Tier: Data Storage and Management Layer
across the organization. A good cluster in data clustering exhibits several key features:
Function: Stores and manages cleaned and transformed data, supporting efficient
Applications of Data Mining querying and analysis. 1. High Intra-cluster Similarity:Instances within the same cluster should be
similar to each other. This means that the distance or similarity measure
1. Retail and E-commerce: Components: between data points within a cluster should be minimized.
○ Customer Segmentation: Identify customer groups based on 2. Low Inter-cluster Similarity:Instances from different clusters should be
● Data Warehouse Database: Central repository optimized for read-intensive
purchasing behavior to improve marketing strategies. dissimilar. This implies that the distance or dissimilarity measure between
operations.
○ Market Basket Analysis: Determine products frequently bought clusters should be maximized.
● Data Marts: Subsets of the data warehouse tailored for specific departments
together to enhance cross-selling and product placement. 3. Compactness:Clusters should be tightly packed, meaning that data points
or business units.
2. Healthcare: within a cluster should be close to each other. This ensures that the cluster
● OLAP Servers: Online Analytical Processing servers that support complex
○ Disease Prediction and Diagnosis: Analyze patient data to predict represents a distinct group.
queries and multidimensional analysis.
diseases and improve early diagnosis.
○ Treatment Effectiveness: Evaluate the success of treatments by 3. Top Tier: Presentation and Analysis Layer Pre Pruning and Post Pruning approch in classification
analyzing patient outcomes.
3. Finance and Banking: Function: Provides tools for data reporting, analysis, and visualization, enabling Ans:Prepruning: Prepruning involves stopping the tree construction process early,
○ Fraud Detection: Identify unusual transaction patterns that indicate end-users to derive insights from the data warehouse. before it becomes fully grown, based on certain conditions.
potential fraud. Purpose: It prevents the tree from becoming overly complex and capturing noise in
○ Risk Management: Assess credit risks by analyzing customer financial Components: the training data, thus improving its ability to generalize to unseen data.
data and payment histories. Example: Setting a maximum depth limit for the tree, limiting the number of leaf
● Query and Reporting Tools: Allow generation of standard and ad-hoc
nodes, or requiring a minimum number of instances in a node before further splitting.
reports.
● Data Mining Tools: Discover patterns and relationships through statistical Post-pruning:: Post-pruning involves constructing the full decision tree and then
analysis and machine learning. removing or collapsing certain nodes or branches based on pruning criteria.
● Dashboards and Visualization Tools: Offer graphical representations of
data through charts and dashboards for easier interpretation. Purpose: It allows the tree to grow fully and capture all patterns in the training data,
and then simplifies it to improve its performance on unseen data.
Example: Using techniques like reduced-error pruning, cost-complexity pruning,
Association algorithm in data mining

Ans:In data mining, association algorithm is a technique used to discover interesting


relationships or associations among a large set of data items. It's commonly applied
in market basket analysis to uncover patterns in consumer behavior. .

1. Definition : An association algorithm is a computational method used to


uncover patterns of association or co-occurrence among a set of items in a
large dataset.
2. Purpose: It's primarily used for market basket analysis to identify
relationships between items purchased together, which helps in
understanding customer behavior, optimizing product placement, and
designing targeted marketing strategies.
3. Popular algorithms : Common association algorithms include Apriori,
FP-Growth, and Eclat. These algorithms employ different strategies to
efficiently mine associations from large transactional datasets, such as using
candidate generation and pruning techniques.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy