Data Warehousing PArt B

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in

support of management's decisions."

Characteristics of Data Warehouse:


 Subject-Oriented
Data warehouses typically provide a brief and direct view around a particular subject, such as
customer, product, or sales, instead of the total organization's current operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the
users to understand the subject.

 Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions attributes types, etc., among different
data sources.

 Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. Once data is stored
in the data warehouse then it cannot be modified, alter, or updated. Prior data is not deleted when
new data is added, making it persistent and non-volatile. Data from the past is kept for analogies,
patterns, and predictive analysis.

 Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data warehouse,
i.e., update, insert, and delete operations are not performed. It usually requires only two
procedures in data accessing: Initial loading of data and access to data. Therefore, the DW does
not require transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse,
and data should not change.
Need For Data Warehousing:

Here are a few situations that indicate need for data warehouse :

1. Because there are several data sources to query

Suppose if one might store some customer data in your application database, but other
information might be locked away in a cloud service like Salesforce . Each data source stores part
of the data you need to build a complete customer profile, but combining these sources can be a
huge challenge.

A robust data warehouse will extract, transform, and load (ETL) your data to help you
consolidate different sources into a central repository so you can view useful summaries or
projections that no single system could provide.

2. Because data must be transformed

In an ideal world, all the information you need maintains the same structure, regardless of
where it came from.In the real world, we have legacy systems, multiple operating systems, and
different programming languages that all treat common entities like dates, time zones,
currency,etc.This is where the “T” in ETL comes in. As you extract data from each source, you
can transform it and save it in a common format. This allows you to standardize data types,
remove corrupted data, and even apply custom business rules depending on how you plan to use
the data.

3. Because there is a high volume of data

All of the data retrieval operations above become more expensive and difficult the more data
you have. Running inefficient queries on 5 petabytes of data is problematic. More data is
inherently hard to manage, but having a data warehouse to centralize and optimize it will make a
huge difference.

4. Because there is a need of user or team-specific dashboards and reports

As your company grows, different stakeholders will need access to your data for different
reasons. For example, the customer service team might care about usage patterns, while the sales
team might care more about customer. When querying your data is a difficult manual operation,
A data warehouse can offer different views or dashboards for different stakeholders. 

5. Because data is stored in high-availability (HA) systems

Uptime is always important, but in some applications, high-availability is a promised or it is a


legal requirement.  Some companies work by processing requests at night or during scheduled
outages, but this is limiting. These delays mean that your business teams cannot independently
gather important data quickly, which will put you at risk if your competition is able to respond to
changes faster than you are.

6. Because there is a need to apply data mining or machine learning to the data

Once you have a warehouse, you can use tools like “Weka” to discover visions from the data.
Once the necessary models are build, one can run them to respond immediately to changes or
abnormalities.
Trends in data warehousing:

1. Multiple Data Types:


Decision support systems were divided into two groups: data warehousing dealt with structured
data; knowledge management involved unstructured data. Different types of data that need to be
integrated in the data warehouse to support decision making more effectively are as follows:
 Structured Numeric
 Image
 Spatial
 Structured Text
 Unstructured Document
 Video
 Audio
Companies are realizing there is a need to integrate both structured and unstructured data in
their data warehouses.

[Spatial Data:
Adding spatial data will greatly enhance the value of your data warehouse. Address, street block, city
quadrant, county, state, and zone are examples of spatial data. Vendors have begun to address the need to
include spatial data. Some database vendors are providing spatial extenders to their products using SQL
extensions to bring spatial and business data together.]

2. Data Visualization:
When a user queries your data warehouse and expects to see results only in the form of output
lists or spreadsheets, your data warehouse is already outdated. You need to display results in the
form of graphics and charts as well. Every user now expects to see the results shown as charts. Data
visualization helps the user to interpret query results quickly and easily.
There are three major trends of data visualization software.

More Chart Types. The numerical results are converted into a pie chart, a scatter plot, or another
chart type. Now the list of chart types supported by data visualization software has grown much
longer.

Interactive Visualization. Dynamic chart types helps users to review a result chart, manipulate it,
and then see newer views online.

Visualization of Complex and Large Result Sets: Newer visualization software can visualize
thousands of result points and complex data structures.

Advanced Visualization Techniques

Chart Manipulation. A user can rotate a chart or dynamically change the chart type to get a clearer view of
the results.

Drill Down. The visualization first presents the results at the summary level. The user can then drill down the
visualization to display further visualizations at subsequent levels of detail.

Advanced Interaction. These techniques provide a minimally invasive user interface. The user simply double
clicks a part of the visualization and then drags and drops representations of data entities. Or, the user simply
right clicks and chooses options from a menu. Visual query is the most advanced of user interaction features.

3. Parallel Processing
In order to speed up query processing, data loading, and index creation, a very effective way to
do accomplish this is to use parallel processing. Both hardware configurations and software
techniques go hand in hand to accomplish parallel processing. A task is divided into smaller units
and these smaller units are executed concurrently.
Parallel Processing Hardware Options. In a parallel processing environment, you will find these
characteristics: multiple CPUs, memory modules, one or more server nodes, and high-speed
communication links between interconnected nodes. Figure below indicates the three options and
their comparative merits.

CPU CPU
CPU CPU
SMP CPU CPU CPU

Common Bus MEM MEM MEM

Shared Memory
Disk Disk Disk
Shared Disks
MPP

CPU CPU CPU CPU CPU CPU

Shared Memory Shared


Memory

Node
Node Common High Speed Bus

Shared disk
CLUSTER
Parallel Processing Software Implementation:
You will have to ensure that the software can allocate units of a larger task to the hardware
components appropriately.
Parallel processing software must be capable of performing the following steps:

• Analyzing a large task to identify independent units that can be executed in parallel
• Identifying which of the smaller units must be executed one after the other
• Executing the independent units in parallel and the dependent units in the
proper sequence
• Collecting, collating, and consolidating the results returned by the smaller units

you will realize the following significant advantages when you adopt parallel processing in your
data warehouse:

• Performance improvement for query processing, data loading, and index creation
• Scalability, allowing the addition of CPUs and memory modules without any changes to
the existing application
• Fault tolerance so that the database would be available even when some of the parallel
processors fail
• Single logical view of the database even though the data may reside on the
disks of multiple nodes

4. Query Tools
Following functions for which vendors have greatly enhanced their
query tools.

• Flexible presentation—Easy to use and able to present results online and on reports
in many different formats
• Aggregate awareness—Able to recognize the existence of summary or aggregate
tables and automatically route queries to the summary tables when summarized
results are desired
• Crossing subject areas—Able to cross over from one subject data mart to another
automatically
• Multiple heterogeneous sources—Capable of accessing heterogeneous data sources on
different platforms
• Integration—Integrate query tools for online queries, batch reports, and data extraction
for analysis, and provide seamless interface to go from one type of output to an- other
• Overcoming SQL limitations—Provide SQL extensions to handle requests that can- not
usually be done through standard SQL

5. Browser Tools
If the users have to go to the data warehouse directly, they need to know what informa-
tion is available there. The users need good browser tools to browse through the informa-
tional metadata and search to locate the specific pieces of information they want to re-
ceive. Similarly, when you are part of the IT team to develop your company’s data
warehouse, you need to identify the data sources, the data structures, and the business rules.
You also need good browser tools to browse through the information about the data sources.
Here are some recent trends in enhancements to browser tools:

• Tools are extensible to allow definition of any type of data or informational object
• Inclusion of open APIs (application program interfaces)
• Provision of several types of browsing functions including navigation through hier
archical groupings
• Allowing users to browse the catalog (data dictionary or metadata), find an
informational object of interest, and proceed further to launch the appropriate query
tool with the relevant parameters
• Applying Web browsing and search techniques to browse through the
information catalogs

6. Data Fusion

Various types of data from multiple disparate sources need to be integrated or fused
together and stored in the data warehouse. Data fusion is a technology dealing with the
merging of data from disparate sources.
Data fusion not only deals with the merging of data from various sources, it also has
another application in data warehousing. In present-day warehouses, we tend to collect data
in astronomical proportions. The more information stored, the more difficult it is to find the
right information at the right time. Data fusion technology is expected to address this
problem also.

7. Multidimensional Analysis
Today, every data warehouse environment provides for multidimensional analysis. This is
becoming an integral part of the information delivery system of the data warehouse. Pro-
vision of multidimensional analysis to your users simply means that they will be able to
analyze business measurements in many different ways. Multidimensional analysis is also
synonymous with online analytical processing (OLAP).

8. Agent Technology
A software agent is a program that is capable of performing a predefined programmable
task on behalf of the user. For example, on the Internet, software agents can be used to sort
and filter out e-mail according to rules defined by the user. Within the data ware- house,
software agents are beginning to be used to alert the users of predefined business
conditions. Some vendors specialize in alert system tools.
As the size of data warehouses continues to grow, agent technology gets applied more and
more. Whenever a threat or opportunity condition is discovered through elaborate analysis,
it makes sense to describe the event to a software agent program. This program will then
automatically signal to the analyst every time that condition is encountered in the future.
Software agents may even be used for routine monitoring of business performance. A
software agent program may be used to alert him or her every time this condition happens.

9. Syndicated Data
The value of the data content is derived not only from the internal operational systems,
but from suitable external data as well. With the escalating growth of data warehouse
implementations, the market for syndicated data is rapidly expanding.
Examples of the traditional suppliers of syndicated data are A. C. Nielsen and Information
Resources, Inc. for retail data and Dun & Bradstreet and Reuters for financial and economic
data. Some of the earlier data warehouses were incorporating syndicated data from such
traditional suppliers to enrich the data content.
Now data warehouse developers are looking at a host of new suppliers dealing with many
other types of syndicated data. The more recent data warehouses receive demographic,
psychographic, market research, and other kinds of useful data from new suppliers.
Syndicated data is becoming big business.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy