Best Practice in Data Management
Best Practice in Data Management
Barry Williams
Best Practice in Data Management 2 Chapter 1. Welcome.............................................................................................3 Chapter 2. BI Architectures ....................................................................................5 Chapter 3. Data Architectures for the Future..............................................................42 Chapter 4. Data Management Getting Started ...........................................................51 Chapter 5. Data Warehouses and Data Marts...............................................................56 ....................................................................................................................56 Chapter 6. Design Patterns for Data Models ...............................................................67 Chapter 7. Enterprise Data Models...........................................................................81 ....................................................................................................................81 ....................................................................................................................82 Chapter 8. Knowledge Management .........................................................................92 Chapter 9. MetaData Management .........................................................................106 Chapter 10. Quality of a Data Model ......................................................................115
Chapter 1. Welcome
1.1 What is this?
This is a collection of Essays on Best Practice in Data Management. Data Management is like a Slowly-Changing Dimension It changes imperceptibly and then after about a year (on the average) you realize that the landscape has changed. Our intention in this book is to capture and define Best Practice at a particular point in time and then keep it up-to-date with new versions of the Book every quarter. We have started with the Topics that we find come up most often in questions to our Database Answers Web Site. I hope you enjoy this Book and would be very pleased to have your comments at barryw@databaseanswers.org. Also, please contact us if there is a topic that you would like to see included.
Our intention with this book is to provide a reference for the selected Topics in Data Management.
An understanding of what each Topic involves. Some Templates to use to get started Which organisations and individuals provide Thought Leadership
1.4 Topics
In this book, we cover some important topics in Data Management, including : Business Intelligence (BI) Checking the Quality of a Data Model
Chapter 2. BI Architectures
Here is the link for LinkedIn Group on BI :-
http://www.linkedin.com/groups/Business-Intelligence-Group-23006
Data Integration Layer Single Version of the Truth Enterprise Data Model - ETL
BI Architecture
Reporting Layer
Report Templates
Data Warehouse
BI and Reports take data from Data Marts and many of the same considerations apply when it comes to determining Best Practice. One difference is that is necessary to have a clearer understanding of the business operations and how the right kind of Performance Reports can provide insight to the business users. This leads to the need for a management education process to be in place so that the evolution of Performance Reports can be planned in a logical manner, from basic summaries, to KPIs, Dashboards and so on. 2.5.2 Data Models This Data Models shows the Data Warehouse for the FBP 3 Project. It is called a Reporting Database for consistency with the previously existing Project documentation.
Atrtributes in black are Headings or Dimensions, and in red are Data Items. Reporting_Database Regions Region Region_Name Banner Banner Banner_Name Record_ID Country (FK) Region (FK) Banner (FK) Department (FK) LP Page (FK) Store Number (FK) Time_Period (FK) Currency (FK) Cash Inventory Cash Sales Clearance Unit Sales Disrtro Units Financial Data Inventory Inventory Ticket Plans Inventory Ticket Actuals Financial Data - Till MD Data Merch Need Unit Production Targets Unit Sales - Plans Unit Sales - Actual TY Data LY Data Other Data Departments Department Department_Name
8 8 2 1
5 2 2 2 2
This Stage produces and delivers Performance Reports for management Report Templates supported by the appropriate Generic software are required.
What.2 - What is Business Intelligence ? This Stage produces and delivers BI and Performance Reports to management : It must be responsive to requests for change. Users requirements are always evolving Therefore the approach and supporting software must be flexible Report Templates supported by the appropriate Generic software are required.
This Stage delivers Performance Reports that meet the requirements of all levels of management. There is a need to be responsive to requests for change. User Requirements are always evolving. Therefore the approach and reporting software tool must be flexible.
A sensible approach is to develop Reporting Templates supported by the appropriate Generic Software.
What.3 How do we assess our User Report Maturity Level ? 1) Blank Template
Template Name Date User Category Weekly Totals Traffic Lights Dashboards KPIs Other
Report Maturity Level April 1st. 2010 Weekly Totals Common Common Common Traffic Lights In use Aware Unaware Dashboards In use Unaware Unaware KPIs In use Aware Unaware Other Mashups
2.2.2.2 Availability of Models and Data Marts 1) Blank Template This Template is used to keep track of the availability of Master Data Models and Data Marts. Template Name Date Category Customers Finance Products Purchase Orders Stores Warehouses Master Data Models Data Marts
N/A Available SEED DTR but needs work DTR but needs work
Report Name Date Produced Product Name Week 1 Date <Value in s> Week 2 Date <Value in s> Week 3 Date <Value in s> Week 4 Date <Value in s> Total <Value in s>
<Value in s>
<Value in s>
<Value in s>
<Value in s>
2) Completed Template These figures are fictitious. Report Name Date Produced Product Name Beer Cigarettes Cigars & cigarillos Leaded Petrol Unleaded Petrol Value of Weekly product Movements March 18th. 2010 Dec 6th 2009 40,000 50,000 25,000 90,000 100,000 Dec 13th 2009 60,000 60,000 30,000 91,000 120,000 Dec 20th 2009 70,000 70,000 31,000 92,000 133,000 Dec 27th 2009 80,000 80,000 32,000 93,000 140,000 Total 160,000 160,000 118,000 366,000 490,000
205,000
361,000
396,000
425,000 1,300,000
Why.1 - Why is this Stage important ? The value and benefits of Reports are always a major part of the justification of the cost of designing and installing a Database.
Step 1. Determine if Users are ready for KPIs,Traffic Lights and Dashboards. Step 2. Check availability of Master Data Models Step 3. Check availability of Data Marts Step 5. Check availability of Report Specifications and SQL Views for Reports Step 6. Perform Gap Analysis to identify any missing data that must be sourced. Step 7. Analyse common aspects of requirements for Performance Reports
There are three Templates in this Section :1. 2. 3. User Report Maturity Level Availability of Master Data Models and Data Marts Templates for Performance Reports
Here's another Kick-Start Tutorial : Step 1. Assess the level of Maturity of the Users concerning KPIs, Dashboards,etc.. Step 2. Check availability of Master Data Models and Data Marts Step 3. Check availability of Report Specifications and SQL Views for Reports Step 5. Tailor the Approach accordingly Step 6. Aim for results in 6 months and interim results in 3 months
If you have a Question that is not addressed here, please feel free to email us your Question at barryw@databaseanswers.org.
How.2 - How do we measure progress in Business Intelligence ? Check for : a Statement of User Requirements ideally with specifications of Templates Software Design Patterns.
How.3 - How do I combine Excel data in my Reports ? Data in Excel Spreadsheets is structured in tabular format which corresponds exactly to the way in which data is stored in relational database. Also Spreadsheets are commonly used and the data frequently needs to be integrated with other data within an organization. Therefore we would expect to find a wide range of solutions are available to solve this problem. Here is a small sample : An ODBC connection can be established for a spreadsheet. Informatica allows Spreadsheets to be defined as a Data Source. Microsofts SQL Server Integrated Services also lets Excel be defined as a Source. Oracle provides a facility to define EXTERNAL table which can be Spreadsheets. Salesforce.com provides their Excel Communicator. How.4 - How do you meet your Chief Executives Report requirements ? In order to always respond to this situation appropriately, it is necessary to have an Information Catalogue, a Data Architecture and Data Lineage. The solution then involves the following Steps :Step 1) Produce a draft Report for the Chief Execs approval Step 2) Trace the lineage and perform a gap analysis for all new data items. Step 3) Talk to the Data Owners and establish when and how the data can be made available. Step 4) Produce a Plan and timescale Step 5) Review your Plan with the Chief Exec and obtain this agreement and formal sign-off. Step 6) Deliver !!!
How.5 - How do I produce Integrated Performance Reports ? Reports for Senior Management fall into two categories : Standard Reports On-demand reports For Standard Reports it is possible to define Templates. For On-demand Reports, the aim is to define a flexible approach to be able to respond to changes to Requirements in a timely manner. The key action here is to establish a unified Reporting Data Platform. This will involve aspects previously discussed, including MDM, CMI and will certainly involve Data Lineage. Senior Management will want to take a view of the integrated data and not focus on details of derivation. Therefore, we have to follow the MDM approach with Data Lineage for each item in the Integrated Performance Reports. Key Performance Indicators (KPIs) Question : What are Key Performance Indicators (KPIs) Key Performance Indicators (KPIs) are in common use and represent one aspect of Best Practice.
2.6.1 Best Practice An Enterprise Data Warehouse is a repository of all the data within the enterprise that is used in Performance Reports. It is intended to guarantee a Single Version of the Truth . It typically stores very detailed data, such as Sales Receipts, and can maintain historical data going back many years. A Data Mart, on the other hands, stores data for specific functional areas, such as Purchases and Sales, and the data is usually limited in timescale and might even have a limited life span. A Data Warehouse can be either a single very large Repository of Data or it can be built as an interlocking set of Data Marts.
Best Practice in Data Management 17 Each Data Mart would store data related to a separate business area, such as Sales, or for a specific Report or family of Reports. In passing, we can mention that there are two well-known authorities in the broad field of Data Warehouses and Data Marts. The first is Bill Inmon, who favours the first approach of large, all-encompassing Data Warehouses. The second is Ralph Kimball, who favours related Data Marts. Inmon and Kimball both write well and present convincing arguments for their points of view. A sensible approach is to start with a single Data Warehouse and then to create Data Marts for specific business requirements as they occur. In order to link Data Marts, they need to share the same values for the same Dimensions, such as Stores or Products. These are called Conformed Dimensions. Without Conformed Dimensions, it is impossible to compare and accumulate related values in Data Marts.
An Agile Approach is very important because it is inevitable that user requirements will change from time to time. We can predict three Phase in the evolution of User Requirements :1. Give me everything 2. Give me these Reports on a regular basis and give me an ad-hoc Enquiry facility. 3. I want integrated KPIs and Dashboards 4. I want to be notified automatically if I have any situation requiring urgent attention In order to meet these Requirements, we need to put in place an Agile Data Warehouse with flexible Data Marts and an integrated BI Toolkit.
The best example is probably Dates. In this example, the Date field is a Conformed Dimension for the Purchasing and Sales Data Marts, but Suppliers and Stores are not. Ticket Dimension Suppliers Purchasing Calendar Sales Stores Dimension Data Mart Dimension Data Mart Dimension -Ticket Number Supplier -Date PO - Date -Date of Sale Store Number ID Issued -Store -Supplier ID Number -PO ID
This table conveys the levels of conformance within a Dimension by grouping the base Dimension with conformed rollup totals. The two let-hand columns are Dimensions and the top column shows Facts. The Yes fields indicate the Dimensions that have to be conformed in order for the analyses to be valid. They show that if we have Product-level data then the Product is a conformed dimension. They are illustrative for discussion purposes.
Order s Date Day Week Month Product Product Category Organisatio n Warehous e Store Division Yes Yes Yes
Shipmen ts Yes
Inventori es
Sales Yes
Returns Yes
Demand Forecast
Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes
This shows the design for three representative Online Shopping Data Marts
Data Mart 1 for Online Shopping Record_ID Shopping_Date Total_Amount Data Mart 2 for Online Shopping Record_ID Shopping_Date_Time Product Type Total_Amount Data Mart 3 for Online Shopping Record_ID Shopping_Date_Time Customer_ID Total_Amount
This design shows two date fields because that is a common pattern. However, there could be more or less than two. In a similar way, it shows six Dimensions and six Facts for illustrative purposes but these, of course, could be any number of attributes.
Ref_Calendar Day_Date_Time Week_Number Month_Number Year_Number
Data Mart - Generic Design Record_ID Date_1 (FK) Date_2 (FK) Dimension_1 (FK) Dimension_2 (FK) Dimension_3 (FK) Dimension_4 (FK) Dimension_5 (FK) Dimension_6 (FK) Fact_1 Fact_2 Fact_3 Fact_4 Fact_5 Fact_6
CALENDAR Day Number (PK) PRODUCT TYPES Product Type Code CUSTOMER TYPES Customer Type Code
DATA_MART_FACTS Fact ID (PK) Date Movement ID Product ID Product Type Code Customer ID
What is a Data Mart ? A Repository of total and detailed data with a standard structure This structure is usually a Facts Table where all the data for analysis is held, together with a number of associated Dimension Tables. Generic software is used, support by common Report Templates How do we get Started ? Step 1. Understand the Users Data Requirements Step 2. Determine the available Data Step 3. Reconcile standards, reference data Step 5. Establish a common view of the Data Platform Step 6. Choose the product or use bespoke SQL Step 7. Design the Templates and agree design with Users Step 8. Populate the Templates with sample data
What.1 : What is a Data Mart ? These questions are from this page : http://www.databaseanswers.org/data_marts.htm
Data Marts are a Repository of summary, total and detailed data to meet User Requirements for Reports. They always have a standard structure, called Dimensional Data Models, which means that it is possible to use Generic Software and adopt a common Approach based on Templates. Describing a Data Mart is a good way to get User buy-in because they can easily be explained in a logical manner which is very user-friendly.
What.2 : What are Data Mart Templates ? Data Marts have a common design of Dimension fields and Facts. Templates are important because they represent a tremendous Kick-Start approach to the design of Data Marts for a specific business area. They are produced by exploiting the common design of Dimensions and Facts. A range of Data Mart diagrams is available in the Case Studies on the Database Answers Web Site.
Why.1 : Why is this Stage important ? It provides a single point of reference for all the data available within the organisation for producing Reports How.1 : How do we get Started ? These questions are form this page : http://www.databaseanswers.org/data_marts_questions.htm
To get started, follow these Steps : Get a broad understanding of Users Data Requirements Establish a common view of the Data Platform Determine the available Data Reconcile standards, reference data Choose the product or use bespoke SQL Use Templates and agree design with Users Populate Templates with sample data Get sign-off on demo specs in 1 month, aim for results for champion in 3 months and final results in 6 months. Adjust timescales in light of experience
How.2: How do we measure progress with Data Marts ? Check the level of Users understanding. Check for existence of Templates.
How.3: How do I improve the performance of my Data Mart ? Every DBMS produces what is called an Execution Plan for every SELECT statement.
2.7.2 The Canonical Enterprise Data Model This offers a Design Pattern that can apply to any high-level System or Business Function that is providing data to be loaded into the Data Warehouse.
CUSTOMERS
PRODUCTS / SERVICES
EVENTS
ORGANISATION
SUPPLIERS
DOCUMENTS
2.7.4 Data Warehouse ERD This diagram shows the Entities that contribute to the Data Landscape for the Warehouse.
Addresses_
Customers_
Documents_
Events
Ref_Customer_Types Ref_Document_Types
Warehouses
Suppliers_Products_
Data Warehouse Phase 1 Generic Fact_ID DWH_Data_Type_Code (FK) Customer_ID (FK) Document_ID (FK) Event_ID (FK) Product_ID (FK) Reporting_Day_Date_Time (FK) Staff_ID (FK) Store_ID (FK) Supplier_ID (FK) Warehouse_ID (FK) Date_From Date_To Dimension_1 (FK) Dimension_2 (FK) Dimension_3 (FK) Amount Count Fact_1 Fact_2 Fact_3 Fact_4 Fact_5 Fact_6
Ref_Calendar Day_Date_Time Week_Number Month_Number Year_Number Dimension_1 Dimension_1 Dimension_1_Description Dimension_2 Dimension_2 Dimension_2_Description Dimension_3 Dimension_3 Dimension_3_Description
DWH_Data_Types DWH_Data_Type_Code DWH_Data_Type_Name eg Gift Card Basic Data eg Gift Card Totals eg Vendor Compliance Totals
Data Warehouse Phase 1 Fact_ID DWH_Data_Type_Code (FK) Customer_ID (FK) Document_ID (FK) Event_ID (FK) Product_ID (FK) Reporting_Date_Time (FK) Staff_ID (FK) Store_ID (FK) Supplier_ID (FK) Warehouse_ID (FK) Damage_Category_Code (FK) Gift_Card_Type (FK) Total_Amount Total_Count Total_Card_Amount Total_Card_Count Damage_Goods_Amount Damage_Goods_Count
2.7.7 ETL Template for Validation Specifications This shows validation for the Event in the EDM DATA ITEM
event_id
TYPE VALIDATION
Unique Internal Identifier for each specific Integer Customer Event. Unique External ID for each specific Event. Text (15) Text (15) Optional and cannot be validated. Reference Data.
COMMENT
For example, a Customer makes an Appointment. For example, a Housing Repair Job Number.
Event Type
Integer Link to Associate Table. Integer Link to Contact table. Reference Data. For example, a Court for Youth Offenders or a School for Pupils. For example, a Social Worker.
Integer Staff Optional, but links to Staff Table if Integer specified. Optional but links to Supplier Table if Integer specified. Optional or specified Address. Integer Event Outcome Text (15) Text (15) Reference Data.
Supplier
For example, a Housing Contractor for repairs. For example, where an Offence took place. For example, Satisfactory.
Event Address
Event Status
Reference Data.
Event Start Date & time Date Event End Date & time
Mandatory
Optional
DATA ITEM
DESCRIPTION
MIN VALUE
MAX VALUE
COMMENTS
MINVAL Dec-311998
COMMENTS
DATA ITEM
DESCRIPTION
Nullable
RULES
DATE
% QUALITY
Nullable Yes
DATE
% QUALITY
ETL Transformations Project Title Known As: Development End date: Additional Comments: Trigger Source (eg Table) Data Item Data Type Target (eg XML File) Data Item Data Typ e Job Schedul e Rule Specificatio n
Specifications taken from migrating sample Customer data. Mapping Specifications Project Title: Creation of a Data Extract for Customers Date: April 1st. 2010 Additional Comments: Stakeholders. Trigger Source These Specifications are subject to review by
When CUSTOMERS.DAT_VAL = SYSDATE (Include DB type and name) Data Type Target Field Name Column Data Type Transf Rule
Table
Column
Table
Office CUSTOMERS CUSTOMERS CUSTOMERS CUSTOMERS CUSTOMERS CUSTOMERS ID DAT_VALID PHON_NUMBER FAX_NUMBER TELEX_NUMBER E_MAIL_ADDRESS NVARCHAR2(8) DATE NVARCHAR2(35) NVARCHAR2(35) NVARCHAR2(35) NVARCHAR2(70) Country CUSTOMERS CUSTOMERS CUSTOMERS CUSTOMERS CUSTOMERS COUNTRY_ID TRADING_ROLE POST_CODE REG_CODE GEO_INF_CODE NVARCHAR2(2) NVARCHAR2(1) NVARCHAR2(9) NVARCHAR2(3) NVARCHAR2(8) OFFICE Code CHAR(2) Copy As is OFFICE Unique ID CHAR(8)
What 1 What is Data Integration ? Data Integration provides a one view of the truth for things of importance to the organisation, such as Customers, Products and Movements. It includes Data Quality, Master Data Management and mapping specifications. Here is the Web Link for the Questions on the Database Answers Web Site : http://www.databaseanswers.org/data_integration.htm Data Integration is concerned with combining data from various Sources into one consistent stream. It provides an essential Single View of Data, for example, a Single View of a Customer. It also provides a natural point at which Data Quality can be addressed. At this Stage, Data Quality can be assessed and a Single View of a Customer can be achieved. When Data Quality is of a uniform good quality, it can be integrated and made available as a consistent View. This will be supported using a Glossary, as described in the Information Catalog Stage. The current incarnation of Data Integration is Master Data Management,(MDM). Data Integration provides a one view of the truth for things of importance to the organisation, such as Traders, Products and Movements. It provides a natural point at which data quality can be addressed. When Data is of uniform good quality it can be integrated and made available as a consistent View. This leads naturally to Master Data Management,(MDM). Details of the Integration, such as mapping specifications, are held in a Glossary, which is described in Stage 7. Some key points : Data Integration is concerned with combining data from various Sources into one consistent stream. It provides an essential Single View of Data, for example, a Single View of a Customer. It also provides a natural point at which Data Quality can be addressed. At this Stage, Data Quality can be assessed and a Single View of a Customer can be achieved. When Data Quality is of a uniform good quality, it can be integrated and made available as a consistent View. This will be supported using a Glossary, as described in the Information Catalog Stage. The current incarnation of Data Integration is Master Data Management,(MDM). Data Integration provides a Single View of the Truth for the things of importance to the organisation, such as Traders, Products and Movements. It provides a natural point at which data quality can be addressed. When Data is of uniform good quality it can be integrated and made available as a consistent View. This leads naturally to Master Data Management,(MDM).
What.2 - What is Master Data Management (MDM) ? One of the major components in Master Data Management (MDM) is Customers. MDM can be defined a Providing a Single View of the Things of Importance within an organisation Master Data Management applies the same principles to all the Things of Interest in an organisation. This can typically include Employees, Products and Suppliers. We have discussed A Single View of the Customer and MDM involves the same kind of operations as a CMI. That is, identification and removal of duplicates, and putting in place to eliminate duplicates in any new data loaded into the Databases. There is a wide choice of software vendors offering MDM products. De-duplication and Address validation is a niche market in this area. On the Database Answers Web Site, there is a Tutorial on Getting Started in MDM There is a sister Web Site devoted to the topic of MDM-As-a-Service
What.3 - What are Conceptual, Logical and Physical Data Models ? Wikipedia has some useful entries on Conceptual Models, Logical Models and Data Models. Conceptual Data Models do not conventionally show Foreign Keys and are very useful for making clear the Entities and Relationships in a Data Model without any Keys or Attributes. They are very useful for discussing Requirements with Users because they show only the basics. Logical Data Models add Foreign Keys and Attributes. They are very useful for publishing a complete statement of the data involved. Physical Data Models are very close to the Database design. They are very useful for discussions between the Data Analyst, DBAs and developers.
What.4 : What does ETL stand for ? Wikipedia has an entry on ETL which is worth a look. ETL stands for Extract, Transform and Load. Extract means Extracting data from Data Sources. Transform covers many tasks, including O Selection of the data of interest o Validation and clean-up of the selected data o Changing the format and content of the data o Loading into the designated Target.
Why.1 - Why is this Stage important ? It provides one view of the truth It offers a point at which Data Integrity can be measured and User involvement obtained to improve Quality until it meets User standards.
How.1 - How do we get started ? Data Profiling is a good starting-point for determining the quality of the data and drafting some simple validation and transformation that can be used to get started. For example, replace LTD by LIMITED (or vice versa), and & by AND. The Design Approach requires Data Models for the areas of the within Scope. It will also require Generic Data Models to support one view of the truth for major entities, such as Traders or Customers. This one view will be implemented as Master Data Management (MDM). Get a broad understanding of the data available Establish a common view of the Data Platform Get a broad understanding of Data Sources Determine the available Data Choose the MDM product Determine strategy for Clouds e.g. Reference Data available globally o o o o In 1 month, produce Generic Data Models In 3 months, confirm GDM with sample data and Facilitated Workshops and choose MDM product. In 6 months, implement MDM and publish GDM and CMI on the Intranet. Adjust timescales in light of experience
Data Integration covers a number of Steps, each of which can have its own Templates. Examples are included here for Data Profiling and Mapping Specifications.
How.2 - How do we follow Best Practice These Steps define a Tutorial of Best Practice : Step 1. Define the Target which is usually a Single View Data Model. Step 2. Define the Data Sources Step 3. Define the Mapping Specifications from the Sources to the Target. Step 5. Define the Data Platform Step 6. Identify Standards to be followed.
This Tutorial is described in detail in a separate document, entitled Data_Integration_Tutorial.doc These questions come from this page : http://www.databaseanswers.org/data_integration_questions.htm If you have a Question that is not addressed here, please feel free to email us your Question. How.3 - How do we measure progress in Data Integration ? Look for the existence of the following items : Generic Data Models An Enterprise Data Platform Identify the Data Sources Selection of a MDM Product Implementation of a Customer Master Index or appropriate alternative
Adjust timescales in light of experience Data Integration covers a number of Steps, each of which can have its own Templates. Examples are included here for Data Profiling and Mapping Specifications.
How.5 - How do I establish a Strategy for Data Quality ? A successful Strategy for Data Quality as an Enterprise Issue must include both organization and technical aspects. Typical Organization aspects are : Commitment from senior management Establishing the slogan Data Quality is an Enterprise Issue as a top-down edict. Identification of the Top 20 Applications and Data Owners across the Enterprise Agree sign-off procedures with Data Owners and Users Technical aspects Establish Key Quality Indicators (KQIs), for example Duplicate Customers records Agree target Data Quality percentage Define KQI Reports and dashboards Develop SQL to measure KQIs Define procedures to improve KQIs How.6 - How do I handle multiple types of Databases ? This could include Oracle, SQL Server and DB2. The key to handling multiple types of Database is to thing of them in terms of an Integrated
SOURCES
CONTACT
TYPE
DATA ITEMS
COMMENTS
Data Sources include all major places where important data is created or used, including : Applications Databases Spreadsheets XML files, and so on
It also includes Information related to each Stage in the Best Practice Road Map on People, Roles and Responsibilities. This Information is stored within an Information Catalog. A Repository record Data Sources for all major Applications, Databases, Spreadsheets and so on, data and information related to each Stage in the Best Practice Road Map This includes details of People, Roles and Responsibilities, Applications, Databases
Why.1 - Why Is this Stage important ? It is important because it provides the starting-point for a range of utilities which combine to provide very valuable functionality.
How.1 How do we get started ? These are the four basic Steps :Step 1. Agree the initial content and revise at regular intervals. Step 2. Identify individuals responsible for data gathering and dissemination. Step 3. Start with a bottom-up Approach and focus on working documents, such as Invoices or Movement Authorisations. .Step 4 Follow a top-down Approach and focus on Reports.
3.2 Implementation
Within any large organisation there is a range of structured and unstructured data, including Databases, Email and documents stored in facilities such as Sharepoint. Sharepoint can be used for :o o o Project Documentation Minutes of Meetings Policies and Procedures
Email Archives with Search (eg Epsilon)] Registered users with profiles o Publish and Subscribe
Reporting Layer
Data Integration Layer Single Version of the Truth Enterprise Data Model - ETL
Best Practice in Data Management 46 3.6 Progress towards a Layered Data Architecture
This diagram shows the overall Architecture :-
BI Architecture
Reporting Layer
(BI, eg Ladder Plan and FBP)
Report Templates
Best Practice in Data Management 47 3.7 Evolution of the Data Integration Layer
There will be a number of Phases in the development of the Data Integration Layer (IDL). Six types of facilities can be added, in the sequence shown in the box.
These facilities can be added in four possible Phases :Phase 1 - Design a Thin Slice from end-to-end for a typical Business Scenario. Phase 2 Add Entities from analysis of the business functions involved. Phase 3 - Address Data Quality including data profiling, data consistency, correctness and so on. Phase 4 Enhance features in MDM, Data Warehouse and Web Services.
Web Services
Data Profiling
Best Practice in Data Management 48 3.8 Progress towards a Layered Data Architecture
This diagram is a simple way to keep track of progress towards the implementation of your Data Architecture. The yellow areas indicate specific Deliverables that have been produced.
Data Work to be Warehouse Completed design Enterprise Data Model Work to be Completed V.2 (EDM) design produced Mapping Template Work to be for EDM produced Completed Design produced for Single Version of the Truth for Customers, Organisation, Products and Stores Template produced for Data Work to be Completed Quality Validation and Clean-up Template Work to be Completed produced for Data eg Information Catalogue Profiling Design produced Generic Review to be completed
Stages and Data Sources in the Lifecycle of a Customer ... Plan Forecasting System Buy
KPI=Vendor Delivery
Deliver
Inventory
Sell eg POS
Enterprise Data Model (EDM) Single Version of the Truth Mapping from EDM to DWH
Data Mart 1
Data Mart 2
Rpt Family 12
Rpt Family 22
Ubiquitous Data Available at Any Time, and Any Place with Any Device
Web Services Design Enterprise Service Bus (ESB) Enterprise Data Model Data Virtualisation (Middleware) IBM, Informatica, Microsoft, Oracle, etc.
Implementation
Data Sources / Databases CSV, ODBC, Oracle, SAP, SQL Server, etc.
http://www.linkedin.com/groups?home=&gid=142267&trk=anet_ug_hm
http://www.linkedin.com/groups/Master-Data-Management-Interest-Group-53314
4.1 Introduction
This Chapter presents a Strategy for Enterprise Data Management. It presents some reasons why you should have a Strategy and it then recommends a tried and tested approach to putting a Strategy in place.
These four States are shown in this diagram, along with an indication of the characteristics of each Stage:-
4.4.2 Self-Assessment This Section provides more details for the As-Is evaluation. It can be used as a method of Self-Assessment for you to identify where an organisation is situated in the maturity of each Component in Enterprise Data Management.
STAGE Nr.
1) Data Sources
BASIC
Knowledge in the heads of individuals.
AVERAGE
Top 20 Applications known with list of Data Sources and Owners. Basic Data Dictionary in place.
IDEAL
Agile development with refactoring techniques.
No Data Models and poor documentation of links between code and databases. 2) Data Integration Ad-hoc integration using bespoke SQL Scripts
Data Models and signoff by DBAs on all changes. User access and sign-off for Data Dictionary MDM approved, data owner sign-off,Data Quality is an Enterprise
Nobody understands Master Data Management (MDM) 2) Data Quality DQ level Policy 3) Performance Reports 4) Mashups 5) Data Governance
MDM is planned
Could be improved Under consideration One-off, often independent Dept. Spreadsheets None None
Sufficient Under implementation Independent Maps, KPIs and drill-down to detailed Reports Isolated development No end-to-end agreement.
6) Information Catalogue
Stand-alone Tool
Accurate, valid and relevant Established as Enterprise Issue Integrated Maps, KPIs and drill-downs for Chief Exec Users aware Procedures published, Roles and Responsibilities and Sign-off all in place. Data Lineage known and auditable. Provided over the Intranet for User SignOff
Best Practice in Data Management 55 4.6 Building Blocks in the Road Map
This diagram shows the six Building Blocks in the Road Map. They start at the top with Data Governance and then go down to the Information Catalogue which is used to record all the important information the Data Management within an organisation.
http://www.linkedin.com/groups/Data-Warehousing-Professionals-Group-is-124955.S.39415157
http://www.linkedin.com/groups/Data-Warehouse-Architecture-1377377? gid=1377377&mostPopular=&trk=tyah
http://tdwi.org/
5.1 Introduction
This document describes a Data Warehouse Strategy. It is divided into three Chapters a Business perspective, a Data Architecture perspective and a Plan. The proposal is to start with Credit Cards in a Retail organisation as the first Project. The format of sample transaction data can be drafted, and sample Reports are also available. Therefore, it will be possible to define both the Data Source and sample User Requirements and to design the Data Architecture that will integrate these two.
Plan
Buy
Promot e
Sell
Analyse
LadderPlan
POTS
POS
Sales Audit
5.3.2 A Strategy for Data Warehousing This document describes a Data Warehouse Strategy. It recommends two Phases :Phase 1. Installing a single Data Warehouse to store all enterprise Data We will also install tuning software to monitor performance to predict when we need to move to Phase 3. Phase 3. Adding Data Marts dedicated for specific functional areas We show examples of the analysis of Card Cards and of Damaged Goods supplied by Vendors
5.3.3 Template for User Requirements There are two aspects to the specification of User Requirements : The totals required, and other derived figures
For example, the requirement might be for the total number of transactions and their value on Card Card on a weekly basis in a specific period of time. In this way, a sample Report layout can be drafted for approval by the User.
Report Name : Card Card Weekly Report Period starting : January 1st. 2011 Week 1 Credit Card Count Total Value Week 2 Week 3 Week 4 Week 5 Week 6
5.3.4 Benchmarks This Section establishes Benchmarks that will be used to evaluate predicted performance from a Netezza Appliance,
5.8.1 Conformance Analysis This table conveys the levels of conformance within a Dimension by grouping the base Dimension with conformed rollup totals. The two let-hand columns are Dimensions and the top column shows Facts. The Yes fields indicate the Dimensions that have to be conformed in order for the analyses to be valid. They show that if we have Product-level data then the Product is a conformed dimension. They are illustrative for discussion purposes.
Orders Date Day Week Month Product Product Category Organisation Warehouse Store Division Yes Yes Yes
Shipments Yes
Inventories
Sales Yes
Returns Yes
Demand Forecast
Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes
Yes
Yes
Customers_
Documents_
Events
Ref_Customer_Types Ref_Document_Types
Warehouses
Suppliers_Products_
Data Warehouse Phase 1 Generic Fact_ID DWH_Data_Type_Code (FK) Customer_ID (FK) Document_ID (FK) Event_ID (FK) Product_ID (FK) Reporting_Day_Date_Time (FK) Staff_ID (FK) Store_ID (FK) Supplier_ID (FK) Warehouse_ID (FK) Date_From Date_To Dimension_1 (FK) Dimension_2 (FK) Dimension_3 (FK) Amount Count Fact_1 Fact_2 Fact_3 Fact_4 Fact_5 Fact_6
Ref_Calendar Day_Date_Time Week_Number Month_Number Year_Number Dimension_1 Dimension_1 Dimension_1_Description Dimension_2 Dimension_2 Dimension_2_Description Dimension_3 Dimension_3 Dimension_3_Description
DWH_Data_Types DWH_Data_Type_Code DWH_Data_Type_Name eg Gift Card Basic Data eg Gift Card Totals eg Vendor Compliance Totals
Data Warehouse Phase 1 Fact_ID DWH_Data_Type_Code (FK) Customer_ID (FK) Document_ID (FK) Event_ID (FK) Product_ID (FK) Reporting_Date_Time (FK) Staff_ID (FK) Store_ID (FK) Supplier_ID (FK) Warehouse_ID (FK) Damage_Category_Code (FK) Gift_Card_Type (FK) Total_Amount Total_Count Total_Card_Amount Total_Card_Count Damage_Goods_Amount Damage_Goods_Count
5.A.2 Volumes These statistics can be used in an evaluation of the performance of a Data Warehouse Appliance, such as IBMs Netezza or Oracles Exadata.
VOLUMES PHASE
Plan Buy
SYSTEM
Ladder Plan POTS POCC
FREQUENCY
Weekly
DAILY
WEEKLY MONTHLY
ANNUAL
5.A.3 Growth in Volume This table shows some predicted growth in the volume of data that will need to be stored in the Data Warehouse.
PHASE
Plan Buy
SYSTEM
Ladder Plan POTS POCC
DAILY
WEEKLY MONTHLY
5.A.4 User Workload You will need to provide more specifics are here and these can be obtained from discussion with Users. This table shows some typical Enquiries and Reports that Users might want to run. PHASE Plan Buy Promote Sell Analyse POS Sales Audit Minute-by-Minute Total Daily Sales SUM(Sales) GROUP BY Day SYSTEM Forecasting PO Tracking FREQUENCY Weekly ENQUIRIES DATA
And heres the LinkedIn Group on Data Modelling and Meta Data Management:-
http://www.linkedin.com/groups?home=&gid=988447&trk=anet_ug_hm
6.2 Concepts
6.2.1 One-to-Many Relationships A Customer can place many Demands or Orders for Products. This defines a One-to-Many Relationship.
A Data Modeller would say For every Customer, there are many Demands or Orders. This is shown in a Data Model as follows :-
6.3.2 Many-to-Many Relationships We can also say that a Order can request many Products.
Many-to-Many Relationship cannot be implemented in Relational Databases. Therefore we resolve this many-to-many into two one-to-many Relationships, which we show in a Data Model as follows :-
When we look closely at this Data Model, we can see that the Primary Key is composed of the Order_ID and Product_ID fields. This reflects the underlying logic, which states that every combination of Order and Product is unique. In the Database, this will define a new record. When we see this situation in a Database, we can say that this reflects a many-to-many Relationship. However, we can also show the same situation in a slightly different way, which reflects the standard design approach of using a surrogate key as the Primary Key and showing the Order and Product IDs simply as Foreign Keys.
The benefit of this approach is that it avoids the occurrence of too many Primary Keys if more dependent Tables occur where they cascade downwards. The benefit of the previous approach is that it avoids the possibility of orphan records in the Products in a Order table.
6.3.3 Rabbits Ears We start with the definition of a Unit, which at its simplest, looks like this :In this case, we use a meaningless ID which is simply a unique number.
Then we think about the fact that every Unit is part of a larger organisation. In other words, every Unit rreports to a higher level within the overall organisation.
Fortunately, we can show this in a very simple and economical fashion by creating a relationship that adds a parent ID to every Unit. This is accomplished by adding a relationship that joins the table to itself.
This is formally called a Recursive or Reflexive relationship, and informally called Rabbits Ears, and it looks like this :-
The Unit at the very top of organisation has no-one to report to, and a Unit at the lowest level does not have any other Unit reporting to it. In other words, this relationship is Optional at the top and bottom levels. We show this by the small letter O at each end of the line which marks the relationship.
6.3.3 Inheritance Inheritance is a very simple and very powerful concept. We can see examples of Inheritance in practice when we look around us every day.
For example, when we think about Houses, we implicitly include Bungalows and Ski Lodges, and maybe even Apartments, Beach Huts and House Boats.
In a similar way, when we discuss Aircraft we might be talking Rotary and Fixed Wing Aircraft.
However, when we want to design or review a Data Model that includes Aircraft, then we need to analyse how different kinds of Aircraft are shown in the design of the Data Model.
We use the concept of Inheritance to achieve this. Inheritance is exactly what it sounds like. It means that at a high level, we identify the general name of the Thing of Interest and the characteristics that all of these Things share. For example, an Aircraft will have a name for the type of Aircraft, such as Tornado and it will be of a certain type, such as Fixed Wing or Rotary.
At the lower level of Fixed-Wing Aircraft, an Aircraft will have a minimum length for the runway that the Aircraft needs in order to take off.
6.3.4 Reference Data Reference Data is very important. Wherever possible, it should conform to appropriate external standards, particularly national or international standards. For example, the International Standards Organization (ISO) publishes standards for Country Code, Currency Codes, Languages Codes and so on. For Materiel and Products, NATO has published the National Codification Bureau (NCB) code. This is in use within the UK MOD and is administered from Kentigern House, Glasgow.
We could describe it in these terms :Customers place Orders for Products of different Types.
The design of this Data Warehouse simply puts all data into a big basket
6.3.6 Reviewing the Design of a Data Warehouse The design of any Data Warehouse will conform to this patter with Dimensions and Facts. Dimensions correspond to Primary Keys in all the associated Tables (ie the Entities in the ERD) and the Facts are the derived values that are available.
Therefore, reviewing the Design of a Data Warehouse involves looking for this Design Pattern.
With one exception, the Relationships are optional because the Enquiries need not involve any particular Dimension.
The one exception to this rule is that the Relationship to the Calendar is mandatory because an Enquiry will always include a Date. Of course, an Enquiry might include all data since the first records, but the principle still applies.
Which Customers ordered the most Products ? Which were the most popular Products in the first week of April ? What was the average time it took to respond to Orders for Aircraft Engines ? How many Orders did we receive in May ?
6.3 Applications
Deliveries
Maintenance
The scope of this Data Modelis the Maintenance of Assets by Third-Party Companies. The Business Rules state :* * * * * * An Asset can have a Maintenance Contract. An Asset consists of Asset Parts Faults occur with these Parts from time to time. Third Party Companies employ Maintenance Engineers to maintain these Assets. Engineers pay Visits which are recorded in a Fault Log. They correct the Faults and this is recorded in the Fault Log.
http://www.databaseanswers.org/data_warehouse/index.htm
In many organisations have an EDM which is a great deal more complex than this one. We are using this one because it is very simple and therefore useful for this Chapter but at the same time, it applies to a wide range of complex organisations.
CUSTOMERS
PRODUCTS / SERVICES
EVENTS
ORGANISATIO N
SUPPLIERS
DOCUMENTS
Examples are : An example of a Document is a Sales Receipt. An example of an Event is a Customer making a Purchase using a Card Card. An example of the Organisation is a Store An example of an Organisation is an Associate
CUSTOMERS
PRODUCTS
ORGANISATION - eg Staff
This Section discusses how the Generic Design Pattern applies to the processing of a Purchase Order (PO). Enter your new Line of Business (LOB) Application here. This could be POS, Inventory Control and so on.
Best Practice in Data Management 83 7.4 Create the LOB Business Data Model
7.6.1 Description The first Step is to create the LOB Business Data Model for the new area. This example provides a template. Create your own Business Data Model here, based on a Design Pattern.
CUSTOMERS
Anonymous
Tender Types
eg Card Cards - Card Nr - Date of Issue - Card Balance - Date last Used
Retail Transactions
- Transaction ID - Transaction Type - Card Nr - Store ID - Staff ID Date of Transaction Amount Adjustment Amount
ORGANISATION
- Staff - Stores
Transaction Types eg
Make purchase, Load money on card, make
Sales Receipts
Tender Type
Credit Card
Reference data
Card Balance N/A These three fields are Generic and provide future-proofing
7.6.4 ERWin Data Model for new LOB This is where you design a more formal version of the Data Model shown above.
7.6.5 Define the Transaction Data 7.6.7.1 Structure of Transaction Data There will usually be a table of Transactions data to be migrated and loaded into the Data Warehouse. This table can be used as a Template. COLUMN NAME
Acct TranStatus TranRespCode TranType CardType Store ChainID Terminal Invoice StrTranNo AuthCode TranDate Trantime Amount TranAmount Swipe GcUsed_On Reconcile Reconcile_info GcTran_Amount McTran_Amount Notes Reason User_Updated
DATA TYPE
varchar varchar Int varchar varchar Int varchar Int decimal decimal varchar datetime varchar decimal decimal varchar varchar varchar varchar decimal decimal varchar varchar varchar
DESCRIPTION LENGTH
Account 20 2 4 50 10 4 10 4 9 9 10 8 10 9 9 2 10 2 100 9 9 200 200 50 User Updated GC Tran Amount MC Tran Amount Notes Reason Authorisation Code Transaction Date Transaction Time Amount Transaction Amount Swipe GC Used On Reconcile Reconcile Info Invoice Str Transaction Number Card Type Store Number Chain ID Terminal Transaction Status Transaction Response Code Transaction Type
COMMENT
Card Number ?
DIM/FACT
DIM DIM DIM
Ref/ Master
Ref Ref Ref Ref Master
The Register or Till For Bulk Activation Generated Register receipt Always generated
? ? Cash or Key
7.6.7.2 Define the Transaction Data to be Migrated This table provides a Template. TBC=To Be Confirmed. DESCRIPTION Action Code Card Number Card Type Chain ID GC Used On Invoice Order Status Staff ID Store Number Swipe Till (Terminal) Transaction Status Transaction Response Code Transaction Type Promotion ID Notes Reason Reconcile Reconcile Info Swipe User Updated User ID Str Transaction Number Authorisation DIM/FACT DIM DIM DIM TBC DIM TBC DIM TBC DIM DIM DIM DIM DIM DIM DIM Not used ? Not used DIM DIM DIM DIM DIM FACT FACT REQUIRED Y/N (TBC) Y Y Y N N (TBC) N Y Y Y N N (TBC) Y Y Y Y N N N N N N (TBC) N (TBC) Y N
Y Y Y Y Y Y
7.7.2 ERWin Model for Data Warehouse This Data Model shows the first draft of the Data Warehouse.
Ref_Calendar_ Day_Date Year_Nr Month_Nr
Data Warehouse for Gift Card Data Barry Williams June 27th 2011
Data_Categories_ Data_Category_Code Data_Category_Description eg Gift Card Transactions Cards_ Card_Number Card_Details EVENTS_ Event_ID Event_Details Promotions_ Promotion_ID Promotion_Details Staff_ Staff_ID Staff_Details eg Associate eg Buyer, Manager Stores_ Store_ID Store_Details
Data_Warehouse_Facts_jun27th Fact_ID Card_Number (FK) Card_Type_Code (FK) Customer_ID Data_Category_Code (FK) Day_Date (FK) Dimension_Code Document_ID Event_ID (FK) Event_Outcome_Code Event_Type_Code (FK) Order_Status_Code (FK) Promotion_ID (FK) Staff_ID (FK) Store_ID (FK) Tender_Type_Code (FK) Transaction_Date (FK) Transaction_Time (FK) Transaction_Type_Code (FK) Amount Transaction Amount GC Tran Amount MC Tran Amount Totals, Graphs, Trends Other Derived Figures Ref_Card_Type_ Card_Type_Code Card_Type_Description Ref_Event_Type_ Event_Type_Code Event_Type_Description eg Customer makes Purchase Ref_Order_Status_ Order_Status_Code Order_Status_Description Ref_Tender_Type_ Tender_Type_Code Tender_Type_Description eg Gift Card - SVC Ref_Transaction_Status_ Transaction_Status_Code Transaction_Status_Description Ref_Transaction_Type_ Transaction_Type_Code Transaction_Type_Description
DATE
REGION
PRODUCT
SALES (000s)
REFUNDS
Month DIMension
Region DIMension
Product DIMension
Data Mart for Sales Report Year Number - Month Region - Product Sales Amount Refund Count
Ref_Year_ Year_Nmber
Ref_Month_ Month_Number
DataMart_for_Aging _Report Year_Nmber (FK) Month_Number (FK) Card_Count GC_Balance MC_Balance Total_Balance
The key aspects are the insights and experience. These are specific to a particular problem or area of activity. The way that they are structured and the manner in which they can be searched and enhanced are important for their usability.
KM is built on a Knowledgebase
This diagram shows how a Knowledgebase(KB) can be created to answer Questions about Data Management. It is shown on this page of the Database Answers Web Site :-
http://www.databaseanswers.org/A_Self_Service_Web_Site.htm
A Feedback facility is provided so that the Knowledgebase can be enhanced in response to the experience of the Users. A Personal Workspace makes it possible for specific Users to save their results over a series of sessions. In this way, they can learn from their experiences and the Workspace can be analysed to improve the performance in the future.
http://www.linkedin.com/
There is currently no Group specifically for KM, but there are over 1 million individual Groups, each of which is a KM Group dedicated to a specific area of interest.
Best Practice in Data Management 95 8.5 A Data Model for KM This Data Model shows how a Database could be designed to support KM. It shows that the Knowledge is related to Topics which have Properties and can include Contacts A typical Topic would be How do we do Data Migration ?
The accumulation of knowledge, ideas, experience, heuristics and what works The structure of Enquiries against the Knowledgebase (KB) The design of the KB which helps enrichment over time. The ability to tailor the KB to provide prepared results for specific Roles and Questions.
Believing that Build it and they will come o KM needs care and feeding by nurturing and marketing
Dont create Review Bottlenecks o Aim to get quality content reviewed and out quickly
Link each Case to at least one Solution Dont treat KM as a Project o It requires an ongoing commitment
http://www.databaseanswers.com/faqs.htm
3) The next level up is Database-driven Content, like this Tutorial on Master Data Management
http://www.databaseanswers.org/pi_best_practice_display/manual.asp?manual_id=BP_MDM
4) The next level up is a dedicated KM System, which can either be installed on premise or in the Clouds. This can be an adaptation of a System for a related purpose, such as Incident Reporting or Trouble Shooting. Two examples in the MOD would be the HP CASD System or Remedy from IBM. This approach is attractive because it is cost-effective and it provides an opportunity for clarification of the requirements. 5) The next level up is a Web Site on the Internet, such as Knowledge Plaza
http://www.knowledgeplaza.net/
This provides a one stop shop to store all file types and resources. 6) The top level is a Social Computing approach, which combines everything and offers very user-friendly facilities for people to provide Feedback. This includes Blogs, Wikis, Emails, Social Networks, like LinkedIn. This is discussed in a recent book entitled The Wisdom of Crowds.
8.8 Self-Assessment
The Internet offers facilities for organisations to carry out a Self-Assessment to determine how they compare to other in their industry. Heres the Web Link - http://www.myknowledgeiq.com/?src=kat-web-hi Here is the result of the Analysis of typical Situation :-
Best Practice in Data Management 99 Appendix 8.A Specifications for a Task Log System 8.A.1 Introduction
This document is a starting-point for discussion and covers the Database design, Requirements and Screenshots.
8.A.2 Screenshots
8.A.10.1 Home Page This Screenshot shows a draft set of facilities for : Feedback and User Comments on any topic. Enquiries Entering and updating Tasks, Lessons Learned and Requirements.
8.A.6.2 Optional CASD Enter, Update Generate automatically Enter, Update Processes Task_Processes Staff details IBMs Remedy Other (TBD)
8.A.6.3 Reference Data CASD Enter, Update, Delete Enter, Update, Delete Enter, Update, Delete Roles Task Outcomes Task Status IBMs Remedy Other (TBD)
Best Practice in Data Management 102 8.A.4 Data Model for the Task Log
Best Practice in Data Management 103 Appendix 8.B. Details of selected Products
8.B.1 Inspire KM Inspire KM focusses on Self-Service and aims to reduce customer support: If offers the following features which make it stand out from its competition :
A completely web-based self-help system Categorize information by area, product, etc Built-in feedback for customers and staff Customers store their own favourites list Popular search terms to find help Glossary of terms to define technical words
Active Response System integrates into any website form Searchable knowledge items and attachments (Microsoft Office and PDF) RSS feeds allow customers to instantly see new knowledge items
8.B.2 Knowledge Plaza Knowledge Plaza is an example of the current generation of Internet-based products that build on what has been learned and also incorporate Web 10.0 features of user involvement and feedback. Knowledge Plaza allows businesses of all sizes to capture what actually matters to them. That's the day to day flow of information and knowledge which needs to be saved, retrieved and used to make or support business decisions. Using many of the most talked about aspects of web 10.0 (tagging, social networks, RSS realtime dashboards and social search), Knowledge Plaza enables users to collaborate around information sources and contribute to an ever growing knowledge base. In search of the great search and customer experience, companies are turning to enterprise search (ES) or enterprise knowledge management (EKM). Both of these solutions are designed to help users search through large bodies of information, but the way in which they do so is fundamentally different and has a profound effect on the users search experience and success.
8.B.3 OpenKM
OpenKM is an Open Source solution which focusses on Document Management. It supports Search Thesaurus and Workflow facilities. Here is the link for the Web Site :-
http://www.openkm.com/
Does KB support global, multilingual content, including synchronising local content in a master document ? Focus o o o of delivery Assisted service or support phone, web, chat or email Self-Service Support
Complexity of Questions resolved by KB o Low-level - Issues resolved in 15 minutes o Medium Issues resolved in an hour or less o High issues resolved in a day or less o Very high not more than 25 issues resolved in a month
http://www.linkedin.com/groups?gid=38815&mostPopular=&trk=tyah
It is clear from these questions that data is useless without Metadata. Data Profiling can involve these kinds of analyses : Domain : whether the data in the column conforms to the defined values or range of values it is expected to take o For example: ages of children in kindergarten are expected to be between 4 and 9. An age of 7 would be considered bad data.
Pattern: a Phone number for Inner London should start 02010. Frequency Counts: if most of our customers are in London then the largest number of occurrences of city should be London Useful Statistics include : o Minimum value o Maximum value o Average Business Rules for bad data would include o a Start Date later than an End Date. o A Start Date almost invariably wrong, such as a Start Date before the orgnaisation was in existence.
Best Practice in Data Management 107 9.2 Types of Tools that can used and how
Typical Tools for Data Profiling produce Reports about specific fields in Tables. These Reports look at values of any field in any Table in a Database. The most common products are available for Data Profiling as part of Data Quality. In other words, a common strategy starts with Data Profiling and builds a Metadata Registry based around the results of the kind of analysis obtained from Profiling. In this space, there are four different kinds of Vendors : The major players, such as IBM The niche players, such as Data Flux The Open Source players, such as DataCleaner and Talend Open Profiler In the Cloud Vendors, such as Informatica
Vendors in Gartners Leaders Quadrant for Data Quality include : Data Foundations (A Cool Vendor) DataFlux IBM Trillium
This is a representative list of Vendors that provides a good starting-point : Ab Initio Data Profiler Software Astera Software Business Data Quality Ltd Citrus Technology Datamartist Profiler Datiris Profiler MeasureGroup Automated Data Analysis SAP BusinessObjects Trillium Software
Best Practice in Data Management 108 9.3 How are Tools used ? Typically, these Tools are used by someone in an Administrator role, sitting at a PC and running software to analyse Data Sources. He or she will update the contents of the Registry as a result of the analysis. Additional content can be obtained by running queries on the System Catalogues for specific Databases. The Metadata Manager will also enter details of mapping between different Schemas. The diagram below shows Tomorrows Data Quality Architecture and is taken from a Presentation given by Barry Williams as the Data Architect for the London Borough of Ealing. The title was A Strategy for establishing Enterprise Data Quality. The Data Dictionary could be implemented as a Metadata Registry. DQ Admin could be a Data Architect in the role of Metadata Manager.
OneData MDR from Data Foundations is currently the only Registry which is 11179-compliant. The price of OneData is several hundred thousand pounds. It provides a useful benchmark against which other options can be evaluated. Here is the Web Link :o http://www.datafoundations.com/solutions_metadata_registries.jsp
Another promising initiative is UDEF, about which Wikipedia says :The Universal Data Element Framework (UDEF) provides the foundation for building an enterprise-wide controlled vocabulary. It is a standard way of indexing enterprise information that can produce big cost savings. UDEF simplifies information management through consistent classification and assignment of a global standard identifier to the data names and then relating them to similar data element concepts defined by other organizations. Though this approach is a small part of the overall picture, it is potentially a crucial enabler of semantic interoperability.
Data profiling is the process of examining the data in a database and collecting statistics about that data. The purpose of these statistics may be to: o o o o o Find out whether existing data can easily be used for other purposes Assess the Data Quality. Assess the challenges involved in integrating data Assess whether metadata accurately describes the actual values Understand data challenges early on, so that late project surprises are avoided. Establish an enterprise view of all data, for uses such as Master Data Management
Data Profiling can be a way to involve business users to provide context about the data, giving meaning to columns of data that were previously poorly defined by metadata and documentation.
9.10.2 Data Profiling Tools Here are two Open Software products for Data Profiling are :-
Commercial Vendors for data profiling software Ab Initio Data Profiler Software Astera Software Business Data Quality Ltd Citrus Technology
This section discusses the results of profiling activity and how they can be stored in such a way that they can be useful and used again in the future. A Metadata Registry can be used to record the agreements reached in discussions. An appropriate Registry Tool can provide controlled access on a Publish-and-Subscribe basis. Metadata can be used in many different ways and the ROI on the time invested in obtaining Metadata is very high. For example, answering Data Profiling questions such as : Which sources of physical data have the best quality for master data construction? Contention of physical data types across different instances of the same data For example, names with different lengths can lead to truncated names. Multiple meanings being attached to the same data in different data repositories. For example, does everybody agree to a common definition of a Customer or Product.
Also answering Data Governance questions such as : An overview of sensitive data and an opportunity to ensure that all instances are maintained under appropriate security regimes
Best Practice in Data Management 112 9.7 Data Model for MetaData Management This Model shows that Metadata Items have Properties and each Property has a series of Values. For a Date, a typical value would be MIN and MAX. These are derived by Profiling and can be used to validate the data.
Best Practice in Data Management 113 9.8 Integrated Data Model for Knowledge and MetaData Management
This shows an Integrated Data Model that would support both Knowledge and Metadata. It can be used to evaluate the features available in a possible solution.
9.10.1 The Simplified Version This shows that all Information is composed of Facts that are related to a number of other Facts. Facts can be searched or indexed by Tags.
This clarifies the Types of Facts, the possible Formats and Types of Relationships. This Data Model could be used to evaluate the features offered by any possible System solution.
A Top-Level Business Data Model can be created using Microsoft Word and is intended for business users and a non-technical audience.
The other Models referred to in this document will always be created by a Data Modelling Tool such as ERWin or IBMs Rational Rose. They could be described as Conceptual, Logical or Physical Models. Conceptual Models show the Things of Interest which are in Scope, for example, Organizations and Products. They may or may not include Keys and will certainly not include physical Data Types, such as the length of character strings. They may include Many-to-Many relation ships. Logical Models will include Primary and Foreign Keys and often the Modelling Tool will provide a facility to generate a Physical Model form a Logical one. Physical Models are often close to the actual design of an operational Database. They will always show data types and field lengths. 1.2 Example of a simple Business Data Model This Model was created in Word and shows Organizations, Requisitions and Products. The flow of logic in a Data Models should go from top-left to bottom-right. This means that the more fundamental Things are on the top and to the left.
Organizations
Stores
Requisitions
Products
Products in a Requisition
A Requisition can ask for an engine to be supplied, like this Hornet Engine for the US Navy :-
This version shows that Organizations and Products each have a hierarchy so that an Organization is part of a higher Organization. Similarly, a Product can be part of a more complex Product.
Organisatiopn s
Stores
Requisitions
Products
Products in a Requisition
The diagram is shown on this page of the Navy Web Site for the Naval Air System Command
http://www.navair.navy.mil/index.cfm?fuseaction=home.display&key=OrganizationalStructure
10.2 Draft the Business Rules Business Rules are valuable because they define in plain English with business terminology the underlying relationships between the Terms that appear in a Data Model. The User community will then be able to agree and signoff the Rules. Here is a small example.
Nr D.1
DESCRIPTION A Requisition must be raised by a valid Organization. Not every Organization will raise a Requisition.
P.1
Requisition Product
A Request or Requisition for Products to be supplied to the Requesting Organization. An Asset that can be separately ordered. It can be a Component and a part of a larger Assembly. It can be very small, such as a Washer, or very large, such as a Tornado aircraft. Very large Products can be subject to a separate Requisition Process.
10.4 Check that the Data Model is correct There may be errors which have a simple explanation. For example, the incorrect use of the Modelling Tool. Any errors should be discussed and resolved with the Modeller and the Users.
This is where the Glossary and Business Rules are very valuable. 10.5 Review with Users At this point, review the Business Rules and the Glossary with Users and aim to get Sign-Off. Make any necessary changes to format and contents. 10.6 Check Normalised Design 10.8.1 Normalised Design This discussion applies to Entity-Relationship Diagrams (ERDs) and not to Data Warehouses. We will start by defining the Rules for Normalisation so that we can recognise cases where they have been broken.
http://www.databaseanswers.org/codds_page.htm
One of his rules can be summarised as :The Data in a Table must belong to the Key, the Whole Key and Nothing but the Key, so help me Codd.
This means, for example, that a record in a ORGANIZATIONS Table must contain data only about the Organization, and nothing about people in the Organization, or activities of the Organization. It might include things like the name of the Organization and when the Organization was founded. Check 1 : Can the values of every data item in a table be derived only from the Primary Key ?
Rule 2 : Another of Codds rules stated that derived data must not be included. For example, the headcount for an Organization would not be included in the Organizations Table because it can be derived by counting the records of members in the Organization. Check 2 : Can any data item be derived from other items ?
Rule 3 : There must be no repeating groups in a Table. The one uncomfortable exception is that Addresses. They are very often stored as a number of repeated lines called Address_Line_1, Address_Line_2, and so on. Check 3 : Do any column names repeat in the same table?
Rule 4 : An item of data must only be in one Table. For example, the name of a Organization would appear only in the Organizations Table. Check 4 : Does the same item of data in appear in more than one table ?
10.8.2 Reference Data 10.8.10.1 Background A list should be made of the Reference Data referred to in a Data Model. When the list is complete it should be analysed for consistency. For example, there will not usually be any relationships between the Reference Data. However, if there are any, then they should be sensible and consistent. For example, a Town might be in a County which would be in a Country. These could all be classified a Reference Data which has relationships which should be validated.
Typical Reference Data could include Job Titles, and Types of Products. In passing, we should note that Organizations, Job Titles and Products are all examples of hierarchical structures. Job Titles will change only very, very rarely. However, when they are stored in a Table which is joined to itself then, of course, the Table will have a Recursive Relationship to itself. Therefore, wherever these occur, we would expect to find compact Data Models that include a great deal with compact and powerful structures. 10.8.10.2 Standards Any appropriate national, international Standards must be considered when values for Reference Data are decided. These include MOD, NATO and ISO standards. For example, ISO publishes standards for Country Codes and NATO maintains standards for Product classification. Therefore any Data Model relating to Products should consider this standard and where appropriate the necessary Tables should be added to the Model.
Data about Categories and Types is often fixed values but some can change infrequently. For example, a new Aircraft Type was introduced with Unmanned Aircraft. The values then became Fixed-Wing, Rotary and Unmanned. This would be an example of Slowly-Changing Data. This highlights the fact that what constitutes Reference Data can be subjective and may be defined differently in Data Models created by different people or organisations.
Check Nr 1 2 3 4
GOOD Y N N N
OK (Y/N) ? N
DESCRIPTIO Can the values of every data item in a table be derived only from the Primary Key ? Can any data item be derived from other items ? Do any column names repeat in the same table? Does the same item of data in appear in more than one table ?
PF, which is shown in the Products_in_a_Requisition Table, stands for Primary and Foreign Key. This is a Primary Key in one Table which is also a link to another Table, where is also a Primary Key.
We use the concept of Inheritance where have Super-Types and Sub-Types. Inheritance is exactly what it sounds like.
It means that at a high level, we identify the general name of the Thing of Interest and the characteristics that all of these Things share.
For example, an Aircraft will have a name for the type of Aircraft, such as Tornado and it will be of a certain type, such as Fixed Wing or Rotary.
At the lower level of Fixed-Wing Aircraft, an Aircraft will have a minimum length for the runway that the Aircraft needs in order to take off.
However, there is an exception to this which is when a one-off event can occur which involves a substantial amount of data. In that case, it would not be good to create a large number of fields which will be blank in the large majority of cases.
For example, when a Soldier joins the Army there might be data that is involved only with the joining details. The basic data for the Soldier will be part of his or her basic records such as Date of Birth and Place of Birth. If a separate Table exists for Joining Details then it would contain such things as date and place of joining. Then the Soldiers Table would have a One-to-One relationship with the Joining Details Table. In other words, it can sometimes be acceptable to see a One-to-One in a Data Model. If that happens, it is necessary to establish the associated Business Rules to clarify the conditions.
Calendar
Day Date
Facts
Dimensions Date
Product Types
Product Type Code
Requisitions
Requisition ID
Job Titles
Job Title Code
Requisition ID Product ID Product Type Code Job Title Organization ID Facts of Data Date of Requisition Products in Requisitions Organization raising Requisition Averages, Counts, Totals KPIs and Dashboards Other Derived Figures
Products
Product ID
Organizations
Unit ID
We might say that Naming Standards are nice to have. In other words, they are not essential but they reflect Best Practice.
Best Practice in Data Management 132 Appendix 10.A Checklist for Quality Assurance of a Data Model
This Checklist extends the basic concept of the Data Model Scorecard which was originated by Steve Hoberman. Nr 1 2 3 4 5 6 FEATURE Can the Scope of the Model be defined ? Can the Requirements be defined ? Does the Model meet the Requirements ? Is there a comprehensive Glossary of Terms ? Have comprehensive Business Rules been defined ? Normalisation Checks (Good value of Y or N) 1. Can the values of every data item in a table be derived only from the Primary Key ? (Y) 10. Can any data item be derived from other items ? (N) 10. Do any column names repeat in the same table? (N) 8. Does the same item of data appear in more than one table ? (N) 7 Is there compliance with any relevant Data Standards ? Desirable Not essential IMPORTANCE Essential Essential Essential Essential Essential Desirable Yes/No COMMENT There must be clear alignment of the Model with the business and the User community. Requirements must be defined. The Model must meet the Requirements. Very important that the value of a Glossary is recognised Very important that the value of Rules is recognised The Model might be OK even if it fails all of these Checks. The power and flexibility of the Relational Approach makes it possible to handle all sort of errors and still provide the foundation for an operational Database.
If any of the Essential Questions have a No answer, which the Model is not Acceptable. Any No answers to Desirable or Not critical Questions do not affect the acceptability of the Model but mean that it could be improved.
RESULT
All Essential Features are Yes Any Essential Features are No
RATING
Acceptable Not Acceptable
COMMENT
Typical Summary
The results of a of a typical Model might result in this summary :Reservations are that the documentation does not demonstrate that the Data Model meets the User Requirements. The Data Model shows some weaknesses which the supplier has agreed to address.
For a Health Check :No action is required beyond the presentation of a Report because the QA is simply to establish the As-Is situation.
For a proposed Application :It is essential that the Model accurately meets the User Requirements. If it does not, then it must be corrected in discussion with the Users and the Modeller.
For Data Migration :It is essential that the Model is correct at the detailed level of Tables, Fields and Data Types.
10.B.1.2 Our Conclusions Our conclusions are that this is not a good Data Model. Reasons include : It contains Reference Data which is not appropriate at the Top Level There is no description of the functional area that the Model supports.
Our first activity therefore is to produce an equivalent Business Model that we like that we can use as the basis for discussion.
Corrective Actions include :1. Create a simple Business Data Model This should be a Model in Word that does not include Reference Data 2. 3. 4. 5. Produce a short description Create a Glossary of Terms Define the representative Business Rules Identify the intended Users and the Owners of the Model
10.B.1.3 Functional Description In this diagram, arrows point from Children to Parents. The Scope of the Data Model includes Requisitions for Products from Organizations. The Functional Description is a simple one-liner Organizations raise Requisitions for Products.
10.B.1.4 A Specific Model in Word The Specific Version is consistent with the Generic Version and looks like this.
Ship
Central Stores
Officer
Department
Requisition
Product
10.B.1.5 A Generic Model in Word In this diagram, arrows point from Children to Parents. Our Generic Data Model (from Section 1.2) looks like this :-
Units
Stores
Products
10.B.1.6 A Top-Level Generic Data Model This is a top-level Model that was created using a Data Modelling Tool. It shows useful detail, such as details of the Relationships. It also replaced a Many-to-Many with two One-to-Many Relationships.
This diagram is a generic version of the one below and is useful to help in providing a higher level context for lower-level, more specific Models. An additional level of detail shows the Rabbits Ears relationship that implements hierarchical relationships for Organizations and for Products.
This Model corrects an error in the original Model that we were given. The error was that a Ship has Departments independent of Officers.
Step 10.B.2 Draft the Business Rules These Rules must be phrased in unambiguous English. Where possible, the English should make it possible to implement a Rule in a Data Model. For example, Rule 1 makes it clear that there is a One-to-Many Relationship between a Ship and an Officer.
1. A Ship is staffed with many Officers. 10. Ships Departments raise Requisitions. 10. A Requisition must be authorised by an Officer. 8. An Officer is assigned to one Ship at any point in time. 9. An Officer is assigned to one or many Ship during the course of their career.
Templates Here is a sample Template for Business Rules :Nr RELATES TO Customers, Requisitions BR.D.2 Customers, Requisitions BR.D.3 Requisitions, Products BR.D.4 Requisitions, Products John Doe A Product can appear in zero, one or many Requisitions. Therefore, there is a Many-to-Many Relationship between Requisitions and Products. John Doe A Requisition can refer to one or many Products. TBD A Requisition must be associated with a valid Customer. OWNER DEFINITION
BR.D.1
John Doe
DEFINITION Any Organization that can raise a Requisition A request for Assets to be supplied. The format of a request can be an electronic message, a paper Form and so on.
Product
An Asset that can be supplied on request. It can be something small, like a Pencil, or something large, like an Printer. The term Equipment is reserved for major items, such as Printers. The word Asset is used to refer to smaller items, such as Pencils
Step 10.B.4 Check that the Data Model is correct The Rules will help in determining whether the Model is correct. In this case, there is an error in that Ships are shown coming in between Ships and Departments. The reality is that Departments exists without Officers. This is corrected in the Top-Level Data Model in Section 1.10.. Step 10.B.5 Review with Users Review and revise as necessary. Step 10.B.6 Check Normalised Design The design looks normalised and therefore is acceptable. The Reference Data looks appropriate and is not related and therefore is acceptable. Step 10.B.7 Look for Design Patterns This Business Model shows these examples of Design Patterns : a One-to-Many Relationship between Ship and Office a Many-to-Many Relationship between Requisition and Product
It does not show Inheritance but in general we would not expect to find it. There are a number of reasons why Inheritance does not appear. For example : Inheritance is not appropriate in this case Inheritance does not show in a Data Model for a physical Database.
Step 10.B.8 Review any Data Warehouses In this Case Study, this Step is not necessary because we do not have a Data Warehouse or Data Mart. Step 10.B.9 Check Naming Standards Standards that are common include ;1. 2. 3. Any of Initial Capitals with lower case elsewhere for example, Organization_ID All capitals for example, ORGANIZATION_ID Lower case everywhere for example Organization_id these Standards is acceptable.
It will be necessary to analyse any discrepancies and decide on a standard to resolve them. Step 10.B.11 Check for Defaults Default values are a powerful technique for adding values in a Data Model. They can be used enforce consistency. Probably the most common example is to specify that the date of entry and creation of a new record should be the current System Date. This applies to new Customers, Orders and the Date of any Payment or Adjustment and so on. Step 10.B.12 Determine the Assurance Level Appendix A defines the process to be followed and discussed appropriate remedial follow-up action.