Data Vault
Data Vault
Data Vault
Contents
● What is Data Vault Architecture
■ Satellites
■ Links
■ Hubs
● The above terms are analogous to which entity in 3NF Modeling.
● What is Vault Speed?
● What are the key features it has?
● Which sources can VaultSpeed connect to? What acts input to the VaultSpeed, is it
data/metadata/files/snapshots?
● What are the different tiers of Vault Speed, is it free or chargeable?
● WHat version of Vault Speed support Data Vault?
● What are the different target connectors/database links it can support?
● How does Vaultspeed handshake with Matillion? Describe the entire process?
● Describe the Part played by each component VaultSpeed, Matillion & SNowflake when
deploying E2E ETL pipeline into Data Vault Model Architecture in the sink db
● How is CI/CD done during this entire pipeline generation ?
● What are the charges attached when Integrating VaultSpeed with Matillion?
● How does Data Vault respond to schema changes?
Data Vault is an innovative data modeling methodology for large scale Data Warehouse
platforms. Data Vault is designed to deliver an Enterprise Data Warehouse while addressing the
drawbacks of the normalized (3rd normal form), and Dimensional Modeling techniques. It
combines the centralized raw data repository of the Inmon approach with the incremental build
advantages of Kimball. Data Vault consists of hub, links & satellite in its core structure.
Hubs are for business keys. They are the most important part of the data vault methodology.
Hubs tables don’t contain any context data or details about the entity. They only contain the
defined business key and a few mandated Data Vault fields
The basic structure of the Hub table is as follows:
⇒ Mandatory Columns:
● Hub Sequence Identifier (generally a number generated from a database)
● “Business Key” Value (generally a string to handle any data type)
● Load Date (generally a date and time)
● Record Source (generally a string)
⇒ Loading Pattern
● Select Distinct list of business Keys
● Add timestamp and record source
● Insert into Hub if the value does not exist
Links: Links store the intersection of business keys (HUBS). Links can be considered the glue that
holds the data vault model together. Just like the Hub, a Link table structure contains no
contextual information about the entities. It just defines the relationship between business keys
from two or more Hubs.
The basic structure and treatment of the Link table is as follows:
⇒ Mandatory Columns
● Link Sequence Identifier (a database number)
● Load Date and Time (generally a date field)
● Record Source (generally a string)
● At least two Sequence Identifiers (either from Hubs or other Links and are numbers).
⇒ Loading Pattern
● Select Distinct list of business Key combinations from source
● Add timestamp and record source
● Lookup data vault identifier from either Hub or Link
● Insert into Link if the value does not exist
Satellite: In Data Vault architecture, a Satellite houses all the contextual details regarding an
entity. Satellite tables contain all the descriptive information, satellites are time aware and
therefore tracks change over time as its main function. Satellites are always directly related
and are subordinate to a hub or a link. They provide context and definition to business keys.
When there is a change in the data, a new row must be inserted with the changed data. These
records are differentiated from one another by utilizing the hash key and one of the Data Vault
mandated fields: the load_date.
The basic structure and treatment of the Link table is as follows:
⇒ Mandatory Columns
● Hub or Link Sequence Identifier
● Load Date
● Load Date End
● Record Source
⇒ Loading Pattern
● Select list of attributes from the source
● Add timestamp and record source
● Compare to the existing applicable set of satellite records and insert when a
change has been detected
● Lookup and use the applicable Hub identifier or the Link identifier
What Problem is Data Vault trying to solve?
Before summarizing the challenges that Data Vault is trying to address, it’s worth considering
the alternative data modeling approach and corresponding data architectures.
Enterprise Data Warehouse: The diagram below shows a potential Enterprise Data
Architecture.
With the EDW approach, data is loaded into a transient Landing Area, after which a series of ETL
processes are used to load data into a 3rd Normal form enterprise data warehouse. The data is
subsequently extracted into dimensional data marts for analysis and reporting.
1. Time to Market: The Enterprise Data Warehouse must first integrate data from each of the
source systems into a central data repository before it’s available for reporting, which adds
time and effort to the project.
2. Complexity and Skill: A data warehouse may need to integrate data from a hundred
sources, and designing an enterprise-wide data model to support a complex business
environment is a significant challenge that requires highly skilled data modeling experts.
3. Lack of Flexibility: A third normal form model tends to model the existing data relationships,
which can produce a relatively inflexible solution that needs significant rework as additional
sources are added. Worse still, over-zealous data modeling experts often attempt to
overcome this by delivering over-complex generic models that are almost impossible to
understand.
Dimensional Design Approach:
The diagram below illustrates a potential data architecture for a classic Dimensional Data
Warehouse design.
The approach above dispenses with the EDW to quickly deliver results to end-users. However,
over time, it found many challenges that became increasingly painful to deal with. These
included:
1. Increasing code complexity: The ETL code (Extract, Transform, and Load) was becoming so
complicated it was no longer manageable. Replacing the ETL tool (Informatica) with Oracle
scripts helped, (as we simplified the solution as we went), but that wasn’t the root of the
problem. We were trying to restructure the incoming data, deduplicate, clean and conform the
data, and apply changing business rules over time. Doing all these steps in a single code base
was very hard indeed.
2. Lack of Raw Data: As the landing area was purely transient (deleted and reloaded each
time), we had no historical record of raw data. This made it difficult for analysts to discover
valuable new data relationships, and the increasing importance of Data Science, which (above
all) needs raw data, was simply ignored.
3. Managing History: As we had no history of raw data and only loaded the attributes needed
for analysis, it became difficult to back-populate additional data feeds.
4. Lineage was challenging: As both the technical and business logic was implemented in
ever-increasing sedimentary layers of source code, it was almost impossible to track the
lineage of a data item from the report back to the source system.
The business loved the initial speed of delivery. However, as time went on, it became
increasingly hard to maintain the pace as the solution became increasingly complex, and
business rules changed over time.
The diagram below shows a potential data architecture used by the Data Vault methodology.
While at first glance, it looks very similar to the Enterprise Data Warehouse architecture above, it
has a few significant differences and similarities, which include:
Data Loading: As the data is loaded from the Landing Area into the Raw Data Vault, the process
is purely one of restructuring the format (rather than content) of the data. The source data is
neither cleaned nor modified, and could be entirely reconstructed without issue.
Separation of Responsibility: The Raw Vault holds the unmodified raw data, and the only
processing is entirely technical, to physically restructure the data. The business rules deliver
additional tables and rows to extend the Raw Vault with a Business Vault. This means the
business rules are both derived from and stored separately from the raw data. This separation
of responsibility makes it easier to manage business rule changes over time and reduces
overall system complexity.
Business Rules: The results of business rules, including deduplication, conformed results, and
even calculations are stored centrally in the Business Vault. This helps avoid duplicate
computation and potential inconsistencies when results are calculated for two or more data
marts.
Data Marts: Unlike the Kimball method in which calculated results are stored in Fact and
Dimension tables in the Data Marts, using the Data Vault approach, the Data Marts are often
ephemeral, and may be implemented as views directly over the Business and Raw Vault. This
means they are both easier to modify over time and avoids the risk of inconsistent results. If
views don’t provide the necessary level of performance, then the option exists to store results in
a table.
⇒ No-code approach: VaultSpeed is the tool that provides a no-code approach to integrate
and model different data from a multitude of sources and technologies.
⇒ Load data efficiently, manage history and handle change: it facilitates data warehouse
automation, with its unique capability to load data efficiently, manage history and handle
change. Moreover, it supports the upcoming modern data architectures, such as data hub and
data mesh.
⇒ Build and scale quickly: VaultSpeed is cloud native, enabling it to build and scale more
quickly and efficiently.
What sources can Vault Speed connect to ?
⇒ Vaultspeed uses a java based application known as “agent” to initiate connection with
defined data sources.
⇒ The agent has the following tasks:
1. connect to the sources through JDBC links and collect the metadata from that source. There
are two versions of the agent: the standard agent and the extended agent, the standard agent
supports only the standard databases while the extended agent supports a wide range of
source systems through the JDBC URL formats or on the cloud application.
The agent stores the metadata in a set of CSV files and sends them to the cloud app.
2. Download the generated DDL and ETL code from the cloud.
3. Deploy the generated code to the target Data warehouse or ETL tool.
Vault speed support connecting to these sources: VaultSpeed data sources
⇒ vault speed reads the metadata of the sources or data files, after establishing the
connection to source. MetaData is used as Input for VaultSpeed.
Please, follow this guide for: git CI/CD deployment with vaultspeed
After the initial data loading takes place, it’s necessary to dump the delta between the
target and source data at regular intervals. Incremental data loading of the newly
added or modified data can be based on any type of incremental data delivery: CDC,
files or source journaling tables. Incremental loading logic takes care of all
dependencies and specifics when data is already loaded in the DWH.