Data Vault

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Matillion + VaultSpeed for Implementing Data Vault

Contents
● What is Data Vault Architecture
■ Satellites
■ Links
■ Hubs
● The above terms are analogous to which entity in 3NF Modeling.
● What is Vault Speed?
● What are the key features it has?
● Which sources can VaultSpeed connect to? What acts input to the VaultSpeed, is it
data/metadata/files/snapshots?
● What are the different tiers of Vault Speed, is it free or chargeable?
● WHat version of Vault Speed support Data Vault?
● What are the different target connectors/database links it can support?
● How does Vaultspeed handshake with Matillion? Describe the entire process?
● Describe the Part played by each component VaultSpeed, Matillion & SNowflake when
deploying E2E ETL pipeline into Data Vault Model Architecture in the sink db
● How is CI/CD done during this entire pipeline generation ?
● What are the charges attached when Integrating VaultSpeed with Matillion?
● How does Data Vault respond to schema changes?
Data Vault is an innovative data modeling methodology for large scale Data Warehouse
platforms. Data Vault is designed to deliver an Enterprise Data Warehouse while addressing the
drawbacks of the normalized (3rd normal form), and Dimensional Modeling techniques. It
combines the centralized raw data repository of the Inmon approach with the incremental build
advantages of Kimball. Data Vault consists of hub, links & satellite in its core structure.

Hubs are for business keys. They are the most important part of the data vault methodology.
Hubs tables don’t contain any context data or details about the entity. They only contain the
defined business key and a few mandated Data Vault fields
The basic structure of the Hub table is as follows:
⇒ Mandatory Columns:
● Hub Sequence Identifier (generally a number generated from a database)
● “Business Key” Value (generally a string to handle any data type)
● Load Date (generally a date and time)
● Record Source (generally a string)

⇒ Loading Pattern
● Select Distinct list of business Keys
● Add timestamp and record source
● Insert into Hub if the value does not exist
Links: Links store the intersection of business keys (HUBS). Links can be considered the glue that
holds the data vault model together. Just like the Hub, a Link table structure contains no
contextual information about the entities. It just defines the relationship between business keys
from two or more Hubs.
The basic structure and treatment of the Link table is as follows:
⇒ Mandatory Columns
● Link Sequence Identifier (a database number)
● Load Date and Time (generally a date field)
● Record Source (generally a string)
● At least two Sequence Identifiers (either from Hubs or other Links and are numbers).

⇒ Loading Pattern
● Select Distinct list of business Key combinations from source
● Add timestamp and record source
● Lookup data vault identifier from either Hub or Link
● Insert into Link if the value does not exist

Satellite: In Data Vault architecture, a Satellite houses all the contextual details regarding an
entity. Satellite tables contain all the descriptive information, satellites are time aware and
therefore tracks change over time as its main function. Satellites are always directly related
and are subordinate to a hub or a link. They provide context and definition to business keys.
When there is a change in the data, a new row must be inserted with the changed data. These
records are differentiated from one another by utilizing the hash key and one of the Data Vault
mandated fields: the load_date.
The basic structure and treatment of the Link table is as follows:

⇒ Mandatory Columns
● Hub or Link Sequence Identifier
● Load Date
● Load Date End
● Record Source

⇒ Loading Pattern
● Select list of attributes from the source
● Add timestamp and record source
● Compare to the existing applicable set of satellite records and insert when a
change has been detected
● Lookup and use the applicable Hub identifier or the Link identifier
What Problem is Data Vault trying to solve?
Before summarizing the challenges that Data Vault is trying to address, it’s worth considering
the alternative data modeling approach and corresponding data architectures.
Enterprise Data Warehouse: The diagram below shows a potential Enterprise Data
Architecture.

With the EDW approach, data is loaded into a transient Landing Area, after which a series of ETL
processes are used to load data into a 3rd Normal form enterprise data warehouse. The data is
subsequently extracted into dimensional data marts for analysis and reporting.

The most significant disadvantages of this approach include:

1. Time to Market: The Enterprise Data Warehouse must first integrate data from each of the
source systems into a central data repository before it’s available for reporting, which adds
time and effort to the project.
2. Complexity and Skill: A data warehouse may need to integrate data from a hundred
sources, and designing an enterprise-wide data model to support a complex business
environment is a significant challenge that requires highly skilled data modeling experts.
3. Lack of Flexibility: A third normal form model tends to model the existing data relationships,
which can produce a relatively inflexible solution that needs significant rework as additional
sources are added. Worse still, over-zealous data modeling experts often attempt to
overcome this by delivering over-complex generic models that are almost impossible to
understand.
Dimensional Design Approach:
The diagram below illustrates a potential data architecture for a classic Dimensional Data
Warehouse design.

The approach above dispenses with the EDW to quickly deliver results to end-users. However,
over time, it found many challenges that became increasingly painful to deal with. These
included:

1. Increasing code complexity: The ETL code (Extract, Transform, and Load) was becoming so
complicated it was no longer manageable. Replacing the ETL tool (Informatica) with Oracle
scripts helped, (as we simplified the solution as we went), but that wasn’t the root of the
problem. We were trying to restructure the incoming data, deduplicate, clean and conform the
data, and apply changing business rules over time. Doing all these steps in a single code base
was very hard indeed.

2. Lack of Raw Data: As the landing area was purely transient (deleted and reloaded each
time), we had no historical record of raw data. This made it difficult for analysts to discover
valuable new data relationships, and the increasing importance of Data Science, which (above
all) needs raw data, was simply ignored.
3. Managing History: As we had no history of raw data and only loaded the attributes needed
for analysis, it became difficult to back-populate additional data feeds.
4. Lineage was challenging: As both the technical and business logic was implemented in
ever-increasing sedimentary layers of source code, it was almost impossible to track the
lineage of a data item from the report back to the source system.

The business loved the initial speed of delivery. However, as time went on, it became
increasingly hard to maintain the pace as the solution became increasingly complex, and
business rules changed over time.

Data Vault Architecture

The diagram below shows a potential data architecture used by the Data Vault methodology.

While at first glance, it looks very similar to the Enterprise Data Warehouse architecture above, it
has a few significant differences and similarities, which include:

Data Loading: As the data is loaded from the Landing Area into the Raw Data Vault, the process
is purely one of restructuring the format (rather than content) of the data. The source data is
neither cleaned nor modified, and could be entirely reconstructed without issue.
Separation of Responsibility: The Raw Vault holds the unmodified raw data, and the only
processing is entirely technical, to physically restructure the data. The business rules deliver
additional tables and rows to extend the Raw Vault with a Business Vault. This means the
business rules are both derived from and stored separately from the raw data. This separation
of responsibility makes it easier to manage business rule changes over time and reduces
overall system complexity.
Business Rules: The results of business rules, including deduplication, conformed results, and
even calculations are stored centrally in the Business Vault. This helps avoid duplicate
computation and potential inconsistencies when results are calculated for two or more data
marts.
Data Marts: Unlike the Kimball method in which calculated results are stored in Fact and
Dimension tables in the Data Marts, using the Data Vault approach, the Data Marts are often
ephemeral, and may be implemented as views directly over the Business and Raw Vault. This
means they are both easier to modify over time and avoids the risk of inconsistent results. If
views don’t provide the necessary level of performance, then the option exists to store results in
a table.

What is Vault Speed?


VaultSpeed is a dataWarehouse automation tool, with vault speed we can automate the
upfront data warehouse design and development on top of data vault 2.0 architecture.
VaultSpeed combines automation, Data Vault modeling and cloud native performance.
Everything to make data warehouse projects less error-prone and time-consuming.
⇒ Data warehouse automation (DWA) is replacing standard methods for building data
warehouses. It automates the planning, modeling, and integration steps to keep pace with an
ever-increasing amount of data and sources.
⇒ It streamline automation processes to extract, transform and load (ETL) data (full load or
incremental load). It uses auto-mapping and job scheduling to eliminate repetitive steps.

Vault Speed key feature ?

⇒ No-code approach: VaultSpeed is the tool that provides a no-code approach to integrate
and model different data from a multitude of sources and technologies.
⇒ Load data efficiently, manage history and handle change: it facilitates data warehouse
automation, with its unique capability to load data efficiently, manage history and handle
change. Moreover, it supports the upcoming modern data architectures, such as data hub and
data mesh.
⇒ Build and scale quickly: VaultSpeed is cloud native, enabling it to build and scale more
quickly and efficiently.
What sources can Vault Speed connect to ?
⇒ Vaultspeed uses a java based application known as “agent” to initiate connection with
defined data sources.
⇒ The agent has the following tasks:
1. connect to the sources through JDBC links and collect the metadata from that source. There
are two versions of the agent: the standard agent and the extended agent, the standard agent
supports only the standard databases while the extended agent supports a wide range of
source systems through the JDBC URL formats or on the cloud application.
The agent stores the metadata in a set of CSV files and sends them to the cloud app.
2. Download the generated DDL and ETL code from the cloud.
3. Deploy the generated code to the target Data warehouse or ETL tool.
Vault speed support connecting to these sources: VaultSpeed data sources

⇒ vault speed reads the metadata of the sources or data files, after establishing the
connection to source. MetaData is used as Input for VaultSpeed.

What are the different tiers of Vault Speed ?


⇒ Vault speed doesn’t offer any trial or test version. It only has paid license, pricing
structure is below.

What version of Vault Speed supports data vault ?


⇒ Vaultspeed works on top of dataVault architecture only. All the versions of vault
speed work with data vault.
What are the different target connectors/database links it can support?
⇒ Vault Speed support these database links as source & target: Vaultspeed target

Describe the Part played by each component VaultSpeed, Matillion &


Snowflake when deploying E2E ETL pipeline into Data Vault Model
Architecture in the sink db ?
⇒ Matillion ETL provides users with data loading and transformation solutions built for
the cloud, with vaultspeed ETL Jobs are generated in Matillion, allowing users to view the
loading logic in detail.
⇒ VaultSpeed will generate all the code necessary to automatically create jobs in
Matillion ETL to load your data to your data vault on Snowflake.
⇒ VaultSpeed’s automation engine generates ETL jobs for Matillion and Orchestration
code to execute ETL jobs
⇒ VaultSpeed will generate all DDL specific for Snowflake and the ETL code for Matillion
ETL jobs
⇒ In the Run mode, Matillion takes care of data loading from source to target using
VaultSpeed’s auto-generated mappings; you can monitor your loads from within the
Matillion interface. Additionally, you can build custom pre-staging and information
mart business logic using Matillion’s ETL designer tools, and automate custom business
rules into the solution with custom VaultSpeed Studio automation templates.

Continuous Pipeline Deployment in vaultspeed with matillion ?


VaultSpeed supports deploying your generated code to your database or GIT
repositories. Or you can even deploy to whatever you want if you use the custom script
option. For Automatic pipeline deployment vaultspeed supports 3 options:
● JDBC connection to automatically deploy the code to a target database.
● Deploy directly to a GIT repository.
● Write your own custom script to do the deployment

Deployment via GIT:


To deploy to GIT, on the Automatic deployment page, select a generation, click
generate and choose GIT. Then enter a connection name and click deploy.
Make sure to use an RSA key when using key-based authentication, not open SSH. Also,
instead of using ssh-add, you might have to add the key to ~/.ssh/config yourself.
The GIT connection has to be added to the connections.properties file of the agent with
the following properties, (attach image below)

Please, follow this guide for: git CI/CD deployment with vaultspeed

What are the charges attached when Integrating VaultSpeed with


Matillion?
⇒ VaultSpeed & Matellion integration doesn’t have any combined pricing structure,
both are following their own pricing structure.
⇒ Matellion will be billed on credits based on your service tier & vaultspeed will be billed
based on license type.
⇒ In Vaultspeed & matillion integration, for all the ETL jobs cost will occur in matellion &
vaultspeed will be charged as per your license. (You can also use vaultspeed’s own ETL
engine, it is included in the vaultspeed license, that will save us from matellion
charges.)

How does Data Vault respond to schema changes ?


The initial load – the first time a source system is loaded into the Data Vault – copies
the entire set of data. This set can consist of files, source tables, external tables,
database dump files, etc. Initial loads use the INI schema, which is automatically
created. The loading logic for any initial loads is pretty straightforward since there is no
previously loaded data to consider. Initial loads are tailored for substantial loading
volumes.

After the initial data loading takes place, it’s necessary to dump the delta between the
target and source data at regular intervals. Incremental data loading of the newly
added or modified data can be based on any type of incremental data delivery: CDC,
files or source journaling tables. Incremental loading logic takes care of all
dependencies and specifics when data is already loaded in the DWH.

VaultSpeed incremental loads have a built-in solution to handle loading windows.


Loading windows will vary in size based on the loading frequency. The incremental
loads are tuned for batch and micro-batch loading performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy