0% found this document useful (0 votes)
449 views

Keboola Advanced Training - Public PDF

This document provides an overview of Keboola best practices for data transformation and loading. Some key points include: 1. Use a business data model (BDM) to describe the business entities, properties, and relationships in a technology-agnostic way. This provides a shared understanding and supports reuse. 2. Implement a multi-project architecture with staging (L1) and integration (L2) projects to isolate transformations and merge data. 3. Avoid complex transformations in a single phase. Instead, split work into simple atomic queries and reuse intermediate results. 4. Define input/output mappings carefully to reduce load volumes and enable incremental loads based on change detection. 5. Leverage

Uploaded by

xjunp05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
449 views

Keboola Advanced Training - Public PDF

This document provides an overview of Keboola best practices for data transformation and loading. Some key points include: 1. Use a business data model (BDM) to describe the business entities, properties, and relationships in a technology-agnostic way. This provides a shared understanding and supports reuse. 2. Implement a multi-project architecture with staging (L1) and integration (L2) projects to isolate transformations and merge data. 3. Avoid complex transformations in a single phase. Instead, split work into simple atomic queries and reuse intermediate results. 4. Define input/output mappings carefully to reduce load volumes and enable incremental loads based on change detection. 5. Leverage

Uploaded by

xjunp05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Keboola Best Practices

Keboola Training Overview


● BDM Methodology
● Keboola Basic Training
● Advanced Training (Best Practices)
● Generic Extractor Introduction
● Component Creation
● Project Health
● API Overview
Databases

Data Transformation Data Sandboxes

SaaS Apps

Advertising / RTB
Analysis Ready
Extract Transform Load Time To Value
Flexible Setup

3rd Party ML Data Science


AppStore Catalogue
Online

Public API Augmentation Services

Powered by:
Core Architecture (Project Detail)
Set of microservices
interacting together via
Storage component.

Each component either


writes it’s result to a
storage or loads data from
it.

Storage is a separate layer


that may be implemented
on various technologies
(Snowflake, Redshift,
Azure, etc.)
All-in-one cloud environment
Component & Component Configuration
Transformation & Transformation Buckets
Transformation Phases & Dependencies
Multi-Project Architecture
● Each Department is not limited by a single project. You can stage your data
pipeline into set of project that share data
● Consider it as L1 and L2 transformation projects
○ L1 - Minor cleanup data pipelines, isolated transformations forming single tables
○ L2 - Final data pipelines, possibly merging multiple tables from L1 together to form BDM
objects
Business Data Model (BDM)
● Method of describing a business in the language of data
● Independent on the underlying technology
● Defines and describes “objects”, “properties” and “values” that are key to
the business operation,
● Provides:
○ Extendable data model that supports all current and future data
initiatives
○ Unified terminology and business understanding through data across
departments
○ Precursor of multi-project architecture
BDM example: CRM
Multi-Project Architecture
Transformations
● Transformation Script (SQL, R, Python or
OpenRefine backend) which you can use to
manipulate your data.
● Operate in a completely separate provisioned
workspace created for each transformation.
● Independent of storage backends.
● Versioning, complete history + fast rollback
● Sandboxes
○ Separated environments for each user/project
○ Available for all backends

Limits (Soft)
● Sandbox disk space is limited to 10GB.
● Memory is limited to 8GB.
● Maximum runtime - 6 hrs
● Limits are soft -> it is possible to increase the limits
on request per project.
Avoid ALTER SESSION Statements
● Avoid ALTER SESSION statements within transformations, since it may lead
to unpredictable behaviour. Since transformation phases are executed in a
single SF workspace, ALTER SESSION statements may be executed in
unpredictable order.
● Also the loading and unloading sessions are separate from your
transformation/sandbox session and the format may change unexpectedly.
● Using explicit statements instead of a global SESSION parameter also
leads to better readability -> explicit is always better.
● More info in the DOCs
Avoid ALTER SESSION Statements
Example:

ALTER SESSION SET TIMESTAMP_OUTPUT_FORMAT= 'YYYY-MM-DD


HH24:MI:SS';

Alternative:

SET DEF_TIMESTAMP_FORMAT= 'YYYY-MM-DD HH:MI:SS';

SELECT
TO_CHAR("datetime"::TIMESTAMP,$DEF_TIMESTAMP_FORMAT);
Set Timezone Explicitly
Be careful when working with timestamps. TIMESTAMP casting may convert the date to local
timezone based on the worker location. It is always better to convert explicitly to required
timezone and to use TIMESTAMP_TZ when you need the timezone information.

Dangerous:

ALTER SESSION SET timezone = 'Asia/Bangkok' ;

Better:

SET DEF_TIMEZONE = 'Asia/Bangkok' ;

SELECT CONVERT_TIMEZONE($DEF_TIMEZONE , '2019-03-05 12:00:00


+02:00'::TIMESTAMP_TZ);
Avoid SELECT * Statements
● Generic SELECT * statements should be avoided and columns always listed
explicitly.

Reasoning

● Using * statements makes it difficult to control upstream effects of a


structural change on the input (column added).
● More lines of code are traded-off by clarity - It is clear what columns are
being used directly from the code.
Transformation Phases vs. Dependencies
Phases

● Each phase executes the transformation code in separate workspace


● Phases ensure transformation execution order
● Phases are to be used when you absolutely need to export the product into storage before
another group of tasks processes the outputs, typically when you're combining different
transformation backends within a transformation bucket i.e. preparing sets using SQL in order for
them to be processed by a python script.

Dependencies

● Define order of transformation execution within single phase


● Use single Snowflake workspace
○ Performance advantages - I/O done only once
○ All objects created within phase are visible from all transformations involved
Transformation Phases vs. Dependencies

Running in workspaces
Phases vs. Dependencies: Best Practice
General rule: we suggest using mainly dependencies in order to ensure all
necessary transformations happen when you trigger a single one manually
within the bucket.

Reasoning

● Using dependencies instead of a phase enables execution parallelisation


and parallel I/O loads which in turn provides better performance.
● Objects can be shared across transformations allowing better code
segmentation
SQL Dep Feature
Keboola connection provides integration with SQL Dep Service.

This service is very useful for analysis of transformation queries, it


allows exploring relationships between objects (tables, columns,
queries) created within Transformation.
Avoid Complex Nested Queries
● Split complex nested queries into as many atomic pieces as possible.

Reasoning:

● Better code readability -> Easier maintenance.


● Complex nested queries may easily exceed the DWH query execution time
limit. Using multiple simple queries helps to avoid such issues.
● You can reuse those “temp” tables in multiple queries within same phase.

Also read: https://multithreaded.stitchfix.com/blog/2019/05/21/maintainable-etls/


Variables
TRUNCATE and INSERT
Case statements -> Mapping table
Defining Input / Output Mapping
Defining Input / Output Mapping
● Enables declaring what input data from storage is used within the
transformation and gets copied into a secure separate workspace where
the transformation runs and what data is collected back.
● There are several techniques that allows additional control over I/O
mappings and enhance performance of I/O stage.

I/O Features:

● Table import mode


● Filters
● Data-types, cleaning
Copy Table Mode
Default mode of loading tables that allows definition
of filters:

Column filter:
choose a subset of columns - reduces load time and
volume transferred

Incremental load:
Define “Changed in last” interval to load only data that
had been changed in specified period

● Not defined on time-dimension within data


itself. Uses table metadata to bring only rows
that had been updated.
Filters

When using ‘Copy Table’ mode it is possible to define


additional filters and data-type conversions prior to load.

Data Filter: Simple filter on particular attribute, simply list of


values that should be included / excluded.

Data types: Define data types of input columns. By default


everything is VARCHAR, but it is possible to do the casting
on the input level

● Note that using this feature might hide the datatype


conversion from the code and reduce clarity of code.
Clone Table Mode
Advance mode, leveraging zero-copy clone
functionality of Snowflake DWH.

● This mode provides significant performance


boost when loading extremely large tables in full
or for speedup of transformations with large
amounts of tables on the input.
● Note that when using this mode no filters can be
setup and the table is loaded as is from the
Storage.
● Using `Clone Table` option provides only
performance enhancement, it does not affect the
credits consumed
Incremental vs. Full Load
● Incremental data flow processes only data changed since last successful run
and adds them to previously processed data.
● Full data flow processes everything every day.
● Keboola behaves as follows:
○ Full load
○ Incremental with PK - Type 1 Slowly changing dimension (SCD)
○ Incremental without PK - Type 2 SCD
○ Note: There are ways to have Type 3 SCD, either by utilizing
transformation of a custom component
■ (such as Table Snapshot)
■ leochan.event_snapshotting
Automatic / Manual Incremental Load
Automatic Incremental Load Manual Incremental Load
- Append all data that has been added or - Append all selected data. If a primary key
changed since the last successful run. If a is specified, updates will be applied to
primary key is specified, updates will be rows with matching primary key values.
applied to rows with matching primary
key values.

(Writers only at this point)

Demo
Break Time
Orchestration Notifications

Set up notification of orchestrations to group chat

a. Full transparency
b. Faster reaction type
Event-triggered Orchestration
Trigger your orchestrations based on source
data changes.

● Cooldown period: limits number of


triggers within some period. E.g. max
once per 5 min
● All tables have to be changed for trigger
to apply

What is a change?

● Any table import. Even if it does not


contain any new data.

TIP: Use artificial “state” tables to control the triggers.


These may contain some additional run info.
Storage
● Storage is the central KBC component managing everything related to storing data and
accessing it.
● Implemented as a layer on top of various database engines / storage services
○ Snowflake
○ S3
○ Redshift, Azure

Table Storage
● Stores tabular data created by components or table imports
● [Currently] based on Snowflake backend
● Easy backups and restorations - Snapshots, “Timetravel” restores
● No limitation by data types, semi-structured data formats natively supported by Snowflake
(JSON, Avro, ORC, Parquet, or XML)
● Organised into “buckets” and tables, can be shared across projects/ organisations
● Can be accessed only via Storage API or interacting with components.

File Storage
● Can be used to store an arbitrary file (e.g. R models)
● Every data load is stored in Files before it is processed and pushed into a table.
In Complete Control
● Any interaction with Storage is logged
● Every payload, result of a transformation or component is stored in Files before pushing to Storage
● Every Storage job is linked to the actual component job via RunID (searchable in Jobs tab)
Data Lineage
Storage Columns Descriptions
- Columns in Storage now
support Markdown language
for text descriptions, similar to
other Keboola components

Demo
Data Types
● Everything is treated as a String in the Storage
○ More flexibility on loading data
○ Datatype can be defined as a metadata on column in
the Storage
○ Datatypes are also transferred from the DB
extractors
● Generally, data types are needed on the output
○ Define datatypes on columns in output stage -> it
will be transferred to any new writer configuration
○ No need to define them in transformations, except
for some particular cases. E.g. for date operations
DATEADD(day, 2, “date_col”::DATE).
○ Snowflake infers datatypes so SUM(value) works
even if its a String
○ It is recommended to cast explicitly in the code
Override Manual Data Types
- Data types set by extractors can now be overriden by the user, by adding a
new row to the “data type” of the column.
- Source data type will exist for lineage and audit sake, but the override data
type will take precedence.

Demo
Defined Data Types
- Metadata for data types in Storage are now exposed to Storage UI
- Setting data types in Storage is only for metadata purposes, the underlying
Keboola storage is still stored as VARCHAR
- Impact:
- By setting the data type, your transformations will automatically set the input mapping of
the tables to the selected data type
- If the column contains data type errors, your transformations will error out on input phase

More info

Demo
Storage Table Operations
Delete full table
● Use with caution! Don’t do just to replace the
table.
Remove or add primary key
● May cause failure of dependent transformations
(check I/O mappings)
Create column
● May break backward dependencies.
Delete column
● May break forward dependencies.
Change column datatype
● Will be inherited by any new transformation or
writer, make sure the data is valid for the datatype.

NOTE: Always be careful when executing any of these operations and


refer to the Graph view to check any affected dependencies..
Chained Aliases
- An alias can be created from
another alias.
- Also aliases created in Shared
Buckets are propagated to linked
buckets and can be further aliased.
- Previous feature did not allow for
chained aliases or sharing aliases
across projects
Metadata Layer
Conceptually, there are those types of metadata collected or generated:
● The catalog information of datasets, such as schema structure, business description, business
context, dataset business taxonomy, location, responsibility/ownership information, PII tags, etc.
● Operational metadata, which includes the jobs, execution information, such as timestamps,
components writing from/into, etc.
● Lineage information metadata, which is the connection between components(jobs) and datasets.
● Data profiling statistics to reveal the column-level and set-level characteristics of the dataset.
This can be further leveraged for data QA analysis
Metadata collection and generation:
● Storage API
● System-collected information - operational metadata
● Generated data - Data profiling, data lineage
How to NOT use Keboola Storage
Snowflake DB, as other analytical databases is not durable for following cases:

● Transactional workloads (OLTP)


● Key-value access with high request rate
● Blob or document storage
● Over-normalized data

Additionally, there is no external access to Keboola storage allowed, due to


metadata layer used on the top of DB.
Use Input Mapping as Health Check
Using Sandboxes
● Safe environment for you to explore, analyze and experiment with copies of
selected data
● Place to troubleshoot and develop transformation scripts without
modifying the actual data
● Unique sandbox per user
● Jupyter, SQL, R
● Online interface & Remote connection credentials
Using Sandboxes
Code Templates
- Save Jupyter
notebook in a file
storage with
predefined tag
- If a sandbox is
loaded from a
transformation, the
transformation code
will be appended
after the template
code.
Code Templates (cont.)
- Similar feature as
Jupyter notebook for
R-Studio
- (_r_sandbox_template_)
- Also ability to create
personal templates

More information:
https://help.keboola.com/ma
nipulation/transformations/sa
ndbox/#code-templates
Markdown Descriptions
● Each component and storage can described with Markdown text
● Important for team work
● Enables “jump-in & fix” work style
● Utilize markdown formatting for text readability (headers, bullet points, etc.)
● Recommended content:
○ Storage: Data structures & types, foreign keys, etc. PII/Personal Data columns,
internal/temp table vs. production
○ Transformations: Function in the dataflow in broad terms, inputs, outputs, TODO section,
unknowns or known quirks to signal possible issues in the future, hardcoded values
○ What happens within transformation, how did you get from A-B and what are the logical
steps of the transformation.
Descriptions in Markdown
Descriptions in Markdown
Versioning
● Configuration change of any
component (that means
extractors, transformation bucket,
writers and orchestrations) -> New
version of the whole configuration
● Available in the UI and API
● Easy RESTORE or FORK
● Tip: Use FORK functionality to make
a quick copy of
configuration/transformation
bucket to avoid manual work
Versioning Improvements
- Save with descriptions
- Transformations UI
Avoiding Tedious Work
Keboola Logs Everything
● Component and Transformation runs
● Versioning
● Storage events
● Tip: Storage imports
Extensive logging
Every Storage Job contains detail with all information about
the execution.

● Run ID referencing the actual component job


● Link to affected table
● Timestamps, duration and transfer size
● File ID as a reference to the actual payload in Files
storage
● Type of load: incremental/full
● Total number of rows imported
● Other results or warnings
Data Retention & Table Snapshots
● Table Snapshots
● All changes saved
○ default data retention range is 7 days
● Ability to trigger ad-hoc snapshot
○ Manually
○ API call
● No need for date and user name

● Component Trash
○ Easy configuration restore
Use the Weapon of Choice
Components
Components

● Extractor – allowing customers to get data from new sources. It only processes input tables from external
sources (usually API).
● Application – further enriching the data or add value in new ways. It processes input tables stored as CSV
files and generates result tables as CSV files.
● Writer – pushing data into new systems and consumption methods. It does not generate any data in KBC
project.
● Processor – adjusting the inputs or outputs of other components. It has to be run together with any of the
above components.
● All components are run using the Docker Runner
● All components must adhere to the common interface
● List of all available components is to see on https://components.keboola.com/

Generic extractor

● Generic Extractor is a KBC component acting like a customizable HTTP REST client.
● Can be configured to extract data from virtually any API and offers a vast amount of configuration options.
● Entirely new extractor for KBC in less than an hour.
Extractors
Extractor types
● Database extractors: SQL Databases and NoSQL MongoDB
● Communication, Social Networks and Marketing and Sales extractors
● Other extractors such as Geocoding-Augmentation or GoogleDrive
● Keboola provided, 3rd part

Generic Extractor

● Very flexible, supports various types of authentication , pagination, nesting objects. Fits almost all REST-like
services.
● Supports iterations (run with multiple parameters)
● Functions
○ allow you to add extra flexibility when needed.
○ Can be used in several places of the Generic Extractor configuration to introduce dynamically generated
values instead of those provided statically.
● Incremental load, remembers last state
● Can be set up and used as a standalone component
A custom component
Common interface
● Predefined set of input and output folders for tables and files,

● a configuration file,
○ Custom defined JSON injected to the docker container
● environment variables and return values.

Other Features

● Logging, manifest files for working with table and file meta-data
● the OAuth part of the configuration file, and
● actions for quick synchronous tasks.
● Docker Runner provides tools for encryption and OAuth2 authorization.
● Custom Science apps (legacy)
● Generic UI
● Processors
● Easy to setup CI workflows - Travis, Quay, Bitbucket pipelines
Processors
Processor is a special type of component which may be used before or after
running an arbitrary component (extractor, writer, etc.).

● They perform an atomic action on the data before load into the
component (writer) or into the Storage.
● They may be plugged to any component.
○ Some components provide UI support for processor, otherwise
they need to be added to a configuration via API call

A Processor would be typically used with a component like S3 Extractor, which


also uses processors on the background. The main advantage is that you do
not need to write any additional code or customize existing application.

Use case: Let’s say you receive a CSV report generated by a legacy system via
S3 and the report contain a report header on 10 lines before actual data =>
such a CSV file is invalid and would fail while attempting to load directly to a
storage

Solution: Plug in a skip-lines processor that would skip the first X rows and
convert the CSV file to a valid form.
Integration - technical intro
Integrate KBC with other systems.

● Use KBC just to exchange data (using the Storage API).


● Use KBC as a data-handling backbone for your product.
● Wrap KBC in your own UI for your customers.
● Control whole data processing pipeline within KBC from the
outside.
● Control any component of the KBC programmatically
Docker runner (behind the scenes)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy