Keboola Advanced Training - Public PDF
Keboola Advanced Training - Public PDF
SaaS Apps
Advertising / RTB
Analysis Ready
Extract Transform Load Time To Value
Flexible Setup
Powered by:
Core Architecture (Project Detail)
Set of microservices
interacting together via
Storage component.
Limits (Soft)
● Sandbox disk space is limited to 10GB.
● Memory is limited to 8GB.
● Maximum runtime - 6 hrs
● Limits are soft -> it is possible to increase the limits
on request per project.
Avoid ALTER SESSION Statements
● Avoid ALTER SESSION statements within transformations, since it may lead
to unpredictable behaviour. Since transformation phases are executed in a
single SF workspace, ALTER SESSION statements may be executed in
unpredictable order.
● Also the loading and unloading sessions are separate from your
transformation/sandbox session and the format may change unexpectedly.
● Using explicit statements instead of a global SESSION parameter also
leads to better readability -> explicit is always better.
● More info in the DOCs
Avoid ALTER SESSION Statements
Example:
Alternative:
SELECT
TO_CHAR("datetime"::TIMESTAMP,$DEF_TIMESTAMP_FORMAT);
Set Timezone Explicitly
Be careful when working with timestamps. TIMESTAMP casting may convert the date to local
timezone based on the worker location. It is always better to convert explicitly to required
timezone and to use TIMESTAMP_TZ when you need the timezone information.
Dangerous:
Better:
Reasoning
Dependencies
Running in workspaces
Phases vs. Dependencies: Best Practice
General rule: we suggest using mainly dependencies in order to ensure all
necessary transformations happen when you trigger a single one manually
within the bucket.
Reasoning
Reasoning:
I/O Features:
Column filter:
choose a subset of columns - reduces load time and
volume transferred
Incremental load:
Define “Changed in last” interval to load only data that
had been changed in specified period
Demo
Break Time
Orchestration Notifications
a. Full transparency
b. Faster reaction type
Event-triggered Orchestration
Trigger your orchestrations based on source
data changes.
What is a change?
Table Storage
● Stores tabular data created by components or table imports
● [Currently] based on Snowflake backend
● Easy backups and restorations - Snapshots, “Timetravel” restores
● No limitation by data types, semi-structured data formats natively supported by Snowflake
(JSON, Avro, ORC, Parquet, or XML)
● Organised into “buckets” and tables, can be shared across projects/ organisations
● Can be accessed only via Storage API or interacting with components.
File Storage
● Can be used to store an arbitrary file (e.g. R models)
● Every data load is stored in Files before it is processed and pushed into a table.
In Complete Control
● Any interaction with Storage is logged
● Every payload, result of a transformation or component is stored in Files before pushing to Storage
● Every Storage job is linked to the actual component job via RunID (searchable in Jobs tab)
Data Lineage
Storage Columns Descriptions
- Columns in Storage now
support Markdown language
for text descriptions, similar to
other Keboola components
Demo
Data Types
● Everything is treated as a String in the Storage
○ More flexibility on loading data
○ Datatype can be defined as a metadata on column in
the Storage
○ Datatypes are also transferred from the DB
extractors
● Generally, data types are needed on the output
○ Define datatypes on columns in output stage -> it
will be transferred to any new writer configuration
○ No need to define them in transformations, except
for some particular cases. E.g. for date operations
DATEADD(day, 2, “date_col”::DATE).
○ Snowflake infers datatypes so SUM(value) works
even if its a String
○ It is recommended to cast explicitly in the code
Override Manual Data Types
- Data types set by extractors can now be overriden by the user, by adding a
new row to the “data type” of the column.
- Source data type will exist for lineage and audit sake, but the override data
type will take precedence.
Demo
Defined Data Types
- Metadata for data types in Storage are now exposed to Storage UI
- Setting data types in Storage is only for metadata purposes, the underlying
Keboola storage is still stored as VARCHAR
- Impact:
- By setting the data type, your transformations will automatically set the input mapping of
the tables to the selected data type
- If the column contains data type errors, your transformations will error out on input phase
More info
Demo
Storage Table Operations
Delete full table
● Use with caution! Don’t do just to replace the
table.
Remove or add primary key
● May cause failure of dependent transformations
(check I/O mappings)
Create column
● May break backward dependencies.
Delete column
● May break forward dependencies.
Change column datatype
● Will be inherited by any new transformation or
writer, make sure the data is valid for the datatype.
More information:
https://help.keboola.com/ma
nipulation/transformations/sa
ndbox/#code-templates
Markdown Descriptions
● Each component and storage can described with Markdown text
● Important for team work
● Enables “jump-in & fix” work style
● Utilize markdown formatting for text readability (headers, bullet points, etc.)
● Recommended content:
○ Storage: Data structures & types, foreign keys, etc. PII/Personal Data columns,
internal/temp table vs. production
○ Transformations: Function in the dataflow in broad terms, inputs, outputs, TODO section,
unknowns or known quirks to signal possible issues in the future, hardcoded values
○ What happens within transformation, how did you get from A-B and what are the logical
steps of the transformation.
Descriptions in Markdown
Descriptions in Markdown
Versioning
● Configuration change of any
component (that means
extractors, transformation bucket,
writers and orchestrations) -> New
version of the whole configuration
● Available in the UI and API
● Easy RESTORE or FORK
● Tip: Use FORK functionality to make
a quick copy of
configuration/transformation
bucket to avoid manual work
Versioning Improvements
- Save with descriptions
- Transformations UI
Avoiding Tedious Work
Keboola Logs Everything
● Component and Transformation runs
● Versioning
● Storage events
● Tip: Storage imports
Extensive logging
Every Storage Job contains detail with all information about
the execution.
● Component Trash
○ Easy configuration restore
Use the Weapon of Choice
Components
Components
● Extractor – allowing customers to get data from new sources. It only processes input tables from external
sources (usually API).
● Application – further enriching the data or add value in new ways. It processes input tables stored as CSV
files and generates result tables as CSV files.
● Writer – pushing data into new systems and consumption methods. It does not generate any data in KBC
project.
● Processor – adjusting the inputs or outputs of other components. It has to be run together with any of the
above components.
● All components are run using the Docker Runner
● All components must adhere to the common interface
● List of all available components is to see on https://components.keboola.com/
Generic extractor
● Generic Extractor is a KBC component acting like a customizable HTTP REST client.
● Can be configured to extract data from virtually any API and offers a vast amount of configuration options.
● Entirely new extractor for KBC in less than an hour.
Extractors
Extractor types
● Database extractors: SQL Databases and NoSQL MongoDB
● Communication, Social Networks and Marketing and Sales extractors
● Other extractors such as Geocoding-Augmentation or GoogleDrive
● Keboola provided, 3rd part
Generic Extractor
● Very flexible, supports various types of authentication , pagination, nesting objects. Fits almost all REST-like
services.
● Supports iterations (run with multiple parameters)
● Functions
○ allow you to add extra flexibility when needed.
○ Can be used in several places of the Generic Extractor configuration to introduce dynamically generated
values instead of those provided statically.
● Incremental load, remembers last state
● Can be set up and used as a standalone component
A custom component
Common interface
● Predefined set of input and output folders for tables and files,
● a configuration file,
○ Custom defined JSON injected to the docker container
● environment variables and return values.
Other Features
● Logging, manifest files for working with table and file meta-data
● the OAuth part of the configuration file, and
● actions for quick synchronous tasks.
● Docker Runner provides tools for encryption and OAuth2 authorization.
● Custom Science apps (legacy)
● Generic UI
● Processors
● Easy to setup CI workflows - Travis, Quay, Bitbucket pipelines
Processors
Processor is a special type of component which may be used before or after
running an arbitrary component (extractor, writer, etc.).
● They perform an atomic action on the data before load into the
component (writer) or into the Storage.
● They may be plugged to any component.
○ Some components provide UI support for processor, otherwise
they need to be added to a configuration via API call
Use case: Let’s say you receive a CSV report generated by a legacy system via
S3 and the report contain a report header on 10 lines before actual data =>
such a CSV file is invalid and would fail while attempting to load directly to a
storage
Solution: Plug in a skip-lines processor that would skip the first X rows and
convert the CSV file to a valid form.
Integration - technical intro
Integrate KBC with other systems.