0% found this document useful (0 votes)
786 views108 pages

Snowflake - T

Snowflake is a cloud data platform that provides data storage and analytics capabilities as a software as a service. It offers a cloud-based data warehouse that is faster and more flexible to set up than traditional on-premise data warehouses. Snowflake uses a hybrid architecture combining aspects of shared disk and shared nothing architectures to provide good performance, scalability, and concurrency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
786 views108 pages

Snowflake - T

Snowflake is a cloud data platform that provides data storage and analytics capabilities as a software as a service. It offers a cloud-based data warehouse that is faster and more flexible to set up than traditional on-premise data warehouses. Snowflake uses a hybrid architecture combining aspects of shared disk and shared nothing architectures to provide good performance, scalability, and concurrency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Snowflake - It is a cloud data platform for data storage as well as for analytical purposes.

It is an
advanced data platform which provides Software as a service. It's a cloud based data warehouse platform
which is built on top of aws and it's a Saas offering.

In contrast to traditional data warehouse solution, snowflake provides a warehouse which is faster, easy to
setup and also more flexible. This particular software is natively designed for the cloud, which means that
you cannot go and install in an on-premise system.
It is a leader in the data management solutions for analytics.

● No administrator required. It is a self managed data warehouse solution or a cloud platform.


● All data is encrypted in transit and at rest : It means after reading the data from the source file, it
gets encrypted and loads that encrypted data into the target.
● Data backup inherent in the cloud : When you create a table, snowflake ensures that tables are
replicated properly. In case there is a failure, you will be able to get your tables back.
● Built-in query optimization : The query optimization will automatically happen. We don’t have to
tell the optimizer to execute queries in this fashion.
● It follows all the ACID properties.(Atomicity, Consistency, Isolation, Durability)
● By default, every command is on Auto_Commit. If you don’t want that, you can change it.
● Basic Visualizations and Dashboards can be created.

Snowflake's unique multi-layered architecture allows for performance, scalability, elasticity, and
concurrency.

Learn more about Snowflake’s Architecture in the Snowflake Documentation -


https://docs.snowflake.com/en/user-guide/intro-key-concepts

Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than
optional command line clients, drivers, and connectors), run in public cloud infrastructures.

Snowflake uses virtual compute instances for its compute needs and a storage service for persistent
storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).

1
Traditional architectures are shared disk architecture which uses multiple nodes for accessing data on a
single storage systems. And shared nothing architecture, it stores a part of your data in each node of the
data warehouse.

Now, Snowflake combines the benefit of both these platforms in an innovative and new design.
Snowflake processes the queries using MPP compute clusters which is called as Massively Parallel
Processing compute cluster where each node in a cluster stores a part of your entire data set locally. So
you can say that Snowflake uses a mix of shared disk and shared nothing architecture.
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures.
Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is
accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake
processes queries using MPP (massively parallel processing) compute clusters where each node in the
cluster stores a portion of the entire data set locally. This approach offers the data management simplicity
of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing
architecture.

Snowflake architecture has 3 layers


Database Storage layer : Stores the data

● When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized,
compressed, columnar format. Snowflake stores this optimized data in cloud storage.

2
● It stores all the data (structured or semi structured data) in the database (Database is a logical
grouping of the object which consists primarily of tables and views). All the tasks related to data
are handled through SQL query.
● The underlying file system in snowflake is managed by s3 in snowflake account. The data in the
s3 is encrypted (It always stores the data in an encryption fashion), compressed and
distributed(The data is stored in multiple partitions) to optimize the performance.
● Pass through pricing : Price to be paid for the amount of storage used.
● Upto 5x Compression : The data is not stored in its regular format. Instead the data which is
stored in snowflake is compressed and then stored.
● Native support or semi-structured data support.
❖ The data will reside inside the tables. These are physically stored in the files in
Amazon s3.v
❖ They are blocks of data per files. These are columnar data.
❖ The data is never overwritten in your cloud.

Query Processing layer / Compute layer / Virtual warehouse (VW) layer : process the data
in a database.

● Snowflake processes the queries in a virtual warehouse.


● A virtual warehouse is a cluster of compute resources with CPU, Memory, and disk.
● The larger the warehouse, the more compute resources in the cluster.
● The amount of compute resources doubles with each size increase.
● Each cluster can obtain all data in the storage layer and run separately so that warehouses do not
share or compete for compute resources.
● Virtual warehouses are usually used for the purpose of data loading or running queries or also can
do both tasks simultaneously.
● These VW can be scaled up or down without any downtime.
● Elastic (turn on Instantly) : The moment we create, it turns on automatically.
● Scale up (complex queries) : If more resources are required, it will automatically scale up.
● Suspend : It will suspend the warehouse when not needed or not used and it will resume the
warehouse when needed or used.
❖ We have to pay till the warehouse runs. As long as the warehouse is not running, no
charges are applied.
❖ Snowflake charges per second.That means, every second the warehouse runs, you will
have to pay for that particular compute resources.
❖ Able to access data in any database.
❖ Transparently caches data accessed by queries.

● Each virtual warehouse is an independent compute cluster that does not share compute resources
with other virtual warehouses. As a result, each virtual warehouse has no impact on the
performance of other virtual warehouses.
For more information, see https://docs.snowflake.com/en/user-guide/warehouses

3
https://docs.snowflake.com/en/user-guide/ui-query-profile

Snowflake compute costs depend on The amount of time warehouses have run and The sizes of running
warehouses.

Snowflake data storage costs include Persistent data stored in permanent tables and Data retained to
enable data recovery (time travel and fail-safe).

4
Cloud Service layer :

● It’s a collection of services which coordinates activities across snowflake.


● This layer coordinates and handles all other services in snowflake including sessions, encryption,
SQL compilation, etc.,
● Services offered by this layer are Authentication, Infrastructure management, Metadata
management, Query parsing and Optimization and Access control.
● The cloud services layer also runs on compute instances provisioned by Snowflake from the
cloud provider.

❖ Each node can handle ‘x’ number of requests.


❖ Cluster is a combination of nodes. Four or five nodes can make a big cluster.

Uses :
● Cloud Agnostic
● Performance and speed
● User-friendly UI/UX
● Reduced Administration overhead
● On-Demand Pricing
● Support Variety of File Formats

Connecting to Snowflake:

Snowflake supports multiple ways of connecting to the service:


● A web-based user interface from which all aspects of managing and using Snowflake can be
accessed.

5
● Command line clients (e.g. SnowSQL) which can also access all aspects of managing and using
Snowflake.
● ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to
Snowflake.
● Native connectors (e.g. Python, Spark) that can be used to develop applications for connecting to
Snowflake.
● Third-party connectors that can be used to connect applications such as ETL tools (e.g.
Informatica) and BI tools (e.g. ThoughtSpot) to Snowflake.

Supported Region :
● Each Snowflake account is hosted in a single region
● Identical features and services across regions
● Difference in unit costs and data storage

Snowflake Editions :

6
Micro Partitions:
● It is a continuous units of storage that holds table data
○ 50 - 500 MB of uncompressed data is stored
○ Generally .. 10 MB to 16 MB (Compressed)
● Which means each micro partition holds 10 to 16 MB of compressed data which is equivalent to
50 - 500 MB of uncompressed data.
● There can be many micro partitions per table depending upon how much amount of data is
coming.
● These partitions are IMMUTABLE. That means once these partitions are created, they can never
be modified. For any new thing or every update a new partition is created.
● Cloud services layer stored metadata about every micro-partition
○ MIN/MAX (Range of values in each column)
○ Number of distinct values
○ NULL count
● We cannot create Indexes in Snowflake.
● Snowflake automatically collects and maintains metadata (stored in cloud service layer metadata)
about and their underlying micro-partitions, including:
○ Table level:
■ Row count
■ Table size (in bytes)
■ File reference and table versions
○ Micro-partition column level:
■ Number of distinct values
■ MIN and MAX values
■ NULL count

7
Data Storage Billing :
● Billed for actual storage use
○ Daily average Terabytes per month
● On-Demand pricing
○ Billed in areas for storage used
○ Around $40/Terabyte/month
○ Minimum monthly charge of $25
● Pre-Purchased Capacity
○ Billed up front with a commitment to a certain capacity
○ Price varies on amount and cloud platform
○ Customer is notified at 70% of capacity

❖ The METERING_HISTORY inside the SNOWFLAKE.ACCOUNT_USAGE schema will


provide the number of credits used in the account. This information is also available in the
WebUI.
❖ Scale out for Concurrency and Scale up for Performance.

Zero-Copy Cloning
● Quickly takes a snapshot of any table, schema, or database.
● When the clone is created
○ All micro-partitions in both tables are fully shared.
○ Micro-partition storage is owned by the oldest table, clone references them
● No additional storage costs until changes are made to the original or the clone.
● Often used to quickly spin up Dev or Test environments.
● Effective backup option as well.

Documentation: https://docs.snowflake.com/en/user-guide/object-clone

8
● The cloned table points to the same micro-partitions as to the original table.
● If the cloned table modifies the records,a new micro-partition is created with the modified
data which is pointing towards the clone table. The original table will not be able to see
the new micro-partition.
● In the same way, if any changes are made in the original table, the cloned table will not
know about that.
● When you clone a database, all the schemas and the tables in that database will be cloned.
● When you clone a schema, all the tables in that schema will be cloned.

Any DDL command is a Metadata only operation. The metadata will be available at the Cloud-
Service layer. When we go below the cloud service layer, then only the compute engine will
start.
For DML commands, warehouse will be started and not DDL commands.

If ID and CLONE_GROUP_ID are different in TABLE_STORAGE_METRICS, then it is a


cloned table.

Example:
CREATE TABLE empclone CLONE emp;

Practice:

create database demo_db;

-- Cloning Tables
-- Create a sample table

CREATE OR REPLACE TABLE demo_db.public.employees

9
(emp_id number,
first_name varchar,
last_name varchar
);

-- Populate the table

Insert into demo_db.public.employees


values(100,'John','Smith'),
(200,'Sam','White'),
(300,'Bob','Jones'),
(400,'Linda','Carter');

-- Show the content of the table employees in the demo_db database and the public schema

select * from demo_db.public.employees;

-- Create a clone of the table

CREATE OR REPLACE TABLE demo_db.public.employees_clone CLONE employees;

-- Show the content of the clone which should be the same as the original table that it was cloned from.

select * from demo_db.public.employees_clone;

-- Add one more record to the clone table

insert into demo_db.public.employees_clone values(500,'Mike','Jones');

-- Verify the content of the clone table to show the original set of records and the additional new record
that we just added

select * from demo_db.public.employees_clone;

-- Verify the content of the original employees table. It should be the original content without the record
we added to the clone.

10
select * from demo_db.public.employees;

-- Delete emp_id 100 from the employees table.

delete from demo_db.public.employees where emp_id = 100;

-- Check the content of the employees table. The employee with Emp_id = 100 is no longer there

select * from demo_db.public.employees;

-- Check the content of the employees_clone table. The employee with Emp_id = 100 is still there.

select * from demo_db.public.employees_clone;

-- Cloning Databases

-- Create a database clone (demo_db_clone) from the original database demo_db

CREATE or replace DATABASE demo_db_clone CLONE demo_db;

-- Point to the demo_db_clone database

use database demo_db_clone;

-- Drop the table employees from the demo_db_clone database

drop table demo_db_clone.public.employees;

-- Drop the table employees_clone from the demo_db_clone database

drop table demo_db_clone.public.employees_clone;

-- Showing tables in the demo_db_clone would show neither the employees table nor the
employees_clone table. They are gone.

11
show tables;

-- Point to the original demo_db database

use demo_db;

-- Showing the tables in the original demo_db database will show all the tables in that database including
the employees and employees_clone tables

show tables;

-- Cloning Schema

-- Create a cloned schema from the original public schema

CREATE or replace SCHEMA public_clone CLONE public;

-- Point to the public schema

use schema public;

-- Show tables in the original public schema. That should return all the tables in the public schema
including a table called employees;

show tables;

-- Point to the public_clone schema

use schema public_clone;

-- Show tables in the public_clone schema

show tables;

-- Drop the table employees_clone in the public_clone schema;

drop table public_clone.employees_clone;

12
-- Show tables in the public_clone schema. The result of that command should not have in it a table
called employees_clone;

show tables;

-- Point to the original public schema

use schema public;

-- The result of that command should include in it the table called employees_clone since it was dropped
from the clone schema and not the original public schema.

show tables;

TIME TRAVEL:
● This is one of the powerful CDP (Continuous Data Protection) feature for ensuring the
maintenance and availability of historical data.
● It helps in recovering the data related objects (database, tables, schemas) which has been
deleted accidentally or intentionally.
● Duplicating and backing up data from key points in the past.
● Also analyze data usage / manipulation over a specified period of time.
● We can retrieve the data in a table present at a specified time.

❖ Both Time Travel and Fail-Safe require additional storage.

Documentation: https://docs.snowflake.com/en/user-guide/data-time-travel

CREATE TABLE my_table (id int)


DATA_RETENTION_TIME_IN_DAYS=90;

ALTER TABLE my_table


SET DATA_RETENTION_TIME_IN_DAYS=30;

Query clauses to support Time Travel Actions:

● At a specific time

13
SELECT * FROM my_table1
AT(TIMESTAMP => ‘Mon, 01 May 2020 16:20:00‘ :: timestamp);

● At an OFFSET

SELECT * FROM my_table1


AT (OFFSET => -60 * 5);

● BEFORE a specific query

SELECT * FROM my_table1


BEFORE (STATEMENT => ‘8e5d0ca9-005e-446e-b858-a8f5b37c5726’);

● Cloning Historical Objects

CREATE TABLE restored_table CLONE my_table1


AT (TIMESTAMP => Mon, 09 May 2020 01:01:00 +0300‘ ::
timestamp);

CREATE DATABASE restored_db CLONE my_db


BEFORE (STATEMENT => ‘8e5d0ca9-005e-446e-b858-a8f5b37c5726’);

● Restoring objects

UNDROP TABLE/SCHEMA/DATABASE

● We keep the old versions of the micro-partitions for the specified retention time.

SHOW PARAMETERS FOR TABLE my_table1;

FAIL-SAFE:

14
● Defined period for fail-safe is 7 days.
● It is a Non-Configurable, 7-day retention for historical data after Time Travel expiration.
● When the data goes to fail-safe, users cannot go and modify anything. It is only
accessible to the snowflake personnel.
● Admins can view Fail-Safe use in the Snowflake Web UI under Account>Billing and
Usage.
● It is not supported for Temporary and Transient tables.

Documentation: https://docs.snowflake.com/en/user-guide/data-failsafe

REPLICATION:
● It helps to keeps database objects and stored data synchronized between one or
more accounts (within the same organization).
● With Replication, we can sync the data in on cloud service provider(Azure) to
another cloud service provider(AWS).
● We can share the across the clouds using the Replication.
● Unit of replication is a database. We can replicate Permanent and Transient
databases in Snowflake. Cannot replicate the Temporary databases.
● The Secondary database which we replicated is always read only. This secondary
(stand by) database can become permanent, if the primary database fails.

15
Documentation: https://docs.snowflake.com/en/user-guide/database-replication-intro

Replication Use Cases:

● Business continuity and Disaster Recovery


● Secure Data sharing across Regions/Clouds.
● Data portability for Account Migrations.

Introduction to Business Continuity & Disaster Recovery Documentation:


https://docs.snowflake.com/en/user-guide/replication-intro

Sharing Data Securely Across Regions and Cloud Platforms Documentation:


https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-plaforms

Replicated Database Objects:

● Tables
○ Permanent
○ Transient
○ Automatic Clustering of Clustered tables
○ Constraints
● Sequences
● Views (Both Standard and Secured)
○ Materialized (Both Standard and Secured)
● File Formats
● Stored Procedures
● User-Defined functions (UDF)
○ SQL and Javascript
● Policies
● Tags
○ Object Tagging

Database Replication and Encryption:

16
● Snowflake encrypts database files in-transit from the source account to the target account.
● If Tri-Secret Secure is enabled (for the source and the target accounts) the files are
encrypted using the public key for an encryption key pair.
● The encryption key pair is protected by the account master key (AMK) for the target
account.

Compute for Database Replication:

● Replication operations use Snowflake-provided compute resources to copy data between


accounts.
● Replication utilization is shown as a special Snowflake-provided warehouse named
REPLICATION.
● Query either of the following:

○ REPLICATION_USAGE_HISTORY table function


○ REPLICATION_USAGE_HISTORY view

Data Transfer for Database Replication:

● Initial database replication and subsequent synchronization operations incur data transfer
charges
○ These are termed “egress fees” and only apply when data is passed across regions
and/or across cloud providers.
○ The cost is passed along to customers.
● Rate or pricing is determined by the location of the source and target accounts, and the
cloud provider.
● Data Transfer Usage is shown in Billing and Usage Data Transfer.

SQL Support in Snowflake:

Logical Data Organization:


● Databases and Schemas logically organize data within a snowflake account.
● A database is a logical grouping of schemas
○ Each database belongs to a single account.
● A schema is a logical grouping of database objects, such as tables and views.

17
18
Table Types:

Views:
A view allows the result of a query to be accessed as if it were a table. The query is specified in
the CREATE VIEW statement.
Views serve a variety of purposes, including combining, segregating, and protecting data. For
example, you can create separate views that meet the needs of different types of employees, such
as doctors and accountants at a hospital:
A view can be used almost anywhere that a table can be used (joins, subqueries, etc.).

19
● The materialized view increases the performance.
● If we want to perform aggregate calculations on a regular table which generally will take
longer time, we calculate that aggregate data and store it in a materialized view. So,
When we want to perform a similar type of aggregate calculation next time, it will go and
read the data from a materialized view.
● The materialized views will get automatically refreshed. (It will give upto date
information on tables)

Example:
CREATE VIEW v1 AS SELECT * FROM emp;

CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM emp;

❖ Creating a normal view does not take time, whereas creation of a materialized
view takes time. Because the normal view only has to store the definition, but
materialized view has to also store the data.

20
21
22
Constraints:

● Snowflake provides support for defining and maintaining constraints.


● The only constraint snowflake enforces is NOT NULL.
● Primarily for data modeling purposes, and to support client tools that use constraints.
○ Example: Tableau supports using constraints for join culling, which can improve
performance.

Standard SQL:

● Snowflake supports most DDL and DML defined in SQL:2003 including:


○ Database and schema DDL.
■ CREATE, DROP, ALTER
■ Snowflake extensions include CREATE DATABASE..CLONE and
UNDROP DATABASE.
○ Table and View DDL.
■ CREATE, DROP, ALTER
■ Snowflake extensions include CREATE TABLE..CLONE and UNDROP
TABLE.

23
■ Snowflake also supports the CREATE SEQUENCE.
○ Generate DML
■ INSERT including multi-table insert
■ MERGE, UPDATE, DELETE, and TRUNCATE
● Query Syntax
○ Snowflake SELECT supports all the standard syntax options including:
■ All standard JOIN types: INNER, OUTER, SELF, LEFT, and RIGHT
JOINS
■ PIVOT - used to transform a narrow table into a wider table for reporting
purposes.
■ SAMPLE - creates a dataset
■ GROUP BY options including CUBE, GROUPING SETS, and ROLLUP.
○ Since Snowflake supports SQL 2003, most of the options supported by other
standard RDS will work.
○ Snowflake extensions for a SELECT statement include the AT and BEFORE
options.

Snowflake Transactions:

● Snowflake Transactions are Designed for Analytic Workloads (OLAP)


● Snowflake supports ACID transactions
○ Atomic, Consistent, Isolated, Durable
● It is not an OLTP Transaction.
● Two types of transaction scope
○ Autocommit
■ DML statement executed without explicitly starting a transaction is
automatically committed on success or rolled back on failure at end of
statement.
■ DDL statements are automatically committed.
○ Explicit – multi-statement transactions:
■ BEGIN, START TRANSACTION, COMMIT, ROLLBACK
● Every statement is executed in the scope of a transaction.

24
Describe Object:

25
● Describes the details for a specified object

Example: DESCRIBE DATABASE my_database;

● Used for
○ Tables, Schemas and Views
○ Sequences
○ File Formats
○ Stages
○ Pipes
○ Tasks and Streams
○ Functions and Procedures

Show Objects:

● List the existing objects for the specified object type. Output includes:
○ Common properties (name, creation timestamp, owning role, comment, etc).
○ Object-specific properties

Example: SHOW DATABASES;


SHOW TABLES;

● LIKE <pattern> can be used to filter output by object name

Example: ‘%testing%’;

● Used for
○ Database objects (Tables, schemas, views, file formats, sequences, stages, tasks,
pipes, etc.)
○ Account Object Types (Warehouses, databases, grants, roles, users, etc.)
○ Session / User Operations (Parameters, variables, transactions, locks, etc.)
○ Account Operations (Replications databases, replication accounts, regions, global
accounts, etc.)

SHOW PARAMETERS;

– It will show the session parameters.

26
SHOW PArAMETERS FOR TABLE employees;

– It will show the table parameters.

GET_DDL:

● Returns a DDL statement that can be used to recreate the specified object.

Example: SELECT GET_DDL (‘VIEW’, ‘REGIONAL MEMBERS’);

SELECT GET_DDL (‘TABLE’, ‘EMPLOYEES’);

● Used for
○ Databases, Schemas, Tables, External tables
○ Views
○ Streams and Tasks
○ Sequences
○ File Formats, Pipes
○ UDFs, Stored Procedures
○ Policies

CACHING AND QUERY PERFORMANCE:

When a query is executing it takes more time but the same query that executed the same time,
the query executes very fast.

Because, whatever the query you are typing, the results of the query are stored in the cache. And
for the subsequent executions, the data will come from the cache only. If it is the identical query
the data will come from the cache only.

The advantage is that the compilation time is reduced.

27
● The optimizer in Snowflake is a cost-based optimizer. It will decide how the query needs
to be executed. Once the cost-based optimizer is available, it will create an execution
plan, that plan is given to the next layer. The next layer is called Compute layer and it
gives to the layer called Storage layer. As per the instructions given by the optimizer, the
query will be fetched from the storage layer by reading multiple partitions.
● After the data is fetched from the data storage it is going to create a result set, it is
nothing but the data which is supposed to be displayed when you execute a query and the
copy of that is first stored in query cache, from there it is been given to the user.
● Hence this query cache will be there for next 24 hours. Within this 24 hours, if you are
running the identical queries, the data will come from the cache only rather than going to
the disk.

Types of Cache:

1. Metadata Cache
2. Query Result cache
3. Data cache

Metadata cache and Query result cache are both available in the cloud service layer.
Data cache will be in the compute layer i.e., Virtual Warehouse layer.

For MAX, MIN, COUNT, for getting that particular information you don’t need to pass through
the virtual warehouse layer. That means this result is coming from Query result cache. So no
amount is need to be paid to snowflake for executing this query.

28
1. METADATA CACHE:

Metadata cache stores whenever a table is created, it not only stores about the table structure, but
it also stores how many records are there in that particular table and at the same time what is the
minimum value and maximum value for each partition. (SHOW command also)

● Metadata stored in cloud service layer


● Micro-partitio level data
○ Row count
● Micro-partition column-level data
○ MIN/MAX values
○ Number of DISTINCT values
○ Number of NULL values
● Table versions and references to physical files (.fdn)
● It also store how many times today you have modified the table

Micro-Partition Metadata Cache

● Used by SQL optimizer to speed up query compilation


● Used for queries that can be answered completely by metadata
○ SHOW commands
○ MIN, MAX (only for integer and date data types)
○ COUNT
● The metadata is fast and do not use a metadata warehouse.
● No virtual warehouse used

○ NOTE: Cloud Services charges may still apply, it it’s use is more than 10% of
your overall compute time.

2. QUERY RESULT CACHE:

● Query results are stored and managed by the cloud service layer
● Snowflake will read the data from this layer only if the identical query is run, and base
tables have not changed
● Available to other users:

29
○ SELECT: Any role that has the SELECT permission on all the tables involved
can use the Query result cache

How it Works?

● Result sets are cached for 24 hours; counter resets each time matching query is re-used
● Result reuse controlled by USE_CACHED_RESULT parameter at account/user/session
level

SHOW PARAMETERS FOR account;

● Eligibility requirement for query to use result set cache:


○ Exact same SQL query (*except maybe whitespace)

SELECT max(emp_id) FROM employees;

SELECT MAX(emp_id) FROM employees;

Above both the queries are not same because the ASCII values of uppercase and
lowercase characters are different.

○ Result must be deterministic (no random function)


○ Changes CAN be made to source table(s), but only if they do not affect any
micro-partitions relevant to query
○ Must have the right permissions to use it

Query result cache use cases

● Any query which is running repeatedly


○ Example: Static dashboards
● Refine the output of another query
○ Use TABLE function RESULT_SCAN (<query id>);

Benefits

● Fast
● Will never give stale results
● No virtual warehouse used
○ Unless the cache is accessed using a RESULT_SCAN

30
3. DATA CACHE:

Example: Out of 10 records, only 2 records are got fetched, those 2 records only it will store
because it is storing a query result cache but data cache stores 10 records.
ALTER SESSION SET USE_CACHED_RESULT=FALSE;

● Stores file headers and column data from queries


○ Stores the data, not the result
● Stored to SSD in virtual warehouse
● When a similar query is run, Snowflake will use as much data from the cache as possible
● Available for all queries run on the same virtual warehouse

Query Example:

SELECT firstname FROM customer WHERE state =’CA’;

● The data cache will only capture the data in the partition returned in the query
● The data cache will only contain the columns in the query
● In this example above, the data cached will only be the records returned by the query.
● For example, if the “state=’CA’” where clause returned 20 out of 100, only firstname and
state columns will be cached for the 20 records returned.

Effectiveness Tip:
● Group and execute similar queries on the same virtual warehouse to maximize data cache
reuse, for performance and query optimization.

How Data Cache works?

● When a query is run, file headers and column data retrieved are stored on SSD
● Virtual warehouse will first read any locally available data, then read remainder from
remote cloud storage
● Data is flushed out in a Least Recently Used (LRU) fashion when cache fills

31
● Remote disk I/O in the data cache refers to the process of reading or writing data from or
to the remote storage location where Snowflake stores its data.

32
What is “EXPLAIN PLAN”?

For every query, you can go and create a execution plan. The “EXPLAIN” will not show you the
data, rather it will explain how the query will get executed.

EXPLAIN
SELECT
1_orderkey
FROM
lineitem

LIMIT 10;

● Snowflake command that displays the logical execution steps without executing
● Results show in JSON, text or tabular format
● Helpful for performance tuning and for controlling costs

What’s In a Plan?
Key Points:
● Partition Pruning
● Join Ordering
● Join Types

33
34
QUERY PROFILE

35
36
37
38
Profile Overview

● Initialization: setup activities prior to processing


● Processing: CPU data processing
● Local Dish I/O: blocked on local SSD on node
● Remote Disk I/O: blocked on remote cloud storage
● Network Communication: blocked on network data transfer
● Synchronization: sync activities between processes

39
SQL PERFORMANCE TIPS

Top Performance Tip:

● Row operations are performed before GROUP operations


● Check row operations first
○ First, Check FROM and WHERE clauses
○ Then, check GROUP BY and HAVING clauses
● Use appropriate filters, as early as possible
○ Filters do not help where applicable
○ The Optimizer uses the filters to prune out unnecessary data

40
JOIN ON UNIQUE KEYS

Tips for Effectiveness:


● Ensure keys are distinct
● Understand relationships between your tables before joining
● Avoid many-to-many join
● Avoid unintentional cross join

Troubleshooting Scenario:
● Joining or non-unique keys can explode your data output (join explosion)
○ Each row in table 1 matches multiple rows in table 2

SNOWFLAKE BUILT-IN OPTIMIZATIONS

● Snowflake provides patented micro-partition pruning to optimize query performance:


○ Static partition pruning based on columns in WHERE clause
○ Dynamic partition pruning based on JOIN columns of a query
● To assist SQL optimizer:

41
○ Apply appropriate filters as early as possible in the query
○ For naturally clustered tables, apply appropriate predicate columns (e.g., date
columns) that have a higher correlation to ingestion order

42
TEMPORARY TABLE
● Great for long calculations that need to be referenced multiple times in same query
● Great for materializing intermediate result
● Backed by micro-partitions and may assist in pruning
● Exist only during current session
● Disappear when you disconnect

SQL NATIVE FUNCTIONS IN FILTER


● Most of the time, the SQL pruner can determine the statistics (min/max) of the function
from the statistics of the filter column
○ Partition pruning can occur for the filter, leading to better query performance

SELECT COUNT(*) FROM orders


WHERE TO_DATE(o_orderdate, ‘yyyy-mm-dd’) = ‘1993-02-04’;

● Sometimes, it’s not possible for the SQL pruner to determine the statistics of the function
from the statistics of the filter column

SELECT COUNT(*) FROM orders


WHERE TO_CHAR(o_orderdate, ‘yyyy-mm-dd’) = ‘1993-02-04’;

43
Cache demo

--create sample table


use warehouse compute_wh;
use database test;
create table nation as select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION;

-- check usage of Metadata Cache in query profile


alter warehouse compute_wh suspend;
select max(n_nationkey) from Nation;

--Check usage of query result cache. first time execution of the query will fetch data from the
storage layer and subsequent execution of identical query will show the data from result cache

alter warehouse compute_wh resume;


select n_nationkey, n_name from nation where n_nationkey >10;

44
--check usage of data cache. You should see that the percentage scanned from cache is 100.00%.

alter session set use_cached_result=FALSE;


select n_nationkey, n_name from nation where n_nationkey >10;

-- add columns , run the query again and check query profile .percentage scanned from cache,
should be less
-- than 100% as we have added additional columns which were not fetched earlier

select n_nationkey, n_name, n_regionkey, n_comment from nation where n_nationkey >10;

DATA LOADING AND UNLOADING

Snowflake does ELT - Extract, Load and Tranform

You need to create a stage - - From external files, you will load data in stage and from stage you
will load data into target table.
Stage - Temporary place where the data will be stored before it is moved to table
Target - Permanent or transient table where data will be stored

External files - - - - - - - - - - - - - - Stage - - - - - - - - - - - - - - - - - - - - - Target table


(txt/csv/json) PUT COPY INTO

● No transformations while loading data to stage. It’s just dumping data


● PUT will read data from files, compress the data and loads the same in stage
● COPY INTO command is used to read data from stage and load the same to target. In
copy into, we can perform transformations.
● PUT command cannot be used in the Web UI. That’s the reason, Command line interface
is used for loading and unloading data.

Before performing data load operations, we need to create


● Stage

45
● Target table

DATA LOADING:

Snowflake supports loading data from files staged in any of the following locations, regardless of the
cloud platform for your Snowflake account:
● Internal (i.e. Snowflake) stages
● Amazon S3
● Google Cloud Storage
● Microsoft Azure blob storage
Snowflake supports both bulk data loading and continuous data loading (Snowpipe). Likewise, Snowflake
supports unloading data from tables into any of the above staging locations.

For more information, see Loading Data into Snowflake - https://docs.snowflake.com/en/user-guide-data-


load

Note
Some data transfer billing charges may apply when loading data from files staged across different
platforms. For more information, see Understanding Data Transfer Cost.

Snowflake Documentation:

● Overview of Data Loading


● Supported File Locations and Formats
● Bulk Loading
● Continuous Loading Using Snowpipe
● Loading Using the Web Interface

● High level data loading process

46
47
STAGES IN SNOWFLAKE
Stages actually incur cost.
Internal Named stage and User stage are created at the virtual warehouse layer.
Table stage is created at the storage layer.
All stages are physical objects because they will be holding the data.

STAGES:

● A stage specifies where data files are stored (i.e. “staged”) so that the data in the files can
be loaded into a table.
● Types of Stages
○ User Stages
○ Table Stages
○ Internal Named Stages
○ External Stages
● By default, each user and table in Snowflake is automatically allocated an internal stage
for staging data files to be loaded. In addition, you can create named internal stages.
● File staging information is required during both steps in the data loading process:
○ You must specify an internal stage in the PUT command when uploading files to
Snowflake.

48
○ You must specify the same stage in the COPY INTO <table> command when
loading data into a table.

USER STAGES:

● Each user has a Snowflake stage allocated to them by default for storing files.
● It is a convenient option if your files will only be accessed by a single user, but need to be
copied into multiple tables.
● User stages are referenced using @~; e.g. use LIST @~ to list the files in a user stage.
● Unlike named stages, user stages cannot be altered or dropped.
● User stages do not support setting file format options. Instead, you must specify file
format and copy options as part of the COPY INTO <table> command.

● This option is not appropriate if:


○ Multiple users require access to the files.
○ The current user does not have INSERT privileges on the tables the data will be
loaded into.

TABLE STAGES:

● Each table has a Snowflake stage allocated to it by default for storing files.
● Is a convenient option if your files need to be accessible to multiple users and only need
to be copied into a single table.
● Table stages have the following characteristics and limitations:
○ Table stages have the same name as the table; e.g. a table named mytable has a
stage referenced as @%mytable.
○ Unlike named stages, table stages cannot be altered or dropped.
○ Table stages do not support setting file format options. Instead, you must specify
file format and copy options as part of the COPY INTO <table> command.
○ Table stages do not support transforming data while loading it (i.e. using a query
as the source for the COPY command).

INTERNAL NAMED STAGES:

● Stores data files internally within Snowflake


● Are named database objects that provide the greatest degree of flexibility for data
loading.
● Because they are database objects, the security/access rules that apply to all objects
apply:

49
○ Users with the appropriate privileges on the stage can load data into any table.
○ Ownership of the stage can be transferred to another role, and privileges granted
to use the stage can be modified to add or remove roles.
○ When you create a stage, you must explicitly grant privileges on the stage to one
or more roles before users with those roles can use the stage.

● Example:

CREATE OR REPLACE STAGE my_stage file_format = my_csv_format;

CREATE OR REPLACE STAGE my_stage file_fromat = (type = ‘csv’ field_delimiter =


‘|’ skip_header =1);

EXTERNAL STAGE:

● References data files stored in a location outside of Snowflake. Currently, the following
cloud storage services are supported:
○ Amazon S3 buckets
○ Google Cloud Storage buckets
○ Microsoft Azure containers

● CREATE OR REPLACE my_ext_stage


url=’s3://load/files/’
credentials=(aws_key_id=’1a2b3c’ aws_secret_key=’4x5y6z’);

BULK LOADING from a LOCAL FILE SYSTEM: First you to configure and then load
the data.

● Configuring
○ Preparing a data load
○ Choosing a Stage for Local Files (current topic)
● Loading
○ Staging data from local file system
○ Copying data from internal storage (to the table)

50
51
If you don’t specify the file format, it will take the default as ‘CSV’. The field delimiter for that
CSV will be ‘,’ and the skip header will be ‘0’.

WHAT IS A STAGE?

Cloud file repository that simplifies and streamlines bulk loading and unloading.
● Stage can be internal or external
○ Internal : stored internally in Snowflake

52
○ External : stored in an external location
● Effectiveness tip: Create stage object to manage ingestion workload
● Data can be queried directly from a stage without unloading it first (without loading it
into the target table)
● Even you write SELECT statement in stage to view the data.

WHAT IS A FILE FORMAT ?


Named object that stores information needed to parse files during load/unload.

● Specifies file type to load or unload (CSV, JSON, etc.)


● Includes type-specific formatting options
● Specify FILE FORMAT object
○ As part of a Table
○ As part of a Named stage
○ In the COPY INTO command
● File Formats are used in this order:
○ Copy into, if available
○ Stage, if available
○ Table, if available
○ Otherwise, use CSV with default settings
● File format not strictly required as some commands allow specifying the file format

53
SUPPORTED FILE FORMATS:

● Structured
○ Delimited
■ CSV (delimiter can be comma, tab, pipe or other)
● Semi-Structured
○ JSON – JavaScript Object Notation. Lightweight data interchange format.
○ AVRO – row-base storage format for Hadoop
○ ORC – Optimized Row Columnar. Used to store Hive data
○ Parquet – Columnar file format that stores binary data. Used in the Hadoop
ecosystem
○ XML – Extensible Markup Language. Simple text-based format for representing
structured information.

Note: Support for XML is currently in preview. It is not available to public.

54
COPY INTO:
● Requires an active warehouse to execute
● Can specify copy options to control how errors are handled:
● Regular expressions can be used to filter the files to be loaded

55
56
57
If you want to load the data once again, say FORCE = TRUE

LINK – https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#examples

https://medium.com/plumbersofdatascience/how-to-ingest-data-from-s3-to-snowflake-with-
snowpipe-7729f94d1797

DATA LOADING RECOMMENDATIONS:


1. File Size
2. Location Path

1. File Size and Number:


● File size and number of files are crucial to optimizing load performance
● Split large files before loading into Snowflake
● Recommended file size should always be 100MB to 250MB after compression

Warehouse Size # files in parallel

XS 8

S 16

M 32

58
L 64

XL 128

SERIAL COPY VS PARALLEL COPY:

● You can either serial copy or parallel copy.


● If you are going for a parallel copy, 95% of the warehouse will be used.
● If you are going for a serial copy, 2% of the warehouse will be used.
● If you have created a clustered warehouse, it will go for the parallel load. If you have
gone for a single cluster, it will go for a serial load.

FILE ORGANIZATION:
● Organize data in logical paths (e.g., subject area and create date)

/system/market/daily/2018/09/05

● Use wildcards late in the file path definition, to reduce scanning:

59
COPY INTO table
FROM @mystage/system/market/daily/2018

DATA VALIDATION AND ERROR HANDLING:

60
DATA LOADING TRANSFORMATIONS AND MONITORING

Transforming Data During Load:

● The COPY commands supports column reordering, column omission, and CAST using
SELECT statement

○ NOT SUPPORTED: Joins, Filters, Aggregations


○ Can include SEQUENCE columns, current_timestamp(), or other column
functions during data load
● The VALIDATION_MODE parameter does not support transformations in COPY
statements

COPY INTO vs INSERT:

● Snowflake is optimized for bulk load and batched DML using the COPY INTO
command
● Use COPY INTO to load the data rather than INSERT with SELECT
● Use INSERT only if needed for transformations not supported by COPY INTO
● Batch INSERT statements
○ INSERT w/ SELECT
○ CREATE TABLE AS SELECT (CTAS)
○ Minimize frequent single row DMLs

61
CONTINUOUS DATA LOADING :
As the data comes and sits in the s3 bucket, automatically the data should read from s3 and get
loaded into database table, for that we use the Continuous data loading process.

62
In case of Continuous data loading, we have to use Snowpipe.

Methods to Auto ingest the Data:

63
64
SNOWPIPE REST API TIPS:

● Snowflake manages the compute required to execute COPY INTO commands


● Snowpipe keeps track of which files it has loaded
● When pipe is recreated using CREATE OR REPLACE PIPE, the load history is reset to
empty

65
LINK —(COPY INTO) https://docs.snowflake.com/en/sql-reference/sql/copy-into-
table#examples

How to Ingest Data from S3 to Snowflake with Snowpipe — https://docs.snowflake.com/en/sql-


reference/sql/copy-into-table#examples

SNOWPIPE BILLING: You have pay for using Snowpipe. Snowflake provides the virtual
warehouse for Snowpipe.

● Serverless model: does not require a virtual warehouse


● Snowflake provides and manages compute resources
○ Capacity grows and shrinks depending on load
● Accounts charged based on actual compute usage
○ Don’t need to worry about suspending a warehouse
○ Charged per-second, per-core
● Utilization cost of 0.06 credits per 1000 files notified via REST calls, auto-ingest or
manual REFRESH

66
DATA UNLOADING:
Unloading is basically like you are unloading the data from a snowflake table, bringing it to the
stage and from there you are moving it to a file.

● Unload Destinations
● Unload with SELECT
● File Paths and Names

UNLOAD SYNTAX:

● Syntax for internal or external stage unload

COPY INTO @my_stage


FROM my_data
FILE_FORMAT = (FORMAT_NAME = ‘my_format’);

● Can use any SQL command and express joins:

COPY INTO @my_stage


FROM (SELECT column1, column2
FROM my_table m JOIN your_table y ON m.id = y.id)

67
FILE_FORMAT = (FORMAT_NAME = ‘my_format’);

● Use @% to unload into a table stage

COPY INTO @%my_table


FROM (SELECT column1, column2 FROM my_table)
FILE_FORMAT = (FORMAT_NAME = ‘my_format’);

❖ When loading the data, you can’t use the Joins and when unloading the data, you can use
the joins.

FILE PATHS AND NAMES:

● To set the file path for the files, add the path after the stage name

COPY INTO @mytable/CSVFiles/ FROM mytable;

● To set the file name for the files, add the name after the folder name

COPY INTO @mytable/CSVFiles/testfile FROM mytable;

● Snowflake appends a suffix to the file name that includes:


○ The number of the compute core in the virtual warehouse
○ The unload thread
○ A file number

Example: testfile_0_0_0.csv

UNLOAD TIPS:

● Can unload into any flat, delimited plain text format (CSV delimited with comma, tab,
etc.)
● A SELECT statement can be used to unload a table to multi-column, semi-structured
format (Parquet or JSON only)
● Can also unload the file in compressed format

68
69
TASKS

Task is more of a scheduler which helps to schedule a single SQL or a stored procedure.
Task engine has CRON and NON-CRON variant scheduling mechanism
Snowflake ensures that only one instance of the task with a schedule is executed at any point of
time.
If parent task fails child task does not execute.

CREATING AND MANAGING TASKS

Working With Tasks:

● Executes a DDL/DML/SQL statement or Stored Procedure


● Runs on a defined schedule, or after completion of a previous task
● Use Cases:
○ Keep aggregates up-to-date
○ Generate data for periodic reports

70
○ Copy data into or out of Snowflake

Tasks Workflow:

● Create a Task Administrator Role (Or use ACCOUNTADMIN)


● CREATE TASK mytask…
○ Specify a warehouse or use a serverless warehouse
○ Optionally Specify a schedule
○ Optionally SPecify a condition
○ Define the task

● ALTER TASK mytask… RESUME

71
Scheduling is always in minutes.
User_task_managed_initial_warehouse_size is for serverless warehouse.

Specifying a Schedule:

● Schedule can be expressed in two eays:

1. ‘<num> minute’
○ Task will run every <num> minutes
○ Example: ‘60 minute’.

2. USING CRON <expression> <timezone>


○ Specifies a cron expression and time zone for running the task.
○ Supports a subset of standard cron utility syntax.

72
For a single task, schedule must be defined for the task or it will never run. The schedule must
not be used in the child task, instead use ‘AFTER’

73
74
Building a Simple Snowflake Task:

1. CREATE OR REPLACE TABLE mytable(dt timestamp);

2. CREATE OR REPLACE TASK mytask_minute


WAREHOUSE=COMPUTE_WH
schedule=’1 MINUTE’

75
AS
INSERT INTO mytable values (current_timestamp);

3. CREATE OR REPLACE TASK mytask_hour


warehouse=COMPUTE_WH
schedule=’USING CRON 0 9-17 * * SUN America/Los_Angeles’
TIMESTAMP_INPUT_FORMAT= ‘YYYY-MM-DD HH24’
AS
INSERT INTO mytable values (CURRENT_TIMESTAMP);

4. SHOW TASKS;

- -TASK DOESNOT GET RESUMED AUTOMATICALLY

5. ALTER TASK MYTASK_MINUTE RESUME;

6. ALTER TASK MYTASK_MINUTE SUSPEND;

7. SELECT * FROM TABLE (information_schema.task_history()) order by


scheduled_time;

8. SELECT * FROM MYTABLE;

9. Drop task mytask_hour;

Adding Descending Tasks onto an Existing Task

First Task:
CREATE OR REPLACE TASK TASK_DEBUG
WAREHOUSE = COMPUTE_WH
SCHEDULE = ‘1 MINUTE’
TIMESTAMP_INPUT_FORMAT = ‘YYYY-MM-DD HH24’
AS
insert into taskdebug
with arr as (select array_construct(‘A’,’B’,’C’,’D’,’E’,’F’) arr)
select arr[ABS(MOD(RANDOM(), array_size(arr)))],CURRENT_TIMESTAMP() from arr;

76
- - CREATE A 2ND TASK WITH AFTER CLAUSE

CREATE OR REPLACE TASK TASK_DEBUG_2


WAREHOUSE = COMPUTE_WH
AFTER TASK_DEBUG - - SPECIFY WHICH TASK IT OPERATES AFTER
AS
Insert into taskdebug
select ‘X’,CURRENT_TIMESTAMP() from arr;

Note: TASK_DEBUG_2 must be enabled first and then the root task

❖ If parent task fails child task does not execute


❖ Any CREATE command will store the information in the metadata only which is nothing
but the cloud service layer.

Practise:

create or replace table mytable(dt timestamp);

create OR REPLACE task mytask_minute


WAREHOUSE = COMPUTE_WH
schedule = '1 MINUTE'
as
insert into mytable values (current_timestamp);

/* Create OR REPLACE task mytask_hour


warehouse = COMPUTE_WH
schedule ='USING CRON 0 9-17 * * SUN America/Los_Angeles'
TIMESTAMP_INPUT_FORMAT= 'YYYY-MM-DD HH24'
AS
INSERT INTO mytable values (CURRENT_TIMESTAMP)
*/

77
create or replace table child_table(x varchar(50));

create OR REPLACE task childtask


WAREHOUSE = COMPUTE_WH
after mytask_minute
as
insert into child_table values ('this is the child task');

show tasks;
alter task mytask_minute suspend;
alter task childtask resume;
alter task mytask_minute resume;

USER DEFINED FUNCTIONS:

● Perform custom operations that are not available through the built-in functions
● Can be written in:
○ JavaScript
○ SQL
○ Scala or Java (using Snowpark)
● No DDL/DML support
● Can be unsecure or secure
● Return a singular scalar value or, if defined as a table function, a set of rows

SQL UDF EXAMPLE

CREATE OR REPLACE FUNCTION order_cnt(custkey number(38,0))


RETURN number(38,0)
AS
$$
SELECT COUNT(1)
FROM “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1”.”ORDERS”
WHERE o_custkey = custkey
$$;

78
SELECT C_name, C_address, order_cnt(C_custkey)
FROM “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1”.”CUSTOMER”;

JAVASCRIPT UDF:

CREATE OR REPLACE FUNCTION convert2fahrenheit (kelvin double)


RETURNS double
LANGUAGE JAVASCRIPT
AS
$$
return (KELVIN * 9/5 - 459.67);
$$;

SELECT convert2fahrenheit(290);

STORED PROCEDURES:

● Allow procedural logic and error handling that straight SQL does not support
● Implemented through JavaScript and, and optionally (commonly), SQL
● JavaScript provides the control structures
● SQL is executed within the JavaScript by calling functions in an API

79
● Argument names are:
○ Case-insensitive in the SQL portion of stored procedure code
○ Case-sensitive in the JavaScript portion

80
81
Example:
vamsi
I am creating a procedure p1 - - - - - - - referring to EMP table
On the emp table, I have permissions.
p1(this p1 procedure is created by user called vamsi) - - - - - - - - - - (it is given to) user1

//user1 says call p1. user1 does not have emp table
Call p1;

//Now user1 executes the procedure, this is going to delete the records from vamsi’s emp table. That is
called as owner rights.

Caller right - - - If the person who is executing the procedure, has the table, then his table will get
affected.

Owner right - - - Even if, the person who is calling the stored procedure, does not have table in his
schema, then also the procedure will get executed and will affect the table of the owner.

82
83
84
85
86
87
88
DATA SHARING

- Enables sharing of the data through named snowflake object called SHARES.
- Share tables, secure views (normal views cannot be shared) and secure udf.
- Share objects in our database (data provider) with other snowflake accounts (data consumers).
- Snowflake always shares live data.
- Are read only.

BENEFITS:

● Provide direct access


● Eliminates data silos
● Increase business efficiency

INTRODUCTION:

● Data Sharing enables sharing selected objects in a database in your account with other Snowflake
accounts. The following Snowflake database objects can be shared:
● Tables
● External tables
● Secure views
● Secure materialized views
● Secure UDFs

● The data producer can provide access to his live data within minutes without copying or moving
the data to any number of data consumers.
● The data consumer can query the shared data from data producer without any performance bottle
necks thanks to snowflakes multi-cluster shared data architecture.

What is a Share?

● Named Snowflake objects that encapsulate all of the information required to share a database.
● The privileges that grant access to the database(s) and the schema containing the objects to share.
● The privileges that grant access to the specific objects in the database.
● The consumer accounts with which the database and its objects are shared.

● Shares are secure, configurable, and controlled 100% by the provider account:
● New objects added to a share become immediately available to all consumers, providing real-time
access to shared data.

89
● Access to a share (or any of the objects in a share) can be revoked at any time.

Options of Data Sharing:

1. Listing - In this you offer a share and additional metadata as a data product to one or more
accounts.
2. Direct Sharing - In which you directly share specific database objects to another account in your
region.
3. Data Exchange - In this you set up and manage a group of accounts and offer a share to that
group.

Data Provider:

- Any snowflake account that creates shares and make them available to others snowflake accounts
to consume.
- For each database you share, snowflake supports using grants to provide granular access control
to selected objects in the database.
- There is no limit on how may shares you can create.

Data Consumer:

- Any account that chooses to create database from share made available by the provider.
- Once the database is created from the shared object, we can access and query the object.
- We can consume as many shares as possible from data providers but can create only one database
per share.

Reader Account:

- Is for 3rd party access.


- Is the consume who does not have snowflake account nor is ready to become a snowflake
consumer.
- Reader account belongs to provider account who has created the account.
- Reader account can only consume data from the provider who has created it.

90
Legacy Data Share Challenges:

● Handling increased data size


● Decrypting sensitive data
● Changing file formats and schema
● Sharing data in real time
● Cleaning data

91
How does Sharing Work?

● With Secure Data Sharing, no actual data is copied or transferred between accounts.
● All sharing is accomplished through Snowflake’s unique services layer and metadata store.
● Shared data does not take up any storage in a consumer account and does not contribute to the
consumer’s monthly data storage charges.
● The only charges to consumers are for the compute resources (i.e. virtual warehouses) used to
query the shared data.

● Any full Snowflake account can both provide and consume shared data.

92
● Snowflake also supports third-party accounts, a special type of account that consumes shared data
from a single provider account.

ACCESS CONTROL:

Introduction:

Two Models:
1. Discretionary Access Control (DAC): Each object has an owner, who can in turn grant access to
that object.
2. Role-based Access Control (RBAC): Access privileges are assigned to roles, which are in turn
assigned to users.

Key Concepts:
● Securable object: An entity to which access can be granted. Unless allowed by a grant, access will
be denied.
● Role: An entity to which privileges can be granted. Roles are in turn assigned to users. Note that
roles can also be assigned to other roles, creating a role hierarchy.

93
● Privilege: A defined level of access to an object. Multiple distinct privileges may be used to
control the granularity of access granted.
● User: A user identity is recognized by Snowflake, whether associated with a person or a program.

94
Securable Objects:

● Every securable object resides within a logical container in a hierarchy of containers. The top-
most container is the customer account.
● All other securable objects (such as TABLE, FUNCTION, FILE FORMAT, STAGE,
SEQUENCE, etc.) are contained within a SCHEMA object within a DATABASE.
● Every securable object is owned by a single role, which is typically the role used to create the
object
● The owning role has all privileges on the object by default, including the ability to grant or revoke
privileges on the object to other roles.
● Access to objects is defined by privileges granted to roles. The following are examples of
privileges on various objects in Snowflake:
○ Ability to create a warehouse.
○ Ability to list tables contained in a schema.
○ Ability to add data to a table.

Roles:

● Roles are the entities to which privileges on securable objects can be granted and revoked.
● Roles are assigned to users to allow them to perform actions required for business functions in
their organization.
● A user can be assigned multiple roles
● Roles are of two types:
○ System defined
○ User defined

System defined roles:

● ACCOUNTADMIN – The account admin is an extremely powerful role; it has all the privileges
of SECURITYADMIN and SYSADMIN. The role should only be used for the initial setup of
SNowflake. This role also can access billing information and visualize the resources used by each
warehouse.
● SECURITYADMIN – The SECURITYADMIN (Security Administrator) is responsible for
users, roles and privileges. All roles, user, and privileges should be owned and created by the
security administrator.
● SYSADMIN – The SYSADMIN (Systems Admin) oversees creating objects inside Snowflake.
The SYSADMIN is responsible for all databases, schemas, tables, and views.
● PUBLIC – This is automatically granted to every user and role and is publicly available.
● USERADMIN – Role that is dedicated to user and role management only. More specifically, this
role:
○ Is granted the CREATE USER and CREATE ROLE security privileges.

95
○ Can create users and roles in the account.
○ This role can also manage users and roles that it owns.
● User admin is a subset of securityadmin.

Custom Roles:

● Custom roles (i.e any roles other than the system-defined roles) can be created by the
SECURITYADMIN roles as well as by any role to which the CREATE ROLE privileges has
been granted.
● By default, the newly-created role is not assigned to any user, nor granted to any other role.
● Creating a hierarchy of custom roles, with the top-most custom role assigned to the system role
SYSADMIN.
○ This role structure allows system administrators to manage all objects in the account,
such as warehouses and database objects, while restricting management of users and roles
to the SECURITYADMIN OR ACCOUNTADMIN roles.
○ If a custom role is not assigned to SYSADMIN through a role hierarchy, the system
administrators will not be able to manage the objects owned by the role.

96
Privileges:

● For each securable object, there is a set of privileges that can be granted on it.
● Privileges must be granted on individual objects, e.g. the SELECT privilege on the mytable table.
● To simplify grant management, future grants allow defining an initial set of privileges on objects
created in a schema; i.e. the SELECT privilege on all new tables created in the myschema
schema.
● Privileges are managed using the GRANT and REVOKE commands.
● In regular (i.e. non-managed) schemas, use of these commands is restricted to the role that owns
an object (i.e. has the OWNERSHIP privilege on the object) or roles that have the MANAGE
GRANTS global privilege for the object (typically the SECURITYADMIN role).
● In managed access schemas, object owners lose the ability to make grant decisions. Only the
schema owner or a role with the MANAGE GRANTS privilege can grant privileges on objects in
the schema, including future grants, centralizing privilege management.

97
Object Ownership and Control:

● ˘Roles Own Everything


● Users Own Nothing
● Ownership can be transferred
● Owning a Object is Different to Owning a Role

98
RESOURCE MONITOR

Introduction:

● A virtual warehouse consumes Snowflake credits while it runs.


● To help control costs and avoid unexpected credit usage caused by running warehouses,
Snowflake provides resource monitors.
● The number of credits consumed depends on the size of the warehouse and how long it runs.
● Limits can be set for a specified interval or date range. When these limits are reached and/or
approaching, the resource monitor can trigger various actions, such as sending alert notifications
and/or suspending the warehouses.
● Resource monitors can only be created by account administrators (i.e. users with the
ACCOUNTADMIN role); however, account administrators can choose to enable users with other
roles to view and modify resource monitors using SQ.

Documentation: https://docs.snowflake.com/en/user-guide/resource-monitors

Tips:

99
● Resource monitors are not intended for strictly controlling consumption on an hourly basis; they
are intended for tracking and controlling credit consumption per interval (day, week, month, etc.)
● They are not intended for setting precise limits on credit usage (i.e. down to the level of
individual credits).

For example, when credit quota thresholds are reached for a resource monitor, the assigned warehouses
may take some time to suspend, even when the action is Suspend Immediate, thereby consuming
additional credits.

Resource Monitor Notifications:

● When a resource monitor reached the threshold for an action, it generates one of the
following notifications, based on the action performed:
● The assigned warehouses will be suspended after all running queries complete.
● All running queries in the assigned warehouses will be cancelled and the warehouses
suspended immediately.
● A threshold has been reached, but no action has been performed.
● The notification is sent to all account administrators who have enabled receipt of
notifications.
● Notification can be received by account administrators through the web interface and/or
email; however, by default, notifications are not enabled:
● To receive notifications, each account administrator must explicitly enable notifications
through their preferences in the web interface.
● In addition, if an account administrator chooses to receive email notifications, they must
provide a valid email address (and verify the address) before they will receive any emails.

Clustering Key:

100
● A clustering key is a subset of columns in a table (or expressions on a table) that are
explicitly designed to co-locate the data in the table in the same micro-partitions.
● Some general indicators that can help determine whether to define a clustering key for a
table include:
○ Queries on the table are running slower than expected or have noticeably
degraded over time.
○ The clustering depth for the table is large.
○ A clustering key can be defined at table or afterward. The clustering key for a
table can also be altered or dropped at any time.
● This is very useful for very large tables where the ordering of the column is not optimal
or extensive DML operation on the table has caused the table’s natural clustering to
degrade.

Benefits of Defining a Clustering Key:

● Improved scan efficiency in queries by skipping data that does not match filtering
predicates.
● Better column compression than in tables with no clustering.
● After a key has been defined on a table, no administration is required, unless you choose
to drop or modify the key.
● All future maintenance on the rows in the table (to ensure optimal clustering) is
performed automatically by snowflake.

101
Note: Although clustering can improve the performance and reduce the cost of some queries, the
compute resources used to perform clustering consume credits. As such, you should cluster only
when queries will benefit substantially from the clustering.

● Queries benefit from clustering when the queries filter or sort on the clustering key for
the table.

Clustering Examples:

Clustering by base columns

● CREATE OR REPLACE TABLE t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);
● SHOW TABLES LIKE ‘t1’;

Clustering by Expression

● CREATE OR REPLACE TABLE t2 (c1 timestamp, c2 string, c3 number) cluster by


(to_date(c1), substring(c2, 0, 10));
● SHOW TABLES LIKE ‘t2’;

Dropping a clustering key

● ALTER TABLE t1 DROP clustering key;


● SHOW TABLES LIKE ‘t1’;

Re-clustering:

● As DML operations (INSERT, UPDATE, MERGE, DELETE, COPY) are performed on


a cluster table, the data in the table might become less clustered. Periodic/regular
reclustering of the table is required to maintain optimal clustering.
● During reclustering, Snowflake uses the clustering key for a clustered table to reorganize
the column data, so that related records are relocated to the same micro-partition. This
DML operation deletes the affected records and re-inserts them, grouped according to the
clustering key.

102
● Reclustering in Snowflake is automatic; no maintenance is needed.

❖ Clustering Keys & Clustered tables —- https://docs.snowflake.com/en/user-guide/tables-


clustering-keys#what-is-a-clustering-key

Usage Notes:

● If you define two or more columns/expressions as the clustering key for a table, the order
has an impact on how the data is clustered in micro-partitions.
● An existing clustering key is copied when a table is created using CREATE TABLE ……
CLONE.
● An existing clustering key is not propagated when a table is created using CREATE
TABLE …. LIKE.
● An existing clustering key is not supported when a table is created using CREATE
TABLE … AS SELECT; however, you can define a clustering key after the table is
created.

When to use Clustering:

● Snowflake supports clustering for both partitioned and non-partitioned tables.


● You have fields that are accessed frequently in WHERE clauses. For example:
● You have tables that contain data in the multi - terabyte (TB) range.
● You have columns that are actively used in the filter causes and queries that aggregate
data. For example, when you have queries that frequently use the date column as a filter
condition, choosing the date column is a good idea.
● You need more granularity than partitioning allows.
● To get clustering benefits in addition to clustering benefits, you can use the same column
for both partitioning and clustering.

Attention:

● Clustering keys are not intended for all tables.


● The size of a table, as well as the query performance for the table, should dictate whether
to define a clustering key for the table.
● In particular, to see performance improvements from a clustering key, a table has to be
large enough to consist of a sufficiently large number of micro-partitions, and the

103
column(s) defined in the clustering key have to provide sufficient filtering to select a
subset of these micro-partitions.
● In general, tables in multi-terabyte (TB) range will experience the most benefit from
clustering, particularly if DML is performed regularly/continually on these tables.

1.check time to execute below query - it will take long time

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL;

SELECT *

FROM store_returns,date_dim

WHERE sr_returned_date_sk = d_date_sk;

In query profile check how many bytes scanned, how many partitions scanned

2. execute below query with few columns and again check time taken,bytes scannemd
time and bytes scanned should have reduced by reducing columns in select list

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL;

SELECT

d_year

104
,sr_customer_sk as ctr_customer_sk

,sr_store_sk as ctr_store_sk

,SR_RETURN_AMT_INC_TAX

FROM store_returns,date_dim

3. add where clause and check time and bytes scanned

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL;

SELECT

d_year,sr_customer_sk as ctr_customer_sk,sr_store_sk as
ctr_store_sk,SR_RETURN_AMT_INC_TAX

FROM store_returns,date_dim

WHERE sr_returned_date_sk = d_date_sk

AND d_year = 1999;

105
2. Clustering

CREATE TABLE TRANSACTIONS

TXN_ID STRING,

TXN_DATE DATE,

CUSTOMER_ID STRING,

QUANTITY DECIMAL(20),

PRICE DECIMAL(30,2),

COUNTRY_CD STRING

);

pl execute below query 10 times to load good amount of data

106
INSERT INTO TRANSACTIONS

SELECT

UUID_STRING() AS TXN_ID

,DATEADD(DAY,UNIFORM(1, 500, RANDOM()) * -1,

'2020-10-15') AS TXN_DATE

,UUID_STRING() AS CUSTOMER_ID

,UNIFORM(1, 10, RANDOM()) AS QUANTITY

,UNIFORM(1, 200, RANDOM()) AS PRICE

,RANDSTR(2,RANDOM()) AS COUNTRY_CD

FROM TABLE(GENERATOR(ROWCOUNT => 10000000));

execute below query and check in profile partitions scanned

SELECT * FROM TRANSACTIONS

107
WHERE TXN_DATE BETWEEN DATEADD(DAY, -31, '2020-10-15')

AND '2020-10-15';

as where clause has txn_date - pl cluster on that

ALTER TABLE TRANSACTIONS CLUSTER BY ( TXN_DATE );

execute below query and check profile - clustering has reduced no of partitions
SELECT * FROM TRANSACTIONS

WHERE TXN_DATE BETWEEN DATEADD(DAY, -31, '2020-10-15') AND '2020-10-15';

108

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy