Snowflake - T
Snowflake - T
It is an
advanced data platform which provides Software as a service. It's a cloud based data warehouse platform
which is built on top of aws and it's a Saas offering.
In contrast to traditional data warehouse solution, snowflake provides a warehouse which is faster, easy to
setup and also more flexible. This particular software is natively designed for the cloud, which means that
you cannot go and install in an on-premise system.
It is a leader in the data management solutions for analytics.
Snowflake's unique multi-layered architecture allows for performance, scalability, elasticity, and
concurrency.
Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than
optional command line clients, drivers, and connectors), run in public cloud infrastructures.
Snowflake uses virtual compute instances for its compute needs and a storage service for persistent
storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
1
Traditional architectures are shared disk architecture which uses multiple nodes for accessing data on a
single storage systems. And shared nothing architecture, it stores a part of your data in each node of the
data warehouse.
Now, Snowflake combines the benefit of both these platforms in an innovative and new design.
Snowflake processes the queries using MPP compute clusters which is called as Massively Parallel
Processing compute cluster where each node in a cluster stores a part of your entire data set locally. So
you can say that Snowflake uses a mix of shared disk and shared nothing architecture.
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures.
Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is
accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake
processes queries using MPP (massively parallel processing) compute clusters where each node in the
cluster stores a portion of the entire data set locally. This approach offers the data management simplicity
of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing
architecture.
● When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized,
compressed, columnar format. Snowflake stores this optimized data in cloud storage.
2
● It stores all the data (structured or semi structured data) in the database (Database is a logical
grouping of the object which consists primarily of tables and views). All the tasks related to data
are handled through SQL query.
● The underlying file system in snowflake is managed by s3 in snowflake account. The data in the
s3 is encrypted (It always stores the data in an encryption fashion), compressed and
distributed(The data is stored in multiple partitions) to optimize the performance.
● Pass through pricing : Price to be paid for the amount of storage used.
● Upto 5x Compression : The data is not stored in its regular format. Instead the data which is
stored in snowflake is compressed and then stored.
● Native support or semi-structured data support.
❖ The data will reside inside the tables. These are physically stored in the files in
Amazon s3.v
❖ They are blocks of data per files. These are columnar data.
❖ The data is never overwritten in your cloud.
Query Processing layer / Compute layer / Virtual warehouse (VW) layer : process the data
in a database.
● Each virtual warehouse is an independent compute cluster that does not share compute resources
with other virtual warehouses. As a result, each virtual warehouse has no impact on the
performance of other virtual warehouses.
For more information, see https://docs.snowflake.com/en/user-guide/warehouses
3
https://docs.snowflake.com/en/user-guide/ui-query-profile
Snowflake compute costs depend on The amount of time warehouses have run and The sizes of running
warehouses.
Snowflake data storage costs include Persistent data stored in permanent tables and Data retained to
enable data recovery (time travel and fail-safe).
4
Cloud Service layer :
Uses :
● Cloud Agnostic
● Performance and speed
● User-friendly UI/UX
● Reduced Administration overhead
● On-Demand Pricing
● Support Variety of File Formats
Connecting to Snowflake:
5
● Command line clients (e.g. SnowSQL) which can also access all aspects of managing and using
Snowflake.
● ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to
Snowflake.
● Native connectors (e.g. Python, Spark) that can be used to develop applications for connecting to
Snowflake.
● Third-party connectors that can be used to connect applications such as ETL tools (e.g.
Informatica) and BI tools (e.g. ThoughtSpot) to Snowflake.
Supported Region :
● Each Snowflake account is hosted in a single region
● Identical features and services across regions
● Difference in unit costs and data storage
Snowflake Editions :
6
Micro Partitions:
● It is a continuous units of storage that holds table data
○ 50 - 500 MB of uncompressed data is stored
○ Generally .. 10 MB to 16 MB (Compressed)
● Which means each micro partition holds 10 to 16 MB of compressed data which is equivalent to
50 - 500 MB of uncompressed data.
● There can be many micro partitions per table depending upon how much amount of data is
coming.
● These partitions are IMMUTABLE. That means once these partitions are created, they can never
be modified. For any new thing or every update a new partition is created.
● Cloud services layer stored metadata about every micro-partition
○ MIN/MAX (Range of values in each column)
○ Number of distinct values
○ NULL count
● We cannot create Indexes in Snowflake.
● Snowflake automatically collects and maintains metadata (stored in cloud service layer metadata)
about and their underlying micro-partitions, including:
○ Table level:
■ Row count
■ Table size (in bytes)
■ File reference and table versions
○ Micro-partition column level:
■ Number of distinct values
■ MIN and MAX values
■ NULL count
7
Data Storage Billing :
● Billed for actual storage use
○ Daily average Terabytes per month
● On-Demand pricing
○ Billed in areas for storage used
○ Around $40/Terabyte/month
○ Minimum monthly charge of $25
● Pre-Purchased Capacity
○ Billed up front with a commitment to a certain capacity
○ Price varies on amount and cloud platform
○ Customer is notified at 70% of capacity
Zero-Copy Cloning
● Quickly takes a snapshot of any table, schema, or database.
● When the clone is created
○ All micro-partitions in both tables are fully shared.
○ Micro-partition storage is owned by the oldest table, clone references them
● No additional storage costs until changes are made to the original or the clone.
● Often used to quickly spin up Dev or Test environments.
● Effective backup option as well.
Documentation: https://docs.snowflake.com/en/user-guide/object-clone
8
● The cloned table points to the same micro-partitions as to the original table.
● If the cloned table modifies the records,a new micro-partition is created with the modified
data which is pointing towards the clone table. The original table will not be able to see
the new micro-partition.
● In the same way, if any changes are made in the original table, the cloned table will not
know about that.
● When you clone a database, all the schemas and the tables in that database will be cloned.
● When you clone a schema, all the tables in that schema will be cloned.
Any DDL command is a Metadata only operation. The metadata will be available at the Cloud-
Service layer. When we go below the cloud service layer, then only the compute engine will
start.
For DML commands, warehouse will be started and not DDL commands.
Example:
CREATE TABLE empclone CLONE emp;
Practice:
-- Cloning Tables
-- Create a sample table
9
(emp_id number,
first_name varchar,
last_name varchar
);
-- Show the content of the table employees in the demo_db database and the public schema
-- Show the content of the clone which should be the same as the original table that it was cloned from.
-- Verify the content of the clone table to show the original set of records and the additional new record
that we just added
-- Verify the content of the original employees table. It should be the original content without the record
we added to the clone.
10
select * from demo_db.public.employees;
-- Check the content of the employees table. The employee with Emp_id = 100 is no longer there
-- Check the content of the employees_clone table. The employee with Emp_id = 100 is still there.
-- Cloning Databases
-- Showing tables in the demo_db_clone would show neither the employees table nor the
employees_clone table. They are gone.
11
show tables;
use demo_db;
-- Showing the tables in the original demo_db database will show all the tables in that database including
the employees and employees_clone tables
show tables;
-- Cloning Schema
-- Show tables in the original public schema. That should return all the tables in the public schema
including a table called employees;
show tables;
show tables;
12
-- Show tables in the public_clone schema. The result of that command should not have in it a table
called employees_clone;
show tables;
-- The result of that command should include in it the table called employees_clone since it was dropped
from the clone schema and not the original public schema.
show tables;
TIME TRAVEL:
● This is one of the powerful CDP (Continuous Data Protection) feature for ensuring the
maintenance and availability of historical data.
● It helps in recovering the data related objects (database, tables, schemas) which has been
deleted accidentally or intentionally.
● Duplicating and backing up data from key points in the past.
● Also analyze data usage / manipulation over a specified period of time.
● We can retrieve the data in a table present at a specified time.
Documentation: https://docs.snowflake.com/en/user-guide/data-time-travel
● At a specific time
13
SELECT * FROM my_table1
AT(TIMESTAMP => ‘Mon, 01 May 2020 16:20:00‘ :: timestamp);
● At an OFFSET
● Restoring objects
UNDROP TABLE/SCHEMA/DATABASE
● We keep the old versions of the micro-partitions for the specified retention time.
FAIL-SAFE:
14
● Defined period for fail-safe is 7 days.
● It is a Non-Configurable, 7-day retention for historical data after Time Travel expiration.
● When the data goes to fail-safe, users cannot go and modify anything. It is only
accessible to the snowflake personnel.
● Admins can view Fail-Safe use in the Snowflake Web UI under Account>Billing and
Usage.
● It is not supported for Temporary and Transient tables.
Documentation: https://docs.snowflake.com/en/user-guide/data-failsafe
REPLICATION:
● It helps to keeps database objects and stored data synchronized between one or
more accounts (within the same organization).
● With Replication, we can sync the data in on cloud service provider(Azure) to
another cloud service provider(AWS).
● We can share the across the clouds using the Replication.
● Unit of replication is a database. We can replicate Permanent and Transient
databases in Snowflake. Cannot replicate the Temporary databases.
● The Secondary database which we replicated is always read only. This secondary
(stand by) database can become permanent, if the primary database fails.
15
Documentation: https://docs.snowflake.com/en/user-guide/database-replication-intro
● Tables
○ Permanent
○ Transient
○ Automatic Clustering of Clustered tables
○ Constraints
● Sequences
● Views (Both Standard and Secured)
○ Materialized (Both Standard and Secured)
● File Formats
● Stored Procedures
● User-Defined functions (UDF)
○ SQL and Javascript
● Policies
● Tags
○ Object Tagging
16
● Snowflake encrypts database files in-transit from the source account to the target account.
● If Tri-Secret Secure is enabled (for the source and the target accounts) the files are
encrypted using the public key for an encryption key pair.
● The encryption key pair is protected by the account master key (AMK) for the target
account.
● Initial database replication and subsequent synchronization operations incur data transfer
charges
○ These are termed “egress fees” and only apply when data is passed across regions
and/or across cloud providers.
○ The cost is passed along to customers.
● Rate or pricing is determined by the location of the source and target accounts, and the
cloud provider.
● Data Transfer Usage is shown in Billing and Usage Data Transfer.
17
18
Table Types:
Views:
A view allows the result of a query to be accessed as if it were a table. The query is specified in
the CREATE VIEW statement.
Views serve a variety of purposes, including combining, segregating, and protecting data. For
example, you can create separate views that meet the needs of different types of employees, such
as doctors and accountants at a hospital:
A view can be used almost anywhere that a table can be used (joins, subqueries, etc.).
19
● The materialized view increases the performance.
● If we want to perform aggregate calculations on a regular table which generally will take
longer time, we calculate that aggregate data and store it in a materialized view. So,
When we want to perform a similar type of aggregate calculation next time, it will go and
read the data from a materialized view.
● The materialized views will get automatically refreshed. (It will give upto date
information on tables)
Example:
CREATE VIEW v1 AS SELECT * FROM emp;
❖ Creating a normal view does not take time, whereas creation of a materialized
view takes time. Because the normal view only has to store the definition, but
materialized view has to also store the data.
20
21
22
Constraints:
Standard SQL:
23
■ Snowflake also supports the CREATE SEQUENCE.
○ Generate DML
■ INSERT including multi-table insert
■ MERGE, UPDATE, DELETE, and TRUNCATE
● Query Syntax
○ Snowflake SELECT supports all the standard syntax options including:
■ All standard JOIN types: INNER, OUTER, SELF, LEFT, and RIGHT
JOINS
■ PIVOT - used to transform a narrow table into a wider table for reporting
purposes.
■ SAMPLE - creates a dataset
■ GROUP BY options including CUBE, GROUPING SETS, and ROLLUP.
○ Since Snowflake supports SQL 2003, most of the options supported by other
standard RDS will work.
○ Snowflake extensions for a SELECT statement include the AT and BEFORE
options.
Snowflake Transactions:
24
Describe Object:
25
● Describes the details for a specified object
● Used for
○ Tables, Schemas and Views
○ Sequences
○ File Formats
○ Stages
○ Pipes
○ Tasks and Streams
○ Functions and Procedures
Show Objects:
● List the existing objects for the specified object type. Output includes:
○ Common properties (name, creation timestamp, owning role, comment, etc).
○ Object-specific properties
Example: ‘%testing%’;
● Used for
○ Database objects (Tables, schemas, views, file formats, sequences, stages, tasks,
pipes, etc.)
○ Account Object Types (Warehouses, databases, grants, roles, users, etc.)
○ Session / User Operations (Parameters, variables, transactions, locks, etc.)
○ Account Operations (Replications databases, replication accounts, regions, global
accounts, etc.)
SHOW PARAMETERS;
26
SHOW PArAMETERS FOR TABLE employees;
GET_DDL:
● Returns a DDL statement that can be used to recreate the specified object.
● Used for
○ Databases, Schemas, Tables, External tables
○ Views
○ Streams and Tasks
○ Sequences
○ File Formats, Pipes
○ UDFs, Stored Procedures
○ Policies
When a query is executing it takes more time but the same query that executed the same time,
the query executes very fast.
Because, whatever the query you are typing, the results of the query are stored in the cache. And
for the subsequent executions, the data will come from the cache only. If it is the identical query
the data will come from the cache only.
27
● The optimizer in Snowflake is a cost-based optimizer. It will decide how the query needs
to be executed. Once the cost-based optimizer is available, it will create an execution
plan, that plan is given to the next layer. The next layer is called Compute layer and it
gives to the layer called Storage layer. As per the instructions given by the optimizer, the
query will be fetched from the storage layer by reading multiple partitions.
● After the data is fetched from the data storage it is going to create a result set, it is
nothing but the data which is supposed to be displayed when you execute a query and the
copy of that is first stored in query cache, from there it is been given to the user.
● Hence this query cache will be there for next 24 hours. Within this 24 hours, if you are
running the identical queries, the data will come from the cache only rather than going to
the disk.
Types of Cache:
1. Metadata Cache
2. Query Result cache
3. Data cache
Metadata cache and Query result cache are both available in the cloud service layer.
Data cache will be in the compute layer i.e., Virtual Warehouse layer.
For MAX, MIN, COUNT, for getting that particular information you don’t need to pass through
the virtual warehouse layer. That means this result is coming from Query result cache. So no
amount is need to be paid to snowflake for executing this query.
28
1. METADATA CACHE:
Metadata cache stores whenever a table is created, it not only stores about the table structure, but
it also stores how many records are there in that particular table and at the same time what is the
minimum value and maximum value for each partition. (SHOW command also)
○ NOTE: Cloud Services charges may still apply, it it’s use is more than 10% of
your overall compute time.
● Query results are stored and managed by the cloud service layer
● Snowflake will read the data from this layer only if the identical query is run, and base
tables have not changed
● Available to other users:
29
○ SELECT: Any role that has the SELECT permission on all the tables involved
can use the Query result cache
How it Works?
● Result sets are cached for 24 hours; counter resets each time matching query is re-used
● Result reuse controlled by USE_CACHED_RESULT parameter at account/user/session
level
Above both the queries are not same because the ASCII values of uppercase and
lowercase characters are different.
Benefits
● Fast
● Will never give stale results
● No virtual warehouse used
○ Unless the cache is accessed using a RESULT_SCAN
30
3. DATA CACHE:
Example: Out of 10 records, only 2 records are got fetched, those 2 records only it will store
because it is storing a query result cache but data cache stores 10 records.
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
Query Example:
● The data cache will only capture the data in the partition returned in the query
● The data cache will only contain the columns in the query
● In this example above, the data cached will only be the records returned by the query.
● For example, if the “state=’CA’” where clause returned 20 out of 100, only firstname and
state columns will be cached for the 20 records returned.
Effectiveness Tip:
● Group and execute similar queries on the same virtual warehouse to maximize data cache
reuse, for performance and query optimization.
● When a query is run, file headers and column data retrieved are stored on SSD
● Virtual warehouse will first read any locally available data, then read remainder from
remote cloud storage
● Data is flushed out in a Least Recently Used (LRU) fashion when cache fills
31
● Remote disk I/O in the data cache refers to the process of reading or writing data from or
to the remote storage location where Snowflake stores its data.
32
What is “EXPLAIN PLAN”?
For every query, you can go and create a execution plan. The “EXPLAIN” will not show you the
data, rather it will explain how the query will get executed.
EXPLAIN
SELECT
1_orderkey
FROM
lineitem
LIMIT 10;
● Snowflake command that displays the logical execution steps without executing
● Results show in JSON, text or tabular format
● Helpful for performance tuning and for controlling costs
What’s In a Plan?
Key Points:
● Partition Pruning
● Join Ordering
● Join Types
33
34
QUERY PROFILE
35
36
37
38
Profile Overview
39
SQL PERFORMANCE TIPS
40
JOIN ON UNIQUE KEYS
Troubleshooting Scenario:
● Joining or non-unique keys can explode your data output (join explosion)
○ Each row in table 1 matches multiple rows in table 2
41
○ Apply appropriate filters as early as possible in the query
○ For naturally clustered tables, apply appropriate predicate columns (e.g., date
columns) that have a higher correlation to ingestion order
42
TEMPORARY TABLE
● Great for long calculations that need to be referenced multiple times in same query
● Great for materializing intermediate result
● Backed by micro-partitions and may assist in pruning
● Exist only during current session
● Disappear when you disconnect
● Sometimes, it’s not possible for the SQL pruner to determine the statistics of the function
from the statistics of the filter column
43
Cache demo
--Check usage of query result cache. first time execution of the query will fetch data from the
storage layer and subsequent execution of identical query will show the data from result cache
44
--check usage of data cache. You should see that the percentage scanned from cache is 100.00%.
-- add columns , run the query again and check query profile .percentage scanned from cache,
should be less
-- than 100% as we have added additional columns which were not fetched earlier
select n_nationkey, n_name, n_regionkey, n_comment from nation where n_nationkey >10;
You need to create a stage - - From external files, you will load data in stage and from stage you
will load data into target table.
Stage - Temporary place where the data will be stored before it is moved to table
Target - Permanent or transient table where data will be stored
45
● Target table
DATA LOADING:
Snowflake supports loading data from files staged in any of the following locations, regardless of the
cloud platform for your Snowflake account:
● Internal (i.e. Snowflake) stages
● Amazon S3
● Google Cloud Storage
● Microsoft Azure blob storage
Snowflake supports both bulk data loading and continuous data loading (Snowpipe). Likewise, Snowflake
supports unloading data from tables into any of the above staging locations.
Note
Some data transfer billing charges may apply when loading data from files staged across different
platforms. For more information, see Understanding Data Transfer Cost.
Snowflake Documentation:
46
47
STAGES IN SNOWFLAKE
Stages actually incur cost.
Internal Named stage and User stage are created at the virtual warehouse layer.
Table stage is created at the storage layer.
All stages are physical objects because they will be holding the data.
STAGES:
● A stage specifies where data files are stored (i.e. “staged”) so that the data in the files can
be loaded into a table.
● Types of Stages
○ User Stages
○ Table Stages
○ Internal Named Stages
○ External Stages
● By default, each user and table in Snowflake is automatically allocated an internal stage
for staging data files to be loaded. In addition, you can create named internal stages.
● File staging information is required during both steps in the data loading process:
○ You must specify an internal stage in the PUT command when uploading files to
Snowflake.
48
○ You must specify the same stage in the COPY INTO <table> command when
loading data into a table.
USER STAGES:
● Each user has a Snowflake stage allocated to them by default for storing files.
● It is a convenient option if your files will only be accessed by a single user, but need to be
copied into multiple tables.
● User stages are referenced using @~; e.g. use LIST @~ to list the files in a user stage.
● Unlike named stages, user stages cannot be altered or dropped.
● User stages do not support setting file format options. Instead, you must specify file
format and copy options as part of the COPY INTO <table> command.
TABLE STAGES:
● Each table has a Snowflake stage allocated to it by default for storing files.
● Is a convenient option if your files need to be accessible to multiple users and only need
to be copied into a single table.
● Table stages have the following characteristics and limitations:
○ Table stages have the same name as the table; e.g. a table named mytable has a
stage referenced as @%mytable.
○ Unlike named stages, table stages cannot be altered or dropped.
○ Table stages do not support setting file format options. Instead, you must specify
file format and copy options as part of the COPY INTO <table> command.
○ Table stages do not support transforming data while loading it (i.e. using a query
as the source for the COPY command).
49
○ Users with the appropriate privileges on the stage can load data into any table.
○ Ownership of the stage can be transferred to another role, and privileges granted
to use the stage can be modified to add or remove roles.
○ When you create a stage, you must explicitly grant privileges on the stage to one
or more roles before users with those roles can use the stage.
● Example:
EXTERNAL STAGE:
● References data files stored in a location outside of Snowflake. Currently, the following
cloud storage services are supported:
○ Amazon S3 buckets
○ Google Cloud Storage buckets
○ Microsoft Azure containers
BULK LOADING from a LOCAL FILE SYSTEM: First you to configure and then load
the data.
● Configuring
○ Preparing a data load
○ Choosing a Stage for Local Files (current topic)
● Loading
○ Staging data from local file system
○ Copying data from internal storage (to the table)
50
51
If you don’t specify the file format, it will take the default as ‘CSV’. The field delimiter for that
CSV will be ‘,’ and the skip header will be ‘0’.
WHAT IS A STAGE?
Cloud file repository that simplifies and streamlines bulk loading and unloading.
● Stage can be internal or external
○ Internal : stored internally in Snowflake
52
○ External : stored in an external location
● Effectiveness tip: Create stage object to manage ingestion workload
● Data can be queried directly from a stage without unloading it first (without loading it
into the target table)
● Even you write SELECT statement in stage to view the data.
53
SUPPORTED FILE FORMATS:
● Structured
○ Delimited
■ CSV (delimiter can be comma, tab, pipe or other)
● Semi-Structured
○ JSON – JavaScript Object Notation. Lightweight data interchange format.
○ AVRO – row-base storage format for Hadoop
○ ORC – Optimized Row Columnar. Used to store Hive data
○ Parquet – Columnar file format that stores binary data. Used in the Hadoop
ecosystem
○ XML – Extensible Markup Language. Simple text-based format for representing
structured information.
54
COPY INTO:
● Requires an active warehouse to execute
● Can specify copy options to control how errors are handled:
● Regular expressions can be used to filter the files to be loaded
55
56
57
If you want to load the data once again, say FORCE = TRUE
LINK – https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#examples
https://medium.com/plumbersofdatascience/how-to-ingest-data-from-s3-to-snowflake-with-
snowpipe-7729f94d1797
XS 8
S 16
M 32
58
L 64
XL 128
FILE ORGANIZATION:
● Organize data in logical paths (e.g., subject area and create date)
/system/market/daily/2018/09/05
59
COPY INTO table
FROM @mystage/system/market/daily/2018
60
DATA LOADING TRANSFORMATIONS AND MONITORING
● The COPY commands supports column reordering, column omission, and CAST using
SELECT statement
● Snowflake is optimized for bulk load and batched DML using the COPY INTO
command
● Use COPY INTO to load the data rather than INSERT with SELECT
● Use INSERT only if needed for transformations not supported by COPY INTO
● Batch INSERT statements
○ INSERT w/ SELECT
○ CREATE TABLE AS SELECT (CTAS)
○ Minimize frequent single row DMLs
61
CONTINUOUS DATA LOADING :
As the data comes and sits in the s3 bucket, automatically the data should read from s3 and get
loaded into database table, for that we use the Continuous data loading process.
62
In case of Continuous data loading, we have to use Snowpipe.
63
64
SNOWPIPE REST API TIPS:
65
LINK —(COPY INTO) https://docs.snowflake.com/en/sql-reference/sql/copy-into-
table#examples
SNOWPIPE BILLING: You have pay for using Snowpipe. Snowflake provides the virtual
warehouse for Snowpipe.
66
DATA UNLOADING:
Unloading is basically like you are unloading the data from a snowflake table, bringing it to the
stage and from there you are moving it to a file.
● Unload Destinations
● Unload with SELECT
● File Paths and Names
UNLOAD SYNTAX:
67
FILE_FORMAT = (FORMAT_NAME = ‘my_format’);
❖ When loading the data, you can’t use the Joins and when unloading the data, you can use
the joins.
● To set the file path for the files, add the path after the stage name
● To set the file name for the files, add the name after the folder name
Example: testfile_0_0_0.csv
UNLOAD TIPS:
● Can unload into any flat, delimited plain text format (CSV delimited with comma, tab,
etc.)
● A SELECT statement can be used to unload a table to multi-column, semi-structured
format (Parquet or JSON only)
● Can also unload the file in compressed format
68
69
TASKS
Task is more of a scheduler which helps to schedule a single SQL or a stored procedure.
Task engine has CRON and NON-CRON variant scheduling mechanism
Snowflake ensures that only one instance of the task with a schedule is executed at any point of
time.
If parent task fails child task does not execute.
70
○ Copy data into or out of Snowflake
Tasks Workflow:
71
Scheduling is always in minutes.
User_task_managed_initial_warehouse_size is for serverless warehouse.
Specifying a Schedule:
1. ‘<num> minute’
○ Task will run every <num> minutes
○ Example: ‘60 minute’.
72
For a single task, schedule must be defined for the task or it will never run. The schedule must
not be used in the child task, instead use ‘AFTER’
73
74
Building a Simple Snowflake Task:
75
AS
INSERT INTO mytable values (current_timestamp);
4. SHOW TASKS;
First Task:
CREATE OR REPLACE TASK TASK_DEBUG
WAREHOUSE = COMPUTE_WH
SCHEDULE = ‘1 MINUTE’
TIMESTAMP_INPUT_FORMAT = ‘YYYY-MM-DD HH24’
AS
insert into taskdebug
with arr as (select array_construct(‘A’,’B’,’C’,’D’,’E’,’F’) arr)
select arr[ABS(MOD(RANDOM(), array_size(arr)))],CURRENT_TIMESTAMP() from arr;
76
- - CREATE A 2ND TASK WITH AFTER CLAUSE
Note: TASK_DEBUG_2 must be enabled first and then the root task
Practise:
77
create or replace table child_table(x varchar(50));
show tasks;
alter task mytask_minute suspend;
alter task childtask resume;
alter task mytask_minute resume;
● Perform custom operations that are not available through the built-in functions
● Can be written in:
○ JavaScript
○ SQL
○ Scala or Java (using Snowpark)
● No DDL/DML support
● Can be unsecure or secure
● Return a singular scalar value or, if defined as a table function, a set of rows
78
SELECT C_name, C_address, order_cnt(C_custkey)
FROM “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1”.”CUSTOMER”;
JAVASCRIPT UDF:
SELECT convert2fahrenheit(290);
STORED PROCEDURES:
● Allow procedural logic and error handling that straight SQL does not support
● Implemented through JavaScript and, and optionally (commonly), SQL
● JavaScript provides the control structures
● SQL is executed within the JavaScript by calling functions in an API
79
● Argument names are:
○ Case-insensitive in the SQL portion of stored procedure code
○ Case-sensitive in the JavaScript portion
80
81
Example:
vamsi
I am creating a procedure p1 - - - - - - - referring to EMP table
On the emp table, I have permissions.
p1(this p1 procedure is created by user called vamsi) - - - - - - - - - - (it is given to) user1
//user1 says call p1. user1 does not have emp table
Call p1;
//Now user1 executes the procedure, this is going to delete the records from vamsi’s emp table. That is
called as owner rights.
Caller right - - - If the person who is executing the procedure, has the table, then his table will get
affected.
Owner right - - - Even if, the person who is calling the stored procedure, does not have table in his
schema, then also the procedure will get executed and will affect the table of the owner.
82
83
84
85
86
87
88
DATA SHARING
- Enables sharing of the data through named snowflake object called SHARES.
- Share tables, secure views (normal views cannot be shared) and secure udf.
- Share objects in our database (data provider) with other snowflake accounts (data consumers).
- Snowflake always shares live data.
- Are read only.
BENEFITS:
INTRODUCTION:
● Data Sharing enables sharing selected objects in a database in your account with other Snowflake
accounts. The following Snowflake database objects can be shared:
● Tables
● External tables
● Secure views
● Secure materialized views
● Secure UDFs
● The data producer can provide access to his live data within minutes without copying or moving
the data to any number of data consumers.
● The data consumer can query the shared data from data producer without any performance bottle
necks thanks to snowflakes multi-cluster shared data architecture.
What is a Share?
● Named Snowflake objects that encapsulate all of the information required to share a database.
● The privileges that grant access to the database(s) and the schema containing the objects to share.
● The privileges that grant access to the specific objects in the database.
● The consumer accounts with which the database and its objects are shared.
● Shares are secure, configurable, and controlled 100% by the provider account:
● New objects added to a share become immediately available to all consumers, providing real-time
access to shared data.
89
● Access to a share (or any of the objects in a share) can be revoked at any time.
1. Listing - In this you offer a share and additional metadata as a data product to one or more
accounts.
2. Direct Sharing - In which you directly share specific database objects to another account in your
region.
3. Data Exchange - In this you set up and manage a group of accounts and offer a share to that
group.
Data Provider:
- Any snowflake account that creates shares and make them available to others snowflake accounts
to consume.
- For each database you share, snowflake supports using grants to provide granular access control
to selected objects in the database.
- There is no limit on how may shares you can create.
Data Consumer:
- Any account that chooses to create database from share made available by the provider.
- Once the database is created from the shared object, we can access and query the object.
- We can consume as many shares as possible from data providers but can create only one database
per share.
Reader Account:
90
Legacy Data Share Challenges:
91
How does Sharing Work?
● With Secure Data Sharing, no actual data is copied or transferred between accounts.
● All sharing is accomplished through Snowflake’s unique services layer and metadata store.
● Shared data does not take up any storage in a consumer account and does not contribute to the
consumer’s monthly data storage charges.
● The only charges to consumers are for the compute resources (i.e. virtual warehouses) used to
query the shared data.
● Any full Snowflake account can both provide and consume shared data.
92
● Snowflake also supports third-party accounts, a special type of account that consumes shared data
from a single provider account.
ACCESS CONTROL:
Introduction:
Two Models:
1. Discretionary Access Control (DAC): Each object has an owner, who can in turn grant access to
that object.
2. Role-based Access Control (RBAC): Access privileges are assigned to roles, which are in turn
assigned to users.
Key Concepts:
● Securable object: An entity to which access can be granted. Unless allowed by a grant, access will
be denied.
● Role: An entity to which privileges can be granted. Roles are in turn assigned to users. Note that
roles can also be assigned to other roles, creating a role hierarchy.
93
● Privilege: A defined level of access to an object. Multiple distinct privileges may be used to
control the granularity of access granted.
● User: A user identity is recognized by Snowflake, whether associated with a person or a program.
94
Securable Objects:
● Every securable object resides within a logical container in a hierarchy of containers. The top-
most container is the customer account.
● All other securable objects (such as TABLE, FUNCTION, FILE FORMAT, STAGE,
SEQUENCE, etc.) are contained within a SCHEMA object within a DATABASE.
● Every securable object is owned by a single role, which is typically the role used to create the
object
● The owning role has all privileges on the object by default, including the ability to grant or revoke
privileges on the object to other roles.
● Access to objects is defined by privileges granted to roles. The following are examples of
privileges on various objects in Snowflake:
○ Ability to create a warehouse.
○ Ability to list tables contained in a schema.
○ Ability to add data to a table.
Roles:
● Roles are the entities to which privileges on securable objects can be granted and revoked.
● Roles are assigned to users to allow them to perform actions required for business functions in
their organization.
● A user can be assigned multiple roles
● Roles are of two types:
○ System defined
○ User defined
● ACCOUNTADMIN – The account admin is an extremely powerful role; it has all the privileges
of SECURITYADMIN and SYSADMIN. The role should only be used for the initial setup of
SNowflake. This role also can access billing information and visualize the resources used by each
warehouse.
● SECURITYADMIN – The SECURITYADMIN (Security Administrator) is responsible for
users, roles and privileges. All roles, user, and privileges should be owned and created by the
security administrator.
● SYSADMIN – The SYSADMIN (Systems Admin) oversees creating objects inside Snowflake.
The SYSADMIN is responsible for all databases, schemas, tables, and views.
● PUBLIC – This is automatically granted to every user and role and is publicly available.
● USERADMIN – Role that is dedicated to user and role management only. More specifically, this
role:
○ Is granted the CREATE USER and CREATE ROLE security privileges.
95
○ Can create users and roles in the account.
○ This role can also manage users and roles that it owns.
● User admin is a subset of securityadmin.
Custom Roles:
● Custom roles (i.e any roles other than the system-defined roles) can be created by the
SECURITYADMIN roles as well as by any role to which the CREATE ROLE privileges has
been granted.
● By default, the newly-created role is not assigned to any user, nor granted to any other role.
● Creating a hierarchy of custom roles, with the top-most custom role assigned to the system role
SYSADMIN.
○ This role structure allows system administrators to manage all objects in the account,
such as warehouses and database objects, while restricting management of users and roles
to the SECURITYADMIN OR ACCOUNTADMIN roles.
○ If a custom role is not assigned to SYSADMIN through a role hierarchy, the system
administrators will not be able to manage the objects owned by the role.
96
Privileges:
● For each securable object, there is a set of privileges that can be granted on it.
● Privileges must be granted on individual objects, e.g. the SELECT privilege on the mytable table.
● To simplify grant management, future grants allow defining an initial set of privileges on objects
created in a schema; i.e. the SELECT privilege on all new tables created in the myschema
schema.
● Privileges are managed using the GRANT and REVOKE commands.
● In regular (i.e. non-managed) schemas, use of these commands is restricted to the role that owns
an object (i.e. has the OWNERSHIP privilege on the object) or roles that have the MANAGE
GRANTS global privilege for the object (typically the SECURITYADMIN role).
● In managed access schemas, object owners lose the ability to make grant decisions. Only the
schema owner or a role with the MANAGE GRANTS privilege can grant privileges on objects in
the schema, including future grants, centralizing privilege management.
97
Object Ownership and Control:
98
RESOURCE MONITOR
Introduction:
Documentation: https://docs.snowflake.com/en/user-guide/resource-monitors
Tips:
99
● Resource monitors are not intended for strictly controlling consumption on an hourly basis; they
are intended for tracking and controlling credit consumption per interval (day, week, month, etc.)
● They are not intended for setting precise limits on credit usage (i.e. down to the level of
individual credits).
For example, when credit quota thresholds are reached for a resource monitor, the assigned warehouses
may take some time to suspend, even when the action is Suspend Immediate, thereby consuming
additional credits.
● When a resource monitor reached the threshold for an action, it generates one of the
following notifications, based on the action performed:
● The assigned warehouses will be suspended after all running queries complete.
● All running queries in the assigned warehouses will be cancelled and the warehouses
suspended immediately.
● A threshold has been reached, but no action has been performed.
● The notification is sent to all account administrators who have enabled receipt of
notifications.
● Notification can be received by account administrators through the web interface and/or
email; however, by default, notifications are not enabled:
● To receive notifications, each account administrator must explicitly enable notifications
through their preferences in the web interface.
● In addition, if an account administrator chooses to receive email notifications, they must
provide a valid email address (and verify the address) before they will receive any emails.
Clustering Key:
100
● A clustering key is a subset of columns in a table (or expressions on a table) that are
explicitly designed to co-locate the data in the table in the same micro-partitions.
● Some general indicators that can help determine whether to define a clustering key for a
table include:
○ Queries on the table are running slower than expected or have noticeably
degraded over time.
○ The clustering depth for the table is large.
○ A clustering key can be defined at table or afterward. The clustering key for a
table can also be altered or dropped at any time.
● This is very useful for very large tables where the ordering of the column is not optimal
or extensive DML operation on the table has caused the table’s natural clustering to
degrade.
● Improved scan efficiency in queries by skipping data that does not match filtering
predicates.
● Better column compression than in tables with no clustering.
● After a key has been defined on a table, no administration is required, unless you choose
to drop or modify the key.
● All future maintenance on the rows in the table (to ensure optimal clustering) is
performed automatically by snowflake.
101
Note: Although clustering can improve the performance and reduce the cost of some queries, the
compute resources used to perform clustering consume credits. As such, you should cluster only
when queries will benefit substantially from the clustering.
● Queries benefit from clustering when the queries filter or sort on the clustering key for
the table.
Clustering Examples:
● CREATE OR REPLACE TABLE t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);
● SHOW TABLES LIKE ‘t1’;
Clustering by Expression
Re-clustering:
102
● Reclustering in Snowflake is automatic; no maintenance is needed.
Usage Notes:
● If you define two or more columns/expressions as the clustering key for a table, the order
has an impact on how the data is clustered in micro-partitions.
● An existing clustering key is copied when a table is created using CREATE TABLE ……
CLONE.
● An existing clustering key is not propagated when a table is created using CREATE
TABLE …. LIKE.
● An existing clustering key is not supported when a table is created using CREATE
TABLE … AS SELECT; however, you can define a clustering key after the table is
created.
Attention:
103
column(s) defined in the clustering key have to provide sufficient filtering to select a
subset of these micro-partitions.
● In general, tables in multi-terabyte (TB) range will experience the most benefit from
clustering, particularly if DML is performed regularly/continually on these tables.
SELECT *
FROM store_returns,date_dim
In query profile check how many bytes scanned, how many partitions scanned
2. execute below query with few columns and again check time taken,bytes scannemd
time and bytes scanned should have reduced by reducing columns in select list
SELECT
d_year
104
,sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,SR_RETURN_AMT_INC_TAX
FROM store_returns,date_dim
SELECT
d_year,sr_customer_sk as ctr_customer_sk,sr_store_sk as
ctr_store_sk,SR_RETURN_AMT_INC_TAX
FROM store_returns,date_dim
105
2. Clustering
TXN_ID STRING,
TXN_DATE DATE,
CUSTOMER_ID STRING,
QUANTITY DECIMAL(20),
PRICE DECIMAL(30,2),
COUNTRY_CD STRING
);
106
INSERT INTO TRANSACTIONS
SELECT
UUID_STRING() AS TXN_ID
'2020-10-15') AS TXN_DATE
,UUID_STRING() AS CUSTOMER_ID
,RANDSTR(2,RANDOM()) AS COUNTRY_CD
107
WHERE TXN_DATE BETWEEN DATEADD(DAY, -31, '2020-10-15')
AND '2020-10-15';
execute below query and check profile - clustering has reduced no of partitions
SELECT * FROM TRANSACTIONS
108