GCP Snowflake.pptx
GCP Snowflake.pptx
class SnowflakeConnector:
def __init__(self, account, user, password, warehouse, database, schema, role):
self.account = account
self.user = user
self.password = password
self.warehouse = warehouse
self.database = database
self.schema = schema
self.role = role
def connect(self):
self.connection = snowflake.connector.connect(
user=self.user,
password=self.password,
account=self.account,
warehouse=self.warehouse,
database=self.database,
schema=self.schema
)
self.cursor = self.connection.cursor()
def close_connection(self):
self.cursor.close()
self.connection.close()
sf_connector = SnowflakeConnector(
GETTING STARTED WITH SNOWFLAKE
VIRTUAL WAREHOUSE
◼ Virtual warehouse is a cluster of compute resources in Snowflake. A virtual warehouse is available in
two types:
• Standard
• Snowpark-optimized
◼ A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform
the following operations in a Snowflake session:
◼ Executing SQL SELECT statements that require compute resources (e.g. retrieving rows from tables and views).
◼ Performing DML operations, such as:
◼ Updating rows in tables (DELETE , INSERT , UPDATE).
◼ Loading data into tables (COPY INTO <table>).
◼ Unloading data from tables (COPY INTO <location>).
◼
SNOWFLAKE MICRO PARTITIONS
DATABASES, TABLES AND VIEWS - OVERVIEW
Benefits of Micro-partitions
◼ In contrast to traditional static partitioning, Snowflake micro-partitions are derived automatically; they
don’t need to be explicitly defined up-front or maintained by users.
◼ As the name suggests, micro-partitions are small in size (50 to 500 MB, before compression), which
enables extremely efficient DML and fine-grained pruning for faster queries.
◼ Micro-partitions can overlap in their range of values, which, combined with their uniformly small size, helps
prevent skew.
◼ Columns are stored independently within micro-partitions, often referred to as columnar storage. This
enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
◼ Columns are also compressed individually within micro-partitions. Snowflake automatically determines
the most efficient compression algorithm for the columns in each micro-partition.
PARTITION OVERLAP
CLUSTERING KEY- OVERVIEW
TYPES OF TABLES
◼ Temporary Tables: Snowflake supports creating temporary tables for storing non-permanent, transitory
data (e.g. ETL data, session-specific data). Temporary tables only exist within the session in which they
were created and persist only for the remainder of the session. As such, they are not visible to other users
or sessions. Once the session ends, data stored in the table is purged completely from the system and,
therefore, is not recoverable, either by the user who created the table or Snowflake.
◼ After creation, temporary tables cannot be converted to any other table type.
CREATE TEMPORARY TABLE mytemptable (id NUMBER, creation_date DATE);
▪ Transient Table: Snowflake supports creating transient tables that persist until explicitly dropped and are
available to all users with the appropriate privileges. Transient tables are similar to permanent tables with
the key difference that they do not have a Fail-safe period. As a result, transient tables are specifically
designed for transitory data that needs to be maintained beyond each session (in contrast to temporary
tables), but does not need the same level of data protection and recovery provided by permanent tables.
CREATE TRANSIENT TABLE mytranstable (id NUMBER, creation_date DATE);
SNOWFLAKE TABLE COMPARISON
Type Persistence Cloning (source Time Travel Retent Fail-safe Period (D
type => target ion Period (Days) ays)
type)
Temporary Remainder of Temporary => 0 or 1 (default is 1) 0
session Temporary Tempora
ry => Transient
Transient Until explicitly Transient => 0 or 1 (default is 1) 0
dropped Temporary Transient
=> Transient
Permanent Until explicitly Permanent => 0 or 1 (default is 1) 7
(Standard Edition) dropped Temporary Perman
ent =>
Transient Permanen
t => Permanent
Permanent Until explicitly Permanent => 0 to 90 (default is 7
(Enterprise Edition dropped Temporary Perman configurable)
and higher) ent =>
Transient Permanen
t => Permanent
SNOWFLAKE VIEW
◼ Types of Views
• Non-materialized views (usually simply referred to as “viewsˮ)
• Materialized views.
◼ Non-materialized Views
◼ Any query expression that returns a valid result can be used to create a non-materialized view, such as:
◼ Selecting some (or all) columns in a table.
◼ Selecting a specific range of data in table columns.
◼ Joining data from two or more tables.
◼ Materialized Views
◼ A materialized view’s results are stored, almost as though the results were a table. This allows faster access, but requires
storage space and active maintenance, both of which incur additional costs.
◼ Secure Views
◼ Secure views have advantages over standard views, including improved data privacy and data sharing; however, they also
have some performance impacts to take into consideration.
VIEWS – MATERIALIZED VIEWS
◼ Materialized views require Enterprise Edition.
◼ A materialized view is a pre-computed data set derived from a query specification (the SELECT in the view
definition) and stored for later use. Because the data is pre-computed, querying a materialized view is
faster than executing a query against the base table of the view.
◼ This performance difference can be significant when a query is run frequently or is sufficiently complex.
As a result, materialized views can speed up expensive aggregation, projection, and selection operations,
especially those that run frequently and that run on large data sets.
◼ Materialized views are particularly useful when:
◼ Query results contain a small number of rows and/or columns relative to the base table (the table on which the view is
defined).
◼ Query results contain results that require significant processing, including:
◼ Analysis of semi-structured data.
◼ Aggregates that take a long time to calculate.
◼ The query is on an external table (i.e. data sets stored in files in an external stage), which might have slower performance
compared to querying native database tables.
◼ The view’s base table does not change frequently.
PERFORMANCE COMPARISON OF VIEW
Perform Security Simplifies Supports Uses Uses Credits Notes
ance Benefits Query Clustering Storage for
Benefits Logic Maintenance
Regular table ✔ ✔
Regular view ✔ ✔
Cached query ✔ Used only if data has not changed
result and if query only uses
deterministic functions (e.g. not
CURRENT_DATE).
Materialized ✔ ✔ ✔ ✔ ✔ ✔ Storage and maintenance
view requirements typically result
in increased costs.
External table Data is maintained outside
Snowflake and, therefore, does
not incur any storage charges
within Snowflake.
[1]
FLOAT, FLOAT4, FLOAT8
[1]
DOUBLE, DOUBLE PRECISION, REAL Synonymous with FLOAT.
String & Binary Data Types VARCHAR
CHAR, CHARACTER Synonymous with VARCHAR except default length is VARCHAR(1).
STRING Synonymous with VARCHAR.
TEXT Synonymous with VARCHAR.
BINARY
VARBINARY Synonymous with BINARY.
Logical Data Types BOOLEAN Currently only supported for accounts provisioned after January 25, 2016.
TIMESTAMP_LTZ TIMESTAMP with local time zone; time zone, if provided, is not stored.
TIMESTAMP_NTZ TIMESTAMP with no time zone; time zone, if provided, is not stored.
TIMESTAMP_TZ TIMESTAMP with time zone.
Semi-structured Data Types VARIANT
OBJECT
ARRAY
Geospatial Data Types GEOGRAPHY
GEOMETRY
Vector Data Types VECTOR
SNOWFLAKE DATATYPES
Data Type Category Data Type Description Example Data Type
NUMBER Variable precision NUMBER(10, 2) for values Category Data Type Description Example
Numeric Data Types
(NUMERIC) numeric values like 12345.67 Semi-structured Flexible data type for VARIANT for JSON data like
INTEGER for values like VARIANT
INTEGER Whole numbers Data Types semi-structured data {"name": "John", "age": 30}
42
Floating-point FLOAT for values like OBJECT for values like
FLOAT (DOUBLE) Collection of key-value
numbers 3.14159 OBJECT {"key1": "value1", "key2":
pairs
String & Binary Data STRING Variable-length VARCHAR(255) for values "value2"}
Types (VARCHAR) character strings like 'Hello, World!' ARRAY for values like [1, 2,
ARRAY Ordered list of elements
CHAR Fixed-length CHAR(10) for values like 3, 4]
(CHARACTER) character strings 'ABCDEFGHIJ'
Group of related fields STRUCT for values like
Variable-length BINARY(100) for binary Structured Data
BINARY STRUCT (emulated using nested STRUCT<name STRING,
binary strings data up to 100 bytes Types
objects) age INTEGER>
BOOLEAN for values like
Logical Data Types BOOLEAN True/False values Geospatial Data Represents geospatial GEOGRAPHY for values like
TRUE or FALSE GEOGRAPHY
Date & Time Data DATE for values like Types data POINT(-122.35 37.55)
DATE Calendar dates
Types '2023-05-18'
Array of numeric values VECTOR for values like [1.0,
TIME for values like Vector Data Types VECTOR
TIME Time of day (emulated using ARRAY) 2.0, 3.0]
'13:45:30'
Not natively supported,
TIMESTAMP for values Unsupported Data
TIMESTAMP Date and time XML can be stored as VARIANT for XML data
like '2023-05-18 13:45:30' Types
VARIANT
TIMESTAMP_LTZ for Not natively supported,
Timestamp with local CLOB STRING for large text
TIMESTAMP_LTZ values like '2023-05-18 can be stored as STRING
time zone
13:45:30'
TIMESTAMP_NTZ for Not natively supported, BINARY for binary large
TIMESTAMP_NT Timestamp without BLOB
values like '2023-05-18 can be stored as BINARY objects
Z time zone
13:45:30' Data Type CAST, Convert data from one
CAST('123' AS INTEGER)
TIMESTAMP_TZ for Conversion CONVERT type to another
Timestamp with
TIMESTAMP_TZ values like '2023-05-18 TO_CHAR,
specified time zone Functions to convert TO_DATE('2023-05-18',
13:45:30+01:00' TO_DATE,
between data types 'YYYY-MM-DD')
TO_NUMBER
SNOWFLAKE DATA LOAD
BULK LOAD
DATA LOADING TO SNOWFLAKE
◼ Bulk load:
◼ External stages
◼ Internal Stages
◼ User stage
◼ Table stage
◼ Named Stage
◼ Continuous load:
◼ Snowpipe
◼ Snowpipe Streaming
Feature Supported Notes
Location of Local environment Files are first copied (“staged”) to an internal (Snowflake) stage,
files then loaded into a table.
Amazon S3 Files can be loaded directly from any user-supplied bucket.
Google Cloud Storage Files can be loaded directly from any user-supplied bucket.
Microsoft Azure cloud storage Files can be loaded directly from any user-supplied container.
•Blob storage
•Data Lake Storage Gen2
•General-purpose v1
•General-purpose v2
File formats Delimited files (CSV, TSV, etc.) Any valid delimiter is supported; default is comma (i.e. CSV).
Semi-structured formats
•JSON
•Avro , ORC
•Parquet , XML (supported as a preview
feature)
Unstructured formats
File encoding File format-specific For delimited files (CSV, TSV, etc.), the default character set is
UTF-8
For semi-structured file formats (JSON, Avro, etc.), the only
supported character set is UTF-8.
BULK LOADING FROM MICROSOFT AZURE
CONFIGURING AN AZURE CONTAINER FOR LOADING DATA
◼ Configure a storage integration:
◼ Step 1 Create a Cloud Storage Integration in Snowflake
◼ Step 2 Grant Snowflake Access to the Storage Locations
◼ Step 3 Create an external stage
list @az_stg_superstore;
list @bw_azure_stage_sas
◼ After the load completes, use the REMOVE command to remove the files in the stage.
◼ REMOVE @mystage/path1/subpath2;
◼ REMOVE @%orders;
◼ RM @~ pattern='.*jun.*';
DIRECTORY TABLES
◼ A directory table is an implicit object layered on a stage (not a separate database object) and is
conceptually similar to an external table because it stores file-level metadata about the data files in the
stage. A directory table has no grantable privileges of its own.
◼ This example retrieves all metadata columns in a directory table for a stage named mystage:
◼ SELECT * FROM DIRECTORY(@mystage);
DATA UNLOADING – EXTERNAL STAGE AZURE
DATA UNLOADING – EXTERNAL STAGE - AZURE
◼ Unload using Storage integration:
◼ Unload using stage storage:
◼ Create storage integration
CREATE or replace STORAGE INTEGRATION azure_int TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'AZURE' ENABLED = TRUE AZURE_TENANT_ID =
'fda27961-1c4d-48ca-b563-4da0bc461de1' STORAGE_ALLOWED_LOCATIONS =
('azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/')
◼ Create stage
CREATE OR REPLACE STAGE az_ext_unload_stage_superstore
URL='azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/snowoutput/' STORAGE_INTEGRATION = azure_int FILE_FORMAT =
csv_superstore;
◼ Copy data
COPY INTO @az_ext_unload_stage_superstore from SUPERSTORE_FRMUSERSTG;
◼ Unload without using stage:
◼ Create storage integration
CREATE or replace STORAGE INTEGRATION azure_int TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'AZURE' ENABLED = TRUE AZURE_TENANT_ID =
'fda27961-1c4d-48ca-b563-4da0bc461de1' STORAGE_ALLOWED_LOCATIONS =
('azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/')
◼ Copy data
COPY INTO 'azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/snowoutput/d1/' from SUPERSTORE_FRMUSERSTG
storage_integration = azure_int;
◼ Unload using SAS token
◼ Create SAS token – as shown in previous slide
◼ Create stage –
CREATE OR REPLACE STAGE bw_azure_stage_sas
URL='azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/'
CREDENTIALS=(AZURE_SAS_TOKEN='sv=2022-11-02&ss=bfqt&srt=co&sp=rwdlacupiytfx&se=2024-06-30T10:26:10Z&st=2024-05-
21T02:26:10Z&spr=https&sig=jRhgc2t%2Br9kAkDoJvmqpJgys%2FtELpc3IoQJAK%2BS3wA0%3D')
ENCRYPTION=(TYPE='NONE')
FILE_FORMAT = csv_superstore;
◼ Copy data
COPY INTO @ bw_azure_stage_sas from SUPERSTORE_FRMUSERSTG;
DATA UNLOADING – VIA INTERNAL STAGE
DATA UNLOADING – VIA INTERNAL STAGE
◼ Unloading Data to a Named Internal Stage
CREATE OR REPLACE STAGE SUPERSTORE_FRMUSERSTG_UNLD_NMED
FILE_FORMAT = csv_superstore_new;
COPY INTO @SUPERSTORE_FRMUSERSTG_UNLD_NMED/unload/ from SUPERSTORE;
GET @SUPERSTORE_FRMUSERSTG_UNLD_NMED/unload/data_0_0_0.csv.gz file://C:\data\unload;
◼
SNOWPIPE
SNOWPIPE
SNOWPIPE
-- Step -1 create storage integration CREATE or replace STORAGE INTEGRATION azure_int_new TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'AZURE' ENABLED = TRUE AZURE_TENANT_ID = 'fda27961-1c4d-48ca-b563-4da0bc461de1'
STORAGE_ALLOWED_LOCATIONS = ('azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/');DESC STORAGE INTEGRATION azure_int_new; GRANT CREATE STAGE ON SCHEMA BWSCHEMA TO ROLE
ACCOUNTADMIN;GRANT USAGE ON INTEGRATION azure_int_new TO ROLE ACCOUNTADMIN;-- Step -2 Create file formatcreate or replace file format csv_superstore TYPE=CSV SKIP_HEADER=1 FIELD_DELIMITER=','
TRIM_SPACE=FALSE FIELD_OPTIONALLY_ENCLOSED_BY='"' REPLACE_INVALID_CHARACTERS=TRUE DATE_FORMAT=AUTO TIME_FORMAT=AUTO TIMESTAMP_FORMAT=AUTO;-- Step -3 Create table structurecreate or replace
TABLE BWDATABASE.BWSCHEMA.SUPERSTORE_fromsnowpipe ( ROWID NUMBER(38,0), ORDERID VARCHAR(16777216), ORDERDATE DATE, SHIPDATE DATE, SHIPMODE VARCHAR(16777216), CUSTOMERID
VARCHAR(16777216), CUSTOMERNAME VARCHAR(16777216),SEGMENT VARCHAR(16777216), COUNTRYREGION VARCHAR(16777216), CITY VARCHAR(16777216), STATE VARCHAR(16777216), POSTALCODE
NUMBER(38,0), REGION VARCHAR(16777216),PRODUCTID VARCHAR(16777216), CATEGORY VARCHAR(16777216), SUBCATEGORY VARCHAR(16777216), PRODUCTNAME VARCHAR(16777216), SALES NUMBER(38,4),
QUANTITY NUMBER(38,0), DISCOUNT NUMBER(38,2), PROFIT NUMBER(38,4));-- Step -4 Create stagecreate or replace stage az_stg_superstore_snowpipestorage_integration = azure_int_new file_format = csv_superstore; list
@az_stg_superstore_snowpipe; drop NOTIFICATION INTEGRATION AZ_NOTIFICATION_INT; CREATE or replace NOTIFICATION INTEGRATION AZ_NOTIFICATION_INT ENABLED =TRUE TYPE =QUEUE
NOTIFICATION_PROVIDER=AZURE_STORAGE_QUEUE AZURE_STORAGE_QUEUE_PRIMARY_URI = 'https://bw2023snowflakedata.queue.core.windows.net/storagequeues' AZURE_TENANT_ID = 'fda27961-1c4d-48ca-b563-4da0bc461de1';
desc NOTIFICATION INTEGRATION AZ_NOTIFICATION_INT; CREATE OR REPLACE STAGE bw_azure_stage_sas URL='azure://bw2023snowflakedata.blob.core.windows.net/bw2023snowflakedatacontainer/'
CREDENTIALS=(AZURE_SAS_TOKEN='sv=2022-11-02&ss=bfqt&srt=co&sp=rwlaciytfx&se=2024-06-29T12:39:17Z&st=2024-05-26T04:39:17Z&spr=https&sig=7JreCeW87wEVBmZdQrcNAJ%2FR9fV%2FJ64L3sV7SYaI5M8%3D')
ENCRYPTION=(TYPE='NONE') FILE_FORMAT = csv_superstore; -- Step -5 Create snowpipecreate pipe az_superstore_snowpipe_new auto_ingest = TRUE integration = 'AZ_NOTIFICATION_INT' as copy into
SUPERSTORE_fromsnowpipe from @bw_azure_stage_sas; -- Step -6 select count(*) from SUPERSTORE_fromsnowpipe;
SNOWPIPE - STREAMING
WORKING WITH SEMI STRUCTURE DATA
◼ Semi-structured Data Types:
◼ The following Snowflake data types can contain Semi-structure data:
◼ VARIANT (can contain any other data type). E.g. {“key1” : “value1”, “key2”: “value2”}
◼ ARRAY (can directly contain VARIANT, and thus indirectly contain any other data type, including itself). E.g. [1,2,3]
◼ OBJECT (can directly contain VARIANT, and thus indirectly contain any other data type, including itself). E.g.
{
"outer_key1": {
"inner_key1A": "1a",
"inner_key1B": "1b“
},
"outer_key2": {
"inner_key2": 2
}
}
◼ Querying Semi-structured Data:
◼ Snowflake supports operators for:
◼ Accessing an element in an array.
◼ Retrieving a specified value from a key-value pair in an OBJECT.
◼ Traversing the levels of a hierarchy stored in a VARIANT.
WORKING WITH SEMI STRUCTURE DATA
◼ Querying Semi-structured Data:
◼ Snowflake supports operators for:
◼ Accessing an element in an array.
◼ select my_array_column[2] from my_table;
◼ select my_array_column[0][0] from my_table;
◼ select array_slice(my_array_column, 5, 10) from my_table;
◼ Inserting data in arrary datatype
◼ INSERT INTO array_example (array_column)
◼ SELECT ARRAY_CONSTRUCT(12, 'twelve', NULL);
◼
SNOWPIPE
SNOWPIPE - STREAMING