Snowflake Prctice1
Snowflake Prctice1
➢ Snowflake architecture
I. Database Storage
II. Query Processing
III. Cloud Services.
Query Process: - It’s like a kind CPU to process the all queries. Its contain
(Virtual warehouses)
a) This is actual processing unit of snowflake.
b) Snowflake process quires using “Virtual warehouses”
c) Virtual WH are considered as the muscle of the system.
d) We Can scale up and scale down easily
e) Auto resume & auto suspend available.
Could Services: - Handle all kind of data management – Its like a brain of
system.
Syntax:-
Syntax:-
➢ Snowflake Pricing
Snowflake cost dependents on: -
1. Snowflake edition
-Standard -$2.7/Credit
-Enterprise -$4/Credit
-Business critical -$5.4/Credit
-VPS- Depends on Organization
2. Region where snowflake account created
3. Cloud platform where snowflake account hosted
4. Virtual warehouse size.
Types of cost in snowflake
1. Storage Cost
-On demand storage - (Postpaid)
-Capacity or fixed storage – (Prepaid)
-On demand storage u can pay as you use /pay as you go
- Capacity storage you have to buy before only
2. Compute cost
-Snowflake edition
-Region and cloud provider
-warehouse size
-Number of clusters
Or
- Copying the data files from a local machine to internal stage (snowflake)
before loading the data into table (Using ‘PUT’ command we can push the
files into internal stage from our local desktop, Linux and UNIX servers)
from here we can load the date into tables using Copy command.
- Bluck loading used VW
- Using the copy command
Continues loading using snow pipe (when we want live or real time data then we
can go for snow pipe) Example -Crick buzz app
- Snow pipe (a serverless data ingestion service) automates loading data into
snowflake from sources like AWS S3, GCP, and Aure blob storage. Snow pipe
supports continuous, real-time or batch loading. This eliminates manual
data loading and keeps your data up-to-date.
What is the difference between Snow pipe and Snowflake streaming pipe?
- Snow pipe loads data from cloud storage, like cloud data storage, S3, and
other storage options. Snowflake streaming loads data directly from
sources via the streaming API and client SDK.
Pattern= ‘.*filepattern.*’
Other_optional_props;
Interview question
1. What is copy command??
Ans- To load the data from external or internal stage to snowflake table.
Copy command supports only below files and file formats
Supported files
-Local Environment
-AWS S3
- GCP
- Microsoft Azure
Supported formats
-Delaminated files (CSV, TSV, etc)
-Json, Avro, Parquet, XML and ORC
2. Types of stages
• External stage (@)
• Internal stage
-User internal stage
- Table internal stage
-Named internal stage
3. What is mean by external stage
• An external stage is the external cloud storage location where data files
are stored.
-AWS S3
- GCP
-Azure Blob storage.
This arn we have give to snowflake then snowflake generate another arn
that arn we have to give below trust relationship arn
(ARN- amazon resource names , s3- simple storage service, IAM-integrated
account management )
-go to snowflake
- Create integration storage object in snowflake
-go back to s3bucket copy bucket name
-Execute the inte query
Integration syntax:-
create or replace storage integration (AWS_S3_INTEGRATION)
TYPE= EXTERNAL_STAGE
STORAGE_PROVIDER= S3
ENABLED =TRUE
STORAGE_AWS_ROLE_ARN=
"arn:aws:iam::654654607075:role/aws_s3_snowflake_integration"
STORAGE_ALLOWED_LOCATIONS=('s3://raj143s3test/csv')
COMMENT= 'integ with aws s3 bucket';
-to get arn id for execute this query - (desc integration
AWS_S3_INTEGRATION;
Note-Why we use internal stage means when we have the files into our
location severs like windows, linux and unix then we use internal stages.
List @~;
list@%mytable;
list@my_stage;
File_format= ( <file_format_name>)
- Validation_mode
- Return_failed_only
- ON_ERROR
- FORCE
- SIZE_LIMIT
- Truncate_columns
- ENFROCE_LENGTH
- PURGE
- Load_UNCERTAIN_FILES
Syntax-
From @externalstage
File_format=(<file_format_name)
Files= (<file_names>)
Validation_mode= RETURN_N_ROWS/RETURN_ERRORS/RETURN_ALL_ERRORS;
-Validate the data files instead of loading them into the table.
- RETURN_N_ROWS- Displays first N records and fails at the first error record.
used ON_ERROR=Continue
2. RETURN_FAILED_ONLY
Syntax:-
FROM @EXTERNALSTAGE
FILE_FORMAT= (file_format_name)
Files= (File_names)
- Specifies whether to return only files that have failed to load in the
statement result.
- Default is false.
3. ON_ERROR (Important )
Syntax-
FROM @STAGE
FILE_FORMAT=(FILE_FORMAT_NAME)
FILES= (FILE_NAMES)
number )/SKIP_FILE_NUM%/ABORT_STATEMNET;
- Skip_File_num- skip a files when the number of errors found in the file is
a data file.
4. FORCE
Syntax
From @stage
File_format=(file_format_name)
Files= (‘file_names’)
FORCE= TRUE/FALSE;
-To load all the files, regardless of whether they have been loaded
previously.
-Default is false, if we don’t specify this property copy command will not
5. Size_Limit
Syntax-
From @stage
File_format=(file_format_name)
Files=(‘file_names’)
Size_limit= <number>;
That means some files can be loaded fully and one will be loaded partially.
Suyntax-
From @stage
File_format= (file_name_format)
Files=(‘file_name’)
- Specifies whether to truncate text strings that exceeds the target column
length.
- Default is false, that means if u don’t specify this option, and if text strings
that exceeds the target column length, then copy command will fail.
6.Purge
Syntax-
FROM @STAGE
FILE_FORMAT=(FILE_FORMAT_NAME)
FILES= (‘FILE-NAMES’)
PURGE= TRUE/FALSE;
-Specify whether to remove data files from the stage automatically after the
-Default is false
7. Load_uncertain_files
Suntax-
From @stage
File_format=(file_format_name)
Files= (‘file_names’)
Load_uncertain_files= TRUE/FALSE;
-Specifies to load files for which the load status is unknown. The copy
Note- The load status is unknown if all the following conditions are true: -
-The initial set of data was loaded into the table more than 64 days earlier.
-If the file was already loaded successfully into the table, this event
data -JSON
- Semi structured
- Can be nested
❖ Process of loading
- now parse the raw data based on the json file content
1.Normal json
Auto_INGEST= [TRUE|FALSE]
AS
<Copy_statement>
true|false
Notification setup: -
alert.
-Use copy commands to extract the data from file and load in
snowflake tables
Practical steps ;-
-Create container
-Get the azure tenant is from azure active directory (Microsoft entra
id)
want to upload and click on container right side 3 dots and click on
access control (IAM) AND select role assignment and add the role
(storage blob data contributor) click on the role and click on select
members and add multi tenant app name then as usual do the
process.
Synatx-
type = External_storage
storage_privider= azure
enables = true
azure_tenant_id = <'azure_tenant_id'>
select system$pipe_status(‘pipe_name);
-specify the timestamp of the last event message received from the
message queue.
-If the time stamp id earlier the expected, this indicates an issue with
-specifies the time stamp of the last “create object” event message that was
between the cloud storage path where the new data files are created and
- Copy history shows the history of all file load and errors if any.
Syntax:-
(table_name=> ‘table_name’
Start_time=>timestamp or expression)
);
- If the load operations encounters errors in the data files, The copy history
Syntax:-
Select * from table (Information_schema.validate_pipe_load
( pipe_name=> pipe_nme
Managing pipes:-
-use DESC pipe_name command to see the pipe properties and the copy
command
-it is best practice to pause and resume pipes before and after performing
below actions
- To modify the copy command, recreating pipe is the only possible way
-when you recreate a pipe. All the loads history will be dropped.
- The process for unloading data into files is the same as the loading process,
expect in reverse.
Step-1
-use the COPY INTO <LOCATION> Command to copy the data from
stage.
Step-2
1.From a snowflake stage use the GET command to download the data files.
2.From s3 use the interface /tools provided by amazon s3 to get the data
files.
3.From azure use the interface/tools provided by Microsoft azure to get the
data files.
From table_name
<options>
Unloading Options: -
-Overwrite- True|FALSE – Specifies to overwrite existing files
multiple files
-Detailed_output- TRUE|FALSE-shows the path and name for each file, its
size, and the number of rows that were unloaded to the file.
➢ Snowflake- Caching
Caching-
-Cache is a temporary storage location that stores copies of files or data so that
they can be accessed faster in future.
-Cache plays vital role in saving costs and speeding up results.
-Improve query performance
Types of cache:-
1.Query Result cache- query result stores up to 24hrs
2.Local disc cache - It is available until VW up and Running, it will cleared when
we suspend the VW.
3.Remote disc cache –Permanent storage location.
Query result:-
-Query results store the data up to 24 hrs
- When it will work means when we are running the FIRST TIME IT WILL SCAN
REMOTE DISC AND SECOND TIME WE RUN queries then it’s not scanned the
permanent storage location it will scan only query result cache and give the result.
-We should run the same queries then only it will work .
Local disc cache :-
- It is available until VW up and running, it will clear when we suspend the VW
- It will work if queries are not same, we can run any changed queries to get the
result faster
-when we run first time this will fetch the data from remote disc and store the ssd
memory.
3. Before executing any statement /query (That means before what ever
the data present in tables we can get it back by running query id)
Synatx-
Select * from table_name before (statement=> <query_id);
Failsafe:-Fail safe is nothing but once the time travel retention period is
over the data will stored in fail safe area.
-Fail safe provides 7days of period to recover the historical data may be.
-This will start immediately one retention period completed
-We can’t query and restore the fail-safe data.
-For restoring fail safe data, we need contact with snowflake support
team to restore and it will take few hours or few days.
- Once the fail-safe period completed no other ways to restore data.
➢ Snowflake-Zero copy cloning
-snowflake allows you to create clones on tables, schemas and data bases in
seconds.
- We can maintain multiples copies of same data with no additional cost, so
this is called zero_copy_cloning.
Cloning Syntax:-
Points to remember :-
- We can’t convert any type of tables to other type.
- We can create transient databases and schemas.
- We can create temp tables with same name as perm/tarns tables. IF we
query with that table name , It fetch the data from temp tables in that
session
How to find the table types?
-Look at the “kind” filed in show tables properties.
Example:-
Create or replace external table sample_ext
(id int as (value:C1::INT),
Name varchar(20) as (value:C2::VARCHAR),
DEPT INT AS (VALUE:C3::INT))
With_location= @MYS3STAGE
File_format= MYS3CSV;
➢ Snowflake -Task (Scheduling queries)
❖ What is Task:-
-We use task for scheduling SQL quires & stored procedures.
-Task can be combined with streams for implementing the continues change data
captures.
-We can maintain a DAG(Directly ACYCLIC Graph) of tasks to keep the
dependencies between tasks.
-Task req compute resources to execute SQL code we can choose either of
1.snowfalke managed compute resources (serverless)
2.User managed (VW).
Task Syntax:-
Create or replace task task_name
Warehouse= <warehouse_name>
Schedule = <time or cron>
After dep_task_name
As
Sql statement
❖ Altering Task:-
DAG OF TASK:-
Task -A
CREATE OR REPLACE TASK TASK_A
WAREHOUSE= SAMPLEWH
SCHEDULE = ‘USING CRON 30 9*** UTC’
AS
<SQL QUERY1>;
Task -B
CREATE OR REPLACE TASK TASK_B
WAREHOUSE= SAMPLEWH
After task_A
AS
<SQL QUERY2>;
Task -C
CREATE OR REPLACE TASK TASK_C
WAREHOUSE= SAMPLEWH
After task_A
AS
<SQL QUERY3>;
Task -D
CREATE OR REPLACE TASK TASK_D
WAREHOUSE= SAMPLEWH
AS
<SQL QUERY3>;
❖ Task History
-We can check task history from information schema table
TASK_HISTORY.
//To see all task history with last executed task first
Select * from table(information_schema.task_history())
Order by scheduled_time desc;
❖ Troubleshoot task
-If your task is not running as per schedule, check below things
Step-3 :- Verify the permissions granted- to the task owner, Owner should have
access to database, schema, tables and warehouses.
Step-4 :- Verify the condition (Only for streams ): check
system$stream_has_data, stream may not have data changes to process.
➢ Snowflake streams
❖ What is stream ??
- A stream object records DML changes made to tables including inserts, updates
and deletes.
- We call this process as change data capture (CDC)
- Streams are combined with tasks to set continues data pipeline.
- snowpipe +Stream+Task -> Continues data load
❖ Metadata of streams.
-METADATA$ACTION : Indicates the DML Operation (insert, delete ) recorded.
-METADATA$ISUPDATE:- Indicates whether the operation was part of an update
statement, updates to rows in the source object are represented as a pair of
delete and insert records in the stream with a metadata column
metadata$isupadte values set to true.
-METADATA$ROW_ID :- Specifies the unique and immutable id for the row, which
can be used to track changes to specific rows over time.
❖ Consuming data from stream
-We can merge statement for consuming the changes from stream and applying
the same on target tables.
❖ Types of streams
1. Standard steams :- A standard stream track all DML changes to the source
object, including inserts, updates and deletes (including table truncates ).
Syntax- Create or replace stream MY-STREAM ON TABLE MY_TBALE;
2. Append_only Streams :- An append only stream track row inserts only.
Update and delete operations (Including table truncates) are not recorded.
Syntax:- create or replace stream MY_STREAM ON TABLE MY_TABLE
APPEND_ONLY=TRUE ;
3. Insert_only Streams:- Supported for external tables. It will track row insert
only. They do not record delete operations.
Syntax:- create or replace stream my_stream on external_table my_table
Insert_only= True;
Types of views :-
1. Non-Materialized views (Normal Views)
2. Secure views
3. Materialized views.
Secure views:-
-A secure views does not allows Unauthorized users to see the definition of the
view.
-Users cant see the underlaying sql query
Advantage of secure views;_
1.can protect the data by not exposing to other users
2.I don’t want the users to see underlaying tables present in our data base.
Syntax;-
Create or replace secure view view_name
As
Select statement ;
Materialized view :-
- A martialized views stores precomputed result set.
- Querying materialized views gives the better performance then quiring the
base tables.
- Can create a single table, can’t build on multiple tables.
- Designed for improved query performance when we are using same dataset
repeatedly.
- Available in enterprise edition and higher.
Syntax;-
Create or replace materialized view view_name
As
Sql statement;
❖ Masking policy :-
- Is the way of hiding sensitive data from unauthorized access
- Masking policies are schema level object.
- Same masking policy can be applied on multiple columns.
❖ Dynamic data masking
- Whatever the Sensitive data in existing tables snowflake will not be
modified, but whenever we run the query, it will apply the masking
dynamically and displays the masked data this is called dynamic masking.
- Data can be masked partially
- Unauthorized users can operate the data but they can’t view the data.
- Mostly masking policies applied based on the roles.
//dropping
Drop masking policy policy_name;
Limitations:-
1. Before dropping masking policies, we should unset them.
2. Data type of input and output values must be same
Data sharing: -First we need to create share object to share the data
- Data can be shared securely
- We can share the data to other snowflake users and to non-snowflake
users.
- For non-snowflake users we have to create reader account and share the
data
- Provider- who is sharing the data by creating share object
- Consumer – who is consuming or using shared data.
Below mentioned objects we can shares with other account: -
-Tables
-External tables
-Secure views
-Secure materialized views
-Secure UDF’s
Data sampling:-
• selecting some part of data or subset of records from a table
• Query Building and testing.
• Data analysis or understanding.
• Useful in dev env where we use small warehouses and occupy less
storage
Sampling method types:-
1. Bernoulli or row – it will fetch 10% 4million = 4 lakh rows
2. System or block - it will fetch data from 10% of 600 = 60 micro
partitioning.