README

Snowflake Snow Pro Core Exam Preparation Course Notes

SnowPro Certifications

[COF-C02] SnowPro Core Certification Exam Guide

[COF-C02] SnowPro Core Certification Exam Study Guide

Snowflake Overview and Architecture

Virtual Warehouses

Storage and Data Protection

Data Storage Layer

Data Movement

Account and Security

Performance and Tuning

Semi-Structured Data

Exam Tips

Exam Prep Tips

Read every single “Reading Assets” article from the [COF-C02] SnowPro Core Certification Exam Study Guide. You will find is the real exam questions blend the type of questions in the practice exams with these real world examples.

Topics Quizzed on Exams

Sample Exams

Free

ExamTopics: Snowflake SnowPro Core Exam Actual Questions: 1110 questions (updated June 19, 2024) Obs: There are quite a lot of ambiguous questions and the showing result might not be correct, so check the comments to be sure.
FullCertified: Snowflake SnowPro Core - Practice Exam 1: 100 questions

Paid

Udemy: [COF-C02] Snowflake SnowPro Core Certification Practice Sets

Architecture

Snowflake's architecture is a hybrid of traditional shared-disk and shared-nothing database architectures.

Similar to shared-disk architectures:
- Snowflake uses a central data repository for persisted data accessible from all compute nodes in the platform
- Offers data management simplicity
Similar to shared-nothing architectures (e.g. Hadoop, Spark)
- Snowflake processes queries using virtual warehouses where each node in the cluster stores a portion of the entire data set locally
- Offers performance and scale-out benefits

Advantages:

Storage, compute and management services are decoupled and can scale independently
There is virtually no limit on how much each layer can be scaled
Workload isolation with virtual warehouses

Snowflake Layers

Snowflake's unique architecture consists of three layers, all of them with High Availability. The price is also charged separately for each layer.

In addition to the three layers mentioned, there is the Cloud Agnostic Layer - allows Snowflake to run on the major cloud providers: AWS, Azure, GCP and ensures Snowflake performs comparaby across providers.

Storage Layer

single copy and source of truth for the data
virtually infinitely scalable
inherits the cloud provider's availability and durability guarantees
- e.g. AWS S3 replicates data across three different availability zones within a region
data is stored in Snowflake's read-optimized proprietary compressed columnar format
- due to columns containing similar data, the columnar format results in better compression
- data is stored compressed
- data is stored using AES-256 encryption
Snowflake manages all aspects of how this data is stored.

Compute (Query Processing) Layer

Multi-Cluster Compute (Virtual Warehouses): each Virtual Warehouse is a Massively Parallel Processing (MPP) compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
Carries out the computation of the query plan generated by the Cloud Services Layer.
Virtual Warehouses retrieve the data required for processing a query and cache it locally for up to 24 hours
Virtual Warehouses can be created or removed almost instantly
VWs can also be paused or resumed as needed
The customer is only charged for the time the VW is running
Each Warehouse is isolated from others and can be configured individually

Cloud Services Layer

Also known as the Global Services Layer, it is a collection of highly available and scalable services that manages all activities across Snowflake. The Cloud Services Layer is shared between accounts (other than accounts setup with a Virtual Private Snowflake Edition)

Security
- Authentication
- Access Control
  - Roles and Users
  - Shares
- data encryption and key rotation
Infrastructure management
- Compute
- Centralized storage
  - Micropartitions
  - Micropartition versioning for Time Travel
Transaction management (ACID compliant)
Query parsing and optimization
- SQL Optimization
  - Cost-based
  - Automatic JOIN order optimization (no hints required)
  - Automatic statistics gathering
  - Data Pruning based on the metadata and statistics
- Processing is delegated to a Virtual Warehouse
- The Cloud Services Layer also handles queries that depend on Metadata only (no Warehouse required), e.g.:
  - Retrieve or modify the session context
  - Retrieve metadata about objects (users, warehouses, databases)
  - Get statistics, e.g.: MIN and MAX of a column, COUNT, number of NULLs
  - DDL retrieval
Metadata and Statistics collected and kept up-to-date
Time Travel and Cloning
Availability
Transparent online updates and patches
Runs on Snowflake-managed compute

DataSharing

Data Sharing

Introduction to Secure Data Sharing

Secure Data Sharing lets you share selected objects in a database in your (Provider) account with other (Consumer) Snowflake accounts. You can share the following Snowflake database objects:

Tables
External tables
Secure views
Secure materialized views
Secure UDFs

Shares are named Snowflake objects that encapsulate all of the information required to share database objects. Using GRANTs, shares encapsulate:

the privileges that grant access to the database, schemas and specific objects we want to share
the consumer accounts that have access to the share and all objects the share has access to.

Some sample SQL:

-- Create a Share and give SELECT privileges to table
CREATE SHARE myShare;
GRANT USAGE ON DATABASE myDb TO SHARE myShare;
GRANT USAGE ON SCHEMA myDb.public TO SHARE myShare;
GRANT SELECT ON TABLE myDb.public.myTable TO SHARE myShare;
-- Add consumer accounts to the share
ALTER SHARE myShare ADD ACCOUNTS = org1.consumer1,org1.consumer2;

-- Show all Shares in the account
SHOW SHARES;

-- See all the privileges a Share has
SHOW GRANTS TO SHARE myShare;

-- See the accounts (consumers) that have access to the Share
SHOW GRANTS OF SHARE myShare;

Provider

Shares data with others
No cost to share other than the storage of the shared data
Has full control over the data shared and the privileges they give to the share
Can revoke privileges granted to the share at any time
Pays for the storage of the data they share

Consumer

Accounts that receive the share/data.

Shared data does not take up any storage in the Consumer account so Consumers don’t pay for its storage
Time Travel is not available to Consumers on shared data
Zero-Copy Cloning is not available to Consumers on shared data
Consumers can, however, copy share data using CTAS (CREATE TABLE ... AS SELECT)

Full Account Consumer

The only charges to consumers are for the compute resources (i.e. Virtual Warehouses) used to query the shared data

Reader Account Consumer

Consumers who are not Snowflake customers and do not have a Snowflake account
The Reader account is set up within the Provider's environment
- Providers manage all aspects of the Reader Account - users, roles, permissions, warehouses, etc.
- A reader account can only consume data from the provider account that created it
Readers must use Warehouses from the Provider's account; the Provider incurs the compute cost
Providers can track and manage the queries Readers can run and can monetize them by charging for the data share or the queries run

Sharing Features

All database objects shared between accounts are READ-ONLY,(i.e. the objects cannot be modified or deleted, including adding or modifying table data
Data is shared live:
- No data movement or copying
- Consumers immediately see all updates
- Data can be shared with an unlimited number of Consumers
Shares can only be created by users
- with the ACCOUNTADMIN role
- or with a role that has been explicitly given the permission to create Shares
Shares can be created between accounts but not within the same account
Consumers cannot re-share a Share
Snowflake does not charge for creating shares although Providers can charge Consumers or Reader Accounts to monetize their data
Sharing is available across all Snowflake Editions
The Consumer and Provider accounts must be in the same cloud and region to share data
- A Provider can replicate their data to an account in another cloud and region if they want to share it with a Consumer in the other cloud/region.

Sharing with Secure Views

The preferred way to share data is by using Secure Views:

allows sharing only select rows and columns rather than entire tables
allows providing different sets to different Consumers by filtering the data based on the values of the context functions CURRENT_ACCOUNT() and CURRENT_ROLE() which would return the Consumer account name and role, respectively.

Types

Public Data Marketplace
Private Data Exchange

Editions

Snowflake Editions

Feature / Edition Matrix

The Snowflake Editions determine what features are available to the customer and what level of service they require. The Snowflake Edition affects the amount charged for compute and data storage.

Standard

Introductory level offering, providing full, unlimited access to all of Snowflake’s core features

Complete SQL Data Warehouse
- Includes standard DDL and DML features
- Includes advanced DML statement such as multi-table inserts, merge and windowing functions
Secure data sharing across regions/clouds
Premier Support 24 x 365
Always-on enterprise grade encryption in transit and at rest
Customer dedicated virtual warehouses
Federated authentication
Database replication
External Functions
Snowsight analytics UI
Create your own Data Exchange
Data Marketplace access
Fail-safe
Continuous Data Protection
- Time Travel
- Network Policies

SPECIFICS

1 day of time travel
No Multi-Cluster warehouses
No Materialized Views
No column-level security
No query statement encryption
No Failover/Fallback
No AWS PrivateLink support
No Tri-Secret Secure encryption

Enterprise

Designed specifically for the needs of large-scale enterprises and organizations

All features of Standard plus:
Multi-Cluster Warehouses
Up to 90 days of time travel
Materialized Views
Column-level security
- Dynamic Data Masking - uses masking policies to mask some columns for users with a lower role at query time
- External Data Tokenization
  - tokenize sensitive data before loading it into Snowflake
  - you can also detokenize it using masking policies
  - useful for sensitive data like passwords, etc.
Search Optimization Service
Annual rekey of all encrypted data

SPECIFICS

No query statement encryption
No Failover/Fallback
No Tri-Secret Secure encryption
No AWS PrivateLink support

Business Critical

Offers even higher levels of data protection to support the needs of organizations with extremely sensitive data

All features of Enterprise plus:
Higher level of data protection
- Data encryption everywhere
- Query statement encryption
- Enhanced security policy
- Database Failover and Fallback for business continuity (e.g. replication with failover and fallback accross regions)
- AWS PrivateLink support
- Tri-Secret Secure encryption using customer-managed keys (AWS)
Enhanced security, see: Snowflake’s Security & Compliance Reports
- HIPPA (Health Insurance Portability and Accountability Act) support / HITRUST
- PCI (PCI-DSS)
- ISO/IEC 27001
- FedRAMP Moderate
- SOC 1 Type II
- SOC 2 Type II
- GxP
External Functions - AWS API Gateway Private Endpoints support

Virtual Private Snowflake (VPS)

Business Critical Edition but in a completely separate Snowflake environment, isolated from all other Snowflake accounts, to provide the highest level of security for governments and financial institutions

All features of Business Critical plus:
Dedicated Services Layer, not shared with other accounts
Customer dedicated virtual servers wherever the encryption key is in memory
Customer dedicated metadata store
Additional operational visibility

Government Regions

For government agencies that require compliance with US federal privacy and security standards, such as FIPS 140–2 and FedRAMP, Snowflake provides support for those only on Amazon Web Services and Azure. These government regions are only supported on Business-Critical Edition or higher.

ObjectModel

Objects

All objects in Snowflake are securable - privileges on objects can be granted to Roles and Roles are granted to Users.
Additionally, all objects can be interacted with and configured via SQL given the user has sufficient privileges.
To own an object means that a role (NOT USER!) has the OWNERSHIP privilege on the object.
Each securable object is owned by a single role, which by default is the role used to create the object.
Other than the Organization and Account objects, objects can be at the Account level or the Schema level. Examples:
- Account level objects:
  - Network Policy
  - User
  - Role
  - Grants
  - Warehouse
  - Database
  - Share
  - Resource Monitor
  - Storage Integration
- Schema level objects:
  - Table
  - External Table
  - View
  - Stream
  - Task
  - Stored Procedure
  - UDF (User Defined Function)
  - Sequence
  - Stage
  - File Format
  - Pipe

Note that object identifiers are case insensitive and cannot contain spaces or special characters unless they are enclosed in quotes.

When an object identifier is enclosed in quotes, it becomes case sensitive. This should be avoided as it can contribute to hard to debug issues!

Schemas and Database Roles are Database Level Objects
The definitions of most objects can be retrieved using the GET_DDL() function.

Table

Tables hold all the data in a database.

Permanent

Default table type, used for the highest level of data protection and recovery
Persists until dropped
0–90 days of Time Travel depending on the Snowflake edition (0–1 days in Standard Edition, 0-90 days in Enterprise Edition and up)
7 days of Fail-Safe
Can be cloned to a permanent, transient or temporary table
- If you clone a Permanent table to Transient or Temporary, the old partitions will remain Permanent, but any new partitions added to the clone will be Transients/Temporary

Transient

Persists until dropped
Available across sessions
Time Travel: 0 or 1 days
No Fail-Safe support
Can only be cloned to a Temporary or Transient table

Temporary

Used for transitory data
Tied to and available only within the context a single user session. As such, they are not visible to other users or sessions
Time Travel: 0 or 1 days
- A temporary table is purged once the session ends, so the actual retention period is for 24 hours or the remainder of the session, whichever is less
Temporary tables incur storage costs
No Fail-Safe support
Can only be cloned to a temporary or transient table
Can be created with a clustering key, if needed

External

Snowflake table "over" data stored in an external stage
Persists until removed
Read-only
No Time Travel
No Fail-Safe support
Cloning is not supported
XML files are not supported for external tables

View

Named definition of a SQL query which can be queried as if it were a table. Views can be created on all table types, including external tables.

Standard

Default view type
No data is stored
No clustering support
Executes as the role which owns it
DDL available to any role with access to the view
Streams are supported

Materialized

Enterprise Edition is required for Materialized views
Results of underlying query are stored
- require storage space and active maintenance, which incur both storage and compute costs
- while Materialized Views behave more like Tables, Time Travel is not supported
Results are auto-refreshed in the background so new data loaded into the table will be propagated into the Materialized View
Consumes background compute for the auto-refresh
- The background compute does not require a customer-provided warehouse; Snowflake manages the compute
Clustering is supported
NO support for Streams
Changes to the schema of base table are not automatically propagated to materialized views
Best used for:
- Query results contain a small number of rows and/or columns relative to the underlying table
- Queries that require significant processing, e.g. analysis of semi-structured data or expensive aggregates
- The underlying table is an external one, which might be slower to query directly
- The view’s base table does not change frequently

Secure

Both Standard and Materialized views can also be defined as Secure

Executes as the role which owns it
DDL available only to authorized users
Can be shared; secure views are mandatory when sharing
Snowflake query optimizer bypasses optimizations used for regular views so secure views are less performant

Stage

Stages in Snowflake specify where data files are stored (staged) in cloud storage. They facilitate data loading and unloading.

When staging uncompressed files in a Snowflake stage, the files are automatically compressed using GZIP unless compression is explicitly disabled
Snowflake automatically generates metadata for staged files:
METADATA$FILENAME: stage path and name of the data file the current row belongs to
METADATA$FILE_ROW_NUMBER: row number for each record in the staged data file
$[COLUMN_NUMBER]: staged data file column number; e.g. $1 refers to the first column in the staged file, $2 refers to the second column, etc.

Named Stages

Created manually
Referenced with @[STAGE_NAME]
can specify file format
Supported Cloud Storage services:
- Amazon S3 Buckets
- Google Cloud Storage Buckets
- Microsoft Azure Containers
Named Stages can be Internal or External
- Internal Named Stage: cloud Storage location managed by Snowflake; data is stored internally within Snowflake.
- Named External Stage: references data files stored outside Snowflake.
  - managed by the user or respective cloud provider and can even be in a different cloud provider than the Snowflake account
  - Creating an external stage requires providing
    - URL where the files are located
    - credentials to access the files
    - decryption keys to enable decryption of the data

Internal Stages

Can only be accessed using the SnowSQL CLI
Can be of two types: User and Table Stage

User Stage

Each user has a Snowflake personal stage allocated to them by default
Referenced with @~
Can only be accessed by the user it belongs to
User stages do not support setting file format options. Instead, you must specify file format and copy options as part of the
COPY INTO command.
Table Stage
- Each table has a Snowflake stage allocated to it by default
- Referenced with @%[TABLE_NAME]
- Accessible to multiple users
- Files can only be loaded/unloaded to/from the table the stage belongs to
File Format
Pre-defined format structure that describes a set of staged data to access or load into Snowflake tables for CSV, JSON, AVRO, ORC, PARQUET, and XML input types
- A File Format can be created as a clone from another File Format
Storage Integration
A Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations. This option enables users to create stages and load and unload data without supplying credentials.
- Account-level object
- A single storage integration can support multiple external stages
Warehouse
Warehouses are the compute part of the Snowflake engine. They are a set of virtual machines provided at runtime to help execute a given query.
Resource Monitor
See Resource Monitors
Sequence

Using Sequences

Sequences are schema-level objects used to generate unique numbers across sessions and statements, including concurrent statements. They can generate values for a primary key or any column that requires a unique value. They have an initial value and an interval.
You can access sequences in queries as expressions. The function nextval, will generate the next unique sequential value.
```
CREATE OR REPLACE SEQUENCE seq START = 1 INCREMENT = 5;
CREATE OR REPLACE TABLE PEOPLE 
(
  ID NUMBER DEFAULT seq.nextval, 
  NAME VARCHAR(50)
);
```
Pipe
A special type of object which enables the automatic loading of data from Stage files as soon as they are available. See Snowpipe
Stored Procedure

Stored Procedure

Extend the system to perform operations.
- A stored procedure can return a single value or (if you are using Snowflake Scripting) tabular data
- Stored procedures can be written in:
  - Java (using Snowpark)
  - JavaScript
  - Python (using Snowpark)
  - Scala (using Snowpark)
  - Snowflake Scripting, an extension to Snowflake SQL that adds support for procedural logic
- Stored Procedures can be secure or unsecure
- Stored Procedures support schema definition (DDL) and data modification (DML) SQL statements
- Stored Procedures can run multiple SQL statements
- Stored Procedures are allowed, but not required, to explicitly return a value (such as an error indicator)
- Stored Procedures are called as independent statements
```
CALL MyStoredProcedure_1(argument_1);
```
- The returned values CANNOT be used directly in a SQL statement
- Every CREATE PROCEDURE statement must include a RETURNS clause that defines a return type, even if the procedure does not explicitly return anything
- Store Procedures CANNOT return a set of rows
- Stored Procedures can be defined to run as their owner (by default) or as the procedure caller (EXECUTE AS CALLER)
- Stored Procedures interact with Snowflake using an API in the respective language. Here are some JavaScript API Objects:
  - Snowflake - contains the methods of the Stored Procedure API; this object is accessible by default (there is no need to create it)
  - Statement - provides method to execute a query statement and access its metadata
  - ResultSet - iterable object which contains the query results;
  - SfDate - wrapper around the Snowflake SQL TIMESTAMP data types
- Snowflake Scripting allow migrating of other databases' stored procedures by embedding their SQL in the Snowflake Stored Procedure's code
- Argument names in the SQL portions of a Stored Procedure are case-insensitive
UDF: User-Defined Function

User Defined Function

UDFs extend Snowflake to perform operations that are not available through built-in, system-defined functions.
- UDFs accept 0 or more parameters.
- For each row passed to a UDF, the UDF must return either:
  - a scalar (i.e. single) value, or
  - if defined as a RETURNS TABLE function, a set of zero or more rows.
    - When the UDF returns a table, it can be called a User Defined Table Function (UDTF)
- A UDF can be written in:
  - SQL (default)
  - Java
  - JavaScript
  - Python
- UDFs can be secure or unsecure
- UDFs do not support schema definitions (DDL) or data modifications (DML)
- The returned value(s) CAN be used directly in statement SQL
- While system functions can be listed with SHOW FUNCTIONS;, to list UDFs, use SHOW USER FUNCTIONS;
- When calling a UDF which returns a table, you must wrap it in the TABLE() function
```
SELECT * FROM TABLE(MY_UDTF_FUNCTION(param1, param2));
```
Data Shares
See Data Sharing

OrganizationsAccountsDatabasesAndSchemas

Organizations, Accounts, Databases And Schemas

Organization

Top level securable object which serves as a logical grouping of Accounts.

The Organization object is not available by default. It can be enabled on request by Snowflake Support.
Once enabled, it makes available features which make use of multiple accounts such as database replication and failover.
It can be used for monitoring usage and billing across accounts.
The ORGADMIN role is added to the primary account in the Organization to allow management of the contained Account objects
By default there is a limit of 25 Accounts in an Organization, can be increased by Snowflake Support

Account

Logical grouping of Databases.

Each account is hosted on a single cloud provider and in a single geographic region
By default, Snowflake does not move data between regions unless requested
The region where the account is provisioned affects:
- the account and storage price
- which regulatory certifications you can achieve
- some Snowflake features which may be region-specific
The account's name must be unique within an organization, regardless of which Snowflake region the account is in.
The URL of an account (e.g. xy12345.us-est-2.aws.snowflakecomputing.com) consists of
- the account identifier (sometimes referred to as the account name) xy12345.us-est-2.aws which can consists any mixture of the following:
  - the account locator xy12345
  - the cloud services region ID us-est-2 and
  - the cloud provider aws
- and the snowflake domain, snowflakecomputing.com
It is possible to request a vanity address to convert the identifier into your company's name.
An account is created with the system-defined role ACCOUNTADMIN
When an account is created by an ORGADMIN, the account name (acme-marketing-account) will contain a unique organization name acme and the account name marketing-account

Database

Logical container of your data, organized in Schemas.

A Database can be created as a clone from another Database
A Database can be created as a replica of another Database
A Database can be created from a Share object provided by another Account
Databases can be created as Transient
- Time Travel: 0 or 1 days
- No Fail-Safe support

Schema

Logical grouping of Tables, Views, Stored Procedures, UDFs, Stages, File Formats, Pipes, Sequences, Shares, etc. Every schema belongs to a single Database. When a database is created, there are two default schemas created in it

A Schema can be created as a clone from another Schema
The Database name and Schema name together form the Namespace for all Schema objects
public: default schema for any object created without specifying a schema
information_schema: stores metadata information
Schemas can be created as Transient.
- Time Travel: 0 or 1 days
- No Fail-Safe support

Overview

The Snowflake Platform

Snowflake is a cloud native data platform delivered as a service

purpose built for the cloud
no management of hardware, updates and patches
automatic optimizations (partitioning)

Snowflake Releases

Snowflake releases new features weekly
The deployments happen transparently in the background; users experience no downtime or disruption of service, and are always assured of running on the most-recent release with access to the latest features.

Data Platform Workloads

Data Warehouse
- Structured and relational data
- ANSI standard SQL
- ACID compliant transactions
- Data stored in databases, schemas and tables
Data Lake
- Separation and scalability of storage and compute
- Schema does not need to be defined up front
- Native processing of semi-structured data formats
Data Engineering
- Data ingestion
  - batch: COPY INTO
  - streaming: Snowpipe
- Separate compute clusters for different workloads
- Tasks and Streams for data pipelines
- Security features:
  - all data encrypted at rest and in transit
  - row-based access control
  - column-level masking
  - Multi-Factor Authentication
Data Science
- Centralized storage removes data management roadblocks
- Partner ecosystem data science tools can integrate natively with Snowflake
  - Amazon SageMaker
  - DataRobot
  - Dataiku
Data Exchange
- Secure data sharing between accounts in the same cloud provider region
- Data Marketplace
- Data Exchange
- BI with Snowflake partner ecosystem tools
Data Applications
- Connectors and drivers
- UDFs and Stored Procedures
- External UDFs (AWS Lambda or Azure Functions)
- Snowpark

Connecting to Snowflake

Snowsight: Snowflake web interface/console
SnowSQL: The Snowflake CLI client
ODBC: client, requires a Snowflake driver
JDBC: client, requires a Snowflake driver
SDK (Node, Python, Kafka, Spark, Go…)

Features

ANSI SQL compliant OLAP data warehouse
Multi-cluster, shared data architecture
Self-tuning and self-healing
Pay only for what you use: storage and compute
Compute (Virtual Warehouses) can be
- Scaled Up/Down by resizing the warehouse on the fly to accommodate complex, heavy workloads
- Scaled Out by using multi-cluster Warehouses allowing Snowflake to automatically spin up additional compute clusters to handle large concurrent query workloads
- Automatically suspended and resumed to accommodate periods of inactivity
Time Travel (see below for more info)
Fail-Safe (see below for more info)
Live data sharing
- Private 1:1 and 1:many data sharing
- Public data marketplaces
- Secure private data exchanges
Multi-cloud: AWS, Azure, GCP
Zero-Copy Cloning
- Objects that can be cloned: Databases, Schemas and Tables
- Only metadata is copied
- Once an object is cloned, the individual copies evolve separately and their storage increases only by the amount of data that was modified in each copy.

Time Travel

Allows access to previous versions of the data - the data in its state at a point of time in the past
A minimum of Enterprise edition required for time travel longer than 1 day, up to 90 days
The period of time travel can be set by database, schema or table

Fail-Safe

A data backup feature
Snowflake admins can recover and restore data up to 7 days after the Time Travel period expires (not configurable)
Accessible only by contacting Snowflake support administrators
Only available for permanent tables
Fail-Safe incurs additional storage costs for the extra 7 days of data retention

Free Trial

You can apply for a 30-day free trial at the following link https://signup.snowflake.com/. Trying out the Enterprise edition is recommended.

Pricing

Snowflake pricing and cost are based on the actual usage. You pay for what you use and scale storage and compute independently.

Storage Costs

All customers are charged a flat monthly fee per Terabyte for the data they store in Snowflake. Storage cost is measured using the average amount of storage used per month, after compression, for all customer data stored in Snowflake.

The storage includes:

Data stored in tables, even temporary ones
Historical data for Time Travel
Historical data for Fail-Safe
Internal Stages

Compute Costs

Customers pay for virtual warehouses using Snowflake credits. The cost of these credits depends on the Snowflake edition that you are using, and the cloud region of the Snowflake account, see https://www.snowflake.com/pricing/. For example, the costs per credit for an account in US East (North Virginia) on AWS were:

Standard Edition: $2.00 per credit
Enterprise Edition: $3.00 per credit
Business Critical Edition: $4.00 per credit

The number of credits that a virtual warehouse consumes is determined by the following:

The warehouse size
The number of clusters (for multi-cluster warehouses)
The time each server in each cluster runs
- Server runtime is billed by second
- Server runtime is billed WITH A ONE-MINUTE MINIMUM

Cloud Services Costs

Customers pay for cloud services using Snowflake credits. The typical utilization of cloud services (up to 10% of daily compute credits) is included for free, meaning most customers will not see incremental charges for cloud service usage.

Data Transfer Costs

Customers who wish to move or copy their data between regions or cloud providers will incur data transfer charges. Features such as External Tables, External Functions, and Data Lake Export may incur data transfer charges.

Purchasing the Snowflake Service

On-Demand: Billed in monthly arrears, customers are charged a fixed rate only for the amount of services consumed
Pre-paid: Pre-purchase Capacity, which involves a commitment to Snowflake. The Capacity purchased is then consumed monthly, providing lower prices and a long-term price guarantee, among other advantages.

ResourceMonitors

Resource Monitor

Resource Monitor

To help control costs and avoid unexpected credit usage caused by running warehouses, Snowflake provides resource monitors. A resource monitor can be used to monitor credit usage by virtual warehouses and the cloud services needed to support those warehouses.

Limits can be set for a specified interval or date range. When these limits are reached and/or are approaching, the resource monitor can trigger various actions, such as sending alert notifications and/or suspending user-managed warehouses.
Resource Monitors can only be created by account administrators (i.e. users with the ACCOUNTADMIN role)
Account administrators can choose to grant users with other roles the MONITOR & MODIFY grants to view and modify Resource Monitors
Resource Monitors can be set at two different Monitor Levels:
- Account Level: a single Monitor can be set to control credit usage for all warehouses in your account.
- Warehouse Level: a monitor can be assigned to one or more warehouses, controlling the credit usage for each warehouse
- NOTE: A warehouse can only be assigned to a single Resource Monitor below the account level
The used credits for a resource monitor reflects the sum of credits consumed by all assigned warehouses within the specified interval, along with the cloud services used to support those warehouses during the same interval.
Resource Monitors can be replicated when replicating Databases

Configuring Resource Monitors

We need to specify several parameters when creating a Resource Monitor:

Credit Quota: specifies the number of Snowflake credits allocated to the monitor for the specified frequency interval.
- Any number can be specified
- Snowflake tracks the used credits/quota within the specified frequency interval by all warehouses assigned to the monitor
- At the specified interval, this number resets back to 0
Monitor Level: specifies whether the resource monitor is used to monitor the credit usage for your entire Account (i.e. all warehouses in the account) or a specific set of individual warehouses.
Schedule: Consists of several properties
- Start: Date and time (i.e. timestamp) when the resource monitor starts monitoring the assigned warehouses. Supported values:
  - Immediately (i.e. current timestamp)
  - Later (i.e. any future timestamp)
  - Note, however, that regardless of the time specified in the start date and time, resource monitors reset at 12:00 AM UTC.
- Frequency: The interval at which the used credits reset relative to the specified start date. Supported values:
  - Daily
  - Weekly
  - Monthly
  - Yearly
  - Never (used credits never reset; assigned warehouses continue using credits until the credit quota is reached)
- End: Date and time (i.e. timestamp) when Snowflake suspends the warehouses associated with the resource monitor, regardless of whether the used credits reached any of the thresholds defined for the resource monitor
- Actions/Triggers: the action to perform if a given threshold is reached within the specified interval
  - Thresholds are usually expressed as a percentage of the credit quota
  - Notification are sent to all account administrators who have enabled receipt of notifications and non-administrator users in the notification list
  - Resource monitors support the following actions:
    - Notify & Suspend: send a notification and suspend all assigned warehouses after all queries in flight have completed.
    - Notify & Suspend Immediately: send a notification and suspend all assigned warehouses immediately, which cancels any queries in flight.
    - Notify: Perform no action, but send an alert notification

USE ROLE accountadmin;

CREATE OR REPLACE RESOURCE MONITOR limit1 WITH CREDIT_QUOT=2000
    FREQUNCY = WEEKLY
    START_TIMESTAMP = '2019-03-04 00:00 PST'
    TRIGGERS ON 80 PERCENT DO SUSPEND
             ON 100  PERCENT DO SUSPEND_IMMEDIATE;

ALTER WAREHOUSE wh1 SET RESOURCE_MONITOR = limit1;
ALTER WAREHOUSE wh2 SET RESOURCE_MONITOR = limit1;

Warehouse Suspension and Resumption

If a Resource Monitor suspends a Warehouse, it cannot be restarted unless one of these conditions are met:

The next interval, if any, starts, as dictated by the start date for the monitor.
The credit quota for the monitor is increased.
The credit threshold for the suspend action is increased.
The warehouses are no longer assigned to the monitor.
The monitor is dropped.

VirtualWarehouses

Virtual Warehouses

Virtual Warehouses (VWs) are the compute layer in Snowflake's architecture. It is a logical wrapper around a cluster of servers with CPU, memory and disk. Data Warehouses are used to perform queries and DML operations (like loading data into tables).

The actual compute underlying the VW is fully managed by Snowflake
While running, even if idle, a VW consumes Snowflake credits - you are charged for compute
VWs can be Standard or Multi-Cluster Warehouses (MCW). Standard Warehouses have only a single compute cluster and cannot scale out

SCALE OUT/IN: Multi-Cluster Warehouses (MCW)

MCWs can automatically SCALE OUT/IN by spawning (or shutting down) additional compute clusters

used for handling spikes in concurrent compute requests
MCWs are a feature of Snowflake Enterprise Edition and higher
MCWs can set to scale out to a maximum of 10 clusters
The number of credits billed is calculated based on the size and the number of warehouses that run within the time period
MCWs scale out under two Scaling Policies - Standard and Economy
- Standard policy: starts additional clusters if a query is queueing or expected to queue
- Economy policy: will start an additional cluster only after at least 6 minutes of workload have been queued up for the future cluster
Incoming queries are load-balanced across the available running clusters

SCALE UP/DOWN: Resizing Warehouses

A Virtual Warehouse can SCALE UP/DOWN (be resized), manually, at any time, even while running

used for handling complex, long-running queries or ingesting large data sets
when resized while running, currently executing queries will complete at the current size; queued and new queries will execute at the new size
Efficient scale up/down should give approximately linearly proportionate results, e.g. running a query on a double-sized warehouse should cut the query run time in half.
It is recommended to keep queries and workloads of similar complexity and with similar compute demands on the same warehouse to simplify compute resource size management
Resizing can be included in SQL scripts so a warehouse can be scaled up/down depending on the demands of subsequent queries

Creating Warehouses

One can create an unlimited number of warehouses in their account.
XS (Extra Small) is the smallest size of VW with one server per cluster.
- each size up doubles the number of servers in the cluster as well as the number of Snowflake credits consumed
- sizes range from XS, through S, M, L, XL, 2XL up to 6XL (currently)
  - the default size for warehouses created through the web interface is XL
  - the default size for warehouses created using SQL is XS
  - 5XL and 6XL warehouses are generally available in AWS regions, and in preview in US Government and Azure regions.
A VW can be set to auto-suspend when idle for a configurable number of minutes.
- By default, the auto-suspend option is enabled.
- Through the web UI auto-suspend can be set as low as 5 minutes
- Through SQL auto-suspend can be set to the minimum auto-suspend setting of 1 minute (60 seconds)
- VWs can also be started or suspended manually
A VW can also be set to auto-resume when a new query is submitted
- Auto-resume is enabled by default
- A VW will automatically start when first created
- In SQL a VW can be created in suspended mode by setting the INITIALLY_SUSPENDED = TRUE option
When a VW is first created, it has, by default, the USAGE permission granted to the role that owns/creates it since ownership defaults to the object creator

Compute Credits

For more info, see Understanding Compute Cost

Customer-managed compute is charged per second with a 60-second minimum each time the warehouse starts
Snowflake-created serverless (background) compute is charged per second with no minimum.
Credit cost varies depending on cloud provider and region
One server in the VW cluster uses one credit per hour, so, for example, an XS (1 server) will use one credit per hour while a large (8 servers) will use 8 credist per hour

Cloud Services Compute Billing

Customers are charged for cloud computing (e.g. in the Cloud Services Layer) that exceeds 10% of the total compute costs for the account.
Use the WAREHOUSE_METERING_HISTORY view to see how much cloud compute your account is using

Transactions

A transaction is a sequence of SQL statements that are committed or rolled back as a unit. All statements in the transaction are either applied (i.e., committed) or undone (i.e., rolled back) together. For this reason, transactions are ACID (Atomicity, Consistency, Isolation & Durability) compliant.

If a session is disconnected for whatever reason and a transaction remains in a detached state which cannot be committed or rolled back, Snowflake takes 4 hours to abort a transaction and roll it back.
You can abort a running transaction with the system function SYSTEM$ABORT_TRANSACTION
Each transaction has independent scope
Snowflake does not support nested Transactions, although it supports nested Stored Procedure calls

DataStorageLayer

Data Storage Layer

The Data Storage Layer is Snowflake managed cloud-based storage available in the major cloud providers (AWS, Azure and GCP).

Full native support for semi-structured data: JSON, AVRO, ORC, XML and Parquet
Customers are billed for data based on its compressed size as well as any files existing in internal (Snowflake-managed) stages

Micro-Partitions

All data in Snowflake tables is automatically divided into micro-partitions. Micro-Partitions are contiguous units of storage that generally hold a maximum of 16MB or encrypted, compressed data (uncompressed the data is between 50-500MB), organized in a columnar way.

Micro-partitions are IMMUTABLE and versioned
- In order to modify data in a micro-partition, a new version of the micro-partition must be created. The older micro-partition is then marked for deletion
- This is what enables the time travel feature in Snowflake
Data is compressed within each micro-partition
- Within each partition, data is organized in a hybrid columnar format
- On load, data is automatically analyzed for the optimal compression scheme based on its format
- Organizing data based on columns containing data of the same type allows for efficient compression which reduces storage and I/O
Snowflake attempts to preserve natural data co-location
- Since data is organized in micro-partitions on load, some natural data clustering and optimization occurs
Table records are contained entirely within a micro-partition, they cannot span multiple partitions

Query Pruning

The metadata stored in the Cloud Services Layer allows Snowflake to minimize the data it needs to load from the storage layer in order to resolve a query. Snowflake performs two types of pruning:

micro-partition pruning - reading only the micro-partition files needed to resolve a query
column pruning - only reading the columns that we need from each micro-partition

The Cloud Services Layer metadata also allows the processing of certain queries (e.g, MIN, MAX) at the Cloud Services Layer without engaging the compute or storage layers. The metadata tracked is:

At the table level:
- row count
- table size (in bytes)
- micro-partition references and table versions
At the Micro-Partition level:
- MIN/MAX (range of values) in each column
  - this allows micro-partition pruning during query optimization
- number of distinct values
- NULL count

Data Storage Billing

Customers are billed for actual (compressed) storage used based on the daily average, billed per Terabyte per month
- On-demand Pricing: Billed in arrears at $40 per Terabyte per month with a minimum charge of $25
- Pre-purchased Capacity:
  - Billed upfront for committed storage capacity
  - Price varies depending on cloud platform
  - Customer is notified when they have consumed 70% of pre-purchased capacity

BulkLoading

Bulk Loading

Bulk load is the process of loading batches of data from files available in a stage into Snowflake tables.

You need to specify the table name where you want to copy the data, the stage where the files are, the file/patterns you want to copy, and the file format
Bulk loading uses a transactional boundary control which wraps each COPY INTO command into an individual transaction that can be rolled back if errors are encountered
The information about the loaded files is stored in Snowflake's metadata. You cannot COPY the same file again in the next 64 days unless you force it with the FORCE = True option
- Renaming a previously ingested file in the stage does not modify the metadata so the file won't be re-ingested with the next COPY INTO command
Bulk loading uses user-managed compute resources (Virtual Warehouses) which affect the processing time and cost of data loading.
Some data transformations can be performed during data load, e.g. column reordering, column omission, type casting, etc.
- some transformations like Flatten, Join, Group by, Filters or Aggregations are not supported.
You cannot Load/Unload files from your Local Drive, they need to be staged (PUT) or downloaded (GET) first
Using the Snowflake UI, you can only load files up to 50MB. You can copy bigger files using SnowSQL.
Organizing input data by granular path can improve load performance

Staging

Staging from a Local Storage on a Client Machine

We can use the PUT command to UPLOAD files from a local directory/folder to an INTERNAL STAGE (named internal stage, user stage, or table stage):

PUT file:///tmp/data/mydata.csv @my_int_stage;

PUT does NOT work with external stages.
You cannot use the PUT command from the Snowflake Web UI, you will need to use SnowSQL
Compression during a PUT operation uses the local machine's memory and /tmp directory disk space.
Uploaded files are automatically encrypted with 128-bit or 256-bit keys
The LIST command can be used to list the files in the stage.

Staging from Cloud Storage

When loading data from cloud storage it is generally recommended to create an external stage
If no stage was created, you will need to provide the location, credentials and decryption keys for each COPY INTO command.

COPY INTO Command

The COPY INTO command is used to load the data from staged files to an existing table. Snowflake will automatically load all files in the stage matching the query parameters which have not been previously loaded

Metadata on which files have been loaded is kept around for about 64 days.

BEST PRACTICE: Files should not be kept in a stage for longer than 64 days

When loading files, you can:

name the files explicitly, e.g FILES = ('<file_name>'[, '<file_name>'][, ...])
provide a pattern to match the files to be loaded: PATTERN = '<regex_pattern>'

Other options:

FILE_FORMAT - must be specified unless a file format has already been attached to the stage or the table
VALIDATION_MODE - "dry run" to validate the data without loading it
SIZE_LIMIT - limit the amount of data being ingested
PURGE - delete any files from the stage which were successfully loaded. Default is FALSE
- If the purge operation fails for any reason, no error is returned
RETURN_FAILED_ONLY - returns only the failed rows. Default is FALSE
ENFORCE_LENGTH - will cause data that exceeds the column size defined in the table to be treated as errors. Default is TRUE
TRUNCATECOLUMNS - will cause data that exceeds the column size defined in the table to be truncated to the column size and not treated as an error. Default is FALSE
FORCE - reload previously loaded files. Default is FALSE. TRUNCATE-ing a table resets the metadata for previously loaded files so FORCE is not needed for a TRUNCATE-d table.
LOAD_UNCERTAIN_FILES - load files older than 64 days where Snowflake metadata on whether they were loaded or not no longer exists. Default is FALSE
ON_ERROR - specify how to handle errors. Default is ABORT_STATEMENT:
- CONTINUE - skip error rows and continue processing file to the end
- SKIP_FILE - files with errors are skipped
- SKIP_FILE_<n> - files with a <n> number of errors are skipped
- SKIP_FILE_<n>% - files with a <n>% percentage of errors are skipped
- ABORT_STATEMENT - abort the COPY INTO execution on any error encountered. Rows loaded prior to the error are reverted

Validating a Data Load

Data can be validated before loading with the COPY INTO option VALIDATION_MODE which can be set to:

RETURN_<n>_ROWS
RETURN_ERRORS
RETURN_ALL_ERRORS

Validating data after a load can be done with the VALIDATE command which returns all errors encountered during a specific load job into a given table:

VALIDATE([namespace.]<table_name>,
  JOB_ID => { '<query_id>' | _last }
);

ContinuousDataLoading

Continuous Loading

There are different ways to ingest data continuously into Snowflake:

Snowpipe: the easiest and most popular way
Snowflake Connector for Kafka: Reads data from Apache Kafka topics and loads the data into a Snowflake table
Third-Party Data Integration Tools: You can do it with other supported integration tools

Snowpipe

Enables streaming/pipeline/micro-batch/near-real-time loading of frequent, small volumes of data into Snowflake.

Specifies the COPY INTO command that is used to load the data
It can be used with internal or external stages
Can be used with structured and semi-structured data
No scheduling is required
Snowpipe is serverless; it doesn't use Virtual Warehouses
It is used for small volumes of frequent data which it loads continuously (micro-batches)
Snowpipe does not guarantee that files will be loaded in the same order that they were staged
Snowpipe retains 14 days of metadata vs. bulk loading which retains metadata for 64 days. You cannot copy the same files with Snowpipe again during these 14 days
- Renaming a previously ingested file in the stage does not modify the metadata so the file won't be re-ingested at the next SnowPipe run
The default behaviour when there is an error loading files is SKIP_FILE rather than bulk copy's ABORT_STATEMENT
A Snowpipe can be paused, resumed and its return status can be retrieved.

Snowpipe can be notified of new files in the stage in one of two ways:

A call can be made to the Snowflake REST API to trigger the ALTER PIPE <pipe_name> REFRESH command to check the stage for new files to be loaded.
- It requires key pair authentication with a JSON Web Token (JWT)
- APIs endpoints available:
  - Data File Ingestion: insertFiles
  - Load History Reports: insertReport and loadHistoryScan
You can setup a task which would run the ALTER PIPE <pipe_name> REFRESH command on a schedule

Alternatively, you can use the Snowpipe AUTO_INGEST method, together with notifications setup on your cloud provider so the pipe is automatically notified when a file lands in the stage

AUTO_INGEST is only available on AWS external stages by using S3 event messages dispatched by Amazon's SQS (Simple Queue Service)
For more information, see Automating Snowpipe for Amazon S3

Once Snowpipe receives a trigger for the ingestion, whether via the REST API or a notification from the cloud provider, files are moved from the stage into an Ingestion Queue and then into the Pipe which contains the COPY INTO command with the source stage and the target table.

Billing

Customers are billed for the cloud compute used by Snowpipe per-second, per-core on actual compute usage.

In addition to the cloud compute costs, there is an utilization cost charge of 0.06 credits per 1000 files notified by using the REST API or AUTO_INGEST

ContinuousDataProcessing

Continuous Processing

Tasks

Tasks

A Task is a way to run a single SQL statement on schedule without having to depend on external tools. A task can execute any one of the following types of SQL code:

Single SQL statement
Call to a stored procedure
Procedural logic using Snowflake Scripting

Task Features

Tasks can be queued based on the outcome of other Tasks to form DAGs
- the Root task is the only one which can be scheduled
- the root task can be queued in case the warehouse is currently busy with another workload
- child tasks are always queued - they only run after the parent task finishes
- each task can have a maximum of 100 children tasks
- a tree of tasks can have a maximum of 1000 tasks, including the root one
- each non-root task can have dependencies on multiple predecessor tasks, up to 100 predecessors
- if a predecessor task is dropped, there are two options:
  - its child task becomes a standalone task
  - its child task becomes a root task
```
-- Create a child task
CREATE TASK <newTask> AFTER <rootTask>;
ALTER TASK <newTask> ADD AFTER <rootTask>;
```
Tasks can also be used with Streams
Tasks can call Stored Procedures
Tasks can be run at a set SCHEDULE interval (in minutes) or with a cron expression
Creating a task requires the CREATE TASK privilege
Tasks are initially created in the suspended state
- they can be started/resumed with ALTER TASK <task_name> RESUME;
Tasks execute with the privileges of the owning Role
- if the owner role of a task is dropped, task ownership is reassigned to the role that dropped the owner role
Tasks cannot be triggered manually
Snowflake ensures only one instance of a task with a schedule is executed at a given time
- if a task is still running when the next scheduled execution time occurs, that scheduled time is SKIPPED.
Tasks have a maximum duration of 60 minutes by default

Task Compute Resources

Tasks require compute resources to execute SQL code. When a task is being created, the user has the choice to opt for using their own Virtual Warehouse or let Snowflake use serverless compute:

User-managed Virtual Warehouse. To enable that, the Virtual Warehouse where the task would execute must be specified with the WAREHOUSE option, see Optional Parameters
Snowflake-managed (i.e. serverless compute model). This is useful for frequently running tasks which would make suspending the warehouse where they run impossible. While Snowflake will automatically scale the serverless compute depending on the workload, the initial size of the serverless compute resource can be specified using USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE

Task History

You can use the information_schema.task_history() Snowflake function to query the history of task usage within a specified date range. You need one of these privileges to see the task history:

You have the ACCOUNTADMIN role
You are the owner of a task
You have the global MONITOR_EXECUTION privilege

SELECT *
FROM table(information_schema.task_history())
ORDER BY scheduled_time;

Streams

Streams (also known as Table Streams) are Snowflake objects that store metadata about DML changes (INSERT, UPDATE, DELETE) made to tables, views and secure views. A Stream can be created on a table and used for CDC (Change Data Capture) to identify and act on changed records.

When created, a table stream logically takes an initial snapshot of every row in the source table by initializing a point in time (called an offset) as the current transactional version of the table. The stream then records the DML changes after this snapshot was taken.

Streams hold no data. They can be thought of as timestamped bookmarks for a particular state of a table
Streams can be of three types:
- Standard Stream
  - Captures all changes in the table records
  - Supported on tables, directory tables, and views
- Append-Only Stream
  - Tracks row inserts only
  - Supported on tables, directory tables, and views
- Insert Only
  - Tracks row inserts only
  - Supported on EXTERNAL TABLES only
Each stream contains the following metadata:
- METADATA$ACTION - indicates the DML operation (INSERT, DELETE) recorded. Note that updates are recorded as INSERT, see METADATA$ISUPDATE below.
- METADATA$ISUPDATE - indicates whether the operation was part of an UPDATE statement
- METADATA$ROW_ID - unique and immutable ID for the row
The SYSTEM$STREAM_HAS_DATA function indicates whether a stream contains change data capture (CDC) records
Streams ensure exactly-once semantics for new or changed data in a table
- a stream only contains the last DML action on a row. A stream will hold only one record per table row, if any, reflecting the current state of the row compared to its initial state when the stream was first created
- if the row has not been modified, there will be no Stream record at all
- if the row was modified but then it was reverted to its original state when the stream was first created, the Stream record for that row will be removed
Multiple streams can be created for the same table and consumed by different tasks

DataUnloading

Data Unloading

Similar to Bulk Loading, unloading uses the COPY INTO command to unload data to an internal or external stage. The difference is that, when unloading, we would

COPY INTO <stage> FROM <table>;

Unloading to Local Storage on a Client Machine

If data has been unloaded to a Snowflake INTERNAL STAGE (named internal stage, user stage, or table stage), you can use the GET command to DOWNLOAD the files to a local directory/folder on a client machine:

GET @my_int_stage file:///tmp/data/;

You cannot use the GET command from the Snowflake Web UI, you will need to use SnowSQL

Staging from Cloud Storage

It is recommended to use an external named stage for data unloading
Files can also be unloaded directly to cloud storage by specifying the URL and any necessary credentials

Considerations

Supported file formats
- Flat (CSV, TSV, etc.) for structured data
- JSON - semi-structured data
- Parquet - semi-structured data
When unloading unnested semi-structured data, you will need to use the OBJECT_CONSTRUCT function to recreate the JSON/Parquet data.
When unloading, files are split
- Each server in the Data Warehouse cluster can process 8 files in parallel, so the larger the Data Warehouse, the more (smaller) files will be generated. Choosing a smaller Warehouse will result in fewer, larger files.
- You can specifiy the SINGLE option for the COPY INTO command to prevent files from being split but that results in a significant performance degradation as you'd be using one Data Warehouse thread instead of 8.
- Default average file size is around 16MB, compressed. MAX_FILE_SIZE can be specified to change that but there are some cloud-specific limitations
  - Azure max filesize is 256MB
  - The max filesize for GCP and Amazon is 5GB
You can use the LIST and REMOVE commands to manage files in stages.
SELECT queries in the COPY INTO statement support the full SQL syntax which allows complex transformations on data egress, e.g. JOIN-ing data from multiple tables, perform aggregations, ordering and filtering, etc.

Empty Strings vs. Null Values Options

FIELD_OPTIONALLY_ENCLOSED_BY - specify the character to use to enclose your fields
EMPTY_FIELD_AS_NULL - convert empty fields into NULL, the default is FALSE
NULL_IF - convert NULL values to something else

Overview

Data Movement

Loading and unloading data into Snowflake can be:

Factors which affect loading times are

the physical location of the stage,
GZIP compression efficiency, or
the number and types of columns

You can both load and unload data into tables with the COPY command

LOAD: copy data from a stage into a Snowflake table
UNLOAD: copy data from a table to a stage

File Formats

A schema-level named object which defines the format information required for Snowflake to interpret a file. You can specify different parameters, for example, the file’s delimiter, if you want to skip the header or not, etc.

Format	Type	Load	Unload	Binary Format
CSV	Structured	Yes	Yes	No
JSON	Semi-structured	Yes	Yes	No
Parquet	Semi-structured	Yes	Yes	Yes
XML	Semi-structured	Yes	No	No
AVRO	Semi-structured	Yes	No	Yes
ORC	Semi-structured	Yes	No	Yes

Structured data: CSV
- It’s the fastest file format to load data
- May need to specify FIELD_DELIMITER, RECORD_DELIMITER and whether the file header should be skipped with SKIP_HEADER
Semi-structured data: JSON, Parquet, Avro, ORC, XML
- Semi-structured data can be loaded as a VARIANT type in a Snowflake table
- The maximum size for the VARIANT data type is 16MB
- it can be queried using JSON notation, see Query Semi-Structured Data
File Format has to be specified for every COPY INTO command
- A File Format can be attached to a stage - automatically associates the File Format for all files stored in the stage
- A File Format can be attached to a table - automatically associates the File Format for all COPY INTO commands targeting the table
- A File Format can be specified for each COPY INTO command which will override the stage or table ones, if specified

AccessControl

Access Control

Overview of Access Control

Overview

Account- and user-level network policies
Online Certificate Status Protocol - OCSP x.509 validation for all Snowflake connectors
Optional support for cloud provider private network solutions
- PrivateLink (AWS and Azure) - no traffic over the public internet
- DirectConnect (AWS) - dedicated on-premise network connection to AWS

Access Flow

Enforce network policy
Verify account is not locked or disabled
Verify user is not locked or disabled
Resolve account and username from request
Perform basic authentication checks

Account-Level Network Policies

IP white/blacklisting - restrict access by IP address or range
Use CIDR notation to specify subnet range
Only one active network policy per account
Must be ACCOUNTADMIN or SECURITYADMIN to configure
ALLOWED_IP_LIST is required, BLOCKED_IP_LIST is optional

-- Create a network policy
CREATE NETWORK POLICY <policy_name>
  ALLOWED_IP_LIST = (<ip_address_or_range>, ...)
  BLOCKED_IP_LIST = (<ip_address_in_allowed_range>, ...)
;
-- Set active network policy
ALTER ACCOUNT SET NETWORK_POLICY = <policy_name>;

User-Level Network Policies

If a user-level policy is assinded to a logged-in user, they are prevented from executing further queries

ALTER USER <username> SET NETW0RK_POLICY = <policy_name>;

Configuration

Allow traffic on the following ports:
- 443: general Snowflake traffic
- 80: for the OCSP cache server which listens for all Snowflake client communication
Allow hostnames:
- all Snowflake clients require temporary access to the provider cloud storage
- SYSTEM$ALLOW_LIST() lists the required hostnames for your Snowflake account

Authentication

Identifies the user and confirms whether they are allowed to login.

Authentication Methods

Username and password policy (Snowflake-controlled, cannot be changed)
- Password can be 8-256 characters long and must contain
  - at least one uppercase letter
  - at least one lowercase letter
  - at least one number
MFA: Multi-Factor Authentication with the Duo app
SSO/Federated Authentication (SAML 2.0), e.g. Okta
Other
- Oauth: authentication from a 3rd party service or application
- Key Pair: JSON Web Token (JWT) signed with a private/public RSA key pair
- SCIM (System Cross-domain Identity Management): often used in Azure or Okta

Authorization

Snowflake's authorization model is a combination of:

RBAC: Role-Based Access Control model
- Privileges are granted to Roles
- Roles are granted to users
DAC: Discretionary Access Control model
- each object has an owner who can grant others access to the object

All Snowflake objects are individually securable. Access to objects operates on the principle of least privilege: unless allowed by a privilege, access is denied.

Key Concepts

User: A person or a program
Role: An account-level object to which we can grant privileges
- We can grant roles to users
- Roles can be granted to other roles, creating a hierarchy
- The privileges associated with a Role are inherited by any Roles above that Role in the hierarchy
Privilege: A defined level of access to an object, e.g. SELECT PRIVILEGE ON DATABASE db1
- Privileges that can be set depend on the securable object. They can be very broad or very granular.

Default Roles

ORGADMIN
- Role that manages operations at the organization level. More specifically, this role:
  - Can create accounts in the organization.
  - Can view all accounts in the organization (using SHOW ORGANIZATION ACCOUNTS) as well as all regions enabled for the organization (using SHOW REGIONS).
  - Can view usage information across the organization.
ACCOUNTADMIN: top level role
- encapsulates the SYSADMIN and SECURITYADMIN roles
- Access to account-level settings
- Can manage Data Shares
- Can manage Resource Monitors
- Can manage Network Policies
SECURITYADMIN: Manage Users, Roles and Network policies
- Access to account-level settings with the exception of Usage & Billing
- Can manage Network Policies
USERADMIN: Manage Users and Roles
SYSADMIN: Manage all Snowflake objects, except:
- Data Shares
- Resource Monitors
- Network Policies
PUBLIC: Lowest privilege role
- Automatically granted to every User and Role in the Account

In addition to the default Roles, USERADMIN (or a user with the CREATE ROLE privilege) can create custom roles and assign roles to them.

As a rule of thumb, top-level custom roles should be assigned to SYSADMIN to ensure SYSADMIN has access to the objects the custom roles work with.
- An exception to this rule of thumb is if the custom roles work with highly sensitive information.

Role Commands

-- Create a new Role
CREATE ROLE <NEW_ROLE>;

-- Grant a Role to a parent Role
GRANT ROLE <CHILD_ROLE> TO ROLE <FATHER_ROLE>;

-- Create a user and assign them a Role
CREATE USER <NEW_USER>
    PASSWORD = <PASSWORD> 
    DEFAULT_ROLE = <ROLE>
    MUST_CHANGE_PASSWORD = TRUE
;
GRANT ROLE <ROLE> TO USER <NEW_USER>;

-- Review Privileges of a Role
SHOW GRANTS TO ROLE <ROLE>;

-- See to whom a Role is assigned
SHOW GRANTS OF ROLE <ROLE>;

Other Access Control Mechanisms

Secure Views and UDFs (User Defined Functions)
- Prevent data from being indirectly exposed via programmatic methods
- Only authorized users can see the view or UDF definition
- bypass some of the optimizations to achieve better security so they are not as efficient

Row-level access - uses CURRENT_ROLE() or CURRENT_USER() to provide row-level security, e.g.

SELECT ...
FROM data_table
JOIN auth_table
  ON auth_table.auth = data_table.auth
  AND auth_table.role = CURRENT_ROLE()
;

DataProtection

Data Protection

Continuous Data Protection

Encryption at Rest and in Transit

Encryption in transit for all client communication using TLS 1.2 SSL
AES 256 bit encryption at rest and in motion
Encryption at every level, see Hierarchical encryption key model
- Root Key
- Account master Key - each account has its own key
- Object Master Keys - each object has its own key
- File Keys - each micro-partition file has its own key
Automatic key expiration and rotation with no customer impact every 30 days. Key states:
- active: used for encryption and decryption
- retired: used for decryption of data previously encrypted with that key
- destroyed: no longer used
Enterprise Edition and above can enable automatic annual re-keying
Tri-Secret Secure or Bring Your Own Key
- uses a composite master key which is a composite of a customer-maintained key with a Snowflake-managed key
- the customer-maintained key must be available at all times for continuous access
Staging Encryption
- In Snowflake-managed internal stages, Snowflake manages encryption
- Customers can also choose to manage their own cloud storage stages where they would need to secure and encrypt the data prior to loading into Snowflake

Continuous Availability

Cloud providers replicate Snowflake storage across at least three availability zones within each cloud region where Snowflake is deployed
- Each availability zone consists of multiple data centers which are:
  - geographically separated
  - have their own security
  - are on separate power grids
  - have their own individual power backups
Snowflake performs fully transparent online updates and patches with no maintenance downtime windows

Snowflake-specific Data Security Features

Time Travel

Understanding & Using Time Travel

Access historical data at any point of time within a defined retention time period. Allows querying/rollback to a previous state of a table/schema/DB within the retention period.

Maximum retention period can be configured per account, DB, schema and/or table by setting DATA_RETENTION_TIME_IN_DAYS:
- Standard Edition - up to 1 day
- Enterprise Edition and higher - up to 90 days
- The default retention time is 1 day in all editions unless explicitly changed
- Can be disabled by setting DATA_RETENTION_TIME_IN_DAYS = 0
- To view current settings, use SHOW PARAMETERS

Allows fixing common mistakes

UNDROP a DB/Schema/Table
- Attempting to restore an object with a name that already exists will result in an error

Access the previous state of a table with AT or BEFORE

Query the data in its previous state
Clone a DB/Schema/Table from its preivous state

-- by Timestamp
SELECT *
FROM my_table
AT(timestamp => 'Mon, 01 May 2021 08:00:00 -0700'::timestamp_tz);

-- by Offset
SELECT *
FROM my_table
AT(offset => -60*15);

-- before query statement ID
SELECT *
FROM my_table
BEFORE(STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726');

Adds to storage costs for maintaining the additional historical data

Zero-Copy Cloning

Enables taking a quick snapshot of a DB/schema/table.

Zero-copy cloning is a metadata-only operation
- Zero-Copy cloning is a FREE operation
Originally, both the original and the clone's objects are pointing to the same micro-partitions
No data is copied, so no additional storage costs are incurred until data is changed in the original or in the clone
Usage
- often used to quickly spin up a Dev or Test environment as a zero-copy clone of Production
- can be used to create backups or data snapshots at given points of time
- can be used to create read-only snapshots for reporting as of a given point of time, e.g.: month-, year-end, etc.
Some considerations:
- Privileges are not cloned
- Data History is not cloned
- Stages cloning
  - Named Internal Stages are not cloned
  - External Stages are cloned
  - Table Stages are cloned
- Pipes cloning: pipes that reference
  - internal stages are not cloned
  - external stages are cloned
Privileges required for cloning objects:
- Table: SELECT
- Pipe, Stream, Task: OWNERSHIP
- all other objects: USAGE

Fail-Safe

Non-configurable, 7-day retention for historical data after the Time Travel expiration

Only accessible by Snowflake support personnel
Fail-safe cannot be disabled
Admins can view Fail Safe storage use in the Web UI under Account > Billing & Usage
Only permanent tables have a Fail-Safe period. Not supported for Temporary, Transient or External tables
Adds to storage costs for maintaining the additional historical data

Snowflake replication features

Snowflake provides customers with additional replication features, over and above the cloud providers' replication across availability zones.

Overview

Account and Security Overview

Security Aspects

Access: network policies
Authentication: Password, MFA or SSO
Authorization: combination of Discretionary and Role Based Access Control
Data Protection: data encryption at rest and in motion, replication, recovery features
Infrastructure: handled by cloud providers

Snowflake Operational Controls

Snowflake’s Security & Compliance Reports

NIST 800-53
SOC2 Type 2
SOC1 Type II
ISO/IEC 27001:2013
HIPAA (Snowflake Business Critical Edition)
PCI (Snowflake Business Critical Edition)
FedRAMP (Snowflake Business Critical Edition)

Caching

One of the most important ways to improve the speed of queries in Snowflake and optimize costs is the cache. There are several different caches in Snowflake: Metadata, Query Result, and Warehouse cache.

Metadata Cache

The Metadata cache is maintained in the Cloud Services Layer and contains objects information & statistics (see Architecture). Snowflake automatically stores different types of metadata to improve the compilation time and query optimization. This helps in operations like MIN, MAX, COUNT, etc., where Snowflake does not need to use the warehouses (and no computing credits are charged), as it has the information already available in the metadata cache.

The metadata cache information is stored for 64 days
The metadata cache contains information about staged files and their ingestion status

Query Result Cache

The Cloud Services Layer accepts SQL requests from users, coordinates queries, managing transactions and results and also holds the Query Results Cache

Holds the results of every query executed in the past 24 hours up to 31 days (depending on how many times it's reused).
The query results are available across virtual warehouses, so cached query results returned are available to all users of the system, provided the underlying data has not changed
- If the data in the Storage Layer changes, the caches are automatically invalidated.
No data is scanned from the Storage Layer and no compute is required so no computing credits are charged for cached results
Cached query results do NOT incur Storage costs
Can't have context functions (e.g. current_time())
Same role is used as the previous query
You can disable the Query Result cache with this command:
```
ALTER SESSION SET USE_CACHED_RESULT = FALSE
```

Warehouse Cache

The actual SQL is executed across the nodes of a Virtual Data Warehouse. This layer holds a cache of the micro-partition data queried, and is often referred to as Local Disk I/O although in reality this is implemented using SSD storage.

All data in the compute layer is temporary, and only held as long as the virtual warehouse is active.
Suspending or resizing a warehouse will clear the data cache
- A low Auto-Suspend setting (e.g. auto-suspend after 1 minute) minimizes the use of the Warehouse data cache
- A resized Warehouse (up or down) starts with a cold (clear) cache
You can get information on how much data is read from Local (Warehouse) storage vs. Remote (Storage layer) by inspecting the query profile's "Scanned Bytes" metric.
It is recommended to use dedicated Virtual Warehouses for similar workloads. For example, a Virtual Warehouse for BI tasks, another for Data Science, etc. There is a chance that data cached by the warehouse for similar queries can can be re-used
The larger a warehouse is, the larger the warehouse cache size is

DataClustering

Data Clustering

Clustering

Typically, data stored in tables is sorted along natural dimensions, for example, by date. This process is called clustering. Data that is not sorted/clustered may hurt query performance, particularly on huge tables.

The Cloud Services Layer collects clustering metadata for the micro-partitions in a table:

The number of micro-partitions that comprise the table.
The number of micro-partitions containing values that overlap with each other.
The depth of the overlapping micro-partitions.

Natural Clustering

As data is loaded into a table, micro-partitions are created based on ingestion order. That is natural clustering.

Natural clustering occurs when data is read as a single data load
The ingestion order may be highly correlated with the values in one or more columns, e.g. autoincrement IDs, dates, sequential numbers, etc.
The source data organization determines what range of values (MIN/MAX) is represented in each micro-partition

Clustering Depth

Clustering depth measures the average number of overlapping micro-partitions for a specific column value.

The smaller the cluster depth is, the better clustered the table is.
A table without partitions would have a cluster depth of 0.
You can use the following commands to get the Cluster Depth:
- SYSTEM$CLUSTERING_DEPTH
- SYSTEM$CLUSTERING_INFORMATION

Automatic Clustering

Automatic Clustering

Snowflake provides periodic & automatic re-clustering to maintain optimal clustering. When the ingestion-order natural clustering is not advantageous or has degraded due to extensive INSERT operations, Snowflake provides a mechanism for its customers to override its natural clustering algorithms by defining clustering keys. This will result in like data (by key) being co-loaded in the same micro-partitions.

Snowflake only reclusters a clustered table if it will benefit from the operation. It monitors and evaluates the tables to determine whether they would benefit from reclustering, and automatically reclusters them, as needed. In most cases, no tasks are required to enable Automatic Clustering for a table.

Automatic clustering can be suspended and resumed:

ALTER TABLE t1 [SUSPEND|RESUME] RECLUSTER;

Clustering Keys

CREATE|ALTER TABLE ...
CLUSTER BY (<column_or_expression>, ...)

Use one to three clustering keys at maximum
Prefer and order clustering keys from low to high cardinality which will result in fewer, larger sets of micro-partitions being co-located together
It is important to choose a clustering key that has appropriate cardinality:
- A large enough number of distinct values to enable effective pruning on the table.
- A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
Expressions such as DATE() or TRUNC() can be used for the clustering keys on columns with high cardinality, e.g. TIMESTAMP or hash strings, etc.
Snowflake maintains and updates the clustering order in the background which incurs compute and storage costs as micropartitions need to be rewritten
Typically, clustering keys are added after ingestion, and/or when natural clustering has degraded.
Cluster Keys are placed on columns usually used in WHERE, JOIN and ORDER BY clauses

Clustering Keys Considerations

Clustering keys are most useful with tables which
- are very large (multi-terabyte range)
- have degraded in query performance as the table grows
- have large clustering depth values
Frequently-changing tables will consume more credits for reclustering
The SYSTEM$CLUSTERING_INFORMATION() function can be used to get clustering information on a table's column. Good clustering will have:
- low average_depth values which indicates low micro-partition MIN/MAX values overlap
- partition_depth_histogram with a high number of low value depths
You can cluster materialized views, as well as tables
Snowflake recommends a maximum of 3 or 4 columns (or expressions) per key.

Overview

Performance and Tuning Overview

Query Optimizer Workflow

Typical workflow of the Snowflake query optimizer:

Row Operations: Inspect FROM, JOIN and WHERE clauses for partition pruning
- Static pruning based on the columns in the WHERE clause based on the MIN/MAX micro-partition metadata
- Dynamic pruning based on the JOIN columns
Inspect GROUP BY and HAVING
Inspect SELECT, DISTINCT, ORDER BY and LIMIT

Query Optimization Tips

Take advantage of partition pruning by limiting the initial data set as much as possible.
- Use appropriate filters as early as possible
- Filter on columns which would be naturally ordered (or have been clustered to be ordered) in micro-partitions
  - Natural clustering occurs on columns which have high correlation to ingestion order, e.g. date columns
- When filtering by string value, avoid wildcards at the beginning of the string as the optimizer cannot take advantage of the MIN/MAX micro-partition metadata
- Prefer uncorrelated over correlated subqueries. A correlated subquery is one which would have to execute multiple times since its query clause (or, worse, clauses!) depends on multiple values from the parent query.
Join on unique keys
- Ensure keys are distinct to avoid join explosions
- Understand relationships between tables before joining
- Avoid many-to-main joins
- Avoid unintentional cross joins
Prefer low cardinality columns for GROUP BY clauses
Add LIMIT to ORDER BY clauses where possible
Limit ORDER BY usage to a single instance in the top level SELECT wherever possible
Review the query Profile Overview for
- spilling to disk
- high network usage

Altering Query Behaviour

These can be set for at the Account/Session/Role or Virtual Warehouse level:

STATEMENT_TIMEOUT_IN_SECONDS: How long can a query run before being cancelled by the system, Default: 2 days
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS: How long can a queued query wait before being cancelled by the system, Default: 0 (forever)

VirtualWarehouseScaling

Virtual Warehouse Scaling

As queries are submitted, resources are calculated and reserved prior to the query execution
When there are insufficient resources, queries are queued until previous queries finish and release resources

Scaling Up

Scaling Up: Resizing a warehouse to handle more complex or data-intensive queries.

Warehouse Size	Servers per Cluster	Clusters	Servers Available
X-Small	1	1	1
Small	2	1	2
Medium	4	1	4
Large	8	1	8
X-Large	16	1	16
2X-Large	32	1	32
3X-Large	64	1	64
4X-Large	128	1	128

Scaling up/down is done by changing the size of the warehouse which affects the number of servers in the warehouse cluster
A warehouse that is too large may waste compute credits if its resources are not fully utilized
A warehouse that is too small may become very slow since it may be resource bound, resulting in spilling data to local or remote disk, further degrading performance.

Scaling Out

Scaling Out: Adding more clusters to the warehouse without changing its size to handle high concurrency loads.

Warehouse Size	Servers per Cluster	Clusters	Servers Available
Medium	4	1	4
Medium	4	2	8
Medium	4	3	12
Medium	4	4	16

Multi-cluster warehouses can work in

Maximized Mode
- the minimum and maximum number of clusters is the same and is more than one
- useful for predictable periods of sustained high loads
Autoscaling Mode
- the minimum and maximum number of clusters are different
- The warehouse will automatically scale up and down (auto-scale) depending on load
- helps to automate the handling of variable number of queries or users over time
- Scaling policies
  - Standard
    - useful for relatively smooth transitions from low to high load and back
    - a new cluster is added
      - as soon as queries start queueing, or
      - an incoming query will be queued due to shortage of resources
    - a cluster is shut down after 2-3 consecutive checks (one minute apart) consistently determine that the load on the least-busy cluster can be redistributed to other clusters
  - Economy
    - useful for erratic transitions from low to high load and back, reduces scale out/in reactions to short-lived load spikes or drops
    - a new cluster is added
      - after a query has been queued for up to 6 minutes, or
      - when Snowflake determines that the current work load will keep available clusters busy for at least 6 minutes
    - a cluster is shut down after 5-6 consecutive checks (one minute apart) consistently determine that the load on the least-busy cluster can be redistributed to other clusters

Scaling Considerations

Query processing cannot cross clusters, so, even though the number or clusters is the same, a complex query will perform differently on a 4-cluster multi-cluster warehouse made of X-Small clusters compared to a single-cluster Medium warehouse.

LoadAndUnloadSemiStructuredData

Load and Unload Semi-Structured Data

Create File Format

CREATE FILE FORMAT

CREATE FILE FORMAT <file_format_name>
TYPE = { JSON | AVRO | ORC | PARQUET | XML }
[<format_options>];

JSON Data

Some JSON Format Options

COMPRESSION: Specifies compression algorithm
- Supported: GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE
  - the BROTLI algorithm is not supported in AUTO mode
- Default for loading: AUTO
- Default for unloading: GZIP
FILE_EXTENSION: Specifies file extension to use
- Default: .JSON
- Only used for unloading
ALLOW_DUPLICATE: Allows duplicate object field names
- Default: FALSE
- Only the last field value will be preserved
- Only used for loading
STRIP_OUTER_ARRAY: If TRUE, JSON parser will strip outer array brackets [] so the file is not treated as a single JSON entry
- Default: FALSE
- Only used for loading
STRIP_NULL_VALUES: If TRUE, JSON parser will remove object fields or array elements with NULL values
- Default: FALSE
- Only used for loading
DATE_FORMAT: Defines the date format in strings
- Default: AUTO
- Used only for loading JSON data into separate columns
TIME_FORMAT: Defines the time format in strings
- Default: AUTO
- Used only for loading JSON data into separate columns

Load Options for JSON Data

Load the file with no parsing or transformations into a VARIANT column:

COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <JSON format with options>);

Transform the file into columns and load into a table with pre-defined columns:

COPY INTO table_name
FROM (
    SELECT <keys>
    FROM <file(s) in stage>
)
FILE_FORMAT = (FORMAT_NAME = <JSON format with options>);

Parquet Data

Some Parquet Format Options

COMPRESSION: Specifies compression algorithm
- Supported: LZO, SNAPPY, NONE
- Default for loading: AUTO
- Default for unloading: GZIP
SNAPPY_COMPRESSION:
- Default: TRUE
- Only used for unloading
BINARY_AS_TEXT: Whether to interpret columns with no defined logical data type as UTF-8 text. When set to FALSE, Snowflake interprets these columns as binary data.
- Default: TRUE
- Only used for loading
TRIM_SPACE: Whether to remove leading and trailing white space from strings
- Default: FALSE
- Only used for loading Parquet data into separate columns
NULL_IF: Replace matching strings with a SQL NULL value
- Default: \\N
- Only used for loading Parquet data into separate columns

Load Options for Parquet Data

Load the file with no parsing or transformations into a VARIANT column:

COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <parquet format with options>);

Transform the file into columns and load into a table with pre-defined columns:

COPY INTO table_name
FROM (
    SELECT
      $1:<column_name>::<data_type>,
      $1:<column_name>::<data_type>,
      ...
    FROM <file(s) in stage>
)
FILE_FORMAT = (FORMAT_NAME = <parquet format with options>);

Loading MATCH_BY_COLUMN_NAME Option

Load semi-structured data directly into columns of a table the names of which match the names of the high-level keys present in the data
Supported for JSON, AVRO, ORC and Parquet file formats

COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <file format>)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Unloading Data

Supported formats
- JSON: Unload from a VARIANT column or reconstruct a JSON format
- Parquet: Use a SELECT to unload the table columns into a Parquet file

Overview

Semi-Structured Data Overview

Snowflake supports the following semi-structured data file formats:

JSON
Avro - originally developed for Apache Hadoop
ORC (Optimized Row Columnar) - used for Hive data
Parquet - binary format used in the Hadoop ecosystem
XML (Extensible Markup Language)

Semi-structured data is a first-class citizen in Snowflake:

ingested in the same way as structured data, with micropartitions, metadata, compression and encryption, to enable quicker query access and partition pruning
all SQL operations are supported with native syntax for accessing data even in semi-structured fields

Semi-structured data is supported using the VARIANT data type:

holds standard SQL data types as well as arrays and objects
Objects - collection of key-value pairs where the values are of VARIANT type
Arrays - collection of values where each value is a VARIANT
non-native values (e.g. dates, timestamps) are stored as strings in VARIANT columns
operations can be slower than when stored as the corresponding data type

QuerySemiStructuredData

Query Semi-Structured Data

-- Sample "weather" VARIANT column data
{ "humidity": 45.6, "cloud_cover": { "density": 25.0, "description": "scattered clouds" }}

To access a value in a VARIANT column under a given key, you can use:
- colon (:) notation: SELECT weather:humidity
- square brackets ([]) notation: SELECT weather['humidity']
To access values of nested objects, you can chain the keys names of each nested object with dot (.) to reach the value: SELECT weather:cloud_cover.density
VARIANT data can be cast to SQL data types using the double colon :: syntax:

SELECT
    weather:humidity::NUMBER(10, 2) AS humidity_percent,
    weather:cloud_cover.description::VARCHAR as cloud_description

Semi-structured Data Functions

Semi-structured Data Functions

ARRAY_AGG: array creation and manipulation
OBJECT_AGG: object creation and manipulation
PARSE_JSON: JSON and XML parsing
TO_ARRAY: conversion to array
FLATTEN: a table function that takes a VARIANT, OBJECT, or ARRAY column and explodes compound values into multiple rows. Often used in Lateral Joins. Returns a table with a defined set of columns:
- SEQ: unique sequence number associated with the input record. Not guaranteed to be continuous or ordered
- KEY: for maps or objects, the key of the exploded value; NULL for arrays
- PATH: path to the data structure element that is being flattened
- INDEX: with arrays, index of the element; NULL otherwise
- VALUE: value of the flattened element
- THIS: the element being flattened; useful for recursive flattening with RECURSIVE => TRUE
IS_<object_type>: type checking
AS_<object_type>: type casting

README

Snowflake Snow Pro Core Exam Preparation Course Notes

Snowflake Overview and Architecture

Virtual Warehouses

Storage and Data Protection

Data Movement

Account and Security

Performance and Tuning

Semi-Structured Data

Exam Tips

Exam Prep Tips

Topics Quizzed on Exams

Sample Exams

Free

Paid

Architecture

Architecture

Snowflake Layers

Storage Layer

Compute (Query Processing) Layer

Cloud Services Layer

DataSharing

Data Sharing

Shares

Provider

Consumer

Full Account Consumer

Reader Account Consumer

Sharing Features

Sharing with Secure Views

Types

Editions

Snowflake Editions

Standard

SPECIFICS

Enterprise

SPECIFICS

Business Critical

Virtual Private Snowflake (VPS)

Government Regions

ObjectModel

Objects

Table

Permanent

Transient

Temporary

External

View

Standard

Materialized

Secure

Stage

Named Stages

Internal Stages

User Stage

Table Stage

File Format

Storage Integration

Warehouse

Resource Monitor

Sequence

Pipe

Stored Procedure

UDF: User-Defined Function

Data Shares

OrganizationsAccountsDatabasesAndSchemas

Organizations, Accounts, Databases And Schemas

Organization

Account

Database

Schema

Overview

Overview

The Snowflake Platform

Snowflake Releases

Data Platform Workloads

Connecting to Snowflake

Features

Time Travel

Fail-Safe