SnowPro Study — Printable

SnowPro Study — All Notes

Source: Infostrux-Solutions/snowflake-snowpro-core-study-notes (main)

README

Snowflake Snow Pro Core Exam Preparation Course Notes

Snowflake Overview and Architecture

Virtual Warehouses

Storage and Data Protection

Data Movement

Account and Security

Performance and Tuning

Semi-Structured Data

Exam Tips

Exam Prep Tips

Read every single “Reading Assets” article from the [COF-C02] SnowPro Core Certification Exam Study Guide. You will find is the real exam questions blend the type of questions in the practice exams with these real world examples.

Topics Quizzed on Exams

Sample Exams

Free

Paid

Architecture

Architecture

Snowflake's architecture is a hybrid of traditional shared-disk and shared-nothing database architectures.

  • Similar to shared-disk architectures:
    • Snowflake uses a central data repository for persisted data accessible from all compute nodes in the platform
    • Offers data management simplicity
  • Similar to shared-nothing architectures (e.g. Hadoop, Spark)
    • Snowflake processes queries using virtual warehouses where each node in the cluster stores a portion of the entire data set locally
    • Offers performance and scale-out benefits

Advantages:

  • Storage, compute and management services are decoupled and can scale independently
  • There is virtually no limit on how much each layer can be scaled
  • Workload isolation with virtual warehouses

Snowflake Layers

Snowflake's unique architecture consists of three layers, all of them with High Availability. The price is also charged separately for each layer.

In addition to the three layers mentioned, there is the Cloud Agnostic Layer - allows Snowflake to run on the major cloud providers: AWS, Azure, GCP and ensures Snowflake performs comparaby across providers.

Storage Layer

  • single copy and source of truth for the data
  • virtually infinitely scalable
  • inherits the cloud provider's availability and durability guarantees
    • e.g. AWS S3 replicates data across three different availability zones within a region
  • data is stored in Snowflake's read-optimized proprietary compressed columnar format
    • due to columns containing similar data, the columnar format results in better compression
    • data is stored compressed
    • data is stored using AES-256 encryption
  • Snowflake manages all aspects of how this data is stored.

Compute (Query Processing) Layer

  • Multi-Cluster Compute (Virtual Warehouses): each Virtual Warehouse is a Massively Parallel Processing (MPP) compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
  • Carries out the computation of the query plan generated by the Cloud Services Layer.
  • Virtual Warehouses retrieve the data required for processing a query and cache it locally for up to 24 hours
  • Virtual Warehouses can be created or removed almost instantly
  • VWs can also be paused or resumed as needed
  • The customer is only charged for the time the VW is running
  • Each Warehouse is isolated from others and can be configured individually

Cloud Services Layer

Also known as the Global Services Layer, it is a collection of highly available and scalable services that manages all activities across Snowflake. The Cloud Services Layer is shared between accounts (other than accounts setup with a Virtual Private Snowflake Edition)

  • Security
    • Authentication
    • Access Control
      • Roles and Users
      • Shares
    • data encryption and key rotation
  • Infrastructure management
    • Compute
    • Centralized storage
      • Micropartitions
      • Micropartition versioning for Time Travel
  • Transaction management (ACID compliant)
  • Query parsing and optimization
    • SQL Optimization
      • Cost-based
      • Automatic JOIN order optimization (no hints required)
      • Automatic statistics gathering
      • Data Pruning based on the metadata and statistics
    • Processing is delegated to a Virtual Warehouse
    • The Cloud Services Layer also handles queries that depend on Metadata only (no Warehouse required), e.g.:
      • Retrieve or modify the session context
      • Retrieve metadata about objects (users, warehouses, databases)
      • Get statistics, e.g.: MIN and MAX of a column, COUNT, number of NULLs
      • DDL retrieval
  • Metadata and Statistics collected and kept up-to-date
  • Time Travel and Cloning
  • Availability
  • Transparent online updates and patches
  • Runs on Snowflake-managed compute

DataSharing

Data Sharing

Introduction to Secure Data Sharing

Secure Data Sharing lets you share selected objects in a database in your (Provider) account with other (Consumer) Snowflake accounts. You can share the following Snowflake database objects:

  • Tables
  • External tables
  • Secure views
  • Secure materialized views
  • Secure UDFs

Shares

Shares are named Snowflake objects that encapsulate all of the information required to share database objects. Using GRANTs, shares encapsulate:

  • the privileges that grant access to the database, schemas and specific objects we want to share
  • the consumer accounts that have access to the share and all objects the share has access to.

Some sample SQL:

-- Create a Share and give SELECT privileges to table
CREATE SHARE myShare;
GRANT USAGE ON DATABASE myDb TO SHARE myShare;
GRANT USAGE ON SCHEMA myDb.public TO SHARE myShare;
GRANT SELECT ON TABLE myDb.public.myTable TO SHARE myShare;
-- Add consumer accounts to the share
ALTER SHARE myShare ADD ACCOUNTS = org1.consumer1,org1.consumer2;

-- Show all Shares in the account
SHOW SHARES;

-- See all the privileges a Share has
SHOW GRANTS TO SHARE myShare;

-- See the accounts (consumers) that have access to the Share
SHOW GRANTS OF SHARE myShare;

Provider

  • Shares data with others
  • No cost to share other than the storage of the shared data
  • Has full control over the data shared and the privileges they give to the share
  • Can revoke privileges granted to the share at any time
  • Pays for the storage of the data they share

Consumer

Accounts that receive the share/data.

  • Shared data does not take up any storage in the Consumer account so Consumers don’t pay for its storage
  • Time Travel is not available to Consumers on shared data
  • Zero-Copy Cloning is not available to Consumers on shared data
  • Consumers can, however, copy share data using CTAS (CREATE TABLE ... AS SELECT)

Full Account Consumer

  • The only charges to consumers are for the compute resources (i.e. Virtual Warehouses) used to query the shared data

Reader Account Consumer

  • Consumers who are not Snowflake customers and do not have a Snowflake account
  • The Reader account is set up within the Provider's environment
    • Providers manage all aspects of the Reader Account - users, roles, permissions, warehouses, etc.
    • A reader account can only consume data from the provider account that created it
  • Readers must use Warehouses from the Provider's account; the Provider incurs the compute cost
  • Providers can track and manage the queries Readers can run and can monetize them by charging for the data share or the queries run

Sharing Features

  • All database objects shared between accounts are READ-ONLY,(i.e. the objects cannot be modified or deleted, including adding or modifying table data
  • Data is shared live:
    • No data movement or copying
    • Consumers immediately see all updates
    • Data can be shared with an unlimited number of Consumers
  • Shares can only be created by users
    • with the ACCOUNTADMIN role
    • or with a role that has been explicitly given the permission to create Shares
  • Shares can be created between accounts but not within the same account
  • Consumers cannot re-share a Share
  • Snowflake does not charge for creating shares although Providers can charge Consumers or Reader Accounts to monetize their data
  • Sharing is available across all Snowflake Editions
  • The Consumer and Provider accounts must be in the same cloud and region to share data
    • A Provider can replicate their data to an account in another cloud and region if they want to share it with a Consumer in the other cloud/region.

Sharing with Secure Views

The preferred way to share data is by using Secure Views:

  • allows sharing only select rows and columns rather than entire tables
  • allows providing different sets to different Consumers by filtering the data based on the values of the context functions CURRENT_ACCOUNT() and CURRENT_ROLE() which would return the Consumer account name and role, respectively.

Types

  • Public Data Marketplace
  • Private Data Exchange

Editions

Snowflake Editions

Feature / Edition Matrix

The Snowflake Editions determine what features are available to the customer and what level of service they require. The Snowflake Edition affects the amount charged for compute and data storage.

Standard

Introductory level offering, providing full, unlimited access to all of Snowflake’s core features

  • Complete SQL Data Warehouse
    • Includes standard DDL and DML features
    • Includes advanced DML statement such as multi-table inserts, merge and windowing functions
  • Secure data sharing across regions/clouds
  • Premier Support 24 x 365
  • Always-on enterprise grade encryption in transit and at rest
  • Customer dedicated virtual warehouses
  • Federated authentication
  • Database replication
  • External Functions
  • Snowsight analytics UI
  • Create your own Data Exchange
  • Data Marketplace access
  • Fail-safe
  • Continuous Data Protection
    • Time Travel
    • Network Policies

SPECIFICS

  • 1 day of time travel
  • No Multi-Cluster warehouses
  • No Materialized Views
  • No column-level security
  • No query statement encryption
  • No Failover/Fallback
  • No AWS PrivateLink support
  • No Tri-Secret Secure encryption

Enterprise

Designed specifically for the needs of large-scale enterprises and organizations

  • All features of Standard plus:
  • Multi-Cluster Warehouses
  • Up to 90 days of time travel
  • Materialized Views
  • Column-level security
    • Dynamic Data Masking - uses masking policies to mask some columns for users with a lower role at query time
    • External Data Tokenization
      • tokenize sensitive data before loading it into Snowflake
      • you can also detokenize it using masking policies
      • useful for sensitive data like passwords, etc.
  • Search Optimization Service
  • Annual rekey of all encrypted data

SPECIFICS

  • No query statement encryption
  • No Failover/Fallback
  • No Tri-Secret Secure encryption
  • No AWS PrivateLink support

Business Critical

Offers even higher levels of data protection to support the needs of organizations with extremely sensitive data

  • All features of Enterprise plus:

  • Higher level of data protection

    • Data encryption everywhere
    • Query statement encryption
    • Enhanced security policy
    • Database Failover and Fallback for business continuity (e.g. replication with failover and fallback accross regions)
    • AWS PrivateLink support
    • Tri-Secret Secure encryption using customer-managed keys (AWS)
  • Enhanced security, see: Snowflake’s Security & Compliance Reports

    • HIPPA (Health Insurance Portability and Accountability Act) support / HITRUST
    • PCI (PCI-DSS)
    • ISO/IEC 27001
    • FedRAMP Moderate
    • SOC 1 Type II
    • SOC 2 Type II
    • GxP
  • External Functions - AWS API Gateway Private Endpoints support

Virtual Private Snowflake (VPS)

Business Critical Edition but in a completely separate Snowflake environment, isolated from all other Snowflake accounts, to provide the highest level of security for governments and financial institutions

  • All features of Business Critical plus:
  • Dedicated Services Layer, not shared with other accounts
  • Customer dedicated virtual servers wherever the encryption key is in memory
  • Customer dedicated metadata store
  • Additional operational visibility

Government Regions

For government agencies that require compliance with US federal privacy and security standards, such as FIPS 140–2 and FedRAMP, Snowflake provides support for those only on Amazon Web Services and Azure. These government regions are only supported on Business-Critical Edition or higher.

ObjectModel

Objects

  • All objects in Snowflake are securable - privileges on objects can be granted to Roles and Roles are granted to Users.
  • Additionally, all objects can be interacted with and configured via SQL given the user has sufficient privileges.
  • To own an object means that a role (NOT USER!) has the OWNERSHIP privilege on the object.
  • Each securable object is owned by a single role, which by default is the role used to create the object.
  • Other than the Organization and Account objects, objects can be at the Account level or the Schema level. Examples:
    • Account level objects:
      • Network Policy
      • User
      • Role
      • Grants
      • Warehouse
      • Database
      • Share
      • Resource Monitor
      • Storage Integration
    • Schema level objects:
      • Table
      • External Table
      • View
      • Stream
      • Task
      • Stored Procedure
      • UDF (User Defined Function)
      • Sequence
      • Stage
      • File Format
      • Pipe
  • Note that object identifiers are case insensitive and cannot contain spaces or special characters unless they are enclosed in quotes.
  • When an object identifier is enclosed in quotes, it becomes case sensitive. This should be avoided as it can contribute to hard to debug issues!

  • Schemas and Database Roles are Database Level Objects
  • The definitions of most objects can be retrieved using the GET_DDL() function.

Table

Tables hold all the data in a database.

Permanent

  • Default table type, used for the highest level of data protection and recovery
  • Persists until dropped
  • 0–90 days of Time Travel depending on the Snowflake edition (0–1 days in Standard Edition, 0-90 days in Enterprise Edition and up)
  • 7 days of Fail-Safe
  • Can be cloned to a permanent, transient or temporary table
    • If you clone a Permanent table to Transient or Temporary, the old partitions will remain Permanent, but any new partitions added to the clone will be Transients/Temporary

Transient

  • Persists until dropped
  • Available across sessions
  • Time Travel: 0 or 1 days
  • No Fail-Safe support
  • Can only be cloned to a Temporary or Transient table

Temporary

  • Used for transitory data
  • Tied to and available only within the context a single user session. As such, they are not visible to other users or sessions
  • Time Travel: 0 or 1 days
    • A temporary table is purged once the session ends, so the actual retention period is for 24 hours or the remainder of the session, whichever is less
  • Temporary tables incur storage costs
  • No Fail-Safe support
  • Can only be cloned to a temporary or transient table
  • Can be created with a clustering key, if needed

External

  • Snowflake table "over" data stored in an external stage
  • Persists until removed
  • Read-only
  • No Time Travel
  • No Fail-Safe support
  • Cloning is not supported
  • XML files are not supported for external tables

View

Named definition of a SQL query which can be queried as if it were a table. Views can be created on all table types, including external tables.

Standard

  • Default view type
  • No data is stored
  • No clustering support
  • Executes as the role which owns it
  • DDL available to any role with access to the view
  • Streams are supported

Materialized

  • Enterprise Edition is required for Materialized views
  • Results of underlying query are stored
    • require storage space and active maintenance, which incur both storage and compute costs
    • while Materialized Views behave more like Tables, Time Travel is not supported
  • Results are auto-refreshed in the background so new data loaded into the table will be propagated into the Materialized View
  • Consumes background compute for the auto-refresh
    • The background compute does not require a customer-provided warehouse; Snowflake manages the compute
  • Clustering is supported
  • NO support for Streams
  • Changes to the schema of base table are not automatically propagated to materialized views
  • Best used for:
    • Query results contain a small number of rows and/or columns relative to the underlying table
    • Queries that require significant processing, e.g. analysis of semi-structured data or expensive aggregates
    • The underlying table is an external one, which might be slower to query directly
    • The view’s base table does not change frequently

Secure

Both Standard and Materialized views can also be defined as Secure

  • Executes as the role which owns it
  • DDL available only to authorized users
  • Can be shared; secure views are mandatory when sharing
  • Snowflake query optimizer bypasses optimizations used for regular views so secure views are less performant

Stage

Stages in Snowflake specify where data files are stored (staged) in cloud storage. They facilitate data loading and unloading.

  • When staging uncompressed files in a Snowflake stage, the files are automatically compressed using GZIP unless compression is explicitly disabled
  • Snowflake automatically generates metadata for staged files:
  • METADATA$FILENAME: stage path and name of the data file the current row belongs to
  • METADATA$FILE_ROW_NUMBER: row number for each record in the staged data file
  • $[COLUMN_NUMBER]: staged data file column number; e.g. $1 refers to the first column in the staged file, $2 refers to the second column, etc.

Named Stages

  • Created manually
  • Referenced with @[STAGE_NAME]
  • can specify file format
  • Supported Cloud Storage services:
    • Amazon S3 Buckets
    • Google Cloud Storage Buckets
    • Microsoft Azure Containers
  • Named Stages can be Internal or External
    • Internal Named Stage: cloud Storage location managed by Snowflake; data is stored internally within Snowflake.
    • Named External Stage: references data files stored outside Snowflake.
      • managed by the user or respective cloud provider and can even be in a different cloud provider than the Snowflake account
      • Creating an external stage requires providing
        • URL where the files are located
        • credentials to access the files
        • decryption keys to enable decryption of the data

Internal Stages

  • Can only be accessed using the SnowSQL CLI
  • Can be of two types: User and Table Stage

User Stage

OrganizationsAccountsDatabasesAndSchemas

Organizations, Accounts, Databases And Schemas

Organization

Top level securable object which serves as a logical grouping of Accounts.

  • The Organization object is not available by default. It can be enabled on request by Snowflake Support.
  • Once enabled, it makes available features which make use of multiple accounts such as database replication and failover.
  • It can be used for monitoring usage and billing across accounts.
  • The ORGADMIN role is added to the primary account in the Organization to allow management of the contained Account objects
  • By default there is a limit of 25 Accounts in an Organization, can be increased by Snowflake Support

Account

Logical grouping of Databases.

  • Each account is hosted on a single cloud provider and in a single geographic region
  • By default, Snowflake does not move data between regions unless requested
  • The region where the account is provisioned affects:
    • the account and storage price
    • which regulatory certifications you can achieve
    • some Snowflake features which may be region-specific
  • The account's name must be unique within an organization, regardless of which Snowflake region the account is in.
  • The URL of an account (e.g. xy12345.us-est-2.aws.snowflakecomputing.com) consists of
    • the account identifier (sometimes referred to as the account name) xy12345.us-est-2.aws which can consists any mixture of the following:
      • the account locator xy12345
      • the cloud services region ID us-est-2 and
      • the cloud provider aws
    • and the snowflake domain, snowflakecomputing.com
  • It is possible to request a vanity address to convert the identifier into your company's name.
  • An account is created with the system-defined role ACCOUNTADMIN
  • When an account is created by an ORGADMIN, the account name (acme-marketing-account) will contain a unique organization name acme and the account name marketing-account

Database

Logical container of your data, organized in Schemas.

  • A Database can be created as a clone from another Database
  • A Database can be created as a replica of another Database
  • A Database can be created from a Share object provided by another Account
  • Databases can be created as Transient
    • Time Travel: 0 or 1 days
    • No Fail-Safe support

Schema

Logical grouping of Tables, Views, Stored Procedures, UDFs, Stages, File Formats, Pipes, Sequences, Shares, etc. Every schema belongs to a single Database. When a database is created, there are two default schemas created in it

  • A Schema can be created as a clone from another Schema
  • The Database name and Schema name together form the Namespace for all Schema objects
  • public: default schema for any object created without specifying a schema
  • information_schema: stores metadata information
  • Schemas can be created as Transient.
    • Time Travel: 0 or 1 days
    • No Fail-Safe support

Overview

Overview

The Snowflake Platform

Snowflake is a cloud native data platform delivered as a service

  • purpose built for the cloud
  • no management of hardware, updates and patches
  • automatic optimizations (partitioning)

Snowflake Releases

  • Snowflake releases new features weekly
  • The deployments happen transparently in the background; users experience no downtime or disruption of service, and are always assured of running on the most-recent release with access to the latest features.

Data Platform Workloads

  • Data Warehouse
    • Structured and relational data
    • ANSI standard SQL
    • ACID compliant transactions
    • Data stored in databases, schemas and tables
  • Data Lake
    • Separation and scalability of storage and compute
    • Schema does not need to be defined up front
    • Native processing of semi-structured data formats
  • Data Engineering
    • Data ingestion
      • batch: COPY INTO
      • streaming: Snowpipe
    • Separate compute clusters for different workloads
    • Tasks and Streams for data pipelines
    • Security features:
      • all data encrypted at rest and in transit
      • row-based access control
      • column-level masking
      • Multi-Factor Authentication
  • Data Science
    • Centralized storage removes data management roadblocks
    • Partner ecosystem data science tools can integrate natively with Snowflake
      • Amazon SageMaker
      • DataRobot
      • Dataiku
  • Data Exchange
    • Secure data sharing between accounts in the same cloud provider region
    • Data Marketplace
    • Data Exchange
    • BI with Snowflake partner ecosystem tools
  • Data Applications
    • Connectors and drivers
    • UDFs and Stored Procedures
    • External UDFs (AWS Lambda or Azure Functions)
    • Snowpark

Connecting to Snowflake

  • Snowsight: Snowflake web interface/console
  • SnowSQL: The Snowflake CLI client
  • ODBC: client, requires a Snowflake driver
  • JDBC: client, requires a Snowflake driver
  • SDK (Node, Python, Kafka, Spark, Go…)

Features

  • ANSI SQL compliant OLAP data warehouse
  • Multi-cluster, shared data architecture
  • Self-tuning and self-healing
  • Pay only for what you use: storage and compute
  • Compute (Virtual Warehouses) can be
    • Scaled Up/Down by resizing the warehouse on the fly to accommodate complex, heavy workloads
    • Scaled Out by using multi-cluster Warehouses allowing Snowflake to automatically spin up additional compute clusters to handle large concurrent query workloads
    • Automatically suspended and resumed to accommodate periods of inactivity
  • Time Travel (see below for more info)
  • Fail-Safe (see below for more info)
  • Live data sharing
    • Private 1:1 and 1:many data sharing
    • Public data marketplaces
    • Secure private data exchanges
  • Multi-cloud: AWS, Azure, GCP
  • Zero-Copy Cloning
    • Objects that can be cloned: Databases, Schemas and Tables
    • Only metadata is copied
    • Once an object is cloned, the individual copies evolve separately and their storage increases only by the amount of data that was modified in each copy.

Time Travel

  • Allows access to previous versions of the data - the data in its state at a point of time in the past
  • A minimum of Enterprise edition required for time travel longer than 1 day, up to 90 days
  • The period of time travel can be set by database, schema or table

Fail-Safe

  • A data backup feature
  • Snowflake admins can recover and restore data up to 7 days after the Time Travel period expires (not configurable)
  • Accessible only by contacting Snowflake support administrators
  • Only available for permanent tables
  • Fail-Safe incurs additional storage costs for the extra 7 days of data retention

Free Trial

You can apply for a 30-day free trial at the following link https://signup.snowflake.com/. Trying out the Enterprise edition is recommended.

Pricing

Pricing

Snowflake pricing and cost are based on the actual usage. You pay for what you use and scale storage and compute independently.

Storage Costs

All customers are charged a flat monthly fee per Terabyte for the data they store in Snowflake. Storage cost is measured using the average amount of storage used per month, after compression, for all customer data stored in Snowflake.

The storage includes:

  • Data stored in tables, even temporary ones
  • Historical data for Time Travel
  • Historical data for Fail-Safe
  • Internal Stages

Compute Costs

Customers pay for virtual warehouses using Snowflake credits. The cost of these credits depends on the Snowflake edition that you are using, and the cloud region of the Snowflake account, see https://www.snowflake.com/pricing/. For example, the costs per credit for an account in US East (North Virginia) on AWS were:

  • Standard Edition: $2.00 per credit
  • Enterprise Edition: $3.00 per credit
  • Business Critical Edition: $4.00 per credit

The number of credits that a virtual warehouse consumes is determined by the following:

  • The warehouse size
  • The number of clusters (for multi-cluster warehouses)
  • The time each server in each cluster runs
    • Server runtime is billed by second
    • Server runtime is billed WITH A ONE-MINUTE MINIMUM

Cloud Services Costs

Customers pay for cloud services using Snowflake credits. The typical utilization of cloud services (up to 10% of daily compute credits) is included for free, meaning most customers will not see incremental charges for cloud service usage.

Data Transfer Costs

Customers who wish to move or copy their data between regions or cloud providers will incur data transfer charges. Features such as External Tables, External Functions, and Data Lake Export may incur data transfer charges.

Purchasing the Snowflake Service

  • On-Demand: Billed in monthly arrears, customers are charged a fixed rate only for the amount of services consumed
  • Pre-paid: Pre-purchase Capacity, which involves a commitment to Snowflake. The Capacity purchased is then consumed monthly, providing lower prices and a long-term price guarantee, among other advantages.

ResourceMonitors

Resource Monitor

Resource Monitor

To help control costs and avoid unexpected credit usage caused by running warehouses, Snowflake provides resource monitors. A resource monitor can be used to monitor credit usage by virtual warehouses and the cloud services needed to support those warehouses.

  • Limits can be set for a specified interval or date range. When these limits are reached and/or are approaching, the resource monitor can trigger various actions, such as sending alert notifications and/or suspending user-managed warehouses.
  • Resource Monitors can only be created by account administrators (i.e. users with the ACCOUNTADMIN role)
  • Account administrators can choose to grant users with other roles the MONITOR & MODIFY grants to view and modify Resource Monitors
  • Resource Monitors can be set at two different Monitor Levels:
    • Account Level: a single Monitor can be set to control credit usage for all warehouses in your account.
    • Warehouse Level: a monitor can be assigned to one or more warehouses, controlling the credit usage for each warehouse
    • NOTE: A warehouse can only be assigned to a single Resource Monitor below the account level
  • The used credits for a resource monitor reflects the sum of credits consumed by all assigned warehouses within the specified interval, along with the cloud services used to support those warehouses during the same interval.
  • Resource Monitors can be replicated when replicating Databases

Configuring Resource Monitors

We need to specify several parameters when creating a Resource Monitor:

  • Credit Quota: specifies the number of Snowflake credits allocated to the monitor for the specified frequency interval.
    • Any number can be specified
    • Snowflake tracks the used credits/quota within the specified frequency interval by all warehouses assigned to the monitor
    • At the specified interval, this number resets back to 0
  • Monitor Level: specifies whether the resource monitor is used to monitor the credit usage for your entire Account (i.e. all warehouses in the account) or a specific set of individual warehouses.
  • Schedule: Consists of several properties
    • Start: Date and time (i.e. timestamp) when the resource monitor starts monitoring the assigned warehouses. Supported values:
      • Immediately (i.e. current timestamp)
      • Later (i.e. any future timestamp)
      • Note, however, that regardless of the time specified in the start date and time, resource monitors reset at 12:00 AM UTC.
    • Frequency: The interval at which the used credits reset relative to the specified start date. Supported values:
      • Daily
      • Weekly
      • Monthly
      • Yearly
      • Never (used credits never reset; assigned warehouses continue using credits until the credit quota is reached)
    • End: Date and time (i.e. timestamp) when Snowflake suspends the warehouses associated with the resource monitor, regardless of whether the used credits reached any of the thresholds defined for the resource monitor
    • Actions/Triggers: the action to perform if a given threshold is reached within the specified interval
      • Thresholds are usually expressed as a percentage of the credit quota
      • Notification are sent to all account administrators who have enabled receipt of notifications and non-administrator users in the notification list
      • Resource monitors support the following actions:
        • Notify & Suspend: send a notification and suspend all assigned warehouses after all queries in flight have completed.
        • Notify & Suspend Immediately: send a notification and suspend all assigned warehouses immediately, which cancels any queries in flight.
        • Notify: Perform no action, but send an alert notification
USE ROLE accountadmin;

CREATE OR REPLACE RESOURCE MONITOR limit1 WITH CREDIT_QUOT=2000
    FREQUNCY = WEEKLY
    START_TIMESTAMP = '2019-03-04 00:00 PST'
    TRIGGERS ON 80 PERCENT DO SUSPEND
             ON 100  PERCENT DO SUSPEND_IMMEDIATE;

ALTER WAREHOUSE wh1 SET RESOURCE_MONITOR = limit1;
ALTER WAREHOUSE wh2 SET RESOURCE_MONITOR = limit1;

Warehouse Suspension and Resumption

If a Resource Monitor suspends a Warehouse, it cannot be restarted unless one of these conditions are met:

  • The next interval, if any, starts, as dictated by the start date for the monitor.
  • The credit quota for the monitor is increased.
  • The credit threshold for the suspend action is increased.
  • The warehouses are no longer assigned to the monitor.
  • The monitor is dropped.

VirtualWarehouses

Virtual Warehouses

Virtual Warehouses (VWs) are the compute layer in Snowflake's architecture. It is a logical wrapper around a cluster of servers with CPU, memory and disk. Data Warehouses are used to perform queries and DML operations (like loading data into tables).

  • The actual compute underlying the VW is fully managed by Snowflake
  • While running, even if idle, a VW consumes Snowflake credits - you are charged for compute
  • VWs can be Standard or Multi-Cluster Warehouses (MCW). Standard Warehouses have only a single compute cluster and cannot scale out

SCALE OUT/IN: Multi-Cluster Warehouses (MCW)

MCWs can automatically SCALE OUT/IN by spawning (or shutting down) additional compute clusters

  • used for handling spikes in concurrent compute requests
  • MCWs are a feature of Snowflake Enterprise Edition and higher
  • MCWs can set to scale out to a maximum of 10 clusters
  • The number of credits billed is calculated based on the size and the number of warehouses that run within the time period
  • MCWs scale out under two Scaling Policies - Standard and Economy
    • Standard policy: starts additional clusters if a query is queueing or expected to queue
    • Economy policy: will start an additional cluster only after at least 6 minutes of workload have been queued up for the future cluster
  • Incoming queries are load-balanced across the available running clusters

SCALE UP/DOWN: Resizing Warehouses

A Virtual Warehouse can SCALE UP/DOWN (be resized), manually, at any time, even while running

  • used for handling complex, long-running queries or ingesting large data sets
  • when resized while running, currently executing queries will complete at the current size; queued and new queries will execute at the new size
  • Efficient scale up/down should give approximately linearly proportionate results, e.g. running a query on a double-sized warehouse should cut the query run time in half.
  • It is recommended to keep queries and workloads of similar complexity and with similar compute demands on the same warehouse to simplify compute resource size management
  • Resizing can be included in SQL scripts so a warehouse can be scaled up/down depending on the demands of subsequent queries

Creating Warehouses

  • One can create an unlimited number of warehouses in their account.
  • XS (Extra Small) is the smallest size of VW with one server per cluster.
    • each size up doubles the number of servers in the cluster as well as the number of Snowflake credits consumed
    • sizes range from XS, through S, M, L, XL, 2XL up to 6XL (currently)
      • the default size for warehouses created through the web interface is XL
      • the default size for warehouses created using SQL is XS
      • 5XL and 6XL warehouses are generally available in AWS regions, and in preview in US Government and Azure regions.
  • A VW can be set to auto-suspend when idle for a configurable number of minutes.
    • By default, the auto-suspend option is enabled.
    • Through the web UI auto-suspend can be set as low as 5 minutes
    • Through SQL auto-suspend can be set to the minimum auto-suspend setting of 1 minute (60 seconds)
    • VWs can also be started or suspended manually
  • A VW can also be set to auto-resume when a new query is submitted
    • Auto-resume is enabled by default
    • A VW will automatically start when first created
    • In SQL a VW can be created in suspended mode by setting the INITIALLY_SUSPENDED = TRUE option
  • When a VW is first created, it has, by default, the USAGE permission granted to the role that owns/creates it since ownership defaults to the object creator

Compute Credits

For more info, see Understanding Compute Cost

  • Customer-managed compute is charged per second with a 60-second minimum each time the warehouse starts
  • Snowflake-created serverless (background) compute is charged per second with no minimum.
  • Credit cost varies depending on cloud provider and region
  • One server in the VW cluster uses one credit per hour, so, for example, an XS (1 server) will use one credit per hour while a large (8 servers) will use 8 credist per hour

Cloud Services Compute Billing

  • Customers are charged for cloud computing (e.g. in the Cloud Services Layer) that exceeds 10% of the total compute costs for the account.
  • Use the WAREHOUSE_METERING_HISTORY view to see how much cloud compute your account is using

Transactions

A transaction is a sequence of SQL statements that are committed or rolled back as a unit. All statements in the transaction are either applied (i.e., committed) or undone (i.e., rolled back) together. For this reason, transactions are ACID (Atomicity, Consistency, Isolation & Durability) compliant.

  • If a session is disconnected for whatever reason and a transaction remains in a detached state which cannot be committed or rolled back, Snowflake takes 4 hours to abort a transaction and roll it back.
  • You can abort a running transaction with the system function SYSTEM$ABORT_TRANSACTION
  • Each transaction has independent scope
  • Snowflake does not support nested Transactions, although it supports nested Stored Procedure calls

DataStorageLayer

Data Storage Layer

The Data Storage Layer is Snowflake managed cloud-based storage available in the major cloud providers (AWS, Azure and GCP).

  • Full native support for semi-structured data: JSON, AVRO, ORC, XML and Parquet
  • Customers are billed for data based on its compressed size as well as any files existing in internal (Snowflake-managed) stages

Micro-Partitions

All data in Snowflake tables is automatically divided into micro-partitions. Micro-Partitions are contiguous units of storage that generally hold a maximum of 16MB or encrypted, compressed data (uncompressed the data is between 50-500MB), organized in a columnar way.

  • Micro-partitions are IMMUTABLE and versioned
    • In order to modify data in a micro-partition, a new version of the micro-partition must be created. The older micro-partition is then marked for deletion
    • This is what enables the time travel feature in Snowflake
  • Data is compressed within each micro-partition
    • Within each partition, data is organized in a hybrid columnar format
    • On load, data is automatically analyzed for the optimal compression scheme based on its format
    • Organizing data based on columns containing data of the same type allows for efficient compression which reduces storage and I/O
  • Snowflake attempts to preserve natural data co-location
    • Since data is organized in micro-partitions on load, some natural data clustering and optimization occurs
  • Table records are contained entirely within a micro-partition, they cannot span multiple partitions

Query Pruning

The metadata stored in the Cloud Services Layer allows Snowflake to minimize the data it needs to load from the storage layer in order to resolve a query. Snowflake performs two types of pruning:

  • micro-partition pruning - reading only the micro-partition files needed to resolve a query
  • column pruning - only reading the columns that we need from each micro-partition

The Cloud Services Layer metadata also allows the processing of certain queries (e.g, MIN, MAX) at the Cloud Services Layer without engaging the compute or storage layers. The metadata tracked is:

  • At the table level:
    • row count
    • table size (in bytes)
    • micro-partition references and table versions
  • At the Micro-Partition level:
    • MIN/MAX (range of values) in each column
      • this allows micro-partition pruning during query optimization
    • number of distinct values
    • NULL count

Data Storage Billing

  • Customers are billed for actual (compressed) storage used based on the daily average, billed per Terabyte per month
    • On-demand Pricing: Billed in arrears at $40 per Terabyte per month with a minimum charge of $25
    • Pre-purchased Capacity:
      • Billed upfront for committed storage capacity
      • Price varies depending on cloud platform
      • Customer is notified when they have consumed 70% of pre-purchased capacity

BulkLoading

Bulk Loading

Bulk load is the process of loading batches of data from files available in a stage into Snowflake tables.

  • You need to specify the table name where you want to copy the data, the stage where the files are, the file/patterns you want to copy, and the file format
  • Bulk loading uses a transactional boundary control which wraps each COPY INTO command into an individual transaction that can be rolled back if errors are encountered
  • The information about the loaded files is stored in Snowflake's metadata. You cannot COPY the same file again in the next 64 days unless you force it with the FORCE = True option
    • Renaming a previously ingested file in the stage does not modify the metadata so the file won't be re-ingested with the next COPY INTO command
  • Bulk loading uses user-managed compute resources (Virtual Warehouses) which affect the processing time and cost of data loading.
  • Some data transformations can be performed during data load, e.g. column reordering, column omission, type casting, etc.
    • some transformations like Flatten, Join, Group by, Filters or Aggregations are not supported.
  • You cannot Load/Unload files from your Local Drive, they need to be staged (PUT) or downloaded (GET) first
  • Using the Snowflake UI, you can only load files up to 50MB. You can copy bigger files using SnowSQL.
  • Organizing input data by granular path can improve load performance

Staging

Staging from a Local Storage on a Client Machine

We can use the PUT command to UPLOAD files from a local directory/folder to an INTERNAL STAGE (named internal stage, user stage, or table stage):

PUT file:///tmp/data/mydata.csv @my_int_stage;
  • PUT does NOT work with external stages.
  • You cannot use the PUT command from the Snowflake Web UI, you will need to use SnowSQL
  • Compression during a PUT operation uses the local machine's memory and /tmp directory disk space.
  • Uploaded files are automatically encrypted with 128-bit or 256-bit keys
  • The LIST command can be used to list the files in the stage.

Staging from Cloud Storage

  • When loading data from cloud storage it is generally recommended to create an external stage
  • If no stage was created, you will need to provide the location, credentials and decryption keys for each COPY INTO command.

COPY INTO Command

The COPY INTO command is used to load the data from staged files to an existing table. Snowflake will automatically load all files in the stage matching the query parameters which have not been previously loaded

  • Metadata on which files have been loaded is kept around for about 64 days.

    BEST PRACTICE: Files should not be kept in a stage for longer than 64 days

When loading files, you can:

  • name the files explicitly, e.g FILES = ('<file_name>'[, '<file_name>'][, ...])
  • provide a pattern to match the files to be loaded: PATTERN = '<regex_pattern>'

Other options:

  • FILE_FORMAT - must be specified unless a file format has already been attached to the stage or the table
  • VALIDATION_MODE - "dry run" to validate the data without loading it
  • SIZE_LIMIT - limit the amount of data being ingested
  • PURGE - delete any files from the stage which were successfully loaded. Default is FALSE
    • If the purge operation fails for any reason, no error is returned
  • RETURN_FAILED_ONLY - returns only the failed rows. Default is FALSE
  • ENFORCE_LENGTH - will cause data that exceeds the column size defined in the table to be treated as errors. Default is TRUE
  • TRUNCATECOLUMNS - will cause data that exceeds the column size defined in the table to be truncated to the column size and not treated as an error. Default is FALSE
  • FORCE - reload previously loaded files. Default is FALSE. TRUNCATE-ing a table resets the metadata for previously loaded files so FORCE is not needed for a TRUNCATE-d table.
  • LOAD_UNCERTAIN_FILES - load files older than 64 days where Snowflake metadata on whether they were loaded or not no longer exists. Default is FALSE
  • ON_ERROR - specify how to handle errors. Default is ABORT_STATEMENT:
    • CONTINUE - skip error rows and continue processing file to the end
    • SKIP_FILE - files with errors are skipped
    • SKIP_FILE_<n> - files with a <n> number of errors are skipped
    • SKIP_FILE_<n>% - files with a <n>% percentage of errors are skipped
    • ABORT_STATEMENT - abort the COPY INTO execution on any error encountered. Rows loaded prior to the error are reverted

Validating a Data Load

Data can be validated before loading with the COPY INTO option VALIDATION_MODE which can be set to:

  • RETURN_<n>_ROWS
  • RETURN_ERRORS
  • RETURN_ALL_ERRORS

Validating data after a load can be done with the VALIDATE command which returns all errors encountered during a specific load job into a given table:

VALIDATE([namespace.]<table_name>,
  JOB_ID => { '<query_id>' | _last }
);

ContinuousDataLoading

Continuous Loading

There are different ways to ingest data continuously into Snowflake:

  • Snowpipe: the easiest and most popular way
  • Snowflake Connector for Kafka: Reads data from Apache Kafka topics and loads the data into a Snowflake table
  • Third-Party Data Integration Tools: You can do it with other supported integration tools

Snowpipe

Enables streaming/pipeline/micro-batch/near-real-time loading of frequent, small volumes of data into Snowflake.

  • Specifies the COPY INTO command that is used to load the data
  • It can be used with internal or external stages
  • Can be used with structured and semi-structured data
  • No scheduling is required
  • Snowpipe is serverless; it doesn't use Virtual Warehouses
  • It is used for small volumes of frequent data which it loads continuously (micro-batches)
  • Snowpipe does not guarantee that files will be loaded in the same order that they were staged
  • Snowpipe retains 14 days of metadata vs. bulk loading which retains metadata for 64 days. You cannot copy the same files with Snowpipe again during these 14 days
    • Renaming a previously ingested file in the stage does not modify the metadata so the file won't be re-ingested at the next SnowPipe run
  • The default behaviour when there is an error loading files is SKIP_FILE rather than bulk copy's ABORT_STATEMENT
  • A Snowpipe can be paused, resumed and its return status can be retrieved.

Snowpipe can be notified of new files in the stage in one of two ways:

  • A call can be made to the Snowflake REST API to trigger the ALTER PIPE <pipe_name> REFRESH command to check the stage for new files to be loaded.
    • It requires key pair authentication with a JSON Web Token (JWT)
    • APIs endpoints available:
      • Data File Ingestion: insertFiles
      • Load History Reports: insertReport and loadHistoryScan
  • You can setup a task which would run the ALTER PIPE <pipe_name> REFRESH command on a schedule

Alternatively, you can use the Snowpipe AUTO_INGEST method, together with notifications setup on your cloud provider so the pipe is automatically notified when a file lands in the stage

  • AUTO_INGEST is only available on AWS external stages by using S3 event messages dispatched by Amazon's SQS (Simple Queue Service)
  • For more information, see Automating Snowpipe for Amazon S3

Once Snowpipe receives a trigger for the ingestion, whether via the REST API or a notification from the cloud provider, files are moved from the stage into an Ingestion Queue and then into the Pipe which contains the COPY INTO command with the source stage and the target table.

Billing

Customers are billed for the cloud compute used by Snowpipe per-second, per-core on actual compute usage.

In addition to the cloud compute costs, there is an utilization cost charge of 0.06 credits per 1000 files notified by using the REST API or AUTO_INGEST

ContinuousDataProcessing

Continuous Processing

Tasks

Tasks

A Task is a way to run a single SQL statement on schedule without having to depend on external tools. A task can execute any one of the following types of SQL code:

Task Features

  • Tasks can be queued based on the outcome of other Tasks to form DAGs
    • the Root task is the only one which can be scheduled
    • the root task can be queued in case the warehouse is currently busy with another workload
    • child tasks are always queued - they only run after the parent task finishes
    • each task can have a maximum of 100 children tasks
    • a tree of tasks can have a maximum of 1000 tasks, including the root one
    • each non-root task can have dependencies on multiple predecessor tasks, up to 100 predecessors
    • if a predecessor task is dropped, there are two options:
      • its child task becomes a standalone task
      • its child task becomes a root task
    -- Create a child task
    CREATE TASK <newTask> AFTER <rootTask>;
    ALTER TASK <newTask> ADD AFTER <rootTask>;
    

  • Tasks can also be used with Streams
  • Tasks can call Stored Procedures
  • Tasks can be run at a set SCHEDULE interval (in minutes) or with a cron expression
  • Creating a task requires the CREATE TASK privilege
  • Tasks are initially created in the suspended state
    • they can be started/resumed with ALTER TASK <task_name> RESUME;
  • Tasks execute with the privileges of the owning Role
    • if the owner role of a task is dropped, task ownership is reassigned to the role that dropped the owner role
  • Tasks cannot be triggered manually
  • Snowflake ensures only one instance of a task with a schedule is executed at a given time
    • if a task is still running when the next scheduled execution time occurs, that scheduled time is SKIPPED.
  • Tasks have a maximum duration of 60 minutes by default

Task Compute Resources

Tasks require compute resources to execute SQL code. When a task is being created, the user has the choice to opt for using their own Virtual Warehouse or let Snowflake use serverless compute:

  • User-managed Virtual Warehouse. To enable that, the Virtual Warehouse where the task would execute must be specified with the WAREHOUSE option, see Optional Parameters
  • Snowflake-managed (i.e. serverless compute model). This is useful for frequently running tasks which would make suspending the warehouse where they run impossible. While Snowflake will automatically scale the serverless compute depending on the workload, the initial size of the serverless compute resource can be specified using USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE

Task History

You can use the information_schema.task_history() Snowflake function to query the history of task usage within a specified date range. You need one of these privileges to see the task history:

  • You have the ACCOUNTADMIN role
  • You are the owner of a task
  • You have the global MONITOR_EXECUTION privilege
SELECT *
FROM table(information_schema.task_history())
ORDER BY scheduled_time;

Streams

Streams (also known as Table Streams) are Snowflake objects that store metadata about DML changes (INSERT, UPDATE, DELETE) made to tables, views and secure views. A Stream can be created on a table and used for CDC (Change Data Capture) to identify and act on changed records.

When created, a table stream logically takes an initial snapshot of every row in the source table by initializing a point in time (called an offset) as the current transactional version of the table. The stream then records the DML changes after this snapshot was taken.

  • Streams hold no data. They can be thought of as timestamped bookmarks for a particular state of a table
  • Streams can be of three types:
    • Standard Stream
      • Captures all changes in the table records
      • Supported on tables, directory tables, and views
    • Append-Only Stream
      • Tracks row inserts only
      • Supported on tables, directory tables, and views
    • Insert Only
      • Tracks row inserts only
      • Supported on EXTERNAL TABLES only
  • Each stream contains the following metadata:
    • METADATA$ACTION - indicates the DML operation (INSERT, DELETE) recorded. Note that updates are recorded as INSERT, see METADATA$ISUPDATE below.
    • METADATA$ISUPDATE - indicates whether the operation was part of an UPDATE statement
    • METADATA$ROW_ID - unique and immutable ID for the row
  • The SYSTEM$STREAM_HAS_DATA function indicates whether a stream contains change data capture (CDC) records
  • Streams ensure exactly-once semantics for new or changed data in a table
    • a stream only contains the last DML action on a row. A stream will hold only one record per table row, if any, reflecting the current state of the row compared to its initial state when the stream was first created
    • if the row has not been modified, there will be no Stream record at all
    • if the row was modified but then it was reverted to its original state when the stream was first created, the Stream record for that row will be removed
  • Multiple streams can be created for the same table and consumed by different tasks

DataUnloading

Data Unloading

Similar to Bulk Loading, unloading uses the COPY INTO command to unload data to an internal or external stage. The difference is that, when unloading, we would

COPY INTO <stage> FROM <table>;

Unloading to Local Storage on a Client Machine

If data has been unloaded to a Snowflake INTERNAL STAGE (named internal stage, user stage, or table stage), you can use the GET command to DOWNLOAD the files to a local directory/folder on a client machine:

GET @my_int_stage file:///tmp/data/;
  • You cannot use the GET command from the Snowflake Web UI, you will need to use SnowSQL

Staging from Cloud Storage

  • It is recommended to use an external named stage for data unloading
  • Files can also be unloaded directly to cloud storage by specifying the URL and any necessary credentials

Considerations

  • Supported file formats
    • Flat (CSV, TSV, etc.) for structured data
    • JSON - semi-structured data
    • Parquet - semi-structured data
  • When unloading unnested semi-structured data, you will need to use the OBJECT_CONSTRUCT function to recreate the JSON/Parquet data.
  • When unloading, files are split
    • Each server in the Data Warehouse cluster can process 8 files in parallel, so the larger the Data Warehouse, the more (smaller) files will be generated. Choosing a smaller Warehouse will result in fewer, larger files.
    • You can specifiy the SINGLE option for the COPY INTO command to prevent files from being split but that results in a significant performance degradation as you'd be using one Data Warehouse thread instead of 8.
    • Default average file size is around 16MB, compressed. MAX_FILE_SIZE can be specified to change that but there are some cloud-specific limitations
      • Azure max filesize is 256MB
      • The max filesize for GCP and Amazon is 5GB
  • You can use the LIST and REMOVE commands to manage files in stages.
  • SELECT queries in the COPY INTO statement support the full SQL syntax which allows complex transformations on data egress, e.g. JOIN-ing data from multiple tables, perform aggregations, ordering and filtering, etc.

Empty Strings vs. Null Values Options

  • FIELD_OPTIONALLY_ENCLOSED_BY - specify the character to use to enclose your fields
  • EMPTY_FIELD_AS_NULL - convert empty fields into NULL, the default is FALSE
  • NULL_IF - convert NULL values to something else

Overview

Data Movement

Loading and unloading data into Snowflake can be:

Factors which affect loading times are

  • the physical location of the stage,
  • GZIP compression efficiency, or
  • the number and types of columns

You can both load and unload data into tables with the COPY command

  • LOAD: copy data from a stage into a Snowflake table
  • UNLOAD: copy data from a table to a stage

File Formats

A schema-level named object which defines the format information required for Snowflake to interpret a file. You can specify different parameters, for example, the file’s delimiter, if you want to skip the header or not, etc.

Format Type Load Unload Binary Format
CSV Structured Yes Yes No
JSON Semi-structured Yes Yes No
Parquet Semi-structured Yes Yes Yes
XML Semi-structured Yes No No
AVRO Semi-structured Yes No Yes
ORC Semi-structured Yes No Yes
  • Structured data: CSV
    • It’s the fastest file format to load data
    • May need to specify FIELD_DELIMITER, RECORD_DELIMITER and whether the file header should be skipped with SKIP_HEADER
  • Semi-structured data: JSON, Parquet, Avro, ORC, XML
    • Semi-structured data can be loaded as a VARIANT type in a Snowflake table
    • The maximum size for the VARIANT data type is 16MB
    • it can be queried using JSON notation, see Query Semi-Structured Data
  • File Format has to be specified for every COPY INTO command
    • A File Format can be attached to a stage - automatically associates the File Format for all files stored in the stage
    • A File Format can be attached to a table - automatically associates the File Format for all COPY INTO commands targeting the table
    • A File Format can be specified for each COPY INTO command which will override the stage or table ones, if specified

AccessControl

Access Control

Overview of Access Control

Overview

  • Account- and user-level network policies
  • Online Certificate Status Protocol - OCSP x.509 validation for all Snowflake connectors
  • Optional support for cloud provider private network solutions
    • PrivateLink (AWS and Azure) - no traffic over the public internet
    • DirectConnect (AWS) - dedicated on-premise network connection to AWS

Access Flow

  1. Enforce network policy
  2. Verify account is not locked or disabled
  3. Verify user is not locked or disabled
  4. Resolve account and username from request
  5. Perform basic authentication checks

Account-Level Network Policies

  • IP white/blacklisting - restrict access by IP address or range
  • Use CIDR notation to specify subnet range
  • Only one active network policy per account
  • Must be ACCOUNTADMIN or SECURITYADMIN to configure
  • ALLOWED_IP_LIST is required, BLOCKED_IP_LIST is optional
-- Create a network policy
CREATE NETWORK POLICY <policy_name>
  ALLOWED_IP_LIST = (<ip_address_or_range>, ...)
  BLOCKED_IP_LIST = (<ip_address_in_allowed_range>, ...)
;
-- Set active network policy
ALTER ACCOUNT SET NETWORK_POLICY = <policy_name>;

User-Level Network Policies

If a user-level policy is assinded to a logged-in user, they are prevented from executing further queries

ALTER USER <username> SET NETW0RK_POLICY = <policy_name>;

Configuration

  • Allow traffic on the following ports:
    • 443: general Snowflake traffic
    • 80: for the OCSP cache server which listens for all Snowflake client communication
  • Allow hostnames:
    • all Snowflake clients require temporary access to the provider cloud storage
    • SYSTEM$ALLOW_LIST() lists the required hostnames for your Snowflake account

Authentication

Authentication

Identifies the user and confirms whether they are allowed to login.

Authentication Methods

  • Username and password policy (Snowflake-controlled, cannot be changed)
    • Password can be 8-256 characters long and must contain
      • at least one uppercase letter
      • at least one lowercase letter
      • at least one number
  • MFA: Multi-Factor Authentication with the Duo app
  • SSO/Federated Authentication (SAML 2.0), e.g. Okta
  • Other
    • Oauth: authentication from a 3rd party service or application
    • Key Pair: JSON Web Token (JWT) signed with a private/public RSA key pair
    • SCIM (System Cross-domain Identity Management): often used in Azure or Okta

Authorization

Authorization

Snowflake's authorization model is a combination of:

  • RBAC: Role-Based Access Control model
    • Privileges are granted to Roles
    • Roles are granted to users
  • DAC: Discretionary Access Control model
    • each object has an owner who can grant others access to the object

All Snowflake objects are individually securable. Access to objects operates on the principle of least privilege: unless allowed by a privilege, access is denied.

Key Concepts

  • User: A person or a program

  • Role: An account-level object to which we can grant privileges

    • We can grant roles to users
    • Roles can be granted to other roles, creating a hierarchy
    • The privileges associated with a Role are inherited by any Roles above that Role in the hierarchy

  • Privilege: A defined level of access to an object, e.g. SELECT PRIVILEGE ON DATABASE db1

    • Privileges that can be set depend on the securable object. They can be very broad or very granular.

Default Roles

  • ORGADMIN
    • Role that manages operations at the organization level. More specifically, this role:
      • Can create accounts in the organization.
      • Can view all accounts in the organization (using SHOW ORGANIZATION ACCOUNTS) as well as all regions enabled for the organization (using SHOW REGIONS).
      • Can view usage information across the organization.
  • ACCOUNTADMIN: top level role
    • encapsulates the SYSADMIN and SECURITYADMIN roles
    • Access to account-level settings
    • Can manage Data Shares
    • Can manage Resource Monitors
    • Can manage Network Policies
  • SECURITYADMIN: Manage Users, Roles and Network policies
    • Access to account-level settings with the exception of Usage & Billing
    • Can manage Network Policies
  • USERADMIN: Manage Users and Roles
  • SYSADMIN: Manage all Snowflake objects, except:
    • Data Shares
    • Resource Monitors
    • Network Policies
  • PUBLIC: Lowest privilege role
    • Automatically granted to every User and Role in the Account

In addition to the default Roles, USERADMIN (or a user with the CREATE ROLE privilege) can create custom roles and assign roles to them.

  • As a rule of thumb, top-level custom roles should be assigned to SYSADMIN to ensure SYSADMIN has access to the objects the custom roles work with.
    • An exception to this rule of thumb is if the custom roles work with highly sensitive information.

Role Commands

-- Create a new Role
CREATE ROLE <NEW_ROLE>;

-- Grant a Role to a parent Role
GRANT ROLE <CHILD_ROLE> TO ROLE <FATHER_ROLE>;

-- Create a user and assign them a Role
CREATE USER <NEW_USER>
    PASSWORD = <PASSWORD> 
    DEFAULT_ROLE = <ROLE>
    MUST_CHANGE_PASSWORD = TRUE
;
GRANT ROLE <ROLE> TO USER <NEW_USER>;

-- Review Privileges of a Role
SHOW GRANTS TO ROLE <ROLE>;

-- See to whom a Role is assigned
SHOW GRANTS OF ROLE <ROLE>;

Other Access Control Mechanisms

  • Secure Views and UDFs (User Defined Functions)
    • Prevent data from being indirectly exposed via programmatic methods
    • Only authorized users can see the view or UDF definition
    • bypass some of the optimizations to achieve better security so they are not as efficient
  • Row-level access - uses CURRENT_ROLE() or CURRENT_USER() to provide row-level security, e.g.
    SELECT ...
    FROM data_table
    JOIN auth_table
      ON auth_table.auth = data_table.auth
      AND auth_table.role = CURRENT_ROLE()
    ;
    

DataProtection

Data Protection

Continuous Data Protection

Encryption at Rest and in Transit

  • Encryption in transit for all client communication using TLS 1.2 SSL
  • AES 256 bit encryption at rest and in motion
  • Encryption at every level, see Hierarchical encryption key model
    • Root Key
    • Account master Key - each account has its own key
    • Object Master Keys - each object has its own key
    • File Keys - each micro-partition file has its own key
  • Automatic key expiration and rotation with no customer impact every 30 days. Key states:
    • active: used for encryption and decryption
    • retired: used for decryption of data previously encrypted with that key
    • destroyed: no longer used
  • Enterprise Edition and above can enable automatic annual re-keying
  • Tri-Secret Secure or Bring Your Own Key
    • uses a composite master key which is a composite of a customer-maintained key with a Snowflake-managed key
    • the customer-maintained key must be available at all times for continuous access
  • Staging Encryption
    • In Snowflake-managed internal stages, Snowflake manages encryption
    • Customers can also choose to manage their own cloud storage stages where they would need to secure and encrypt the data prior to loading into Snowflake

Continuous Availability

  • Cloud providers replicate Snowflake storage across at least three availability zones within each cloud region where Snowflake is deployed
    • Each availability zone consists of multiple data centers which are:
      • geographically separated
      • have their own security
      • are on separate power grids
      • have their own individual power backups
  • Snowflake performs fully transparent online updates and patches with no maintenance downtime windows

Snowflake-specific Data Security Features

Time Travel

Understanding & Using Time Travel

Access historical data at any point of time within a defined retention time period. Allows querying/rollback to a previous state of a table/schema/DB within the retention period.

  • Maximum retention period can be configured per account, DB, schema and/or table by setting DATA_RETENTION_TIME_IN_DAYS:
    • Standard Edition - up to 1 day
    • Enterprise Edition and higher - up to 90 days
    • The default retention time is 1 day in all editions unless explicitly changed
    • Can be disabled by setting DATA_RETENTION_TIME_IN_DAYS = 0
    • To view current settings, use SHOW PARAMETERS
  • Allows fixing common mistakes
    • UNDROP a DB/Schema/Table
      • Attempting to restore an object with a name that already exists will result in an error
    • Access the previous state of a table with AT or BEFORE
      • Query the data in its previous state
      • Clone a DB/Schema/Table from its preivous state
      -- by Timestamp
      SELECT *
      FROM my_table
      AT(timestamp => 'Mon, 01 May 2021 08:00:00 -0700'::timestamp_tz);
      
      -- by Offset
      SELECT *
      FROM my_table
      AT(offset => -60*15);
      
      -- before query statement ID
      SELECT *
      FROM my_table
      BEFORE(STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726');
      
  • Adds to storage costs for maintaining the additional historical data

Zero-Copy Cloning

Enables taking a quick snapshot of a DB/schema/table.

  • Zero-copy cloning is a metadata-only operation
    • Zero-Copy cloning is a FREE operation
  • Originally, both the original and the clone's objects are pointing to the same micro-partitions
  • No data is copied, so no additional storage costs are incurred until data is changed in the original or in the clone
  • Usage
    • often used to quickly spin up a Dev or Test environment as a zero-copy clone of Production
    • can be used to create backups or data snapshots at given points of time
    • can be used to create read-only snapshots for reporting as of a given point of time, e.g.: month-, year-end, etc.
  • Some considerations:
    • Privileges are not cloned
    • Data History is not cloned
    • Stages cloning
      • Named Internal Stages are not cloned
      • External Stages are cloned
      • Table Stages are cloned
    • Pipes cloning: pipes that reference
      • internal stages are not cloned
      • external stages are cloned
  • Privileges required for cloning objects:
    • Table: SELECT
    • Pipe, Stream, Task: OWNERSHIP
    • all other objects: USAGE

Fail-Safe

Non-configurable, 7-day retention for historical data after the Time Travel expiration

  • Only accessible by Snowflake support personnel
  • Fail-safe cannot be disabled
  • Admins can view Fail Safe storage use in the Web UI under Account > Billing & Usage
  • Only permanent tables have a Fail-Safe period. Not supported for Temporary, Transient or External tables
  • Adds to storage costs for maintaining the additional historical data

Snowflake replication features

Snowflake provides customers with additional replication features, over and above the cloud providers' replication across availability zones.

Overview

Account and Security Overview

Security Aspects

  • Access: network policies
  • Authentication: Password, MFA or SSO
  • Authorization: combination of Discretionary and Role Based Access Control
  • Data Protection: data encryption at rest and in motion, replication, recovery features
  • Infrastructure: handled by cloud providers

Snowflake Operational Controls

Snowflake’s Security & Compliance Reports

  • NIST 800-53
  • SOC2 Type 2
  • SOC1 Type II
  • ISO/IEC 27001:2013
  • HIPAA (Snowflake Business Critical Edition)
  • PCI (Snowflake Business Critical Edition)
  • FedRAMP (Snowflake Business Critical Edition)

Caching

Caching

One of the most important ways to improve the speed of queries in Snowflake and optimize costs is the cache. There are several different caches in Snowflake: Metadata, Query Result, and Warehouse cache.

Metadata Cache

The Metadata cache is maintained in the Cloud Services Layer and contains objects information & statistics (see Architecture). Snowflake automatically stores different types of metadata to improve the compilation time and query optimization. This helps in operations like MIN, MAX, COUNT, etc., where Snowflake does not need to use the warehouses (and no computing credits are charged), as it has the information already available in the metadata cache.

  • The metadata cache information is stored for 64 days
  • The metadata cache contains information about staged files and their ingestion status

Query Result Cache

The Cloud Services Layer accepts SQL requests from users, coordinates queries, managing transactions and results and also holds the Query Results Cache

  • Holds the results of every query executed in the past 24 hours up to 31 days (depending on how many times it's reused).
  • The query results are available across virtual warehouses, so cached query results returned are available to all users of the system, provided the underlying data has not changed
    • If the data in the Storage Layer changes, the caches are automatically invalidated.
  • No data is scanned from the Storage Layer and no compute is required so no computing credits are charged for cached results
  • Cached query results do NOT incur Storage costs
  • Can't have context functions (e.g. current_time())
  • Same role is used as the previous query
  • You can disable the Query Result cache with this command:
    ALTER SESSION SET USE_CACHED_RESULT = FALSE
    

Warehouse Cache

The actual SQL is executed across the nodes of a Virtual Data Warehouse. This layer holds a cache of the micro-partition data queried, and is often referred to as Local Disk I/O although in reality this is implemented using SSD storage.

  • All data in the compute layer is temporary, and only held as long as the virtual warehouse is active.
  • Suspending or resizing a warehouse will clear the data cache
    • A low Auto-Suspend setting (e.g. auto-suspend after 1 minute) minimizes the use of the Warehouse data cache
    • A resized Warehouse (up or down) starts with a cold (clear) cache
  • You can get information on how much data is read from Local (Warehouse) storage vs. Remote (Storage layer) by inspecting the query profile's "Scanned Bytes" metric.
  • It is recommended to use dedicated Virtual Warehouses for similar workloads. For example, a Virtual Warehouse for BI tasks, another for Data Science, etc. There is a chance that data cached by the warehouse for similar queries can can be re-used
  • The larger a warehouse is, the larger the warehouse cache size is

DataClustering

Data Clustering

Clustering

Typically, data stored in tables is sorted along natural dimensions, for example, by date. This process is called clustering. Data that is not sorted/clustered may hurt query performance, particularly on huge tables.

The Cloud Services Layer collects clustering metadata for the micro-partitions in a table:

  • The number of micro-partitions that comprise the table.
  • The number of micro-partitions containing values that overlap with each other.
  • The depth of the overlapping micro-partitions.

Natural Clustering

As data is loaded into a table, micro-partitions are created based on ingestion order. That is natural clustering.

  • Natural clustering occurs when data is read as a single data load
  • The ingestion order may be highly correlated with the values in one or more columns, e.g. autoincrement IDs, dates, sequential numbers, etc.
  • The source data organization determines what range of values (MIN/MAX) is represented in each micro-partition

Clustering Depth

Clustering depth measures the average number of overlapping micro-partitions for a specific column value.

Automatic Clustering

Automatic Clustering

Snowflake provides periodic & automatic re-clustering to maintain optimal clustering. When the ingestion-order natural clustering is not advantageous or has degraded due to extensive INSERT operations, Snowflake provides a mechanism for its customers to override its natural clustering algorithms by defining clustering keys. This will result in like data (by key) being co-loaded in the same micro-partitions.

Snowflake only reclusters a clustered table if it will benefit from the operation. It monitors and evaluates the tables to determine whether they would benefit from reclustering, and automatically reclusters them, as needed. In most cases, no tasks are required to enable Automatic Clustering for a table.

Automatic clustering can be suspended and resumed:

ALTER TABLE t1 [SUSPEND|RESUME] RECLUSTER;

Clustering Keys

CREATE|ALTER TABLE ...
CLUSTER BY (<column_or_expression>, ...)
  • Use one to three clustering keys at maximum
  • Prefer and order clustering keys from low to high cardinality which will result in fewer, larger sets of micro-partitions being co-located together
  • It is important to choose a clustering key that has appropriate cardinality:
    • A large enough number of distinct values to enable effective pruning on the table.
    • A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
  • Expressions such as DATE() or TRUNC() can be used for the clustering keys on columns with high cardinality, e.g. TIMESTAMP or hash strings, etc.
  • Snowflake maintains and updates the clustering order in the background which incurs compute and storage costs as micropartitions need to be rewritten
  • Typically, clustering keys are added after ingestion, and/or when natural clustering has degraded.
  • Cluster Keys are placed on columns usually used in WHERE, JOIN and ORDER BY clauses

Clustering Keys Considerations

  • Clustering keys are most useful with tables which
    • are very large (multi-terabyte range)
    • have degraded in query performance as the table grows
    • have large clustering depth values
  • Frequently-changing tables will consume more credits for reclustering
  • The SYSTEM$CLUSTERING_INFORMATION() function can be used to get clustering information on a table's column. Good clustering will have:
    • low average_depth values which indicates low micro-partition MIN/MAX values overlap
    • partition_depth_histogram with a high number of low value depths
  • You can cluster materialized views, as well as tables
  • Snowflake recommends a maximum of 3 or 4 columns (or expressions) per key.

Overview

Performance and Tuning Overview

Query Optimizer Workflow

Typical workflow of the Snowflake query optimizer:

  1. Row Operations: Inspect FROM, JOIN and WHERE clauses for partition pruning
    • Static pruning based on the columns in the WHERE clause based on the MIN/MAX micro-partition metadata
    • Dynamic pruning based on the JOIN columns
  2. Inspect GROUP BY and HAVING
  3. Inspect SELECT, DISTINCT, ORDER BY and LIMIT

Query Optimization Tips

  • Take advantage of partition pruning by limiting the initial data set as much as possible.
    • Use appropriate filters as early as possible
    • Filter on columns which would be naturally ordered (or have been clustered to be ordered) in micro-partitions
      • Natural clustering occurs on columns which have high correlation to ingestion order, e.g. date columns
    • When filtering by string value, avoid wildcards at the beginning of the string as the optimizer cannot take advantage of the MIN/MAX micro-partition metadata
    • Prefer uncorrelated over correlated subqueries. A correlated subquery is one which would have to execute multiple times since its query clause (or, worse, clauses!) depends on multiple values from the parent query.
  • Join on unique keys
    • Ensure keys are distinct to avoid join explosions
    • Understand relationships between tables before joining
    • Avoid many-to-main joins
    • Avoid unintentional cross joins
  • Prefer low cardinality columns for GROUP BY clauses
  • Add LIMIT to ORDER BY clauses where possible
  • Limit ORDER BY usage to a single instance in the top level SELECT wherever possible
  • Review the query Profile Overview for
    • spilling to disk
    • high network usage

Altering Query Behaviour

These can be set for at the Account/Session/Role or Virtual Warehouse level:

  • STATEMENT_TIMEOUT_IN_SECONDS: How long can a query run before being cancelled by the system, Default: 2 days
  • STATEMENT_QUEUED_TIMEOUT_IN_SECONDS: How long can a queued query wait before being cancelled by the system, Default: 0 (forever)

VirtualWarehouseScaling

Virtual Warehouse Scaling

  • As queries are submitted, resources are calculated and reserved prior to the query execution
  • When there are insufficient resources, queries are queued until previous queries finish and release resources

Scaling Up

Scaling Up: Resizing a warehouse to handle more complex or data-intensive queries.

Warehouse Size Servers per Cluster Clusters Servers Available
X-Small 1 1 1
Small 2 1 2
Medium 4 1 4
Large 8 1 8
X-Large 16 1 16
2X-Large 32 1 32
3X-Large 64 1 64
4X-Large 128 1 128
  • Scaling up/down is done by changing the size of the warehouse which affects the number of servers in the warehouse cluster
  • A warehouse that is too large may waste compute credits if its resources are not fully utilized
  • A warehouse that is too small may become very slow since it may be resource bound, resulting in spilling data to local or remote disk, further degrading performance.

Scaling Out

Scaling Out: Adding more clusters to the warehouse without changing its size to handle high concurrency loads.

Warehouse Size Servers per Cluster Clusters Servers Available
Medium 4 1 4
Medium 4 2 8
Medium 4 3 12
Medium 4 4 16

Multi-cluster warehouses can work in

  • Maximized Mode
    • the minimum and maximum number of clusters is the same and is more than one
    • useful for predictable periods of sustained high loads
  • Autoscaling Mode
    • the minimum and maximum number of clusters are different
    • The warehouse will automatically scale up and down (auto-scale) depending on load
    • helps to automate the handling of variable number of queries or users over time
    • Scaling policies
      • Standard
        • useful for relatively smooth transitions from low to high load and back
        • a new cluster is added
          • as soon as queries start queueing, or
          • an incoming query will be queued due to shortage of resources
        • a cluster is shut down after 2-3 consecutive checks (one minute apart) consistently determine that the load on the least-busy cluster can be redistributed to other clusters
      • Economy
        • useful for erratic transitions from low to high load and back, reduces scale out/in reactions to short-lived load spikes or drops
        • a new cluster is added
          • after a query has been queued for up to 6 minutes, or
          • when Snowflake determines that the current work load will keep available clusters busy for at least 6 minutes
        • a cluster is shut down after 5-6 consecutive checks (one minute apart) consistently determine that the load on the least-busy cluster can be redistributed to other clusters

Scaling Considerations

  • Query processing cannot cross clusters, so, even though the number or clusters is the same, a complex query will perform differently on a 4-cluster multi-cluster warehouse made of X-Small clusters compared to a single-cluster Medium warehouse.

LoadAndUnloadSemiStructuredData

Load and Unload Semi-Structured Data

Create File Format

CREATE FILE FORMAT

CREATE FILE FORMAT <file_format_name>
TYPE = { JSON | AVRO | ORC | PARQUET | XML }
[<format_options>];

JSON Data

Some JSON Format Options

  • COMPRESSION: Specifies compression algorithm
    • Supported: GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE
      • the BROTLI algorithm is not supported in AUTO mode
    • Default for loading: AUTO
    • Default for unloading: GZIP
  • FILE_EXTENSION: Specifies file extension to use
    • Default: .JSON
    • Only used for unloading
  • ALLOW_DUPLICATE: Allows duplicate object field names
    • Default: FALSE
    • Only the last field value will be preserved
    • Only used for loading
  • STRIP_OUTER_ARRAY: If TRUE, JSON parser will strip outer array brackets [] so the file is not treated as a single JSON entry
    • Default: FALSE
    • Only used for loading
  • STRIP_NULL_VALUES: If TRUE, JSON parser will remove object fields or array elements with NULL values
    • Default: FALSE
    • Only used for loading
  • DATE_FORMAT: Defines the date format in strings
    • Default: AUTO
    • Used only for loading JSON data into separate columns
  • TIME_FORMAT: Defines the time format in strings
    • Default: AUTO
    • Used only for loading JSON data into separate columns

Load Options for JSON Data

Load the file with no parsing or transformations into a VARIANT column:

COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <JSON format with options>);

Transform the file into columns and load into a table with pre-defined columns:

COPY INTO table_name
FROM (
    SELECT <keys>
    FROM <file(s) in stage>
)
FILE_FORMAT = (FORMAT_NAME = <JSON format with options>);

Parquet Data

Some Parquet Format Options

  • COMPRESSION: Specifies compression algorithm
    • Supported: LZO, SNAPPY, NONE
    • Default for loading: AUTO
    • Default for unloading: GZIP
  • SNAPPY_COMPRESSION:
    • Default: TRUE
    • Only used for unloading
  • BINARY_AS_TEXT: Whether to interpret columns with no defined logical data type as UTF-8 text. When set to FALSE, Snowflake interprets these columns as binary data.
    • Default: TRUE
    • Only used for loading
  • TRIM_SPACE: Whether to remove leading and trailing white space from strings
    • Default: FALSE
    • Only used for loading Parquet data into separate columns
  • NULL_IF: Replace matching strings with a SQL NULL value
    • Default: \\N
    • Only used for loading Parquet data into separate columns

Load Options for Parquet Data

Load the file with no parsing or transformations into a VARIANT column:

COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <parquet format with options>);

Transform the file into columns and load into a table with pre-defined columns:

COPY INTO table_name
FROM (
    SELECT
      $1:<column_name>::<data_type>,
      $1:<column_name>::<data_type>,
      ...
    FROM <file(s) in stage>
)
FILE_FORMAT = (FORMAT_NAME = <parquet format with options>);

Loading MATCH_BY_COLUMN_NAME Option

  • Load semi-structured data directly into columns of a table the names of which match the names of the high-level keys present in the data
  • Supported for JSON, AVRO, ORC and Parquet file formats
COPY INTO table_name
FROM <file(s) in stage>
FILE_FORMAT = (FORMAT_NAME = <file format>)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Unloading Data

  • Supported formats
    • JSON: Unload from a VARIANT column or reconstruct a JSON format
    • Parquet: Use a SELECT to unload the table columns into a Parquet file

Overview

Semi-Structured Data Overview

Snowflake supports the following semi-structured data file formats:

  • JSON
  • Avro - originally developed for Apache Hadoop
  • ORC (Optimized Row Columnar) - used for Hive data
  • Parquet - binary format used in the Hadoop ecosystem
  • XML (Extensible Markup Language)

Semi-structured data is a first-class citizen in Snowflake:

  • ingested in the same way as structured data, with micropartitions, metadata, compression and encryption, to enable quicker query access and partition pruning
  • all SQL operations are supported with native syntax for accessing data even in semi-structured fields

Semi-structured data is supported using the VARIANT data type:

  • holds standard SQL data types as well as arrays and objects
  • Objects - collection of key-value pairs where the values are of VARIANT type
  • Arrays - collection of values where each value is a VARIANT
  • non-native values (e.g. dates, timestamps) are stored as strings in VARIANT columns
  • operations can be slower than when stored as the corresponding data type

QuerySemiStructuredData

Query Semi-Structured Data

-- Sample "weather" VARIANT column data
{ "humidity": 45.6, "cloud_cover": { "density": 25.0, "description": "scattered clouds" }}
  • To access a value in a VARIANT column under a given key, you can use:
    • colon (:) notation: SELECT weather:humidity
    • square brackets ([]) notation: SELECT weather['humidity']
  • To access values of nested objects, you can chain the keys names of each nested object with dot (.) to reach the value: SELECT weather:cloud_cover.density
  • VARIANT data can be cast to SQL data types using the double colon :: syntax:
SELECT
    weather:humidity::NUMBER(10, 2) AS humidity_percent,
    weather:cloud_cover.description::VARCHAR as cloud_description

Semi-structured Data Functions

Semi-structured Data Functions

  • ARRAY_AGG: array creation and manipulation
  • OBJECT_AGG: object creation and manipulation
  • PARSE_JSON: JSON and XML parsing
  • TO_ARRAY: conversion to array
  • FLATTEN: a table function that takes a VARIANT, OBJECT, or ARRAY column and explodes compound values into multiple rows. Often used in Lateral Joins. Returns a table with a defined set of columns:
    • SEQ: unique sequence number associated with the input record. Not guaranteed to be continuous or ordered
    • KEY: for maps or objects, the key of the exploded value; NULL for arrays
    • PATH: path to the data structure element that is being flattened
    • INDEX: with arrays, index of the element; NULL otherwise
    • VALUE: value of the flattened element
    • THIS: the element being flattened; useful for recursive flattening with RECURSIVE => TRUE
  • IS_<object_type>: type checking
  • AS_<object_type>: type casting