Snowflake Bulk Vs Once Query Difference

Hello everyone,

I would greatly appreciate your assistance in clarifying when to utilize bulk queries versus once per record queries.

My flow is suppose to run every 30 minutes, and I may encounter scenarios where there are more than 50 records or none at all. I'm unsure about the criteria for deciding which type of query to use in such cases.

Additionally, I'm curious if the selection of query type is determined solely by the quantity of data or if it depends on other factors. For instance, should we use the once per record query for custom queries and the bulk query for automatically generated queries?

Thank you in advance for your guidance and insights.

0

Comments

3 comments
Date Votes
  • Yash Kumar you've just about hit the nail on the head. Bulk insert is great when you have thousands and millions of records to push into Snowflake because it's more efficient (up to 1 MB of data per insert) thus consumes fewer credits in the long run, you don't have to write any sql, and you can use the mapper that you're used to. However, with bulk insert, you most likely need an additional flow to run your merge query so that you can dedup your data. Outside of the bulk insert option, you have "run once per page of records" and "run once per record". These two options require you to write some sql, but would allow you to not have two flows strung together since the sql could just be a merge query that inserts and updates your data. Out of those two options, running once per page would be preferred because you could run the query once for the entire page of data (default is 20 records per page) instead of having to run a query for each individual record. 

    Bulk insert
    Pros:

    • Supports dropdown style selection for table names
    • Mapper support with drop down of available columns for the chosen table
    • No sql required for inserting
    • More performative for getting data quickly into Snowflake and thus reducing warehouse compute time. Can insert 1 MB of data per query.

    Cons:

    • You most likely need another flow to run your merge query so you can dedup your data.
    • If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

    Typical setup:

    merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
        select
            *
        from {{{settings.flowGrouping.stagingDatabase}}}.{{{settings.flowGrouping.stagingSchema}}}.{{{settings.flowGrouping.stagingTable}}} as s
        qualify 
            row_number() over (partition by s.primarykey||s.resource order by _updated_at desc) = 1) as s on s.primarykey = t.primarykey and s.resource = t.resource
    when matched and s._updated_at > t._updated_at then 
        update set 
            t.data = s.data
            ,t._updated_at = s._updated_at
    when not matched then 
        insert 
            (primarykey 
             ,resource
             ,data
             ,_updated_at
             ,_created_at
            ) 
        values 
            (s.primarykey
             ,s.resource
             ,s.data
             ,s._updated_at
             ,s._created_at
            )
    ;



    Run once per page

    Pros:

    • One flow can be made to get data from source to Snowflake
    • One query can be made for the page of data and not just a single records, thus saving you 20 times (by default) the number of query calls to Snowflake

    Cons:

    • Requires sql to be written
    • Slower to get a lot of data into Snowflake
    • If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

    Typical setup:

    merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
        {{#each batch_of_records}}
        {{#if @last}}
        select 
          {{record.id}} as primarykey
          ,{{record.resource}} as resource
          ,parse_json({{jsonSerialize record.data}}) as data
          ,{{timestamp}} as _updated_at
          ,{{timestamp}} as _created_at
        {{else}}
        select 
          {{record.id}} as primarykey
          ,{{record.resource}} as resource
          ,parse_json({{jsonSerialize record.data}}) as data
          ,{{timestamp}} as _updated_at
          ,{{timestamp}} as _created_at
        
        union all
        
        {{/if}}
        {{/each}}
        ) as s on s.primarykey = t.primarykey and s.resource = t.resource
    when matched and s._updated_at > t._updated_at then 
        update set 
            t.data = s.data
            ,t._updated_at = s._updated_at
    when not matched then 
        insert 
            (primarykey 
             ,resource
             ,data
             ,_updated_at
             ,_created_at
            ) 
        values 
            (s.primarykey
             ,s.resource
             ,s.data
             ,s._updated_at
             ,s._created_at
            )
    ;


    Run once per record

    Pros:

    • Great to use for single step processing like running merge queries, table creates, table drops, clearing stage data, etc..
    • If an error occurs, the single record will fail out compared to the page/batch of records sent

    Cons:

    • Will be the slowest of all the options if you're trying to get a lot of data in
    • Requires sql

    Typical setup:

    Same as run once per page, but with this query.

    merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
        select 
          {{record.id}} as primarykey
          ,{{record.resource}} as resource
          ,parse_json({{jsonSerialize record.data}}) as data
          ,{{timestamp}} as _updated_at
          ,{{timestamp}} as _created_at
        ) as s on s.primarykey = t.primarykey and s.resource = t.resource
    when matched and s._updated_at > t._updated_at then 
        update set 
            t.data = s.data
            ,t._updated_at = s._updated_at
    when not matched then 
        insert 
            (primarykey 
             ,resource
             ,data
             ,_updated_at
             ,_created_at
            ) 
        values 
            (s.primarykey
             ,s.resource
             ,s.data
             ,s._updated_at
             ,s._created_at
            )
    ;
    0
  • Thanks a lot Tyler!

    I was looking to know regarding one more point i.e if I have to perform both Add and update on the same table does it requires two separate steps? 

    0
  • Yash Kumar the merge query for Snowflake handles both inserts and updates so you could use it like I showed above. The recommended approach is to have one flow that bulk inserts data into a table, then have another flow triggered to run afterwards that merges the staged data into some final production table.

    0

Please sign in to leave a comment.

 

Didn't find what you were looking for?

New post