When you create integrations that import data to Snowflake, then the recommendation is to Use optimized bulk load for large volumes. However, if you have limited volumes of data or any other factors that restrict your import of data in a single bulk load, then Use batch insert SQL query. For batch insert, unlike optimized bulk load, you must create flows to merge the data in Snowflake and then delete the data after the import is completed. (If your flows run slow, consider enhancing page size, batch size, or concurrency.)
The below integrations focus on how you can use batch insert to update data in Snowflake – the first example outlines a Shopify-Snowflake integration & the second example outlines a NetSuite-Snowflake integration.
If you already have an Amazon Web Services (AWS) account and use S3 buckets, consider building an integration to upload the data to an Amazon S3 bucket, and then bulk load the data directly from the Amazon S3 bucket to Snowflake. Before you build your flows, see Configuring a Snowflake storage integration to access Amazon S3 and complete any pre-work such as creating a storage integration and copying to the Snowflake staging table. The advantage of using such an integration is that it would considerably reduce your Snowflake compute credits.
Consider the below integration with two flows to help you build your own integration – the first flow exports data and writes it to an S3 bucket, and the second flow loads the data from the S3 bucket to Snowflake.
Flow 1
This flow (Shopify orders to S3) exports orders from Shopify using a REST API call and writes it to an S3 bucket.
-
The Get orders export step uses a REST API call to the
/orders.json
endpoint to retrieve a list of orders. -
The Transfer orders import step writes the retrieved data into an S3 bucket named "celigo-labs''.
Tip
You should create a file format in Snowflake that can handle the file placed in your S3 bucket. In this example, the first flow placed JSON data into the S3 bucket and all records were wrapped in an array.
create or replace file format myjsonformat type = 'json' strip_outer_array = true;
Flow 2
This flow (Snowflake staging merge & clear) retrieves the staging records, then it merges and clears the data in the staging database.
-
The Select external stage files export step uses an SQL query based on the last updated timestamp to retrieve files from the staging external S3 stage.
-
The Merge import step uses an SQL merge statement to merge data from the S3 stage files into a production table.
-
The Remove file from external stage import step uses an SQL remove statement to clear all records from the staging table.
Note: When you build the integration, if you have to set up a connection to Amazon S3, it will be counted against your license
The best practice for updating existing data in Snowflake is to use two separate flows that are linked together:
-
The first flow should simply import all changes to a temporary staging table in Snowflake.
-
The second flow should merge the staging table with the production table, and then delete the staging table so that the flows can be run again.
The following two examples illustrate the flows described above.
Flow 1
The first flow (DF03 - NetSuite to Snowflake) exports data from NetSuite and imports into a single staging table in Snowflake.
DF03 - NetSuite to Snowflake
Flow 2
The second flow (DF03 -- m&d) first checks if any records exist in the staging table. This is done with a simple dummy query: SELECT * FROM netsuite_staging LIMIT 1
If any records are returned, integrator.io runs the flow to merge the tables together, then deletes all rows in the staging table so that the next time the flow runs, only future changes will be synchronized.
DF03 -- m&d
The following image illustrates the MERGE query used in the above flow. Notice that Use SQL query on first page only is selected since we only want to run the MERGE query once.
The following image illustrates the dummy query that checks if there are any records in the staging table. If so, integrator.io will only export a single record to keep the merge. The single record returned is only used to trigger subsequent steps in the flow, and the data from that record is not used for anything else.
Comments
0 comments
Please sign in to leave a comment.