Can we specify write disposition with new bigquery connector?

Answered

Comments

6 comments

  • Official comment
    Bhavik Shah Senior Product Manager
    Answer Pro
    Great Answer

    Jeremy Nelson I can confirm the new connector doesn't support running query as jobs. Having said that, we would register this as an enhancement request and would update you once it is rolled out. 

  • Bhavik Shah Senior Product Manager
    Answer Pro
    Great Answer

    Thanks Jeremy Nelson for posting this. Can you confirm if you're trying to achieve this with the new BigQuery connector or existing REST API based connector ? 

    0
  • Jeremy Nelson
    Engaged

    I'm trying to set this up using the new BigQuery connector .

    0
  • Jeremy Nelson
    Engaged

    Thanks for following up, however I'm a bit confused.  In the connector itself, for the option "How would you like the records imported?" The first option is "Use bulk insert SQL query".

    This implies to me a batch insert is done, no?

    If so then this is the "write disposition" option I am referring to that I would like to be able to setk: https://cloud.google.com/bigquery/docs/batch-loading-data#appending_to_or_overwriting_a_table 

    0
  • Bhavik Shah Senior Product Manager
    Answer Pro
    Great Answer

    Jeremy Nelson Batch insertion that is being supported today is through createQueryStream()" instead of a query job. This enables to load the bulk data in JSON array format. 

    https://googleapis.dev/nodejs/bigquery/latest/RowQueue.html#insert

    I think this would be nice opportunity to understand the case in detail. Is it possible to log a ticket through Celigo support and then we can take it forward from there?  

    0
  • Scott Henderson CTO
    Celigo University Level 1: Skilled
    Answer Pro
    Top Contributor

    Hey Jeremy,

    I am not an expert with BigQuery, but maybe you could solve this via two flows in integrator.io:

    1. First flow deletes all data from the table (i.e. SQL -> "delete from XYZ" ).
    2. Second flow reloads the table with fresh data.

    You can configure the second flow to run immediately after the first flow. Or, maybe this is no better than the work around you are already using?

    Also, not sure if it helps, but we use Snowflake + integrator.io internally at Celigo, and here is how we generally do ingestion; and if you are syncing really large data sets then there are some cool efficiencies here that should translate to BigQuery too.

    1. First flow is a 'delta' flow, and gets all the updated data from all the different sources being sync'd, and then blindly loads all delta/changed data into a temporary staging table in Snowflake.
    2. Second flow does 2 separate steps (i.e. via 2 separate bubbles in the flow).  The first step/bubble uses a MERGE query to merge all data in the temporary staging table with the permanent production table, and Snowflake is super fast at merging massive data sets (i.e. millions of changes only take seconds to merge). The second step/bubble deletes all data in the staging table so that it is empty for the next flow run.

    It is worth noting that we always sync raw JSON data into Snowflake, and we sync everything into a single table per source system (i.e. NetSuite, Salesforce, MongoDB, etc...), and then we use Snowflake views to unpack the JSON into views per record type, and this allows us to be really dynamic and iterative with views, because we dont have to ever sync the full data sets again.  I am pretty sure Google BigQuery supports this same paradigm where you load raw JSON, and unpack via views.

    Lastly, here is a picture of the merge and delete flow if it helps.  You can basically use this pattern to run any random SQL stuff in serial order, before or after other flows, etc...

    0

Please sign in to leave a comment.