Can we specify write disposition with new bigquery connector?Answered
Happy to see the new bigquery connector for use.
Is there any way we can specify a new write disposition for imports into a table? For example if I want a batch load job to have a WRITE_TRUNCATE write disposition?
For now I have a workaround where I ingest into a partitioned table and then have a scheduled query create another table that consists of the latest partition of the former table so that users don't have to worry about dealing with partitions. But this is slightly annoying to deal with and it'd be great if I can just set a WRITE_TRUNCATE write disposition on the batch load and overwrite the contents of the table on every load.
Jeremy Nelson I can confirm the new connector doesn't support running query as jobs. Having said that, we would register this as an enhancement request and would update you once it is rolled out.
Thanks Jeremy Nelson for posting this. Can you confirm if you're trying to achieve this with the new BigQuery connector or existing REST API based connector ?0
I'm trying to set this up using the new BigQuery connector .0
Thanks for following up, however I'm a bit confused. In the connector itself, for the option "How would you like the records imported?" The first option is "Use bulk insert SQL query".
This implies to me a batch insert is done, no?
If so then this is the "write disposition" option I am referring to that I would like to be able to setk: https://cloud.google.com/bigquery/docs/batch-loading-data#appending_to_or_overwriting_a_table0
Jeremy Nelson Batch insertion that is being supported today is through createQueryStream()" instead of a query job. This enables to load the bulk data in JSON array format.
I think this would be nice opportunity to understand the case in detail. Is it possible to log a ticket through Celigo support and then we can take it forward from there?0
I am not an expert with BigQuery, but maybe you could solve this via two flows in integrator.io:
- First flow deletes all data from the table (i.e. SQL -> "delete from XYZ" ).
- Second flow reloads the table with fresh data.
You can configure the second flow to run immediately after the first flow. Or, maybe this is no better than the work around you are already using?
Also, not sure if it helps, but we use Snowflake + integrator.io internally at Celigo, and here is how we generally do ingestion; and if you are syncing really large data sets then there are some cool efficiencies here that should translate to BigQuery too.
- First flow is a 'delta' flow, and gets all the updated data from all the different sources being sync'd, and then blindly loads all delta/changed data into a temporary staging table in Snowflake.
- Second flow does 2 separate steps (i.e. via 2 separate bubbles in the flow). The first step/bubble uses a MERGE query to merge all data in the temporary staging table with the permanent production table, and Snowflake is super fast at merging massive data sets (i.e. millions of changes only take seconds to merge). The second step/bubble deletes all data in the staging table so that it is empty for the next flow run.
It is worth noting that we always sync raw JSON data into Snowflake, and we sync everything into a single table per source system (i.e. NetSuite, Salesforce, MongoDB, etc...), and then we use Snowflake views to unpack the JSON into views per record type, and this allows us to be really dynamic and iterative with views, because we dont have to ever sync the full data sets again. I am pretty sure Google BigQuery supports this same paradigm where you load raw JSON, and unpack via views.
Lastly, here is a picture of the merge and delete flow if it helps. You can basically use this pattern to run any random SQL stuff in serial order, before or after other flows, etc...0
Please sign in to leave a comment.