Datalake Salesforce Export
I need to set-up flows to run on a nightly basis to feed data from our Salesforce HealthCloud instance to our AWS S3 datalake.
Ideally, I would like to pull the delta on objects (i.e. Accounts, Contacts, etc) nightly, but I am not sure the best approach to pulling all fields from each object. I haven't used the real time export before, but from reading the article in the community it seems like this triggers anytime there is an update to the object instead of being able to schedule it and pull all updates on a nightly basis.
The second approach seems like it would be to write a large SOQL query, but that doesn't seem ideal either and cumbersome to update anytime a new field is added to the object.
What would be the best approach to pulling data from objects on a nightly basis and pushing to AWS S3?
Comments
Hi Dave. With respect to the "nightly pull", I would use the scheduling feature to pick when you want it to run and then the "Delta" processing feature of the export step. That should handle the frequency that you want. The node would look something like this.
Unfortunately, Salesforce has an odd limit imposed if you use the "Fields(All)" directive so that it only will allow 200 records. The way around that is to write out the SOQL. Yes, it is cumbersome of the way they have configured their sql-like syntax.
Dave Guderian you will have to basically do what you did for HubSpot.
While the fields(all) method works, it only works for 200 results and it doesn't get the next 200-400 results, etc. You have to explicitly state all the fields you want which is why you need to make it like you have for HubSpot.
Dave Guderian like in this thread: https://docs.celigo.com/hc/en-us/community/posts/20280849745947/comments/20937245390235
Thank you both for the feedback.
Tyler- I was able to follow the steps to get the information to flow 2, but I don't understand how to update the SOQL query with the fields (i.e. name) that I am fetching in flow 1.
Do I need to change the salesforce connection again in step 2, or can i use a standard export?
Export from Step 2 below, but not sure how to pass in the names from step 1.
Flow 1 Names:
Flow 1 Steps:
Dave Guderian I'll do you one better. Install this integration zip and it should create as many flows as you need in order to sync over all your Salesforce data to S3. Basically, it has a flow that creates other flows based on the integration settings you put in. It gives you a drop down of Salesforce objects to select to sync over and will automatically get all the fields for each object and then update each SOQL query. As for the scheduling, you just schedule the first flow to run and then all the others will run after it.
Tyler-
Thank you so much for taking the time to put this together. I am trying to get it up and running but am getting some errors. I'm not sure what the error is indicating or if perhaps one of the fields is not appropriately populating prior to hitting the branching?
Here is the step I am receiving the error on.
Here is the error (it looks like the SF ID isnt being populated):
I put a "stop" in just before the branching so I could catch the payload before the branch:
I think I may have figured it out... I changed "sandbox": true and that seemed to work. Any other changes you can think of off hand that I may need to make?
Dave Guderian I guess you probably installed this in sandbox? I hadn't thought about that, but it makes sense it failed since I made this in production. Nothing else I can think of.
Thanks Tyler. Looks like I am running into a similar issue on the second to last step on flow 1, but I'm not sure where to make the change on this step.
Dave Guderian I made some updates and the link should download you the new version. I should have mapped sandbox everywhere because by not mapping sandbox, it defaults to production. So there was a mismatch in what environments resources were being made.
Thanks Tyler. I just downloaded it again from the link and it now it only has the one flow in the .zip file. Is that correct? Also, the only connection was the integrator.io connection.
Dave Guderian 3rd time is the charm. Try the new version in the link above.
Tyler-
You're amazing! This is totally cool! I do have a couple of follow-up questions (as I'm sure you could have guessed)!
Question 1:
I see my file in AWS (so flow 1 worked), but I don't see that flow 2 actually ran (I set just to pull the ACCOUNT object). When would you expect to see flow 2 run?
Can you elaborate on what is actually driving the placement of the file in AWS. Is it the file key that is being set within flow 1, or is it the actual file key on the AWS connection? I am assuming that since flow 2 didn't run, it must be the file key in flow 1 that is driving the placement of the file in AWS.
Lastly, I need to get the files going to the right overall spot in AWS! Is this the correct syntax to use based on what we discussed in office hours?
raw/source=salesforce-health-cloud/table={{settings.flow.object}}/year={{dateFormat "YYYY" date}}/month={{dateFormat "MM" date}}/day={{dateFormat "DD" date}}/account_{{timestamp "YYYY-MM-DDThhmmss" "UTC"}}Z.json
Your the best Tyler. Thanks again for all the help on this!
Dave Guderian the all generated flows should run after the first flow. The last step in the first flow is to trigger the generated flows to run. Maybe run the first flow again and see if it goes now? I've been testing fresh installs and mine are running. Maybe debug that import step and see what the responses are.
The first flow is updating and creating the exports, imports, and flow so it is driving what the import resource looks like. All it's doing is populating the file key field that you see within the UI on the S3 import step.
For your file path, I would have something like this:
Update import:
{{record._PARENT.s3Path}}/table={{lowercase record._PARENT.object}}/year=\{{timestamp \"YYYY\"}}/month=\{{timestamp \"MM\"}}/day=\{{timestamp \"DD\"}}/{{lowercase record._PARENT.object}}_\{{timestamp \"YYYY-MM-DDThhmmss[Z]\"}}.json
Create import:
{{record.s3Path}}/table={{lowercase record.object}}/year=\{{timestamp \"YYYY\"}}/month=\{{timestamp \"MM\"}}/day=\{{timestamp \"DD\"}}/{{lowercase record.object}}_\{{timestamp \"YYYY-MM-DDThhmmss[Z]\"}}.json
Tyler- This did work correctly. I just didnt wait long enough apparently to see the additional flow come up (it created a new account flow that ran for flow 2). I also used the handlebars/syntax you suggested and the files are landing in the appropriate places in S3, so everything appears to be working correctly from that standpoint.
That said, I did have one error that popped up when flow 1 ran for one of the objects. Is this because there are too many fields being pulled in the SOQL, and if so is there a work-around for this?
Dave Guderian give the new version a shot.
Thanks Tyler. That worked great and the flow ran without any errors.
Last question (I think! 😁). What (if any) changes do I need to make to the flows for the initial batch upload (i.e. all current/historical data in the system) and then what (if any) changes do I need to make for the incremental nightly loads?
I cant say thanks enough for the time you put into this. I know other customers will benefit from this as well if they are moving data from SF to AWS!
Dave Guderian when the first flow runs, it will run the flows that it created and updated. Each export on each flow is analyzed to determine if the export should be a export type of delta or an export type of all. I determine if the export should be a delta if any of the following fields are on the Salesforce object: ["SystemModstamp","LastModifiedDate","CreatedDate","LoginTime","Timestamp","UpdatedDate"]. If none of those fields are on the Salesforce object, then it gets set to an export type = all and will not run on a delta. When a new flow is created for an object that supports a delta export, the flow will run and pull data back to 2000. On subsequent runs, it will run off of delta data. If you want to trigger a resync of all tables, you can use the new button I added into the integration settings. Clicking this checkbox and saving the form will trigger all flows to run and pull data again back to 2000. To get the new version, use the above link again.
Tyler-
I am consistently getting some errors with one of my objects. Is the integration looking at all fields on the object, or is there maybe a filter that would need to be applied to only pull a field if its active? I'm just not clear what this error is indicating (note its on the second flow that would import into AWS S3)?
Dave Guderian is that field deprecated in Salesforce? I don't have any field filters now, but it could be easily added. I'm curious what is different about that specific field that makes it ineligible to query.
Tyler - I believe its because of the child relationship. For example, in order to return values I have to query:
select id, (Select id from RenewedInsurancePolicies) from insurancepolicy
Dave Guderian the screenshot you sent doesn't match up to the error? The errors says it's an issue with a field named "CoverageCode".
Thats my fault Tyler. This is the correct error.
What's the api name for that field? I have a similar field setup and it works fine. Here is my field from the describe lookup call. Can you do the describe call for this object and send me what the metadata for that field looks like?
Will this work? I used a little different method.... not as pretty I know. Let me know, and I can do the describe call if not.
Dave Guderian can we cover in office hours next week? I'm not too sure here.
Please sign in to leave a comment.