Combined with “external ID” fields, Salesforce’s “ upsert ” data-loading option is a godsend for dumping data from legacy systems into Salesforce. But what do you do if you only want to insert new records or only want to update existing records, using the external ID as an inter-system matching key, considering Salesforce’s “insert” and “update” operations don’t support matching on external IDs?
The Jitterbit ETL tool makes it relatively easy to use a SOQL query ahead of an insert
or update
operation and use data from the query to filter which “upload candidate” records actually get sent to Salesforce. The trick is to write code in Jitterbit’s scripting language.
TL;DR for experienced Jitterbit programmers
This is a step-by-step tutorial showing how to write small bits of code that use the Jitterbit execution environment’s memory to save a hashmap of IDs from both sides of an ETL process that’s about to run, and then to use your hashmap at data-upload-time on a record-by-record basis.
Skip to the parts that mention “<trans>
”, the “Operation Script” sections, and the pro-tips / further considerations sections for the goodies.
Today’s Data
In My Source Database
My “data source” is an ordinary Oracle database whose People table looks like this when I run the following SQL query:
SELECT Id_Column, First_Name, Last_Name, Phone_Number, Email_Address
FROM People
Id_Column | First_Name | Last_Name | Phone_Number | Email_Address |
---|---|---|---|---|
ERP10006 | Darweesh | Daher | 444-444-4444 | 444@example.com |
ERP17685 | Helen | Hopper | 888-888-8888 | 888@example.com |
ERP33014 | Grace | Gao | 777-777-7777 | 777@example.com |
ERP71892 | Cathy | Combs | 333-333-3333 | 333@example.com |
In Salesforce
My “data destination” is a Salesforce org whose Contact table looks like this when I run the following SOQL query:
SELECT Id, Name, My_External_Id__c, Phone, Email
FROM Contact
Id | Name | My_External_Id__c | Phone | |
---|---|---|---|---|
003010101010101 | Anush Amjit | ERP28190 | ||
003020202020202 | Benita Borges | |||
003030303030303 | Cathy Combs | ERP71892 | ||
003040404040404 | Darweesh Daher | ERP10006 | ||
003050505050505 | Ezra Ellis | |||
003060606060606 | Frances Fulvia |
Yes, Someone’s Missing
Note that even though Salesforce believes there’s a record in my “external data source” with ID ERP28190
named Amjit Anush, it seems that Amjit has been deleted from my external database.
Dealing with that data integrity issue is another problem for another Jitterbit post.
ETL Business Rules
Today we’re going to deal with some pretty unrealistic business rules, for example’s sake.
Update
- Whenever someone exists in both databases, we want to feed their Phone Number into Salesforce.
- We’ll leave First Name, Last Name, and E-Mail Address alone upon record updates.
In other words, if we were to do an update right now, we’d expect to populate Cathy’s and Darweesh’s phone numbers but not their e-mail addresses.
Insert
- Whenever someone exists in Oracle but not in Salesforce, we want to feed their First Name, Last Name, External ID, and E-Mail Address into Salesforce.
- We’ll leave Phone Number alone on upon record inserts.
In other words, if we were to do an insert right now, we’d expect to add new records for Grace and Helen to Salesforce (both records having blank phone numbers but non-blank e-mail addresses in Salesforce).
Jitterbit: Creating A Dictionary
What The Dictionary Will Look Like
Ultimately, what we’re going to do today is run the following SOQL query:
SELECT Id, My_External_Id__c
FROM Contact
WHERE My_External_Id__c <> NULL
Which will give us data that looks like this:
Id | My_External_Id__c |
---|---|
003010101010101 | ERP28190 |
003030303030303 | ERP71892 |
003040404040404 | ERP10006 |
We’ll transform and store that data into a data structure called a dictionary for use as a “cache” of values we can reference within Jitterbit without running another SOQL query.
The contents of our dictionary will look like this:
ERP10006
:003040404040404
ERP28190
:003010101010101
ERP71892
:003030303030303
Note: If you’ve ever done Apex programming within Salesforce, you may have written similar code where you looped over the results of a SOQL query and used them to fill a Map<Id,Id>
with values for later reference.
Declare A Variable
First, declare a Jitterbit “global variable.” I’m going to call mine “existingContactIdsByExtId
.”
Tell Jitterbit How To Log Into Salesforce
Before I write any code to populate existingContactIdsByExtId
, I need to tell Jitterbit how to talk to my Salesforce, so I’ll create a “Salesforce Org” object in Jitterbit and give it my login credentials:
Populate The Variable
Now I’ll create a new “Script” and call it “ Lazy-Populate existingContactIdsByExtId.”
Here’s the Jitterbit-script code I’ll put into it:
<trans>
If( IsNull( $existingContactIdsByExtId) || $existingContactIdsByExtId == "" ,
WriteToOperationLog( "$existingContactIdsByExtId was null -- building it" );
arr = SfLookupAll(
"<TAG>Salesforce Orgs/my_salesforce_login@example.com</TAG>"
,"SELECT My_External_Id __c, Id FROM Contact WHERE My_External_Id__ c <> null"
);
$existingContactIdsByExtId = Dict();
i = 0;
While( i < Length(arr) ,
$existingContactIdsByExtId[arr[i][0]] = arr[i][1];
i = i + 1;
);
);
</trans>
Code Notes
Lazy Code
Note that this script is “lazy.”
That means that if we have already put data into existingContactIdsByExtId
during the runtime of a Jitterbit operation, a 2nd+ run of this script would skip running all code within the If()
statement.
What Our Data Looks Like
Although Jitterbit doesn’t show us what the contents of the variable arr
look like when we test-run our script, here’s what it would look like if we could get it to display (I have my tricks…):
{{ERP28190, 003010101010101}, {ERP71892, 003030303030303}, {ERP10006, 003040404040404}}
And existingContactIdsByExtId
looks like this:
[ERP10006=>"003040404040404",ERP28190=>"003010101010101",ERP71892=>"003030303030303"]
Jitterbit: Updating Salesforce
Update Operation
Next, I used Jitterbit to create an ordinary Salesforce Update operation that queries my Oracle database as a source.
I’ll trust that you know how to do that in Jitterbit (or are comfortable figuring out how to do it … or work with someone who is) and are here to learn my “caching” trick.
I named my operation “ Update Contacts By External ID.”
Transformation
Mapping: Phone
Once you’ve set that up, in your Salesforce operation details screen, click the “Edit” button by “Mappings” to bring up the transformation involved in the operation (for me, it’s called “ Update Contacts By External ID Request “).
Drag Phone_Number
at left onto Phone
at right.
Condition (execute vs. skip record)
Now we have to tell Jitterbit to detect that Grace and Helen aren’t in Salesforce and skip over trying to update their Salesforce records.
Right-click the [E*] Contact
folder at top and click “ Add Condition
.”
A new node will appear at right just below the “Contact” folder called “ Condition
.”
Double-click it. A “Formula Builder” window will appear for editing the node named “ root$transaction.body$update$Contact.Condition
.”
At left, between the <trans>
and </trans>
tags, type:
HasKey($existingContactIdsByExtId, ID_COLUMN);
(Note that my Oracle database upper-cases column names when queried, so what I was previously calling Id_Column
needs to be referred to in this “Jitterbit transformation formula” as ID_COLUMN
.)
If I click the “Test” button, I can see that the result is 0
(false), indicating that ID ERP10006
(Darweesh Daher) is in existingContactIdsByExtId
.
Hmmm. That’s not right. Darweesh actually is in my Salesforce Contact table with ERP10006
on file.
But the Jitterbit “Test” button doesn’t know that. As far as it’s concerned, existingContactIdsByExtId
is just an empty variable.
To test our HasKey()
code and ensure that we wrote it correctly, we can add 2 lines to the “Condition” formula so that it runs our script and populates existingContactIdsByExtId
with data:
$existingContactIdsByExtId = '';
RunScript("<TAG>Scripts/Lazy-Populate existingContactIdsByExtId</TAG>");
HasKey($existingContactIdsByExtId, ID_COLUMN);
That’s better: now the answer is 1
(true). This means that that Darweesh’s record will be processed when we run our Salesforce Update operation.
Just keep in mind that we don’t actually want to leave those extra $existingContactIdsByExtId = ''
or RunScript()
lines of code in our condition once we’re done testing it.
It’s wasteful of Jitterbit’s CPU time (and hence our real-world time) – particularly if we have a lot of rows in our database query.
There’s no real-world need to try to “populate existingContactIdsByExtId
” every time we try to load a new row of data from our database into Salesforce. It’s much better off being done just before the entire “Update” operation kicks off as a whole.
Next, in the upper-right area, I’m going to click the little “1 of 4” next to the “Manual SQL” folder under the “Source” folder, which should flip it to say “2 of 4” and show ID ERP17685
’s (Helen Hopper’s) data instead.
I’ll hit “Test” once again to validate:
Satisfied that my condition lets Darweesh through but skips over Helen, I’ll delete my “testing-only” lines of code so that my formula simply reads as follows:
<trans>
HasKey($existingContactIdsByExtId, ID_COLUMN);
</trans>
I click “OK” to close the formula editor, then immediately click the “Save” button at top right area of my “Transformation” editor pane.
Mapping: ID
I have one more job to do before I can save, deploy, and close my “Transformation”: since this is an Update operation, I need to specify a Contact ID for each row that I’m updating in Salesforce.
My Oracle database has no idea what the answer to this question is (it just thinks of Darweesh as ERP10006
), but luckily, Jitterbit knows.
It just has to ask existingContactIdsByExtId
what the Salesforce Contact ID for ERP10006
is (in this case, 003040404040404
).
I double-click on “Id” at right to bring up its formula editor, setting the code between the <trans>
and </trans>
tags to be:
$existingContactIdsByExtId[ID_COLUMN];
(I clicked “# of 4” repeatedly until it looped back around to “1 of 4” again so that I could see test data with Darweesh, but as you can see, Jitterbit isn’t happy, because it’s convinced that the contents of existingContactIdsByExtId
is a bunch of blank text _(a “string”).
As before, I can add 2 quick $existingContactIdsByExtId = '';
and RunScript("<TAG>Scripts/Lazy-Populate existingContactIdsByExtId</TAG>");
lines of code beforehand that I won’t keep in production to make sure things work.
Operation Script
We need to add one more thing to our Salesforce Update Operation before it will work: we need to ensure that every time it runs, the first thing it does is execute the “ Lazy-Populate existingContactIdsByExtId ” script.
From your Salesforce Update Operation’s configuration screen, click the “Edit” button next to “Operation.”
Right-click on the node of that Operation representing your data source (mine is called “My Oracle Database” because I’m creative like that) and click “ Insert Before This ,” next clicking “ Script.”
Right-click on the new leftmost node called “Script” and click “Select Existing Script.”
Choose “ Lazy-Populate existingContactIdsByExtId ” and click “OK.”
Drag the leftmost node around until things look pretty, then use the “Save” and “Deploy” icons in your upper-right corner of this operation’s editor tab to ensure that your work isn’t lost. It should look something like this:
Validation
Let’s try running our Operation. Find it at left under “ Jitterbit Connect ” -> “ Salesforce ” -> “ Salesforce Updates ” -> (whatever you named it), right-click on it, and click “ Run Update.”
Hooray – it works!
When I run the following SOQL query:
SELECT Id, Name, My_External_Id__c, Phone, Email
FROM Contact
My data looks like this:
Id | Name | My_External_Id__c | Phone | |
---|---|---|---|---|
003010101010101 | Anush Amjit | ERP28190 | ||
003020202020202 | Benita Borges | |||
003030303030303 | Cathy Combs | ERP71892 | 333-333-3333 | |
003040404040404 | Darweesh Daher | ERP10006 | 444-444-4444 | |
003050505050505 | Ezra Ellis | |||
003060606060606 | Frances Fulvia |
Jitterbit: Inserting Into Salesforce
Insert Operation
Time to try inserting.
I used Jitterbit to create an ordinary Salesforce Insert operation that queries my Oracle database as a source.
Again, I’ll trust that you know how to do this in Jitterbit (or have a plan to figure it out).
I named my operation “ Insert Contacts By External ID.”
Transformation
Mapping: Name, Phone, External ID
Once you’ve set that up, in your Salesforce operation details screen, click the “Edit” button by “Mappings” to bring up the transformation involved in the operation (for me, it’s called “ Insert Contacts By External ID Request “).
I dragged ID_COLUMN
onto My_External_Id__c
, FIRST_NAME
onto FirstName
, LAST_NAME
onto LastName
, and EMAIL_ADDRESS
onto Email
. No surprises there.
Condition (execute vs. skip record)
Now I have to do the same kind of “ Add Condition
” steps that I did above for my Update transformation, only the formula is going to be different.
In this case, we want:
<trans>
!HasKey($existingContactIdsByExtId, ID_COLUMN);
</trans>
The difference is the exclamation point before the HasKey
– it serves as a “not” and flips the true/false results of HasKey
.
Of course, we can test with the following script:
<trans>
$existingContactIdsByExtId = '';
RunScript("<TAG>Scripts/Lazy-Populate existingContactIdsByExtId</TAG>");
!HasKey($existingContactIdsByExtId, ID_COLUMN);
</trans>
Now we can see that Darweesh returns 0 (false) and Helen returns 1 (true). That’s perfect: we want to insert Helen but not Darweesh.
We’ll save and deploy our transformation (making sure we took any test lines of code out of our “Condition” formula).
(There’s no ID mapping)
Note that we don’t have to play any games with the Salesforce Id
field on an Insert the way we did on an Update
operation. We’ll just leave it un-mapped.
Operation Script
As with Update, need to add one more thing to our Salesforce Insert Operation before it will work: we need to ensure that every time it runs, the first thing it does is execute the “ Lazy-Populate existingContactIdsByExtId ” script.
From your Salesforce Insert Operation’s configuration screen, click the “Edit” button next to “Operation.”
Right-click on the node of that Operation representing your data source and click “ Insert Before This ,” next clicking “ Script.”
Right-click on the new leftmost node called “Script” and click “Select Existing Script.”
Choose “ Lazy-Populate existingContactIdsByExtId ” and click “OK.”
Drag the leftmost node around until things look pretty, then use the “Save” and “Deploy” icons in your upper-right corner of this operation’s editor tab to ensure that your work isn’t lost. It should look something like this:
Validation
Let’s try running our Operation. Find it at left under “ Jitterbit Connect ” -> “ Salesforce ” -> “ Salesforce Inserts ” -> (whatever you named it), right-click on it, and click “ Run Insert.”
It works again. Yay.
When I run the following SOQL query:
SELECT Id, Name, My_External_Id__c, Phone, Email
FROM Contact
My data looks like this:
Id | Name | My_External_Id__c | Phone | |
---|---|---|---|---|
003010101010101 | Anush Amjit | ERP28190 | ||
003020202020202 | Benita Borges | |||
003030303030303 | Cathy Combs | ERP71892 | 333-333-3333 | |
003040404040404 | Darweesh Daher | ERP10006 | 444-444-4444 | |
003050505050505 | Ezra Ellis | |||
003060606060606 | Frances Fulvia | |||
003050505050505 | Grace Gao | ERP33014 | 777@example.com | |
003060606060606 | Hopper Helen | ERP17685 | 888@example.com |
Fix A Typo
Oh wait – there’s a typo.
My database had Helen’s first and last name flipped.
(Did you all notice this whole time? Am I the only person who didn’t notice in any of my previous screenshots about her? 😳)
Can I Just Update?
All right – I just fixed her data in my source database.
In the real world, I’d probably have my UPDATE on a daily schedule, so Grace’s and Helen’s phone numbers are likely to end up in Salesforce by tomorrow, anyway.
How about I just manually run another update now?
…*Drumroll*…
Id | Name | My_External_Id__c | Phone | |
---|---|---|---|---|
003010101010101 | Anush Amjit | ERP28190 | ||
003020202020202 | Benita Borges | |||
003030303030303 | Cathy Combs | ERP71892 | 333-333-3333 | |
003040404040404 | Darweesh Daher | ERP10006 | 444-444-4444 | |
003050505050505 | Ezra Ellis | |||
003060606060606 | Frances Fulvia | |||
003050505050505 | Grace Gao | ERP33014 | 777-777-7777 | 777@example.com |
003060606060606 | Hopper Helen | ERP17685 | 888-888-8888 | 888@example.com |
*Womp Womp*
Grace and Helen now have phone numbers, but Helen’s name isn’t fixed … because my business rules I implemented said not to change anything but “phone number” in Salesforce upon update.
2 Options To Actually Fix It
At this point, I can either:
- Fix Helen manually in Salesforce, or
- Delete her from Salesforce and run the INSERT again.
I’m going to go with the latter for funsies.
…*Drumroll*…
Id | Name | My_External_Id__c | Phone | |
---|---|---|---|---|
003010101010101 | Anush Amjit | ERP28190 | ||
003020202020202 | Benita Borges | |||
003030303030303 | Cathy Combs | ERP71892 | 333-333-3333 | |
003040404040404 | Darweesh Daher | ERP10006 | 444-444-4444 | |
003050505050505 | Ezra Ellis | |||
003060606060606 | Frances Fulvia | |||
003050505050505 | Grace Gao | ERP33014 | 777-777-7777 | 777@example.com |
003060606060606 | Helen Hopper | ERP17685 | 888@example.com |
That’s better.
Lesson Learned?
- Q: Is there a moral to this mistake?
-
A: Maybe …
- Don’t double your work and break out a business process into “insert” and “update” operations where “upsert” will do?
- (That is, if your field-by-field mappings are the same for each process.)
I really painted myself into a corner by deciding I needed a special phone-only update process as a “business rule.”
Consolidation Pro Tip
I recommend that you kick off every individual operation with an execution of the “ Lazy-Populate existingContactIdsByExtId ” Jitterbit script so that it’s easy to test or run the components individually.
Then, if you’re planning to run them back-to-back in actual scheduling, make a new Script-only operation called “ Run Me ” with a script called “ Execution Controller ” that has, say, the following contents:
<trans>
RunScript("<TAG>Scripts/Lazy-Populate existingContactIdsByExtId</TAG>");
RunOperation("<TAG>Operations/Jitterbit Connect™/Salesforce Inserts/Insert Contacts By External ID</TAG>");
RunOperation("<TAG>Operations/Jitterbit Connect™/Salesforce Updates/Update Contacts By External ID</TAG>");
</trans>
Technically, when “ Run Me ” runs, “ Lazy-Populate existingContactIdsByExtId ” will run 3 times (once because of “Execution Controller” and twice more because of the sub-operations), but because it’s “lazy,” it won’t actually do much the 2nd and 3rd times it runs.
I think it’s worthwhile to keep the “lazy” script as part of each sub-operation so that each sub-operation remains independently testable.
After all, it’s not like I’m running the script for every row that gets data-loaded (ahem … you did take that “test code” out of your row-by-row transformation logic, right?).
Further Considerations
I hope this post has helped give you an architecture to handle business cases when you do need to have special “insert by external ID” vs. “update by external ID” business processes.
One thing that’s really fun about this architecture is that Jitterbit couldn’t care less if you’re using a proper Salesforce external ID as the “matching” field.
I’ve done this with, say, a concatenation of FirstName
+ "|"
+ LastName
+ "|"
+ Email
as the “key” to my dictionary (keep in mind that if multiple Salesforce Contacts share a name+email, you’d only have one result stored in your “Jitterbit cache” dictionary variable, so think through how you want to handle that, business-logic-wise).
Note that I don’t typically load data directly on “soft” cache-matches like that. Instead, I’m looking to surgically fill in My_External_Id __c
in Salesforce before doing a normal “upsert” against My_External_Id__ c
.
I’ll save that architecture for another post … so stay tuned for more Jitterbit tips!