Boy have I been on a database integration adventure these last few weeks. In this series, I'll review some of the more interesting hoops I jumped through with Jitterbit to load data in CSV files provided by an external admissions application partner (think common app ... only this was a much smaller organization) into tables of a Salesforce org.
Intro
Before I launch into a bunch of "how" party tricks, I'd like to spend this post talking about what business purpose I was trying to accomplish.
As I mentioned, I am building a scheduled daily data-load job to fetch 3 CSV files, plus a number of PDF/PNG/etc. files, from a vendor's server and selectively load their contents into Salesforce.
Source Data Structure
The 3 CSV files are:
-
students.csv
: 1 student per row, containing details like their first name, last name, date of birth, etc.- Each row contains a unique record-identifying value in the
STUDENT_ID
column.
- Each row contains a unique record-identifying value in the
-
applications.csv
: 1 admissions application per row, containing details like the level to which they are applying (e.g. "undergraduate"), the type of application they are submitting (e.g. "first-year / transfer"), what they'd like to major in, in which semester they'd like to start, etc.- Each row contains a unique record-identifying value in the
APPLICATION_ID
column. - Each row also contains a column with a cross-reference to a
STUDENT_ID
value.
- Each row contains a unique record-identifying value in the
-
previous_schools.csv
: 1 "previously attended school" per row, containing details like the name of the school they attended, whether it was a high school or a university, the date on which they started attending the school, the date on which they stopped attending the school, their GPA at that school, etc.- There is no unique record-identifying column in this data set.
- Each row contains a column with a cross-reference to a
APPLICATION_ID
value. - Each row contains a column with a cross-reference to a
STUDENT_ID
value.
Furthermore, the data I fetch includes a number of PDF, PNG, etc. photographs of transcripts, test scores, and other supplemental documents to support the admissions application. These are named with a strict pattern as follows:
-
STUDENT_ID
, followed by an underscore -
APPLICATION_ID
, followed by an underscore - Free text describing what kind of document it is (e.g. "High School transcript 1")
- A period to separate the filename from the file extension
- A file extension (e.g.
pdf
,png
, etc.)
To the extent that a collection of files in the same folder can have an "entity-relationship diagram," here's how I interpret the data to be "structured:"
Note that in this blog series, I am using the word "student" to describe a person -- a human being -- because that's how the vendor named our files and my mind was on this particular project when I began writing.
Attentive reader Luke pointed out that the term "student" is often context-specific. For example, working with Salesforce's Education Data Architecture ("EDA"), he uses the word "student" to mean the junction between a Salesforce
Contact
and a SalesforceAccount
(or "school" within an institution).My use of the word "student" in this series is simply a synonym for "person."
Destination Data Structure
Luckily, our senior admin managed to convince the 3rd-party application provider to make sure that their data feed of CSVs broken up into the same object (table) structure and foreign-key/primary-key relationships as our Salesforce org.
(That is, they negotiated to ensure that the CSV files from the vendor would be normalized the way we normalize our Salesforce org.)
In plain English, this means that 1 row of source-CSV-file data should map to 1 row of Salesforce data.
The ERD of our Salesforce org looks almost identical to the one I provided for the "files," and that's no accident:
The two main differences are that:
- There's an actual table connected to
Contact
calledAttachment
that we will be inserting 1 record per "loose PDF/PNG/etc." into, loading the file itself into theBody
field. - There's a circular lookup to
Application__c
fromContact
calledActive_Application__c
. Even thoughContact
is a parent table toApplication__c
... think of this as a "favorite child" sort of field.
There are other lookups, particularly on Application__c
, (e.g. "in which term someone would like to start" is a lookup to `Termc`), but I've tried to keep the basic inter-table relationships simple.
Within Salesforce, Contact
and Application__c
have unique external ID-typed fields for storing the CSV-side values of STUDENT_ID
and APPLICATION_ID
, respectively.
What Needs To Go Where
It would be lovely if I could simply map similar tables to each other with a basic UPSERT
on external ID, but that isn't the case.
Let's take a look at each table in Salesforce and talk about the business rules for loading data into it.
People
Let's get this out of the way: selling university seats to prospective students is a business-to-customer endeavor, not a business-to-business one.
We don't bother with Salesforce's native Lead
object. Everyone's a Contact
.
However, to facilitate deduplication, we also have a table called Contact_Temp__c
with some Visualforce apps built on top of its data.
Contact_Temp__c
is a great place to load potentially-unclean batches of prospective students and sort out who's who, comparing them against actual Contact
records, before updating or inserting records in the Contact
table.
Because humans are in charge of actually putting data into Contact
, I get off pretty easy when it comes to loading data into Contact_Temp__c
.
I'll show my cards later, but suffice it to say that it's not complicated and is almost a straightforward "dump each row of the CSV file into a Salesforce record" mapping.
Admissions Applications
Then we have the heart of the matter: data representing an "application for admission" to a program at the university.
I was told that I don't have to keep Application__c
records up-to-date once they've been initially loaded (even if any details change on the CSV-file side at a later date), so you'd think the business process would go like this:
-
Make sure a
Contact
record exists with a value inExternal_Student_Id__c
that matchesSTUDENT_ID
fromapplications.csv
. - Make sure an
Application__c
record does not yet exist with a value inExternal_Application_Id__c
matchingAPPLICATION_ID
fromapplications.csv
. -
If you clear those two preconditions,
INSERT
a newApplication__c
record into Salesforce cross-referencing theContact
in question and fill in the details as appropriate.
Better Salesforce Insert/Update Operations with Jitterbit Caching
Katie ・ Oct 28 '19 ・ 14 min read
But that'd be easy.
We can't have that!
It turns out that during the human-driven process of creating Contact
records, thanks to the magic of Salesforce automations that people generally find useful to our business processes, placeholder Application__c
records get created with almost no details filled in but a Status__c
of "Placeholder."
So it's actually this:
- Make sure a
Contact
record exists with a value inExternal_Student_Id__c
that matchesSTUDENT_ID
fromapplications.csv
. - Make sure an
Application__c
record does not yet exist with a value inExternal_Application_Id__c
matchingAPPLICATION_ID
fromapplications.csv
. -
If you clear those two preconditions, check if the
Contact
in question has any existing placeholderApplication__c
records.- If you find 1 or more, pick the oldest one and
UPDATE
it, filling in the details as appropriate, including settingStatus__c
to "Started." - If you don't find any,
INSERT
a newApplication__c
record into Salesforce cross-referencing theContact
in question and fill in the details as appropriate, including settingStatus__c
to "Started."
- If you find 1 or more, pick the oldest one and
Got it??
Admissions Applications: Bonus Round
We had to set Status__c
to "Started" because our Salesforce org has a bazillion automations dependent upon that value.
However, as soon as they've had a moment to run, we need to go back through the Application__c
records we just UPSERT
ed and set Status__c
to "Submitted."
Still with me?
Contacts: Bonus Round
Finally, we have to make sure that Active_Application__c
on Contact
points to the Application__c
record we just UPSERT
ed.
(Exception: If we upserted 2 applications for the same parent Contact
, we'll arbitrarily pick 1 to do the honors.)
Lookup Errors: Bonus Round
Earlier, I mentioned that I hadn't shown lookup fields within Application__c
to objects such as Term__c
, Major__c
, etc. in the Salesforce entity-relationship diagram above.
However, it is still important to populate these fields correctly.
If applications.csv
has data in START_TERM_NAME
or MAJOR_NAME
that Jitterbit can't easily match to records within Salesforce's Term__c
or Major__c
tables via SOQL query, then I need to e-mail out an error log so our data integrity team can contact the 3rd-party vendor.
Flowchart
Here's a flowchart to recap how data needs to flow out of applications.csv
:
Previous Schools Attended
Loading "previous schools attended" follows similar architecture to loading "admissions applications," mercifully without "bonus steps:"
- We don't want to continue to update the Salesforce records corresponding to any CSV records we've already data-loaded once.
- We have to do a tricky "take over the first placeholder you find, if applicable" operation because we have Salesforce automations on
Application__c
in place that create placeholderPrevious_School__c
records wheneverLevel__c
is undergraduate:- 1 high school placeholder
- + 1 college placeholder for transfers
Of course, the first requirement is rendered tricky because previous_schools.csv
doesn't have any sort of obvious "unique record ID" that facilitates matching CSV records against existing Previous_School__c
Salesforce records.
Since the business logic for "yes, it's a match" was simply "do nothing," I decided there wasn't severe harm in a slight chance of false positives. I made my matching criteria as broad as seemed reasonable:
- A match on name counts as a match.
- A match on date-range also counts as a match (in case someone "tidied up" the name in Salesforce later on).
- After all, what are the odds of attending two schools over the exact same timespan?
Therefore, our business process for data-loading Previous_School__c
records is as follows:
- Make sure a
Contact
record exists with a value inExternal_Student_Id__c
that matchesSTUDENT_ID
fromprevious_schools.csv
. - Make sure an
Application__c
record exists with a value inExternal_Application_Id__c
that matchesAPPLICATION_ID
fromprevious_schools.csv
. - Make sure a
Previous_School__c
record does not yet exist withSCHOOL TYPE
matchingRecordType.DeveloperName
and with either of the following "matches" to data fromprevious_schools.csv
:- A
SCHOOL NAME
value (CSV) matchingName
(Salesforce) - A
START DATE
andEND DATE
value pair (CSV) matchingStart_Date__c
andEnd_Date__c
(Salesforce)
- A
-
If you clear those three preconditions, check if the
Application__c
in question has any existing placeholderPrevious_School__c
records of the sameSCHOOL TYPE
. For blogging purposes, we'll say that we can recognize a placeholder because itsName
is "~~~PLACEHOLDER~~~
."- If you find 1 or more, pick the oldest one and
UPDATE
it, filling in the details as appropriate. - If you don't find any,
INSERT
a newPrevious_School__c
record into Salesforce cross-referencing theApplication__c
in question and fill in the details as appropriate.
- If you find 1 or more, pick the oldest one and
Flowchart
Supplemental Document Attachments
Here's our business process for loading PDF/PNG/etc. files from the vendor into Attachment
in Salesforce:
- Make sure a
Contact
record exists with a value inExternal_Student_Id__c
that matchesSTUDENT_ID
(available by parsing the filename). - Make sure an
Attachment
record does not yet exist for thatContact
with a value inName
matching matching the filename. -
If you clear those two preconditions,
INSERT
a newAttachment
record into Salesforce cross-referencing theContact
in question (filename toName
, contents toBody
).
Contacts: Bonus Round
- If we just
INSERT
ed anyAttachment
records for aContact
, we have toUPDATE
theContact
'sLast_Attachment_Upload_Date__c
field to the current date and time. - If we just
INSERT
ed anyAttachment
records for aContact
whose filenames contain the phrase "transcript
," we have toUPDATE
theContact
'sLast_Transcript_Upload_Date__c
field to the current date and time.
Flowchart
Up Next: How I Did It
For the rest of this series, I'll break these business rules into bite-sized pieces and talk about the architecture I chose in Jitterbit to implement them.
I'm sure we'll hit a few fun Jitterbit-specific tricks along the way.
For the most part, I suspect I'll be covering architectural design concepts that should be transferable to any schedule-based ETL tool, so stay tuned, adminelopers, devs, and architects.