I was working on a project recently that uses SQL Server as its primary database. This wasn't my first rodeo with SQL Server and in fact, I have a side project that makes heavy use of the .NET stack. But, in this project, I encountered a major performance problem that wasn't very clear and was very easy to miss.
So in the spirit of learning in public, I thought id write up my experience with this problem. We can dive into the problem and the solution so that you can avoid this scar tissue in the future.
The Project
First, let's set the stage for the problem by getting into some of the background of the project. For the purposes of this blog post, I am going to use an example that mirrors the actual project.
The project is a web API that is built using dotnet core
. The main endpoint in the API takes in an array of ids and checks a table in the database for those ids. In terms of APIs, this one is rather straightforward.
In developing this API, we had access to the database but not the original schema. That wasn't a problem as we made use of Entity Framework code first to represent the table we want to query. We looked at the schema that is on the real database and mirrored those columns with their types into our Entity Framework model.
Here is what that model looked like after that initial phase.
namespace my_api.Data
{
public class FieldTable
{
public Int64 Id { get; set; }
public string FieldId { get; set; }
public string Description { get; set; }
public DateTime DateAdded { get; set; }
public bool Available { get; set; }
}
}
The FieldId
is the column we query in the API to see if the array of ids that were passed in match any FieldId
values in the database. The ones that match we return to the client. Here is the API logic code that does that.
public IEnumerable<MatchedField> GetMatches(IEnumerable<string> fieldIds)
{
return _dataContext.FieldTable.Where(f => fieldIds.Contains(f.FieldId))
.Select(f =>
new MatchedField()
{
FirstSeen = f.DateAdded,
FieldId = f.FieldId.ToLower(),
Available = f.Known
}
).ToList();
}
Entity Framework to raw SQL
At a high-level Entity Framework is going to map our code above into a raw SQL query that is going to look like this.
select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in (N'1', N'2', N'3', N'4', N'etc')
This query looks innocent enough right? It is doing a rather straightforward lookup on the table. But it's actually doing a bit more than that, notice the N
character in front of each id value.
This character in SQL is declaring the type of that string as nvarchar
. If you're not familiar with the nvarchar
type, it allows you to store any Unicode value in a column. This is different from a varchar
data type which only allows you to store ASCII.
This is where the differences between nvarchar
and varchar
become important. When we defined our FieldTable
class up above, we didn't specify the SQL types of the columns on the table. So what does that mean? It means that Entity Framework is going to use it's default SQL types, for .NET strings the default type is nvarchar
.
This is why we see the N
character in front of each of our values that the raw SQL query is looking up.
Is that a problem? Not if the column on our table is of type nvarchar
. Entity Framework is sending nvarchar
ids in the query and our column has that type so were all good.
But what if the column is of type varchar
instead?
If FieldId
is actually of type varchar
but Entity Framework sends a nvarchar
set of ids, now we have a type mismatch. When this happens a conversion now has to happen at query time.
This is a subtle nuance that can often go unlooked. But, the performance impact can be huge if our query is looking up hundreds of values.
The Lesson Learned
This subtle difference when looking up one FieldId
value wasn't all that noticeable. It seemed a bit slower than it should have been but not to bad overall.
But, querying for 500 FieldId
values was not performant at all, it was on the order of 45-60 seconds. This led to the investigation laid out above. Looking at the raw query Entity Framework was generating we saw that the id values were being prefixed with N'1', N'2', N'3'
. We assumed that those columns were in fact nvarchar
so that shouldn't be the performance bottleneck.
But, then we ran the same query but instead of prefixing the id values with N
we looked up normal varchar
strings.
select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in ('1', '2', '3', '4', 'etc')
The results came back in less than 500 milliseconds.
Looking at the FieldId
column we were able to confirm that it was actually of type varchar
and not nvarchar
. Performance bottleneck found ✅.
This wasn't Entity Frameworks fault or even the fault of the database. It was a small bug in the data model that we created and was very easy to overlook. When we defined the table, FieldTable
, in code we specified that the FieldId
was of type string.
namespace my_api.Data
{
public class FieldTable
{
public Int64 Id { get; set; }
public string FieldId { get; set; }
public string Description { get; set; }
public DateTime DateAdded { get; set; }
public bool Available { get; set; }
}
}
When it came time for EF to query that table it did what it does best, translate your code into a SQL query. But, it added to the WHERE IN
clause the N
prefix for each id. It operated under the assumption that the FieldId
column was of type nvarchar
. That is because nvarchar
is the default SQL type for the .NET type string
in Entity Framework.
select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in (N'1', N'2', N'3', N'4', N'etc')
Bada bing, major performance problem. But why is that? Because now SQL Server has to convert each id that is declared as nvarchar
in the query to a varchar
type to query the column. That conversion with 500+ ids to lookup was very costly.
Be explicit when necessary
The fix was very straightforward to put in place. We needed to be explicit with our properties defined for FieldTable
. This meant adding an attribute to each property that tells Entity Framework the exact SQL data type this represents.
namespace my_api.Data
{
public class FieldTable
{
public Int64 Id { get; set; }
[Column(TypeName="varchar(50)")]
public string FieldId { get; set; }
[Column(TypeName="varchar(500)")]
public string Description { get; set; }
public DateTime DateAdded { get; set; }
public bool Available { get; set; }
}
}
The change was to add the [Column(TypeName="varchar(50)")]
to the FieldId
property. This tells Entity Framework the exact SQL data type of this column. By doing that, Entity Framework now generates the appropriate SQL query, one with out nvarchar
strings.
select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in ('1', '2', '3', '4', 'etc')
The result? Looking up 500+ ids at a time can now be done in 200-400 milliseconds instead of 45-60 seconds.
Conclusion
Building solutions where you don't have access to all the pieces in play can be challenging. Duplicating a schema manually into your own code is error-prone as we have seen.
ORMs like Entity Framework are fantastic at hiding complexities around database access. Most of the time this is what we want. However, as we have seen, sometimes that hiding can introduce nuances that are easy to overlook. Use ORMs when needed but make sure you have a solid footing in the implicit decisions they may or may not make.
Are you hungry to learn even more about Amazon Web Services?
I have spent pretty much my entire career working within an AWS ecosystem. But even I can find learning new AWS services daunting at times. That's why I created a course that teaches you how to master AWS the same way I do, by actually using it.
If you are looking to begin your AWS journey but feel lost on where to start, consider checking out my course. We focus on hosting, securing, and deploying static websites on AWS. Allowing us to learn over 6 different AWS services as we are using them. After you have mastered the basics there we can then dive into two bonus chapters to cover more advanced topics like Infrastructure as Code and Continuous Deployment.