Py's Sty: Maximizing Performance with Entity Framework & MVC / Web API

You've just started on a contract and the manager or lead developer comes up to you and says: "Ok, we were using Entity Framework for our data layer but it's too slow so we need to replace it with {Dapper/NHibernate/EF-Core/NoSQL/ADO+Sprocs/etc.}"

When I hear this (and I have heard it more than a couple times) it sends the warning bells ringing, and presents an opportunity to save clients a whole lot of money. 99.5% of the time the issue isn't that EF is too slow; The issue is that they just don't know how to use it, and even switching to another technology stack they will likely encounter the same or worse performance issue.

Whenever I face the task of improving performance with EF, there are always issues and smells to investigate and eliminate. In this article I'll be outlining some of the major ones, and what can be done about them. Some of these items can involve a fair amount of re-factoring to implement, but the performance impact of these changes will be evident quite clearly to justify the changes.

Culprit #1 - Code-First

I'll get it out there right now. I *hate* EF code-first. I mean I really hate code-first. It is in the list of technologies that I wish were never invented, right up there with "remember password" and "hide file extensions". The road to Hell was paved in good intentions, and code-first accounts for more than a few bricks on that path. Code-first and Migrations make up now the majority of EF related issues on Stack Overflow. They are great tools for whipping up Alpha 0.1 and getting something up and running fast, but beyond that they should simply be turned off. Code-first databases *can* be set up efficiently, but I'd say that every instance of a code-first database implementation that I've seen in a production system makes my inner-DBA shrivel and die a little. I am not a DBA, but I have worked with enough to have picked up on what makes for efficient and effective database design. Out of the box, code-first offers none of this. My recommendation is that if you're using it and not experiencing performance issues, then you shouldn't be reading this article. If you are reading this article and using code-first, seriously consider turning it off unless you know this beast well enough to have addressed these following smells:

a) Inefficient or missing indexes. Indexes are something that in EF systems should be added and maintained as systems mature. Looking at real-world usage and performance you should be constantly evaluating and implementing indexes. This isn't something that should be set up "first" (premature optimization), and it shouldn't be forgotten/ignored.
b) Poor schema design. Simple things like using GUIDs for PKs with NewId() instead of NewSequentialId(), or using meaningful string PKs/FKs, not to mention probably a dozen sins that a DBA will point out about generated schemas.
c) Time wasted attempting to resolve issues with code first generation and migrations, and the inevitable work-around/hacks to keep it happy. How many columns are left as inefficient data types, confusing names, or otherwise polluting your schema, indexes, and entities because you couldn't get EF migrations to clean them up properly without failures or side-effects?
d) Generation is no substitute for a good DBA. More than a couple teams that I have joined have foregone a DBA thinking that developers using code-first negate the need for a DBA. (Frankly, if I was a DBA faced with a team planning to use code first, I'd expect I'd either get frustrated and quit, or have been forced out.) If you're using code-first without access to a DBA to tune it you are asking for trouble. If you do still have a in the team DBA, give them a hug for sticking around and putting up with you!

No, I don't like code-first at all, and it's responsible in part or whole for at least a few of the issues below. I will always recommend using a database first approach with EF and manage the schema manually, preferably with access to a DBA.

Culprit #2 - Schema related issues.

Even if code first wasn't responsible for the schema, there are a few schema related bugbears to investigate and potentially improve. These often aren't the primary performance bottleneck, but they are one of the quickest wins to squeeze a bit of performance out without code changes, and these steps will help identify bigger wins in re-factoring for further culprit items.

The first thing that every EF developer needs to be familiar with is a profiler. This can be an EF profiler, or any SQL profiler for your database of choice will work equally well. A common one I use for SQL Server is ExpressProfiler (https://github.com/OleksiiKovalov/expressprofiler) which is a quick and simple profiling tool. Using a profiler against an isolated database that you can test against will help reveal slow, expensive queries that you can copy and run against the database with an execution plan. This can highlight missing indexes for example.

The key things to check in the schema:
a) Correctly typed keys with efficient identity/defaults. For example, if you use UniqueIdentifier PKs, are you using NewSequentialId() instead of NewId()? Are you using composite PKs where a dedicated meaningless PK could be used? Are you using meaningful string PKs instead of meaningless numeric ones?
b) Are there missing indexes, or indexes that can be improved?
c) Do you have an index maintenance schedule?
d) Do you have Transaction Log maintenance scheduled as part of your backup strategy? (SQL Server)
e) In general, has a DBA gone over your database and maintenance scheduling? If no, get someone to help look at it.

Culprit #3 - Exposing entities instead of View Models / DTOs.

In terms of performance impacts on a system, this culprit is far an above the most significant issue to track down and eliminate. By now you should have a profiler installed, and the profiler is instrumental in demonstrating the pain that passing entities around can cause. The first problem that you can encounter with returning entities, especially in the case of MVC controllers is unexpected lazy load calls. These show up in a profiler when you trigger a call against the controller then see a whole stream of database hits go through. If you put a breakpoint at the end of the call, then run to that point, and then resume to exit the method, these extra DB calls will appear after exiting the method. What is happening is that ASP.Net is attempting to serialize the entities to return to the view. The serializer iterates over the properties and by doing so, trips the lazy loaded property proxy methods, triggering a load from the database.

What is worse is that when you're faced with returning a collection of entities such as search results, each set of lazy loaded dependencies will be queried against the database *one.. by.. one*. Ouch! So lets say I query a page of order results. Orders have a customer reference and a delivery address reference. Customers have an address and contact details reference. If you returned a page of 20 orders, the lazy load hits will include 20x "SELECT FROM tblCustomer WHERE CustomerID = x" plus 40x "SELECT FROM tblAddress WHERE AddressID = x" (20 for the order delivery address, and 20 for the customer address) plus 20x "SELECT FROM tblContact WHERE CustomerID = x", etc. etc. for every reference of Order, and every reference of that reference and so-on.

These lazy loads can be mitigated by eager-loading the data using .Include() statements, however this still presents performance issues as the database server needs to retrieve all columns of all related tables for the applicable rows. This data then needs to be transmitted and allocated memory on the application server, before being serialized and sent across the wire to the client. That is almost always amounting to a lot more data than is needed, and doesn't present opportunities to take full advantage of indexing.

Sending entities also most importantly exposes systems to revealing more information than users should be able to access. Even if you don't add controls to your view for that data, a simple F12 and inspection in the debugging tools can expose that information to clever viewers. Passing entities back to the server and being "clever" by attaching them to a context and saving is a cardinal sin. Those entities expose all FKs etc. and could have been modified in ways that your client doesn't support, but a clever debugger could modify.

Lastly, dealing with detached entities adds complexity and potential errors within your server code. You need to test whether entities aren't already associated to a DbContext before reattaching them, and ensuring that references are updated before being saved.

Using POCO view models you can improve performance considerably by letting EF query just the columns and rows you need from the database, and pass that smaller payload across the wire to the client. Tools like Automapper now support IQueryable extensions via .ProjectTo which means you can relatively easily integrate mapping an entity query and related data into a view model that is optimized for the particular view.

I believe the biggest argument for passing entities (besides being lazy about defining a view model which looks similar to an entity) is to magically avoid having to re-read the data when performing the update. I call B.S. on that "optimization" on the grounds that systems generally do a lot more reads than writes, and updates typically involve fetching a relatively small amount of data by PK which EF performs ridiculously fast. Time also passes from when that data is read and when it is written so systems still need to handle the fact that the server data could have been updated, so the code still needs to be prepared to re-load the entity anyways. Trading off the security and performance for avoiding defining an extra set of simple POCO classes and trying to avoid a very fast read is a bad deal.

Culprit #4 - Async, async everywhere.

When developing large, complex systems, you need to know and understand async/await and how it works with EF. However, many developers are a bit mistaken about what it does, and where it should or should not be used. Generally, I find that once teams decide to use it, they decide to use it *everywhere* for consistency. This is a mistake.

Async does not make queries faster. It makes them slower. What async does do is make servers more responsive in that you can kick off a fairly expensive query for one request, then free up that request thread to start processing another request. When that first query finishes, it will resume to finish off that first request on a new worker thread. When you implement async code you add an overhead to set up the code to effectively suspend and set up a continuation point for resumption later. That overhead is both in terms of server resources, and thread resource synchronization. If you use it for big, expensive requests, those requests will not hold up server processing waiting for those queries. However, if you use it for *all* queries, every operation, including the short & sweet ones will be participating in that park & resume scheduling game. My recommendation when it comes to async is to start with everything running synchronously, then as you identify heavy queries, update these to use async operations. Typical candidates include:

Searches
Retrieving entity graphs for updates
Retrieving large graphs for reports or detail views.

Removing unnecessary async code is a relatively easy re-factor and something I suggest when I see systems where async/await have been lathered throughout a system.

Culprit #5 - Large DbContexts.

As systems grow larger, the number of tables and entities can get quite large. The larger a DbContext gets, the longer it takes to spin one up. IMO this is a major drawback of most Code-First implementations. Using bounded contexts (Contexts geared towards specific areas of an application with just those entities and their direct dependencies) can noticeably improve the performance of EF operations. Entities used as a read-only reference in one context can be greatly simplified so that they can be loaded and associated more efficiently. For example a Customer may have 40+ columns, many non-nullable, and the act of creating an Order needs a reference to a customer but rather than mapping an entire Customer table entity for an OrderContext, you can simplify it to just the relevant details you would need for a customer as far as an order is concerned. Smaller entity definition + fewer entity definitions = faster context spin-up and simpler read/write code. There is probably a way to get bounded contexts cooperating against a single DB schema /w migrations, but frankly I find code-first is more hassle than it's worth.

Introducing bounded contexts with a DB-first configuration is a relatively simple task to approach on a screen by screen basis. Often this is an approach I use to introduce EF into legacy systems, and a similar approach can be employed to separate a set of pages from an existing global context into using a new purpose-built context. An absolute must for this step is to ensure that Culprit #3 isn't rearing it's head. To have this level of flexibility you cannot be passing entities around.

Culprit #6 - Entity Framework 4

This item deserves special mention. In more than a few instances I've seen criticisms of EF performance from teams that are still using Entity Framework 4. I've used EF since EF 4 was first introduced (technically EF v2) and unless I *had* to use it, I didn't. I'd previously been an NHibernate advocate. I'd tinkered with EF 5, but I didn't take EF seriously until EF 6 was released. From a performance perspective, EF 4 is a dog's breakfast, and any project still using it should immediately be looking at upgrading to EF 6. The migration path from 4 to 6 is pretty simple. Unfortunately code written around EF 4 will not be initially taking advantage of many of the improvements introduced with 6 such as deferred execution, but those can be re-factored in over time to net significant performance improvements. Simply replacing EF 4 with EF 6 would net 10~25% performance improvements without any other changes.

Conclusion

Hopefully these six culprits can help you with identifying performance problems with EF-backed systems and some ideas on how to improve them rather than giving into believing EF needs to be replaced for something else. If you have any comments or questions about coaxing some performance out of EF flick me a comment or post your question on StackOverflow tagged under "entity-framework" as I tend to follow that tag.

Py's Sty

Wednesday, December 12, 2018

Maximizing Performance with Entity Framework & MVC / Web API