Friday, December 14, 2018

2 very good reasons why you should never pass entities to the client.

So you are building a web application with Entity Framework on the back end. You've got the code to load the entities now you need to get that data to the client. Then the question enters your head, "I've already loaded this data, why not simply return the entity?" Why bother creating a view model when it's just going to contain many of the fields you've already got here? Well, here's 2 very good reasons why you shouldn't, and if you have, you should seriously consider changing it.

1. Security

If you are passing an entity to the client, and critically, if you are accepting that entity back from the client, attaching it to a context, and saving it, you have opened a massive, gaping security hole in your application.

As an example lets say I have a system that a user logs into and displays a list of debts that user currently has, and allows them to do something innocent like updating a status or record a new note. The debt entity might look something like:


We have a model which contains a list of debts and a view that lets a user tick off a debt and mark it as accepted. We might only display the debt amount and maybe a couple extra details on the screen, however when you look at the debugger where the model might be exposed such as our action to accept the debt:

Since we passed the entity, and our serializer diligently serialized all properties and children we expose all of that data to the client. Hence I might have notes that aren't meant to be seen by the client, yet here they are.

The action handler looks something like:

Now this is a very basic example. In reality we could have a view that allows for a more meaningful edit against an entity.  When we click the Accept button the record gets updated as accepted, all seems good. Except...

I don't want to accept a debt of $5000. I want to make that $1.

Using the debug tools, with a break-point before that call to UpdateDebt, I can edit the debt payload to change the Amount from "5000" to "1" then resume the execution.

Suddenly my data shows that the debt was accepted, but now has an amount of $1?! 

Heck, I don't even want to pay $1. Let's make it someone else's problem.  Using the same break-point we can see that it's assigned to a Customer with ID = 1.  Let's change those Customer ID references to 2, maybe update that customer name to "Fred" and anything else that might identify me.   Presto, that debt is now someone else's problem. What happened?

Let's have a look at the UpdateDebt method:

Part of the attraction I fear developers have with Entity Framework and reattaching entities is that it will save an extra read to the database. Just attach it and save. It's also only 3 lines of code! Win. There are libraries such as GraphDiff that even help dive through related entities to attach them and mark them as modified. Doing this violates the cardinal rule of client-server software development. "Trust nothing that the client sends you!"

Now you might argue that it doesn't make sense to have a client update a single field like that. That an "accept" action would simply take a debt ID and do that behind the scenes. True, but any update action in a system that takes an entity and attaches it is vulnerable to this tampering, and importantly, any entity associated with that entity via it's references.

Even using a view model is not immune to this form of tampering. If you use a tool like Automapper to map entities to view models, but then use Automapper to map view model's back to entities you can expose yourself to the same unintended changes, though to a lesser degree provided those view models limit the data visible. 

In any case, *never* trust the data coming back from the client. Even if it's an action that just accepts a debt ID, verify that the debt ID actually belongs to the currently authenticated user. *always*. If it isn't, log the user out immediately, and log the attempt to notify administrators. You either have a bug that should be fixed, or someone actively trying to circumvent or compromise the system.

2. Performance

Now point #1 should be enough, but if you've chosen to pass entities anyways, and are validating everything coming back from the client, the second reason you should reconsider using entities is that sending entities incur a performance cost, and are easily susceptible to serious performance issues. One issue that I see commonly tripping up users around entity framework is lazy loading where they pass entities to the client. When MVC /  Web API go to send entities back to the client, the serializer touches each property on the entity. Where that property is a lazy-loadable proxy method, that touch triggers off a lazy-load call. Each of those loaded entities then have each of their properties touched to be serialized, triggering more lazy loads. What is worse is that where you might have loaded 20 rows for search results, the lazy loads for each of those 20 entities and their children will be performed 1 by 1, not using the criteria you used to load the 20 parent entities as if would have been the case if you used .Include() to eager load them. The simple work-around is to eager load everything, but that means eager loading absolutely everything associated with that entity, whether you intend to display it or not.

Even in this case we are now faced with a situation where we are loading dozens of columns of data where we might only want to display a handful. That is extra work and memory for your database to provide those unnecessary columns, little to no optimization the database can do with indexes, extra data across the wire to your app server, then extra data across the wire to the client, and extra memory needed on both the app server and client. How much faster would it be responding if it just queried and returned that handful of columns? It's noticeable.

Addressing these issues

The simple answer to avoid these issues? Use view-models. A view model is a POCO (Plain 'ol C# Object) that contains just the data you want your view to see, and the IDs you will need to perform any action from that view. Automapper now provides support for the IQueryable interface and Linq integration via the .ProjectTo() method which can pass through the Select criteria that the database will need to pass back just what is needed for the view-model. The client doesn't see anything it shouldn't, and the data is optimal for performance by not sending anything that isn't needed.

When it comes to getting those view models back: Trust nothing. Every operation and update absolutely must be authenticated and have all applicable IDs resolved back to the user. Any deviation should be logged and generate an alert to administrators.

Conclusion


Hopefully this helps convince developers to reconsider ideas to pass entities between their controllers and views / API consumers.  If you are working on a system that does pass entities around, at least at a minimum be sure to test your client with the debugger tools thoroughly to see exactly what data is being exposed, and ensure that every screen and action is properly guarding against JSON data tampering.

Wednesday, December 12, 2018

Maximizing Performance with Entity Framework & MVC / Web API

You've just started on a contract and the manager or lead developer comes up to you and says: "Ok, we were using Entity Framework for our data layer but it's too slow so we need to replace it with {Dapper/NHibernate/EF-Core/NoSQL/ADO+Sprocs/etc.}"

When I hear this (and I have heard it more than a couple times) it sends the warning bells ringing, and presents an opportunity to save clients a whole lot of money. 99.5% of the time the issue isn't that EF is too slow; The issue is that they just don't know how to use it, and even switching to another technology stack they will likely encounter the same or worse performance issue.

Whenever I face the task of improving performance with EF, there are always issues and smells to investigate and eliminate. In this article I'll be outlining some of the major ones, and what can be done about them. Some of these items can involve a fair amount of re-factoring to implement, but the performance impact of these changes will be evident quite clearly to justify the changes.

Culprit #1 - Code-First

I'll get it out there right now. I *hate* EF code-first. I mean I really hate code-first. It is in the list of technologies that I wish were never invented, right up there with "remember password" and "hide file extensions". The road to Hell was paved in good intentions, and code-first accounts for more than a few bricks on that path. Code-first and Migrations make up now the majority of EF related issues on Stack Overflow. They are great tools for whipping up Alpha 0.1 and getting something up and running fast, but beyond that they should simply be turned off. Code-first databases *can* be set up efficiently, but I'd say that every instance of a code-first database implementation that I've seen in a production system makes my inner-DBA shrivel and die a little. I am not a DBA, but I have worked with enough to have picked up on what makes for efficient and effective database design. Out of the box, code-first offers none of this. My recommendation is that if you're using it and not experiencing performance issues, then you shouldn't be reading this article. If you are reading this article and using code-first, seriously consider turning it off unless you know this beast well enough to have addressed these following smells:

a) Inefficient or missing indexes. Indexes are something that in EF systems should be added and maintained as systems mature. Looking at real-world usage and performance you should be constantly evaluating and implementing indexes. This isn't something that should be set up "first" (premature optimization), and it shouldn't be forgotten/ignored.
b) Poor schema design. Simple things like using GUIDs for PKs with NewId() instead of NewSequentialId(), or using meaningful string PKs/FKs, not to mention probably a dozen sins that a DBA will point out about generated schemas.
c) Time wasted attempting to resolve issues with code first generation and migrations, and the inevitable work-around/hacks to keep it happy. How many columns are left as inefficient data types, confusing names, or otherwise polluting your schema, indexes, and entities because you couldn't get EF migrations to clean them up properly without failures or side-effects?
d) Generation is no substitute for a good DBA. More than a couple teams that I have joined have foregone a DBA thinking that developers using code-first negate the need for a DBA. (Frankly, if I was a DBA faced with a team planning to use code first, I'd expect I'd either get frustrated and quit, or have been forced out.) If you're using code-first without access to a DBA to tune it you are asking for trouble. If you do still have a in the team DBA, give them a hug for sticking around and putting up with you!

No, I don't like code-first at all, and it's responsible in part or whole for at least a few of the issues below. I will always recommend using a database first approach with EF and manage the schema manually, preferably with access to a DBA.

Culprit #2 - Schema related issues.

Even if code first wasn't responsible for the schema, there are a few schema related bugbears to investigate and potentially improve. These often aren't the primary performance bottleneck, but they are one of the quickest wins to squeeze a bit of performance out without code changes, and these steps will help identify bigger wins in re-factoring for further culprit items.

The first thing that every EF developer needs to be familiar with is a profiler. This can be an EF profiler, or any SQL profiler for your database of choice will work equally well. A common one I use for SQL Server is ExpressProfiler (https://github.com/OleksiiKovalov/expressprofiler) which is a quick and simple profiling tool. Using a profiler against an isolated database that you can test against will help reveal slow, expensive queries that you can copy and run against the database with an execution plan. This can highlight missing indexes for example.

The key things to check in the schema:
a) Correctly typed keys with efficient identity/defaults. For example, if you use UniqueIdentifier PKs, are you using NewSequentialId() instead of NewId()? Are you using composite PKs where a dedicated meaningless PK could be used? Are you using meaningful string PKs instead of meaningless numeric ones?
b) Are there missing indexes, or indexes that can be improved?
c) Do you have an index maintenance schedule?
d) Do you have Transaction Log maintenance scheduled as part of your backup strategy? (SQL Server)
e) In general, has a DBA gone over your database and maintenance scheduling? If no, get someone to help look at it.

Culprit #3 - Exposing entities instead of View Models / DTOs.

In terms of performance impacts on a system, this culprit is far an above the most significant issue to track down and eliminate. By now you should have a profiler installed, and the profiler is instrumental in demonstrating the pain that passing entities around can cause. The first problem that you can encounter with returning entities, especially in the case of MVC controllers is unexpected lazy load calls. These show up in a profiler when you trigger a call against the controller then see a whole stream of database hits go through. If you put a breakpoint at the end of the call, then run to that point, and then resume to exit the method, these extra DB calls will appear after exiting the method.  What is happening is that ASP.Net is attempting to serialize the entities to return to the view. The serializer iterates over the properties and by doing so, trips the lazy loaded property proxy methods, triggering a load from the database. 

What is worse is that when you're faced with returning a collection of entities such as search results, each set of lazy loaded dependencies will be queried against the database *one.. by.. one*. Ouch! So lets say I query a page of order results. Orders have a customer reference and a delivery address reference. Customers have an address and contact details reference. If you returned a page of 20 orders, the lazy load hits will include 20x "SELECT FROM tblCustomer WHERE CustomerID = x" plus 40x "SELECT FROM tblAddress WHERE AddressID = x" (20 for the order delivery address, and 20 for the customer address) plus 20x "SELECT FROM tblContact WHERE CustomerID = x", etc. etc. for every reference of Order, and every reference of that reference and so-on.

These lazy loads can be mitigated by eager-loading the data using .Include() statements, however this still presents performance issues as the database server needs to retrieve all columns of all related tables for the applicable rows. This data then needs to be transmitted and allocated memory on the application server, before being serialized and sent across the wire to the client. That is almost always amounting to a lot more data than is needed, and doesn't present opportunities to take full advantage of indexing.

Sending entities also most importantly exposes systems to revealing more information than users should be able to access. Even if you don't add controls to your view for that data, a simple F12 and inspection in the debugging tools can expose that information to clever viewers. Passing entities back to the server and being "clever" by attaching them to a context and saving is a cardinal sin. Those entities expose all FKs etc. and could have been modified in ways that your client doesn't support, but a clever debugger could modify.

Lastly, dealing with detached entities adds complexity and potential errors within your server code. You need to test whether entities aren't already associated to a DbContext before reattaching them, and ensuring that references are updated before being saved.

Using POCO view models you can improve performance considerably by letting EF query just the columns and rows you need from the database, and pass that smaller payload across the wire to the client. Tools like Automapper now support IQueryable extensions via .ProjectTo which means you can relatively easily integrate mapping an entity query and related data into a view model that is optimized for the particular view.

I believe the biggest argument for passing entities (besides being lazy about defining a view model which looks similar to an entity) is to magically avoid having to re-read the data when performing the update. I call B.S. on that "optimization" on the grounds that systems generally do a lot more reads than writes, and updates typically involve fetching a relatively small amount of data by PK which EF performs ridiculously fast.  Time also passes from when that data is read and when it is written so systems still need to handle the fact that the server data could have been updated, so the code still needs to be prepared to re-load the entity anyways. Trading off the security and performance for avoiding defining an extra set of simple POCO classes and trying to avoid a very fast read is a bad deal.

Culprit #4 - Async, async everywhere.

When developing large, complex systems, you need to know and understand async/await and how it works with EF. However, many developers are a bit mistaken about what it does, and where it should or should not be used. Generally, I find that once teams decide to use it, they decide to use it *everywhere* for consistency. This is a mistake.

Async does not make queries faster. It makes them slower. What async does do is make servers more responsive in that you can kick off a fairly expensive query for one request, then free up that request thread to start processing another request. When that first query finishes, it will resume to finish off that first request on a new worker thread.  When you implement async code you add an overhead to set up the code to effectively suspend and set up a continuation point for resumption later. That overhead is both in terms of server resources, and thread resource synchronization. If you use it for big, expensive requests, those requests will not hold up server processing waiting for those queries. However, if you use it for *all* queries, every operation, including the short & sweet ones will be participating in that park & resume scheduling game.  My recommendation when it comes to async is to start with everything running synchronously, then as you identify heavy queries, update these to use async operations. Typical candidates include:

  • Searches
  • Retrieving entity graphs for updates
  • Retrieving large graphs for reports or detail views.

Removing unnecessary async code is a relatively easy re-factor and something I suggest when I see systems where async/await have been lathered throughout a system.

Culprit #5 - Large DbContexts.

As systems grow larger, the number of tables and entities can get quite large. The larger a DbContext gets, the longer it takes to spin one up. IMO this is a major drawback of most Code-First implementations. Using bounded contexts (Contexts geared towards specific areas of an application with just those entities and their direct dependencies) can noticeably improve the performance of EF operations. Entities used as a read-only reference in one context can be greatly simplified so that they can be loaded and associated more efficiently. For example a Customer may have 40+ columns, many non-nullable, and the act of creating an Order needs a reference to a customer but rather than mapping an entire Customer table entity for an OrderContext, you can simplify it to just the relevant details you would need for a customer as far as an order is concerned. Smaller entity definition + fewer entity definitions = faster context spin-up and simpler read/write code. There is probably a way to get bounded contexts cooperating against a single DB schema /w migrations, but frankly I find code-first is more hassle than it's worth.

Introducing bounded contexts with a DB-first configuration is a relatively simple task to approach on a screen by screen basis. Often this is an approach I use to introduce EF into legacy systems, and a similar approach can be employed to separate a set of pages from an existing global context into using a new purpose-built context. An absolute must for this step is to ensure that Culprit #3 isn't rearing it's head. To have this level of flexibility you cannot be passing entities around.

Culprit #6 - Entity Framework 4

This item deserves special mention. In more than a few instances I've seen criticisms of EF performance from teams that are still using Entity Framework 4. I've used EF since EF 4 was first introduced (technically EF v2) and unless I *had* to use it, I didn't. I'd previously been an NHibernate advocate. I'd tinkered with EF 5, but I didn't take EF seriously until EF 6 was released. From a performance perspective, EF 4 is a dog's breakfast, and any project still using it should immediately be looking at upgrading to EF 6. The migration path from 4 to 6 is pretty simple. Unfortunately code written around EF 4 will not be initially taking advantage of many of the improvements introduced with 6 such as deferred execution, but those can be re-factored in over time to net significant performance improvements. Simply replacing EF 4 with EF 6 would net 10~25% performance improvements without any other changes.

Conclusion

Hopefully these six culprits can help you with identifying performance problems with EF-backed systems and some ideas on how to improve them rather than giving into believing EF needs to be replaced for something else. If you have any comments or questions about coaxing some performance out of EF flick me a comment or post your question on StackOverflow tagged under "entity-framework" as I tend to follow that tag.