Thursday, October 16, 2025

Shrinking JSON for Data Storage

 Lately I have been working on a pet project for helping visualize how an offset account impacts one's mortgage. This has mainly been a foray into building apps in Blazor. The goal is to have a version of the app that can run in a fully offline mode leveraging local storage, which I have used Blazored.LocalStorage to assist with.  Behind the scenes it is serializing objects to JSON to be stored in the local storage which works just fine.

However, this did raise a bit of a concern in that the data being stored would be entered over years worth of entries to track transaction amounts to compute the offset impact. Each record is reasonably small, but there will potentially be a fair number of them, and I'd like the option to include things like descriptive names. JSON itself adds additional storage concerns as the objects and all values need to be represented as strings. With databases in modern systems, space is cheap for storage and indexing. With local storage we are limited to 5MB for storage. This means we need to be a bit more cautious with our data.  This turned into a rather interesting problem to find a balance between ease of use and storage space.

The first thing I looked at was ensuring that I was only storing info that was absolutely needed. With a relational database a Transaction record might look something like:

TransactionId, TransactionDate, AccountId, Amount, Description

With this offline store, it resembles more like a document store where the transactions will fall under an account document. We still want an Id for the transaction within the document, but we can get rid of the FK.  Now when it comes to the description, many rows will be null, but for ones that people fill in, there is a good chance that many will repeat. For this I decided to normalize by adding a TransactionDescription table. This replaces Description with TransactionDescriptionId, being an integer rather than the string in each row. As users start typing we can look up common descriptions and offer to link to an existing one or create a new row for new values.

When it comes to dates I decided to treat these as raw values with an unmapped transalation to a C# date. So dates would be stored in a short-form ISO format of "yyyyMMdd".

For dollar amounts I elected to store these in cents, letting the DTO expose a floating currency value.

So a typical record would start to look like:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545, "DescriptionId": 32 }

... where if the transaction had no description:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545, "DescriptionId": null }
91 characters.

This is already a bit more compact than if I took a more default C#, denormalized object serialized to JSON. Though we can do better. For a start, when serializing to JSON there is an option on how #null values are handled, either included or excluded. For data contracts it makes sense to include them so consumers know they are a value in the destination object. For data storage though we can ignore them. In the second case with no description this becomes:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545 }
68 characters.

Next, we can use the [DataContract] and [DataMethod] attributes. Rather than the default opt-out serialization behaviour where we would need [JsonIgnore] to exclude properties we don't want serialized, if we use [DataContract] we need to opt-in properties to include with [DataMethod] but we can also specify a value name. For instance with my above simple transaction, the DTO looks something like:

    [DataContract]
    public class Transaction : IDto
    {
        [DataMember(Name ="id")]
        public int TransactionId { get; init; }

        [DataMember(Name="dt")]
        public int TransactionDateRaw { get; init; }
        public DateTime TransactionDate => DateTime.ParseExact(TransactionDateRaw, "yyyyMMdd", null);

        [DataMember(Name ="am")]
        public long Amount { get; init; }

        [DataMember(Name ="td")]
        public int? TransactionDescriptionId { get; init; }
    }


The JSON now looks like:

{ "id":1, "dt": "20251012", "am": 1545 }
40 characters.

We have gone from 91 characters to less than half at 40 characters. 

The next change I made was around the date. I was going to want to provide daily figures for things like the balance, while supporting that multiple transactions on a given day. When it comes to loans with an offset account, the general recommendation is to use the interest free period on credit cards for daily transactions and pay the card in full each month. This reduces reporting on each transaction but we'll still have things like bill payments and other transactions falling on the same day. The structure I ended up with pulled the date value to an AccountDay class containing the starting and final balance with a collection of transactions. This pulled the repeated date out of the transaction rows.

The offset account manages a SortedList<int, AccountDay> collection of days indexed by the date (yyyyMMdd format)  The resulting JSON looks like:

{"na":"Test Account","dz":{"20251001":{"ib":5000,"tx":[],"ba":5000},"20251002":{"ib":5000,"tx":[{"id":1,"am":44400},{"id":2,"am":29153},{"id":3,"am":17166}],"ba":95719},"20251003":{"ib":95719,"tx":[{"id":1,"am":39177},{"id":2,"am":39521},{"id":3,"am":27103},{"id":4,"am":41982}],"ba":243502},"20251004":{"ib":243502,"tx":[{"id":1,"am":43598}],"ba":287100},"20251005":{"ib":287100,"tx":[],"ba":287100},"20251006":{"ib":287100,"tx":[{"id":1,"am":12655},{"id":2,"am":9537},{"id":3,"am":38402}],"ba":347694},"20251007":{"ib":347694,"tx":[{"id":1,"am":46109},{"id":2,"am":43684}],"ba":437487},"20251008":{"ib":437487,"tx":[{"id":1,"am":11688},{"id":2,"am":2481}],"ba":451656},"20251009":{"ib":451656,"tx":[{"id":1,"am":40970}],"ba":492626},"20251010":{"ib":492626,"tx":[{"id":1,"am":32517},{"id":2,"am":20656},{"id":3,"am":17146}],"ba":562945},"20251011":{"ib":562945,"tx":[{"id":1,"am":1350},{"id":2,"am":21341}],"ba":585636},"20251012":{"ib":585636,"tx":[],"ba":585636},"20251013":{"ib":585636,"tx":[],"ba":585636},"20251014":{"ib":585636,"tx":[{"id":1,"am":41260},{"id":2,"am":31053}],"ba":657949},"20251015":{"ib":657949,"tx":[{"id":1,"am":47606}],"ba":705555}}}
It isn't very readable by any stretch, but it is compact and deserializes back into the object model where all of the work & formatting is done.

My thoughts are that I'll load and save account data in parcels by month or by year, depending on the final data size and trying to ensure this tool remains device friendly. For instance if data is automatically parcelled and serialized by "yyyyMM" then the "days" key shrinks from "yyyyMMdd" to just "dd", saving even more space. 

I could have left it at this and let Blazored.LocalStorage serialize this structure, and I'd likely have enough local storage for the decades of data for the typical user, but Blazored.LocalStorage can also work with storing string values which got me thinking: What if I serialized and compressed the data? Compression should significantly shrink the JSON, but then I'd need to Base64 the string, adding about 1/3rd to the compressed size. From here I wanted to see what a typical use case might look like so creating around 15 years worth of transactions. The test generator built anywhere from around 35,000-50,000 records. With the original JSON objects, this amounted to a size of around 1.3MB. The shortened JSON properties dropped that to around 545kB. The last step was to see what compression might do.

    public static string? Serialize(this IDto dto)
    {
        try
        {
            string json = JsonConvert.SerializeObject(dto, new JsonSerializerSettings { NullValueHandling = NullValueHandling.Ignore });
            var buffer = Encoding.UTF8.GetBytes(json);
            using var fromStream = new MemoryStream(buffer);
            using var toStream = new MemoryStream();
            using var zipStream = new DeflateStream(toStream, CompressionLevel.Optimal);

            fromStream.CopyTo(zipStream);
            zipStream.Flush();
            string base64 = Convert.ToBase64String(toStream.ToArray());
            return base64;
        }
        catch (Exception ex)
        {
            return null;
        }
    }

    public static T? Deserialize<T>(string dataBase64) where T : IDto
    {
        if (string.IsNullOrEmpty(dataBase64)) return default;
        try
        {
            using var toStream = new MemoryStream();
            var data = Convert.FromBase64String(dataBase64);
            using var fromStream = new MemoryStream(data);

            using var zipStream = new DeflateStream(fromStream, CompressionMode.Decompress);

            zipStream.CopyTo(toStream);
            zipStream.Flush();
            string json = UTF8Encoding.UTF8.GetString(toStream.ToArray());
            return JsonConvert.DeserializeObject<T>(json);
        }
        catch (Exception ex)
        {
            return default;
        }
    }

With this I was able to get the data size down to 98kB using DEFLATE/GZip including the Base64 conversion. This also means I can easily package up and offer to save a backup of data from the local storage on request, and re-import into local storage.  

The next question was total time needed to serialize/deserialize with and without compression. Times did vary between runs but the general impact was that compression was going to add around 20% to the read and write times. This was while running on a PC so I'm far more cautious about this impact when running on a device once I have a suitable build ready for testing. However, this was reading and writing 15 years worth of data in one hit. Even in that situation it was adding around 100ms to a 500ms operation. The goal is to parcel data by month or by year worst case where the complete data set doesn't need to be loaded.

In any case this provided a lot of food for thought when thinking about JSON serialized data for data communication and storage as opposed to a more contractual or otherwise visible medium. When we want to package up data tightly for storage or transport then we can leverage options in the JSON serialization as well as compression when we have compute available on the consumer. The trade-offs to balance besides space/size and performance also include the complexities of working with the data and potential for bugs. (For instance if I move the "yyyyMM" aspect out of the AccountDay and rely on the loaded month "packet" of data.

Friday, August 29, 2025

When will AI's "Monsanto Moment" come?

 I cannot help but see parallels between the introduction of AI into software development, and the rise of gene modification in crops. Crops have had their genes modified through domestication for centuries, but the rise of corporations like Monsanto popularizing lab-based manipulation led to the patent system being used to protect that investment.  Back in the early 2000's there was a big push to try and scare consumers about supposed risks of GMO crops, and a lot of that fear carries over today with consumers demanding non-GMO produce on the assumption that it is healthier. The reality behind that push wasn't around health and safety, it was around money and the use of these patents to force farmers to change their practices. Prior to GMO crops, farmers would buy seed stock, and if they had a good harvest they could opt to save some of it as seed for the next year. When they bought GMO seed stock, stuff that was resistant to insects or to herbicides, etc. they sign a contract that bars them from being able to keep seed stock. Each year they would need to buy the full allotment of GMO seed. The company would even go so far as to sue adjacent farmers that might benefit from cross-pollination from GMO fields.  This rather heavy-handed treatment of farmers wasn't likely to garner much sympathy from the general public who only stood to benefit from better supply from the increased yields. Still, to attack back at these corporations, supporters attacked GMO any way they could.

Today, AI tools are becoming increasingly available, at low cost or even free. People have started asking the questions around ownership, both in terms of the licensing/copyright of the code these AI tools have been trained with, and the code they in turn generate. For now, companies like Microsoft will claim if you use an AI tool and in turn get challenged by a copyright holder for a violation, they've got your back, while at the same time what code the tool generates is your IP.. But for how long? Monsanto didn't start from scratch with the gene makeup of the crops it improved. Generations of botanists, farmers, etc. experience and cross-breeding had supplied the current base. Much of that was done for little more than recognition for their work, out in the public domain, expecting that future efforts for improvement would remain freely available. That is until big industry finds a way to patent it's work, commercialize it, and take ownership of it. When it comes to business, it is a lot like fishing. When you first sense a fish is biting at your bait, you need to resist the urge to yank the line or you pull the hook out of its mouth. Fish can be clever and grab the edge of a bait and waiting for a free meal. No, instead you wait patiently until the fish is committed, then pull and set the hook. Once they could convince the patent office and courts, the hook was firmly set.

Today, companies like Microsoft have invested a good deal of money into developing and marketing these AI tools. They are putting their lines out in the water with tasty bait offering to help companies and development teams produce better quality code and products faster than ever. They are being patient, not to spook the fish. Today you own what the tool generates, but soon, I'd wager, companies like Microsoft will set that hook, and like Monsanto, demand their share of the value of the yield you produce directly, or indirectly from their seed; By force. How that exactly shapes up, we'll have to see. Perhaps terms that once you start development with AI assistance you cannot "opt out"? Or will they demand part ownership on the basis that their tools generated a share of the IP in the end product?

Thursday, August 21, 2025

Why I don't Grok AI

 I'm a bit of a dinosaur when it comes to software development. I've been on the rollercoaster chasing the highs from working with a new language or new toolset. I've ridden through the lows when a technology I happen to really enjoy working with ends up getting abandoned. (Silverlight, don't get started, for a website? Never, for Intranet web apps? Chef's kiss)

I'm honestly not that worried about AI tools in software development. As an extension to existing development tools, if it makes your life simpler, all the more power to you. Personally I don't see myself ever using it for a few reasons. One reason is the same as why I don't use tools like Resharper. You know, the 50,000 different hotkey combinations that can insert templated code, etc. etc. The reason I don't use tools like that is because for me, coding is about 75% thinking and 25% actual writing. I don't like to, nor want to write code faster because I don't need to. Often in thinking about code I realize better ways to do it, or in some cases, that I don't actually need that code at all. Having moar code fast can be overwhelming. Sure, AI tools are trained (hopefully) on best practices and should theoretically produce better code the first time around, not needing as much re-factoring, but the time to think and tweak is valuable to me. It's a bit like the tortoise and the hare. Someone with AI assistance will probably produce a solution far faster than someone without one, but at the end of the day, what good is speed if you're zipping along producing the wrong solution? Call me selfish but I also think any developer should see the writing on the wall that if a tool saves them 50% of their time, employer expectations are going to be pushing for 100% more work out of them in a day. 

The second main reason I don't see myself using AI is when it comes to stuff I don't know, or need to brush back up on, I want to be sure I fully understand the code I am responsible for, not just requesting something from an LLM. Issues like "impostor syndrome" are already a problem in many professions. I don't see the situation getting anything but worse when a growing portion of what you consider "employment" is feeding and changing the diapers on a bot. I have the experience behind me to be able to look at the code an LLM generates and determine whether it's fit for purpose, or the model's been puffing green dragon. What somewhat scares me is the idea of "vibe coding" where people that don't really understand coding use LLMs in a form of trial and error to get a solution done. Building a prototype? Great idea. Something you're going to convince people or businesses to actually use with sensitive data or decisions with consequences? Bad, bad idea.

Personally I see the value in LLM-based code generation plateauing rather quickly in terms of usefulness. It will get better, to a point, as it continues to learn from samples and corrections written and reviewed by experienced software developers. However, as Github starts to get filled by AI-generated code, and sites like StackOverflow die off with the new generation of developer consulting LLMs for guidance and "get this working for me" rather than "explain why this doesn't work", the overall quality of generated code will start to slip. With luck it's noticeable before major employers dive all-in, giving up on training new developers to understand code & problem solve, and all of us dinosaurs retire.

Until then I look forward to lucrative contracts sorting out messes that greenhorns powered by ChatGPT get themselves into. ;)

Autofac and Lazy Dependency Injection: 2025 edition

 Thank you C#!  I couldn't believe it's been two years since I last posted about my lazy dependency implementation. Since that time there have been a few updates to the C# language, in particular around auto-properties that have greatly simplified the use of lazy dependencies and their property overrides to simplify unit testing. I also make use of primary constructors, which are ideally suited to the pattern since unlike regular constructor injection, the assertions happen in the property accessors, not a constructor.

The primary goal of this pattern is still to leverage lazy dependency injection while making it easier to swap in testing mocks. Classes like controllers can have a number of dependencies that they use, but depending on the action and state passed in, many of those dependencies don't actually get used in all situations. Lazy loading dependencies sees dependencies initialized/provided only if they are needed, however this adds a layer of abstraction to access the dependency when it's needed, and complicates mocking the dependency out for unit tests a tad more ugly.

The solution which I call lazy dependencies +property mitigates these two issues. The property accessor handles the unwrapping of the lazy proxy to expose the dependency for the class to consume. It also allows for a proxy to be injected. Each lazy dependency in the constructor is optional. If the IoC Container doesn't provide a new dependency or a test does not mock a referenced dependency, the property accessor throws a for-purpose DependencyMissingException to note when a dependency was not provided.

Updated pattern:

 public class SomeClass(Lazy<ISomeDependency>? _lazySomeDependency = null)
 {
     [field: MaybeNull]
     public ISomeDependency SomeDependency
     {
         protected get => field ??= _lazySomeDependency?.Value ?? throw new DependencyMissingException(nameof(SomeDependency));
         init;
     }
 }

This is considerably simpler than the original implementation. We can use the primary constructor syntax since we do not need to assert whether a dependency was injected or not. Under normal circumstances all lazy dependencies will be injected by the container, but asserting them falls on the property accessor. No code, save the accessor property, should attempt to access dependencies through the lazy dependency. The auto property syntax, new to C# gives us access to the "field" keyword. We also leverage a public init setter so that our tests can inject mocks for any dependencies they will use, while the getter remains protected (or private) for accessing the dependency within the class. The dependency property will look for an initialized instance, then check the lazy injected source, before raising an exception if the dependency has not been provided.

Unit tests provide a mock through the init setter rather than trying to mock a lazy dependency:

Mock<ISomeDependency> mockDependency = new();
mockDependency.Setup(x => /* set up mocked scenario */);

var classUnderTest = new SomeClass
{
    SomeDependency = mockDependency.Object
};

// ... Test behaviour, assert mocks.

In this simple example it will not look particularly effective, but in controllers that have several, maybe a dozen dependencies, this can significantly simplify test initialization. If a test scenario is expected to touch 3 out of 10 dependencies then you only need to provide mocks for the 3 dependencies rather than always mocking all 10 for every test.  If internal code is updated to touch a 4th dependency then the test(s) will break until they are updated with suitable mocks for the extra dependency. This allows you to mock only what you need to mock, and avoid silent or confusing failure scenarios when catch-all defaulted mocks try responding to scenarios they weren't intended to be called upon.