Py's Sty: Shrinking JSON for Data Storage

Lately I have been working on a pet project for helping visualize how an offset account impacts one's mortgage. This has mainly been a foray into building apps in Blazor. The goal is to have a version of the app that can run in a fully offline mode leveraging local storage, which I have used Blazored.LocalStorage to assist with. Behind the scenes it is serializing objects to JSON to be stored in the local storage which works just fine.

However, this did raise a bit of a concern in that the data being stored would be entered over years worth of entries to track transaction amounts to compute the offset impact. Each record is reasonably small, but there will potentially be a fair number of them, and I'd like the option to include things like descriptive names. JSON itself adds additional storage concerns as the objects and all values need to be represented as strings. With databases in modern systems, space is cheap for storage and indexing. With local storage we are limited to 5MB for storage. This means we need to be a bit more cautious with our data. This turned into a rather interesting problem to find a balance between ease of use and storage space.

The first thing I looked at was ensuring that I was only storing info that was absolutely needed. With a relational database a Transaction record might look something like:

TransactionId, TransactionDate, AccountId, Amount, Description

With this offline store, it resembles more like a document store where the transactions will fall under an account document. We still want an Id for the transaction within the document, but we can get rid of the FK. Now when it comes to the description, many rows will be null, but for ones that people fill in, there is a good chance that many will repeat. For this I decided to normalize by adding a TransactionDescription table. This replaces Description with TransactionDescriptionId, being an integer rather than the string in each row. As users start typing we can look up common descriptions and offer to link to an existing one or create a new row for new values.

When it comes to dates I decided to treat these as raw values with an unmapped transalation to a C# date. So dates would be stored in a short-form ISO format of "yyyyMMdd".

For dollar amounts I elected to store these in cents, letting the DTO expose a floating currency value.

So a typical record would start to look like:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545, "DescriptionId": 32 }

... where if the transaction had no description:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545, "DescriptionId": null }
91 characters.

This is already a bit more compact than if I took a more default C#, denormalized object serialized to JSON. Though we can do better. For a start, when serializing to JSON there is an option on how #null values are handled, either included or excluded. For data contracts it makes sense to include them so consumers know they are a value in the destination object. For data storage though we can ignore them. In the second case with no description this becomes:

{ "TransactionId":1, "TransactionDate": "20251012", "Amount": 1545 }
68 characters.

Next, we can use the [DataContract] and [DataMethod] attributes. Rather than the default opt-out serialization behaviour where we would need [JsonIgnore] to exclude properties we don't want serialized, if we use [DataContract] we need to opt-in properties to include with [DataMethod] but we can also specify a value name. For instance with my above simple transaction, the DTO looks something like:

    [DataContract]
    public class Transaction : IDto
    {
        [DataMember(Name ="id")]
        public int TransactionId { get; init; }

        [DataMember(Name="dt")]
        public int TransactionDateRaw { get; init; }
        public DateTime TransactionDate => DateTime.ParseExact(TransactionDateRaw, "yyyyMMdd", null);
        [DataMember(Name ="am")]
        public long Amount { get; init; }
        [DataMember(Name ="td")]
        public int? TransactionDescriptionId { get; init; }
    }

The JSON now looks like:

{ "id":1, "dt": "20251012", "am": 1545 }
40 characters.

We have gone from 91 characters to less than half at 40 characters.

The next change I made was around the date. I was going to want to provide daily figures for things like the balance, while supporting that multiple transactions on a given day. When it comes to loans with an offset account, the general recommendation is to use the interest free period on credit cards for daily transactions and pay the card in full each month. This reduces reporting on each transaction but we'll still have things like bill payments and other transactions falling on the same day. The structure I ended up with pulled the date value to an AccountDay class containing the starting and final balance with a collection of transactions. This pulled the repeated date out of the transaction rows.

The offset account manages a SortedList<int, AccountDay> collection of days indexed by the date (yyyyMMdd format) The resulting JSON looks like:

{"na":"Test Account","dz":{"20251001":{"ib":5000,"tx":[],"ba":5000},"20251002":{"ib":5000,"tx":[{"id":1,"am":44400},{"id":2,"am":29153},{"id":3,"am":17166}],"ba":95719},"20251003":{"ib":95719,"tx":[{"id":1,"am":39177},{"id":2,"am":39521},{"id":3,"am":27103},{"id":4,"am":41982}],"ba":243502},"20251004":{"ib":243502,"tx":[{"id":1,"am":43598}],"ba":287100},"20251005":{"ib":287100,"tx":[],"ba":287100},"20251006":{"ib":287100,"tx":[{"id":1,"am":12655},{"id":2,"am":9537},{"id":3,"am":38402}],"ba":347694},"20251007":{"ib":347694,"tx":[{"id":1,"am":46109},{"id":2,"am":43684}],"ba":437487},"20251008":{"ib":437487,"tx":[{"id":1,"am":11688},{"id":2,"am":2481}],"ba":451656},"20251009":{"ib":451656,"tx":[{"id":1,"am":40970}],"ba":492626},"20251010":{"ib":492626,"tx":[{"id":1,"am":32517},{"id":2,"am":20656},{"id":3,"am":17146}],"ba":562945},"20251011":{"ib":562945,"tx":[{"id":1,"am":1350},{"id":2,"am":21341}],"ba":585636},"20251012":{"ib":585636,"tx":[],"ba":585636},"20251013":{"ib":585636,"tx":[],"ba":585636},"20251014":{"ib":585636,"tx":[{"id":1,"am":41260},{"id":2,"am":31053}],"ba":657949},"20251015":{"ib":657949,"tx":[{"id":1,"am":47606}],"ba":705555}}}

It isn't very readable by any stretch, but it is compact and deserializes back into the object model where all of the work & formatting is done.

My thoughts are that I'll load and save account data in parcels by month or by year, depending on the final data size and trying to ensure this tool remains device friendly. For instance if data is automatically parcelled and serialized by "yyyyMM" then the "days" key shrinks from "yyyyMMdd" to just "dd", saving even more space.

I could have left it at this and let Blazored.LocalStorage serialize this structure, and I'd likely have enough local storage for the decades of data for the typical user, but Blazored.LocalStorage can also work with storing string values which got me thinking: What if I serialized and compressed the data? Compression should significantly shrink the JSON, but then I'd need to Base64 the string, adding about 1/3rd to the compressed size. From here I wanted to see what a typical use case might look like so creating around 15 years worth of transactions. The test generator built anywhere from around 35,000-50,000 records. With the original JSON objects, this amounted to a size of around 1.3MB. The shortened JSON properties dropped that to around 545kB. The last step was to see what compression might do.

    public static string? Serialize(this IDto dto)
    {
        try
        {
            string json = JsonConvert.SerializeObject(dto, new JsonSerializerSettings { NullValueHandling = NullValueHandling.Ignore });
            var buffer = Encoding.UTF8.GetBytes(json);
            using var fromStream = new MemoryStream(buffer);
            using var toStream = new MemoryStream();
            using var zipStream = new DeflateStream(toStream, CompressionLevel.Optimal);

            fromStream.CopyTo(zipStream);
            zipStream.Flush();
            string base64 = Convert.ToBase64String(toStream.ToArray());
            return base64;
        }
        catch (Exception ex)
        {
            return null;
        }
    }

    public static T? Deserialize<T>(string dataBase64) where T : IDto
    {
        if (string.IsNullOrEmpty(dataBase64)) return default;
        try
        {
            using var toStream = new MemoryStream();
            var data = Convert.FromBase64String(dataBase64);
            using var fromStream = new MemoryStream(data);

            using var zipStream = new DeflateStream(fromStream, CompressionMode.Decompress);

            zipStream.CopyTo(toStream);
            zipStream.Flush();
            string json = UTF8Encoding.UTF8.GetString(toStream.ToArray());
            return JsonConvert.DeserializeObject<T>(json);
        }
        catch (Exception ex)
        {
            return default;
        }
    }

With this I was able to get the data size down to 98kB using DEFLATE/GZip including the Base64 conversion. This also means I can easily package up and offer to save a backup of data from the local storage on request, and re-import into local storage.

The next question was total time needed to serialize/deserialize with and without compression. Times did vary between runs but the general impact was that compression was going to add around 20% to the read and write times. This was while running on a PC so I'm far more cautious about this impact when running on a device once I have a suitable build ready for testing. However, this was reading and writing 15 years worth of data in one hit. Even in that situation it was adding around 100ms to a 500ms operation. The goal is to parcel data by month or by year worst case where the complete data set doesn't need to be loaded.

In any case this provided a lot of food for thought when thinking about JSON serialized data for data communication and storage as opposed to a more contractual or otherwise visible medium. When we want to package up data tightly for storage or transport then we can leverage options in the JSON serialization as well as compression when we have compute available on the consumer. The trade-offs to balance besides space/size and performance also include the complexities of working with the data and potential for bugs. (For instance if I move the "yyyyMM" aspect out of the AccountDay and rely on the loaded month "packet" of data.

Py's Sty

Thursday, October 16, 2025

Shrinking JSON for Data Storage

No comments:

Post a Comment