Friday, January 29, 2010

Append-Only Data Models

An append-only data model follows that when a data entity changes, each change is represented as an insert against the relevant table. There are no updates or deletes recorded against the data. The domain model and any adopted ORM strategy needs to reflect this behaviour difference. Soft-delete is a term used when records are marked inactive rather than permanently deleted from the database. This is an integral part of append-only models, but can be easily employed separately from append-only models.

Advantages of Append-OnlyAppend-only models offer several advantages to applications. The most significant is a built-in audit trail for reviewing all changes to records and a snapshot of a record at any point in time. This audit trail can be used for reporting, undo-chains, and review/approval behaviour.

Challenges of Append-OnlyAppend-only models automatically increase the amount of storage space needed to represent data historically. This proves to raise additional challenges when dealing with ORMs to ensure references correctly point to the latest version when applicable, and avoiding new associations with stale references.

Structuring Data for Append-OnlyThere are 3 conventional approaches for adopting an append-only data model. Each addresses the fact that for any given data entity there can be one or more record in the database to represent it over time.

Key + Version Number
The first option is to introduce a version number and form a concatenated key between the record ID and Version. New records receive the incremented version number. An optional table can be introduced to help indicate the most current version number for a data entity rather than performing index scans against the table.

Advantages
  1. ID is preserved between versions. (Suitable for meaningful IDs)
  2. FKs by ID are preserved. (Though not unique by themselves so caution is needed when querying.)
  3. Ancestor records are “pure”. (Unmodified when new descendants are created.)
  4. Old records can be archived/deleted easily.

Disadvantages
  1. Concatenated PK.
  2. Not intuitive to map relationships that are versioned.

Timestamps can be added in place of version numbers, or complimenting version numbers to provide functionality to assess a record’s state a particular point in time.

Ancestor Reference + Version Number
The second option is to use a version number in combination with a reference (FK) to the record’s ancestor. This approach utilizes a meaningless PK for the record so each new version has a unique ID plus a reference to the previous version. (The ancestor.) The version number is tracked as well as a reference to identify the most current version of the record.



At some point we may want to archive or discard older records. This can be a bit of a problem with the FK relationship between record and ancestor when the ancestor is removed from the production data.


Setting a revision’s ancestor reference to null is fine if the ancestor is deleted, however when archiving, special care will be needed to re-attach the reference when the revision itself is moved to the archive.

Advantages
  1. Unique IDs suitable for FK relationship. (No concatenation needed.)
  2. Ancestor records are “pure”. (Unmodified when new descendants are created.)
  3. Maps easier to ORM solutions, though care is needed to ensure stale references are avoided.

Disadvantages
  1. External FK relationships must be updated to the new version.
  2. May require index scans like above solution to keep references fresh.
  3. Old records cannot be easily removed due to FK relationships.

Descendant Reference
The third option is to use a reference (FK) to the record’s descendant. This is similar to the second approach however instead of the current object holding a reference to its elder, where each elder is modified to contain a reference to its descendant. The advantage in this model is that the version number is not required, as the current instance will always have a null descendant ID. This makes it simple to detect when a reference has been made stale.

It also facilitates deleting and archiving better than the ancestor model since records can be deleted without invalidating FKs. The archived records, not remaining production records, merely require a placeholder proxy for the cut off point descendant.



Advantages
  1. Unique IDs suitable for FK relationship. (No concatenation needed.)
  2. Simpler detection for current instance and stale references.
  3. Old records can easily be removed or archived.

Disadvantages
  1. External FK relationships must be updated to the new version.
  2. Ancestor records are not “pure”. Ancestor records must be updated with their descendant reference.

Snapshots
A snapshot refers to a model where a snapshot of the data is taken and recorded before each and every data modification. This is not append-only by definition because it serves only to provide a picture of historical data, but it is not geared towards making versioned data accessible to references. Snapshots can be recorded in the same table, or in separate history tables. Snapshots can also be configured to be performed automatically by the database. This can be beneficial in situations where data can be manipulated by more than one application or process.

What Version Do I Point To?
Regardless of the approach taken, it is important that the domain model for the application is designed with append-only behaviour in mind. The most significant challenge in adopting this model is that you need to be explicit when dealing with the relationships between data. Certain design and behaviour decisions are needed within the application. If a revision is made to an object that is referenced by other objects, should those references be automatically updated to the latest revision, or is it valid that they can remain referenced to the version that was current when the association was made?

For example, if we make a change to a Doctor entity to update contact details or other relevant information, it would probably be beneficial to ensure that any appointments associated to that Dr., past, present, or future, should be updated to the current revision of the Doctor’s record. Alternatively, if we were to adjust a billing rate applicable for a service, we’d likely want to ensure that past, and possibly present appointments refer to the previous version of the record, while new and future appointments reflect the updated rate. (Possibly prompting users to ask whether or not present and future appointments should be updated or not.)

This is an important consideration when designing domain and data models around this kind of behaviour. For instance, having a service such as an Interpreter in the system, we’d want to be sure that information such as contact details would be kept separate from figures such as billing charges. Some data we may want to ensure is always pointing to the current revision, while others may be more selective.

In the case of selective association, this is one scenario where option 1 outlined above becomes a clumsy choice. Any foreign key association where the revision can be selective requires both the ID and version number. In the case of an appointment to service provider billing information, the appointment will likely have a revision number, as well as the ID to the billing information and the version of the billing information.

No comments:

Post a Comment