If data never changed or needed to be updated our lives as software designers might be a little easier, we could just store the data once and be done with it. But data changes, it gets edited and deleted, and we designers cringe when we hear a user utter the words, “Can I undo that?” Its what Einstein did to Physics, he messed with the time dimension and everything came unraveled. Fortunately, for most of us we don’t have to work at the messy quantum level (yet). As it is with web software designers and data versioning, you might design twenty applications without having to worry much about it and then one day it hits you. “Can I undo?” Desktop application software designers have had to deal with the “undo” problem for many years, and now its time that web and IT software designers jump in. To make matters more complicated we have to figure out how to solve this problem for vast amounts of data typically stored in databases. Let’s call it Web 2.0 with the undo feature!
Of course undo is only one feature related to data versioning. There are many other reasons you might want to track changes. Here is a list of some typical reasons:
- Ability to undo changes
- Reporting and anaylsis on historical changes to data
- Auditing who made changes to data (and what changes they made)
- Taking a snapshot of data for contractual purposes. (e.g. taking a snapshot when an invoice is printed.)
How versioning should be modeled in your application depends on what features you are trying to achieve. While it may be academically rewarding trying to solve for all of these features, it may be unnecessarily expensive since most systems (such as relational databases) don’t have a good native way to deal with versioning.
Time itself is not absolute. You don’t have to be traveling at the speed of light to see this. Try hooking a desktop application in Finland to a server hosted in England. If the combined system depends on time as registered by both computers then the server may think your PC in Finland is operating one or two hours in the future (depending on the time of year and daylight savings time adjustments.) Even within the same timezone the slight amount of time that the clocks of a server and a PC differ may make all the difference. The moral of the story: choose a standard for where to get the time and keep time zones in mind.
What about dates? They are fairly fixed, aren’t they? Once you get around the time zone thing and unless you are dealing with large ranges of dates, in practice it seems so. Computers have been programmed to handle the fact that a year is actually a little under 365.25 days. Don’t lookup the term “year” on wikipedia unless you want the ugly truth about how all this works. Suffice it to say that humans haven’t always used our modern calendar and so dates in the past have been recorded using many different schemes including basing them on the cycles of the moon. The Date class in Java represents a specific moment in time to the millisecond, but the methods for converting Date to a month, day, or year have been deprecated. Why? Because there are different calendars to choose from. The GregorianCalendar class allows you to a way to convert a Date into the days in our modern calendar. How very flexible! Even still our modern handling of time is still not exactly precise, but it is close enough for most applications.
Logging Changes
One approach to tracking changes is to log them in a separate column or table in the database. Let’s say that we want to analyze the changes that an online user makes to their shopping cart which is stored in relational tables:
Shopping Cart Table NUMBER cart_id NUMBER user_id VARCHAR(1000) description ... other data ... VARCHAR(32767) change_log
Shopping Cart Lines Table NUMBER line_id NUMBER cart_id NUMBER product_name NUMBER quantity NUMBER price ... other data ...
The change_log column in the table could contain a comma delimited list of the changes made by the user that looks something like this:
01-JAN-2007 created, 01-JAN-2007 added part #8839 to cart, 02-JAN-2007 added part #6746 to cart, 03-JAN-2007 removed part #8839
Don’t be too horrified. I have seen many systems designed like this and they work as a quick and dirty solution for limited troubleshooting purposes. Just remember to concatenate the log string down to 32767 characters when you update the table and at least your application won’t crash. Extending this methodology gets a little trickier. A change_log column could be added to the lines table, where you could have a little more room for quantity and price details but that won’t work for deletes. My opinion on this methodology is that if you choose the quick and dirty, keep it simple and abandon it altogether if your needs get more complex.
The next step up in fancy in terms of design options is to pull the change log out into a separate table (or tables):
Shopping Cart Change Log Table NUMBER cart_id VARCHAR(2000) change DATE change_date
Here you are not limited in space for the change text that is logged. Every change can have it’s own row in the change log table, and the date column makes changes easier to query. This solution costs a little more since now you have an extra table to manage and all programmers writing inserts need to know to hit multiple tables. The improved querying capabilities gives you easier reporting and troubleshooting, but you are still reduced to parsing the text for more specific queries. Now, what if we expand the change column into multiple columns that can keep the details of the change? There are several approaches to this, but before we get to those we will need an extra change log table for the lines:
Shopping Cart Lines Change Log Table NUMBER line_id ... change columns ... DATE change_date
Now we can attach information on changes directly to the individual lines in the shopping cart. One more sophisticated way of capturing the change information is to copy the rows from the original tables into the change log tables so that we can capture the exact set of data that was in the original table before or after the change was made. We also need a change_type column for tracking whether this was an update, add or delete.
Shopping Cart Change Log Table NUMBER cart_id VARCHAR(1000) description ... other cart data ... VARCHAR(15) change_type (add/delete/update) DATE change_date Shopping Cart Lines Change Log Table NUMBER line_id NUMBER product_name NUMBER quantity NUMBER price ... other line data ... VARCHAR(15) change_type (add/delete/update) DATE change_date
This solution approaches another more elegant solution which we will cover later so we won’t analyze this too much. The other approach is to capture changes on a column by column basis so that we insert a row into the change log table for each value that was changed. For this approach the two change log tables would look like this:
Shopping Cart Change Log Table NUMBER cart_id column_name before_value after_value DATE change_date
Shopping Cart Lines Change Log Table NUMBER line_id column_name before_value after_value DATE change_date
This approach captures updates well, but doesn’t work for updates and deletes. A full history of how data in each column changed is now available. Another minor issue is that for a single update to a row in the shopping cart lines table there may be multiple lines in the log table for changes to product, quantity, and price. This multiple rows make it harder to query and force you to find a way to group the multiple rows back into one change most likely by grouping on the change_date column which isn’t a perfect solution. Another option for this solution is to combine the two tables by genericizing the foreign key id and adding another column that tells us which table to join back to. This means you could reduce your whole change logging system down to a single table that tracks changes for all tables in the database, but at the cost of having a very large table to manage, dealing with the associated performance problems, and having to write complex and unintuitive queries.
An Elegant (and expensive) way of Versioning
The previous line of approaches led us further down into a hole where things seemed to be getting more and more complex while adding marginal value. That is a sure sign that either: 1) you are stuck in a design rut; or 2) your about to make a big breakthrough in design. Should we sell our stock or hold on for the big payoff? Let’s sell. Let’s go back to the original tables themselves and try another approach. Here we will add two new columns for versioning: a from_date, and a to_date column and a column to indicate if the cart line is active or has been deleted. For our example we will work solely with the lines table:
Shopping Cart Lines Table NUMBER line_id NUMBER cart_id NUMBER product_name NUMBER quantity NUMBER price DATE from_date DATE to_date VARCHAR active_or_deleted
Each row in the table now represents how the date looked between the two dates. Here is what the data for a single line item would look like if we created the line with a quantity of 1 and then changed it 2 days later to a quantity of two. Note that there are two rows in the table to represent the historical changes:
1, 1, Ink Jet Printer X11, 1, $150, 01-JAN-2007, 03-JAN-2007 active 1, 1, Ink Jet Printer X11, 2, $150, 03-JAN-2007, 31-DEC-9999 active
The first row shows what the data looked like between 01-JAN-2007 and 03-JAN-2007 and the second row shows what the data looked like from 03-JAN-2007 to 31-DEC-9999. This does make querying tougher even if it is just to get the latest data. The reason we don’t use null is that it upsets the way the database handles date comparisons. If we did use null then getting the current row would be easier, but it would be harder find the row effective for a certain date. Another issue is that we must be careful about how we compare dates. The to_date on the first row matches the from_date on the second row. We must compare one exclusively (less-than) and on inclusively (greater-than-or-equal) to make this work. Of course to implement this approach we must also change the way that we update, insert and delete rows in the table so that we never update or delete. Instead we only insert. This rule alone suggest that we are keeping history since we never go back and change historical data.
Here is how a deleted item would look:
1, 1, Ink Jet Printer X11, 1, $150, 01-JAN-2007, 03-JAN-2007 active
1, 1, Ink Jet Printer X11, 2, $150, 03-JAN-2007, 05-JAN-2007 active
1, 1, Ink Jet Printer X11, 2, $150, 05-JAN-2007, 31-DEC-9999 deleted
And if we undelete the item we add another row:
1, 1, Ink Jet Printer X11, 1, $150, 01-JAN-2007, 03-JAN-2007 active 1, 1, Ink Jet Printer X11, 2, $150, 03-JAN-2007, 05-JAN-2007 active 1, 1, Ink Jet Printer X11, 2, $150, 05-JAN-2007, 07-JAN-2007 deleted 1, 1, Ink Jet Printer X11, 2, $150, 07-JAN-2007, 31-DEC-9999 active
Notice also that we are breaking our rule slightly about updating rows. When we insert a row to make a change to it, we also go back and update the previous row to end-date the previous change. This adds quite a bit of complexity to the code for updating the table and would typically warrant creating a wrapper API to encapsulate the code.
Let’s make one more modification before calling the design final. Let’s change the dates to version numbers like this:
Shopping Cart Lines Table NUMBER line_id NUMBER cart_id NUMBER product_name NUMBER quantity NUMBER price DATE from_version DATE to_version VARCHAR active_or_deleted
Here is some example data with versions instead of dates:
1, 1, Ink Jet Printer X11, 1, $150, 1, 2 1, 1, Ink Jet Printer X11, 2, $150, 2, 9999999
Then let’s have a version table that keeps track of the version numbers and can include all sorts of other meta-information:
Version Table NUMBER version_number DATE update_date VARCHAR(200) updated_by VARCHAR(2000) reason_for_update
Joining it all together we can write a query that shows information about a particular version of the data:
SELECT v.*, i.* FROM version_table v, line_table i WHERE v.version_number = 1 AND v.version_number >= i.from_version AND v.version_number < i.to_version
Which would return a row like this:
1, 01-JAN-2007, mtruchard, adding a part, 1, 1, Ink Jet Printer X11, 1, $150, 1, 2
This is a fairly flexible design and can allow for changes to data to be tracked, changes to be undone, and even undo’s to be redone and a full history of the undo’s and redo’s tracked. This approach is similar to the way source code revision software operates. The major downside is the complexity. What used to be a simple and intuitive query to get the most recent data has been complicated and the updates are best left to an API built by a programmer who knows how the scheme works. And then there is the pesky “999999″, change that to null and historical queries gets messy and slow, but don’t change it and we have nightmares of Y2K all over again. Overall, however, this solution is far more elegant than any of the previous ones.
Change Capture Built-in to the Database
An even better solution would be to have the database handle all this for you. I haven’t yet seen a scheme like this built into a relational database, but that doesn’t mean there isn’t one out there. What I have seen are several mechanisms for logging or capturing changes on a database table. Oracle databases have what is called redo logs where all changes made to a set of tables can be automatically written out to a file on the system. These logs can then be played back in the database to perform the operations over again. Oracle also has a technology called CDC where changes can be captured on a table as they are happening and written to a queue. The queue can either be queried directly for changes or can be used to insert the data into other tables with a structure that allows for tracking of history like the ones discussed previously. This is one method for populating a datawarehouse with historical change data while keeping the original tables simple. Used alone, or in combination with the approaches described above built in database features are the best way to go if they can meet your needs.