Contents
Arguably the biggest part of data warehousing is tying data from multiple sources together to create a single version of the truth. Sometimes, lookup tables or data conversions are needed to join data. In these instances we often rely on cross-platform data consistency.
So what does this mean? In short it means that if we have some form of ID field that’s shared between two or more systems, that this ID field should match in these two systems, or be the subject of an easy lookup.
Good Cross-Platform Data Consistency
In an example of having separate databases for payroll versus holidays/sickness, these two databases should share a common employee ID reference to allow easy relationships to be created. This is the easiest matchup, and allows us in a data warehousing setting to join the data between the two systems directly.
Cross-Platform Data Lookups
In situations where different systems refer to same-entities in different ways, a simple lookup table is the best way to match the data. Take Airports for example, there are several codes that could be used to refer to an airport, the most common being the IATA (used for timetables and baggage) and the ICAO (used for flight tracking). If we want to join our timetables to our flight tracking, we’ll most likely need a lookup table similar to the example below.
icao | iata | airport_name |
EGLL | LHR | Heathrow Airport |
EGKK | LGW | Gatwick Airport |
EGBB | BHX | Birmingham Airport |
EGCC | MAN | Manchester Airport |
EGSS | STN | London Stansted Airport |
An end result of this could be a data summary showing how course deviation (flight tracking) affects timetables (with a little data science), joining multiple data sources with good cross-platform data consistency seamlessly.
Poor Cross-Platform Data Consistency
Recently I’ve been working with a new platform from a third party with the requirement of reconciling data with an older platform. Due to the volume of data, and the age and design of the older platform, there is no easy ID field that we can match between the two systems. We in fact need to match across multiple fields to reconcile the data, this is unavoidable in real-world production settings.
The heart of this system is a set of electronic devices, which each individually have a counter per transaction. There are thousands of these devices, which belong to different companies, and the counters on each device reset after 10,000 transactions, so we also need to use a timestamp for a counter’s date range. So we need to join on several fields, they look like this (note this is example data):
Field Description | System A | System B |
Company | london | 1280265796 |
Device Serial Number | 19776526 | 012dc40e |
Device Counter | 7364 | 7364 |
Transaction Timestamp | 1637179325 | 2021-11-17 20:12:05 |
The only field that matches in this example between the two systems is the device counter, so we can immediately make a connection here, but that doesn’t help us much as we know the counter resets after 10,000 transactions, and every device has its own counter. The other fields are more complicated to match, and a lack of solid documentation and design meant these needed to be identified through a long (and costly) discovery process.
Transaction Timestamp
The transaction timestamp in system A is immediately recognizable as Unix epoch timestamp, however when converting it to a readable timestamp like system B, we discovered that they don’t always match. The two timestamps can in fact be several minutes apart, and during daylight savings they were an hour apart. One system using UTC, the other system using UK time as GMT/BST.
As such, after converting system A’s epoch integer to a datetime field, we then also need to add a fuzzy match of around 2 hours to give leeway for timestamp variations.
Device Serial Number
Again, an immediately recognizable string appears in system B here in a hex value. This was a simple conversion from hex to decimal, and the join can be made between these two fields.
Company
With nothing recognizable around these two fields, a long period of investigation and discovery ensued, eventually revealing a hidden code, and a strangely unnecessary data conversion. On the devices, the company name of london
is accompanied by a company code of LOND
, however this code is not visible in system A. Furthermore, the code is not simply transmitted to system B, it is instead converted to ASCII, then to binary, then to an integer.
The full conversion is as follows:
Company Code: L O N D
ASCII Conversion: 076 079 078 068
Binary Conversion: 01001100 01001111 01001110 01000100
Decimal Conversion: 1280265796
As such, we need to use a combination of a lookup table and two mathematical conversions in order to join these fields. The argument for this method was that a long integer only takes up 8-bytes of storage, ignoring the fact that “LOND
” is a 4-byte string (only single-byte characters are allowed in the source system with a maximum length of four characters).
Conclusion
There’s no excuse for poor cross-platform data consistency if you control the data inside every platform, all it takes is some forward thinking and good architectural design. A data element should not need to be converted more than once in order to create a join between two systems, and even then, a clear definition of the data types, formats, and conversions should be provided.