Importance of Cross-Platform Data Consistency

Contents

1 Good Cross-Platform Data Consistency
2 Cross-Platform Data Lookups
3 Poor Cross-Platform Data Consistency
4 Conclusion

Arguably the biggest part of data warehousing is tying data from multiple sources together to create a single version of the truth. Sometimes, lookup tables or data conversions are needed to join data. In these instances we often rely on cross-platform data consistency.

So what does this mean? In short it means that if we have some form of ID field that’s shared between two or more systems, that this ID field should match in these two systems, or be the subject of an easy lookup.

Good Cross-Platform Data Consistency

In an example of having separate databases for payroll versus holidays/sickness, these two databases should share a common employee ID reference to allow easy relationships to be created. This is the easiest matchup, and allows us in a data warehousing setting to join the data between the two systems directly.

Cross-Platform Data Lookups

In situations where different systems refer to same-entities in different ways, a simple lookup table is the best way to match the data. Take Airports for example, there are several codes that could be used to refer to an airport, the most common being the IATA (used for timetables and baggage) and the ICAO (used for flight tracking). If we want to join our timetables to our flight tracking, we’ll most likely need a lookup table similar to the example below.

icao	iata	airport_name
EGLL	LHR	Heathrow Airport
EGKK	LGW	Gatwick Airport
EGBB	BHX	Birmingham Airport
EGCC	MAN	Manchester Airport
EGSS	STN	London Stansted Airport

^{Five Largest Commercial Airports in England by Runway Length}

An end result of this could be a data summary showing how course deviation (flight tracking) affects timetables (with a little data science), joining multiple data sources with good cross-platform data consistency seamlessly.

Poor Cross-Platform Data Consistency

Recently I’ve been working with a new platform from a third party with the requirement of reconciling data with an older platform. Due to the volume of data, and the age and design of the older platform, there is no easy ID field that we can match between the two systems. We in fact need to match across multiple fields to reconcile the data, this is unavoidable in real-world production settings.

The heart of this system is a set of electronic devices, which each individually have a counter per transaction. There are thousands of these devices, which belong to different companies, and the counters on each device reset after 10,000 transactions, so we also need to use a timestamp for a counter’s date range. So we need to join on several fields, they look like this (note this is example data):

Field Description	System A	System B
Company	london	1280265796
Device Serial Number	19776526	012dc40e
Device Counter	7364	7364
Transaction Timestamp	1637179325	2021-11-17 20:12:05

^{Poor Cross-Platform Data Consistency}

The only field that matches in this example between the two systems is the device counter, so we can immediately make a connection here, but that doesn’t help us much as we know the counter resets after 10,000 transactions, and every device has its own counter. The other fields are more complicated to match, and a lack of solid documentation and design meant these needed to be identified through a long (and costly) discovery process.

Transaction Timestamp

The transaction timestamp in system A is immediately recognizable as Unix epoch timestamp, however when converting it to a readable timestamp like system B, we discovered that they don’t always match. The two timestamps can in fact be several minutes apart, and during daylight savings they were an hour apart. One system using UTC, the other system using UK time as GMT/BST.

As such, after converting system A’s epoch integer to a datetime field, we then also need to add a fuzzy match of around 2 hours to give leeway for timestamp variations.

Device Serial Number

Again, an immediately recognizable string appears in system B here in a hex value. This was a simple conversion from hex to decimal, and the join can be made between these two fields.

Company

With nothing recognizable around these two fields, a long period of investigation and discovery ensued, eventually revealing a hidden code, and a strangely unnecessary data conversion. On the devices, the company name of london is accompanied by a company code of LOND, however this code is not visible in system A. Furthermore, the code is not simply transmitted to system B, it is instead converted to ASCII, then to binary, then to an integer.

The full conversion is as follows:

Company Code:           L        O        N        D
ASCII Conversion:      076      079      078      068
Binary Conversion:   01001100 01001111 01001110 01000100
Decimal Conversion:  1280265796

As such, we need to use a combination of a lookup table and two mathematical conversions in order to join these fields. The argument for this method was that a long integer only takes up 8-bytes of storage, ignoring the fact that “LOND” is a 4-byte string (only single-byte characters are allowed in the source system with a maximum length of four characters).

Conclusion

There’s no excuse for poor cross-platform data consistency if you control the data inside every platform, all it takes is some forward thinking and good architectural design. A data element should not need to be converted more than once in order to create a join between two systems, and even then, a clear definition of the data types, formats, and conversions should be provided.