Everything is ‘big data’ these days; if you’re not using Hadoop on datalakes then you’re yesterday’s news. (Or tomorrow’s!) Over the last few years we’ve spent some quality time with large datasets in the oil & gas an petrochemicals sectors and here’s a challenge that’s not often mentioned:
Lots of data doesn’t make it ‘big’
Let’s say you have a large petrochemicals plant, with tens or hundreds of thousands of high frequency measurements. So you have lots of data. You’d think you were in the ‘big data’ club, no problem. Perhaps not. For example, the historian measurement frequency might be significantly higher than the instrument measurement frequency, or vice versa, or the values averaged at the historian/instrument into fewer readings.
Terabytes and terabytes of the same value, repeated
Or there might be very frequent measurements that don’t change that much; if a furnace has been at 850C for three weeks, it’s not that useful and can be easily downsampled. Downsampling discards unchanged data to effectively reduce the frequency of measurement on a single time series; it reduces the amount of data for a single variable.
(Downsampling can produce a 10 times reduction in the data storage needed, depending on the nature of the original signal and how accurate the downsampled data has to be.)
Dimensional reduction might also be useful and might take a terabyte of data down to a more manageable gigabyte. Where downsampling reduces the frequency of measurement, dimensional reduction reduces the number of variables; redundant time series are combined rather than eliminated. So if you have 3 signals and signal #3 is a unchanging, it can be combined with another signal without any impact.
(Dimensional reduction can take 500 sensors down to, say, 10 sensors. That’s a big difference.)
And the impact is really the key. These methods are ‘lossy’; there are parallels with image compression in this regard, in that JPEGs are ‘lossy’ compared to a high quality original. Downsampling and dimensional reduction are also ‘lossy’ in that some of the original information has been removed, but as with JPEGs, the question is whether the data loss makes any real difference. Perhaps not.
So your data might not be ‘big’ enough.
It’s always curiously unsettling to tell a customer they don’t have enough data – almost like they/their organisation has been under-performing. But some data simply isn’t ‘big’. This is particularly prevalent in the industrial world, where there really is a lot of data available but a lot of it is structured and repetitive.
All isn’t lost even if the dataset isn’t particularly large. We’ve had success with a few thousand lines of data though it’s not enough for any of the predictive algorithms to give reliable results (assuming a reasonably complex system).
And we have an approach to engagement which will work pretty well for you.