Everything is ‘big data’ these days; if you’re not using Hadoop on datalakes then you’re yesterday’s news. (Or tomorrow’s!)
Over the last few years we’ve spent some quality time with large datasets in the oil & gas an petrochemicals sectors and here’s a challenge that’s not often mentioned:
Context is king
So let’s say we have sufficient quantity and quality of data. What represents ‘good’ and ‘bad’ performance?
In other sectors (e.g., marketing) this has given rise to ‘tag management’ software. In the oil & gas, petrochemicals and manufacturing sectors this hasn’t caught on, partly because of the richness of the data already available and partly because the market is not yet mature enough to demand better tag management.
So how do we deal with it? How do we add context to data?
Let’s split this to keep it simple; automatic, semi-automatic and manual methods.
Manual methods are the easiest to understand. As you might expect, this is where a subject matter expert flags portions of data as being valuable for some reason. Often in manufacturing or oil & gas these are time-series. We have the following manual methods in play right now:
- Sabisu Publisher, which uploads MS Excel documents, is a key entry point for many users and allows us to persist and historise the core data, including metadata – it’s a key unstructured data source in its own right, as well as being a source of metadata for other process data.
- Process engineers or operators can flag up certain data as indicating a particular context, e.g., a particular running mode, or related to a particular incident.
- We have various UX improvements planned this year to allow a high quality of metadata to be captured in real-time from end-users during data acquisition, e.g., operators.
Semi-automatic methods are initiated by a data scientist or developer to assign metadata to a significant quantity of data. This can be useful where applying rules from an unstructured source, e.g., a Standard Operating Procedure. For example:
- Where data has been aggregated already a single query can apply metadata to that aggregation as a bulk update, e.g., to indicate that all the data for a certain time period indicated a certain plant running mode. In terms of storage requirements, it’s a winner but it does have the constraint that if the data is re-aggregated under different conditions the metadata has to be reapplied.
- That challenge can be circumvented by a bulk update applied post-collection but pre-aggregation; so typically directly to the noSQL (in our case, Shika) data, using distributed processing in the case of very large datasets (a simple MapReduce problem).
- The obvious step from there is to apply metadata automatically on data acquisition, preferably as part of the real-time write operation to memory (or for higher latency applications, disk).
Automatic methods are very useful for manufacturing process data, particularly that which is well instrumented. However we can still get valuable metadata automatically from project data. Here’s how we see automatic contextualisation:
- Many of the process variables captured automatically by existing systems actually provide metadata which describes plant behaviour, e.g., running modes, feed types, alarm system status, etc.
- Algorithms can ‘recognise’ previous behaviour and assign metadata to it. Clearly there is a risk involved here as metadata assigned in this way is regarded as a lower quality to that assigned by an expert end-user.
- Project data is usually unstructured and metadata rich.
So that’s how we add context at the moment. Most of this is done implemented initially by our developers and data scientists with the customer’s knowledge and assistance. Over time, the customer takes on more responsibility for the maturity of their metadata, should they feel ready.
We hope that’s interesting and shows how we do things. As always if you have any questions or suggestions, head over to our LinkedIn group and share your thoughts.