Data Collection and Sharing: StAT Pattern
In research, if you don't have good data, you have not done the work. The following is a summary of what I have learned about capturing research data.
Peter Coad's Universal Model started me down the path of this thinking years ago, and the development of what has become known as the StAT pattern (Standard Activity Template). It is a pattern for capturing data, and allows easy aggregation and summaries for different reporting needs (ie Data Warehousing). This was used to standardize data from multiple research systems into consistent reporting for an entire research pipeline. For those familiar with data warehousing, this follows the 'star/snowflake' pattern where there are transaction tables surrounded by standardized 'facts'.
It assumes that activities tie everything together, whether it is a 7-11 retail system selling grape slurpees (Activity: sale; Subject: slurpee; Where: store, register; Who: clerk, customer; ...), gene transformation or growing plants.
The activity table has the following core columns:
Subject_Id:This is usually a reference to a plant or part of the environment (air, water). This is a reference, in the same way that the firmware_module has an id reference to the firmware_module_type.
Activity_Id: A reference to the activity being done - planting, observing, temperature_up, ...
Activity_Name: a denormalized convenience field, a recognizable name that corresponds to the id.
Start_Date: Some activities are 'punctiliar' or momentary and will have only one data. Other processes span a period of time and will have a start data and and end date. This field will always be populated, often when the record is created.
End_Date: If the activity spans a period of time, this is the time it finished; otherwise it is left blank.
Participant: Who did the work. This may be a person, or it may be the a piece of equipment (sensor, RaspberryPi, ...)
Current_Status: This is one of three values - Success, Failure, Cancel. This is a high level activity summary that allows for cross activity summaries (process timelines).
Status_Qualifier: Each activity type will have a standardized set of qualifiers, usually a list of why the activity failed. If the activity was germination, it is important to know why the plant might have died: fungus, eaten by insects, grower dropped the tray, ... This information is important for process improvement inquiries.
Comment: At an operational level, sometimes the status qualifier is not enough information to understand what happened. Comments allow for any free text that might be useful in the future ("Mice chewed through the water pipe and drained the system").
Often a separate table is created for each activity, and additional fields are added to support the activity (inputs/materials and outputs/products). An observation will have not just the subject (lettuce plant) but possibly the plant part (leaf), attribute (length), unit (inches) and value (2.5). Sometimes this my be detailed with the protocol used and equipment (ie model of pH sensor). These detailed tables support the operational needs.In a small, single food computer it may not be necessary to tag each record with common knowledge (ie the variety of lettuce being grown, who performed the activity), but it is important to add this information in when the data is aggregated and combined with other projects (ie handed off to MIT). Such 'common knowledge' needs to be captured somewhere and made available. In the rush to 'move on' I often saw end dates ignored (not captured), from an operational perspective this is not important, but it made it impossible to determine performance and effeciency at a later time (ie did Task A end when the Task B started, or was there a delay between when Task A finished and Task B started?)
Queries or map reduce jobs are used to strip out common fields for summary reporting: what is the average time to complete this task? How many tasks have finished this step? What is the relationship between pH and leaf length? What is the success rate of fertilizer A to fertilizer B?
Reporting is what drives data collection. If you don't use data, there is no point in collecting it. There is a need to think through the types of reports (operational, administrative, analytic) and the particulars of reports before defining what is to be collected. The 'big data' mantra of 'collect is all and figure it out later' does not work. You are wasting time and money if you fail to collect data, or collect it in incompatible formats (ie trying to compare light LUX readings to photosynthetically active radiation). It will never be perfect, but having a plan and thinking about this from the start will pay off with dividends later.