Sensor Data Modeling


#1

Discussion of data structures for storing sensor descriptions and readings


OpenAg for Outreachy and Google Summer of Code (GSoC) projects
My shot at a "Food SuperComputer"
#2

@rbaynes
The strength and weakness of CouchDB is that relational data can be ‘flattened’ (de-normalized) for fast reporting without the need for multi-table joins. Capturing the sensor reading is not that difficult, the problem is trying to figure out how much (redundant) information about the sensor should be stored in the record. I have recently been confronted by this issue while working with camera images.

At first I just dumped images into a directory (‘pictures’) and had each image with a timestamp as the name. With only one camera on the MVP, this was sufficient: the directory told me the category (camera) and the file name gave me the date, the file itself was the value. The bill-of-materials for the MVP serves as a relational table of attributes about the camera (brand, model, features), and construction plans can tell me the location/placement. The data serves double duty as a unique image id, with only one camera, there is no need for the camera information - it can be assumed (or may not be important).

Things got complicated when I installed more cameras, and want to coordinate them for stereoscopic image processing.

In a relational world, I would tag the pictures with a unique camera identifier (as the image directory name?) and put the camera attributes in a separate table. There are, in addition, two other relationships: location and camera-to-camera-relationship. With stereoscopic processing, there is also the need to know the relationship between the cameras (identifying the ‘Right’ and ‘Left’ cameras and possibly calibration information). Here is where semantics get tricky; Right and Left are relative terms, do we mean on the right or left side of the MVP, or do we mean right and left in relationship of the cameras to each other? Technically these are different relationships, even though practically they may be the same values.

An additional complication is that I am now moving the images to CouchDB, as attachments to a record. Should each image be in a separate record, or should there be one stereoscopic record with a ‘Right’ and ‘Left’ attachment?

To really throw a monkey wrench into the works - I actually have three cameras running in my MVP. However, since a Raspberry Pi can have only one Raspberry camera (though multiple USB cameras), I had to put my Raspberry NOIR camera on a separate Raspberry Zero. I have no simple way to coordinate the image storage on my ‘brain’, though it is possible to push the images from both Raspberries to the same cloud CouchDB table.

The practical compromise I am currently considering is:
One record per image
Attributes are: timestamp, file_name, source handle (Raspberry, USB, NOIR) and relative_location (Right, Left).

Attached is a rough data model sketch:


#3

All great questions with no easy answers @webbhm.

I like your proposed data model. Since couch builds its b-tree indexes
based on the mapping function(s) you add to a newly created database, the
headache I have is figuring out what types of queries you will need to do
on that data.

Thanks,
Rob Baynes


#4

@rbaynes
I have used the following query for general sensor charting:

“map”: “function(doc) {\n\t\tif(doc.value && doc.attribute && doc.status == ‘Success’)\n\t \t{\n\t\t\temit([doc.attribute, doc.name, doc.timestamp], doc);\n\t \t}\n}”

Attribute is temperature, humidity, etc; and the name is the particular sensor (needed when I have multiple temperature probes). I often pivot the data on the name.
Further filtering is done in the code (ie. most recent dates).
I am looking at making a change to include the MAC address, so all MVPs can be in the same table, and I can select either an individual box, or do comparisons.

Phenotype data models almost exactly like the sensors, but getting the semantics correct is what concerns me.

This stuff never has easy answers, but they are a lot easier if there are clearly defined research/query goals.


#5

I’d like to take a step back and ask basic questions:

  1. Time Series Data
  • pH, EC, TDS, Sal, Temp, Humidity,…, images
  1. Visual Data
    @webbhm Any luck with volume estimation with an array of cameras? Another idea would be using CNN to train model on crop type and stage of growth. This information would have to be structured within.

  2. Data Architecture

@webbhm Depends on device architecture. Noticed PFCs was designed to do a number of tasks locally and perhaps the analysis result is streamed to a dashboard? plantOS is designed to keep stream data directly to our cloud platform with some pre-processing for data fidelity(bad data=garbage analysis) before analysis is done on the cloud. We primarily use Tensorflow on Python for local Raspi analysis.

@webbhm Another consideration for your data architecture is how endpoint devices are managed and secured. It did impact our architecture so it’s better to think that through prior after locking down functional requirements.

  1. Data Models
    Note: Current reference would be a json file with threshold for sensor value over time. Perhaps it should be structured based on outcomes to achieve
  • Plant Recipe
  • Time series analysis of sensor parameters to automatically determine thresholds based on plant type
  • Visual Identifier
  • CNN to visually identify plant type and stage of growth. Can look into disease detection based on plant type in the future
  • Volume estimation
  • Does OpenAG have an alternative to PlantCV? Noticed it on OpenAG site

#6

@devtar I would also encourage you to read through these posts. It’s all too easy to start collecting data and forget what actually matters to the plant science community. It was many of these discussions that led to the development of the MVP so that those out there who want to contribute (@wsnook @_ic @webbhm) but don’t want to spend $3500 on a V2 could participate on the software/data side of things as well. I think this is critical to the growth of the community, you can’t be open source if you don’t have a community (not just partners and customers).


#7

@rbaynes, @Webb.Peter
Looking at the data in: rbaynes_pfc_env-data-pt_spaghet-a-bot_b8-27-eb-a9-11-79, there were two attributes that jumped out at me, your “is_” values. This is a common pattern that is useful, boolean values make it easy to sort categories of data. But in my experience I think there is a better format. Boolean values make it difficult to add new categories, and hide details that are likely useful in other contexts. As I will show, I am as guilty as anyone of summarizing data, but I have also been burned by it (expecially the boolean summary).
‘is_manual’ hides the data source, who did the data entry. I prefer storing a user_id or the like. With this I can go back and ask administrative questions like ‘Who entered the most data’, ‘What is the data accuracy by user’. These are things that a teacher might want of student data (for grades or awards). I can also envision the other side, where I want to know what algorithm calculated the value; was plant size determined by a touch sensor probing the plant, or extracted from image analysis? My preference would be for an attribute of ‘data_source’ (user_id, image_analysis, touch_sensor) and for CouchDB, a category (person, automation).
‘Is_desired’ implies secondary, derived data; the value is being compared to something (not specified in the record) and evaluated (ie. the temperature is above a set limit). We cannot avoid summaries, a temperature sensor is derived from a voltage (or resistance), my water level is derived from the ‘time of flight’ of a laser beam. I am recording that my reservoir is ‘Full, Empty, OK’. These are all based on comparing the distance of the laser to the water level to stored values. The actual value is meaningless. my reservoir is not the stock bin, so the actual distance is useless for any comparisons with another system. Where it is needed is when I don’t trust the sensor and need to validate how the value was derived. In cases like this, I like to put the raw data in the comments. Another option is to store the raw data, and create another record with the secondary/derived value; marking the source as possibly the algorithm name (category=‘derived’), and possibly putting a formula or helpful information in the comment.
Our data records are getting bigger, but that is the nature of ‘no SQL’ databases.
At this time I don’t have ‘source’ or ‘source_category’, but will definitely add them in the next round.
{‘env’:{‘enterprise’:, ‘farm’:, ‘mac’:},
‘experiment_id’:
‘timestamp’:,
‘source’:{‘name’:, ‘category’:},
‘status’:,
‘attribute’:,
‘value’:,
‘comment’
}

Looking at this, I am beginning to wonder if all/most of the top level attributes should be lists, even if we have only one item in the list currently; would help for growth. ‘timestamp’:{[[date], [month], [day], [hour], [minute], [second]]}


#8

@rbaynes, @Webb.Peter
Subject
Looking over my last post, I realized I left off one of the most significant values: the subject, what the measurement is taken from. Thinking it through, I may want to change some of what I have been collecting, I am guilty of the error of ‘proxy naming’.
I keep going back to Barry Smith’s definition that a measurement records an attribute of a substance. The substance is the ‘thing’ of interest: plant, air, water. It seems obvious, but can get complicated. There is a constellation of information here that gets jumbled together, all parts are needed, but they are easy to confuse. The constellation consists of the subject, where it is located (discriminator when there are multiple potential subjects) and the instrument used to capture the measurement.
I think I am making a mistake when I talk about a ‘camera image’. The image is really an image of a plant (or collection of plants), the camera is the instrument used to take the image. However, when there are multiple images from different instruments, the camera used becomes the easiest thing by which to call the image: the ‘NOIR Camera’ -vs- ‘Left side camera’ (and even here we are mixing type and location as discriminators). It is common to have the sensor be the ‘proxy’ for the subject.
There is a similar problem in taking temperature. The temperature is of air, water or possibly a leaf; the sensor (si7021) is the instrument. Confusion comes when talking about location, is location an attribute of the sensor, the subject, or both? I am recording ‘reservoir temperature’, when it is actually the ‘temperature of the water located in the reservoir’. Or similarly I have ‘si7021_top’ which is shorthand for ‘temperature for the air at the top of the chamber recorded with the si7021 sensor’. In common discussion, we abbreviate and use proxies, as it is easy for most of us to assume the context and fill in the missing pieces - something computers struggle with.
I am inclined to avoid this proxy confusion by having two fields: subject and instrument. I would then have location be a part of the subject (sensor location could be found in the context/environment document). I will reserve discussion of the details for a further post.
{
‘subject’:{‘name’:, ‘location’:},
‘instrument’{‘name’:,‘type’:, ‘location’:, ‘protocol’:}
}

example:

{
‘subject’:{‘name’:‘water’, ‘location’:‘reservoir’}.
‘insturment’:{‘name’:‘mcp9808’}.
‘attribute’:‘temperature’,
‘value’:22
}

As a side note, fun comes when we have human observations. We have already captured the person information in the ‘user’, but are they also the instrument (technically ‘yes’, the person has two roles in this observation: recorder and instrument)? If I am manually recording EC, the instrument is both the EC meter and my eye; but if I am recording taste do I want/need to record my tongue as the instrument? My leanings is to not have an instrument in this situation. This is the crazy level to which I think we need to define our data issues.


#9

@rbaynes, @Webb.Peter
Location
I think we need a ‘controlled vocabulary’ of locations. In a relational database this would be a ‘lookup table’ which would constrain the entries; but we are not so lucky. The best we can do is agree to what we put in the code, or lists controlling the UI input.:
Sensor placement for the MVP and PFC is probably fairly simple: left, right, center, top, reservoir, ambient (outside the box). I think that about covers it.
Plant location gets tricky. I have six holes in my reservoir, and was considering numbering them in clockwise order (following GIS polygon practice). However, I know that @Webb.Peter has a reservoir with 12 holes - should we go with a matrix grid (good for python processing!!). But what do we do if there is a hexagon pattern? I may be over working this, but my background was with field plots that needed to be gridded off so that placement randomization could be calculated. I am willing to let this go, and leave the numbering up to the individual, but just want us to be aware of where this could go.
Once we move outside a simple box, all bets are off. Do we need to identify racks & shelves. I know of some research greenhouses where the plants are on conveyors, where there is no location (other than the greenhouse and conveyor).
For the most part, we are using location as a data discriminator; and the actual value/meaning is irrelevant; it becomes important only when it is used for calculations/comparisons (randomization). We may find ourselves going down the road of multiple location standards (by environment category). Lets agree to limit ourselves to the MVP/PFC category for now, and we can keep it simple.


#10

@webbhm Back in high school, my chemistry and physics teachers made us turn in lab reports that included a diagram of our equipment, a full step by step description of the procedure we followed, our raw data, and then finally our calculations and analysis of the results. During the actual experiments, we recorded those details in lab notebooks, then later we typed it up. The point was to document what we did sufficiently that someone else would have a good shot at reproducing our results or identifying our errors.

I don’t think the problem you’re contemplating in the above comments is something that can be solved just by picking good field names in a data record. The deeper issue is that, if you want to meaningfully compare data sets, you need to know details of the equipment and procedures used to collect the data.


#11

@wsnook
Totally agree with you. At this point I am defining the observations. You are emphasizing the environment/context and procedure/protocol (recipe?).
While I recognize the need for these (and have them referenced in the observation data); 1) They are a much more difficult part to standardize (so want us to get some experience with something relatively easier first), and 2) A lot of this information already exists in bill-of-materials or the Cornell Lettuce Handbook; places that are humanly available, though not understandable to the computer.


#12

Perhaps I did a bad job of explaining myself–sorry about that. I’m not trying to create any dichotomies. Rather, I meant to advocate for the benefits of the scientific method, reproducibility, and using good old fashioned prose.

The lack of standard hardware and procedures is getting in the way of what you’re trying to do. But, the current number of operational food computers is so small that manual data cleanup ought to be reasonable as an intermediate step. Perhaps it would help if you anticipate custom hardware and provide a way for people to submit a description of their equipment and procedures along with their measurements–like a lab report.


#13

I agree with the goals, the issue is the use of prose. If we write up the environment in prose, it is readable for people, but almost impossible to search or use for configuration. If we have prose and formatted data (JSON or XML) then there is the problem of keeping the two documents in agreement (we already have that problem with build documents).
This is something that I hope we can quantify to some degree in the future, running parallel experiments to see what differences the different builds produce. At this time I agree theoretically that difference will give different outcome, but whether it is significant, or gets lost in the ‘noise’ I cannot say.
This is a definite problem in the scientific community, as the joke goes, the definition of reproducibility is not reproducible:


The problem is most significant when analysis relies on statistical probability, but I don’t think we are pushing things that far at this time. This is a problem we need to keep an eye on, but when the larger scientific community does not have a solution, I want to go careful here and focus more on things we have more control over and likely has bigger impact.


#14

@rbaynes, @Webb.Peter
Image Analysis
I am playing around with doing image analysis from the cameras I have in my MVP, and this is raising some interesting data definition issues: What exactly are we collecting data on, a plant, or an image? The image is a proxy for the plant, but not the plant itself. In some ways this is no different than temperature, where we are measuring the resistance of a thermocouple, and not the air ‘directly’. In another way this seems more removed, we are not recording the display of a digital caliper when we give an image leaf length. However, if our camera and code are good, the final result may be just as good.
I have been thinking of putting the picture name in the comment section of the data log, but beginning to re-think that decision. Expecially with JSON, I am moving toward extending the ‘subject’ to include the image information. With stereo image analysis, there may actually be the need for a more complex structure that identifies the ‘left’ and ‘right’ images. Possibly something like:
{‘subject’:‘plant’, ‘location’:1,
‘image’:{[‘name’:‘xxxxxxx’, ‘view’:‘right’],[‘name’:‘yyyyyy’, ‘view’:‘left’]}
}
Should name be a file path (not much help saying it is on a Raspberry Pi, not on the network), or reference an image saved to CouchDB as an attachment (I think the latter)?


#15

Location:
For the large scale plant research we are doing in the shipping containers,
we use a locator in a format similar to an IP address:
warehouse.container.rack.level.X.Y. This is specific to our research,
but for a smaller device, a similar approach could be used. Say you are
running an experiment on 3 devices, your locator would be DeviceMAC.X.Y.
The X & Y are the coordinates of the plant in a tray / raft with the
origin in the upper left.

Values:
I prefer keeping a list of Value objects, where each contains the location,
variable name, type, value and a link back to a parent object.

Thanks,
Rob Baynes


#16

@rbaynes
Could you give me an example of your list of value objects? I am not sure I understand what you mean by a link back to a parent object, and how that would be expressed in JSON.
The problem with location is always how many hierarchy layers to record. (Planet: Earth, Continent:North America, …); what can be assumed, and what is needed at the data record level for query discrimination. System that skip hierarchy layers (ie the MVP without racks and levels) tend to give me heartburn. I need to think on this and consider if there needs to be a location descriptor in the meta-data that states how many level a data set uses (and what they are).


#17

Sorry for the confusion @webbhm. I am just thinking about data
organization on the back end at the moment.

There will be some contextual information for when we are communicating the
JSON objects from the device to the backend. e.g. when I am receiving
temperature values from a PFC, I will know the device ID (MAC) and hence
the recipe/program it is running.

Thanks,
Rob Baynes


#18

Hello, i want listening people’s opinion of “Photocell” sensor’s role and archive this sensors data.

I’m not convince it is appropriate thread to question this. But in my database collect brightness using this tiny analogue photocell sensors( using their “Voltage” value ).


I think this “Photocell”'s value is only one sensor data to tracking effectiveness of “LED bulb”.(Or comparing with various types of bulbs).

I know it is only detecting “Brightness” of bulbs. Other problem is how to using this sensor data to tracking plant status and using it.

Today, i read datasheet of photocell like a following images(https://cdn-learn.adafruit.com/downloads/pdf/photocells.pdf)

They mention this “Photocell” have more sensitivity with “550nm(Green)” to detecting brightness. Unfortunately “Photosynthesis” prefer other range like this images.


(from the http://hyperphysics.phy-astr.gsu.edu/hbase/Biology/ligabs.html)

My question is “Using this Photocell’s brightness sensor data is useful for modeling of data for PFC? Even this sensor detected more sensitive with ‘550nm(green)’!”

  1. What do you think about role of “Photocell”?
  2. How to using this sensor data to tracking Plant status or environment in PFC?
  3. or Any ideas to using this photocell in PFC?

Probably, it discusses agos, But i couldn’t find any thread or comment about this.

I look forward someone’s answers and opinions.

Thank you.


#19

@house You’re on the right path. I suggest you check out these threads (and the links on them) and take your discussion there to the experts. What you want is a Quantum meter that measures only photosynthetically active radiation. There may be a correlation and use for these cheap photocells down the line, but as of right now they aren’t a good indication of light intensity.


#22

@rbaynes, @Webb.Peter
Participant
This model is getting more complicated that I like, but it is where the use cases are driving me.
We want to say that a person or a sensor can have the role of a data recorder - and we want to capture who records the observation. Thus I have a ‘Participant’ hanging off of the observations. “Participant” is not the best term (compared to “Observer”), but it is the generic name used by BPM (business process modeling) for the person or thing participating in an activity. I am not sure we will ever be integrating with BPM systems, but it never hurts to follow naming standards and be open to future integration.
MySQL (or MariaDB) does a poor job of modeling inheritance (sub-typing), the best it can do is to have a common ‘parent’ table with multiple ‘child’ tables that capture the sub-types. Hence the ‘Sensor_Participant’ and ‘Person_Participant’. These tables are the role of being an observation recorder. When we get into Administration, we will see the people can have other (security) roles. I am following Peter Coad’s Universal Model where a role is defined as ‘stereotypical behavior in a context’. A role is what a person ‘does’, not what they ‘are’; a person can have multiple roles (driver, doctor, patient, pilot, student, employee, …) and the role may have attributes associated with it (licenses, qualifications, …). Roles are assigned/related to people, places and things.
Sensor roles are a bit more complex, I want to say that a sensor in a location has a role, not the sensor by itself. The use case I am thinking of is where a sensor is moved (from the top of he MVP to the Reservoir), that it then changes its role. At least for the MVP, I see the location being a part of the usage description (but I can think of counter examples). I don’t plan to implement it (at least not until there is a requirement), but Sensor_Location could have additional data like data installed, date removed, and the Sensor could be associated with maintenance activity (calibrations, repairs).


Two interesting counter examples to my Sensor_Location argument. One is computer vision, where multiple cameras in different locations are used as a single sensor (stereo vision), or one camera in a location is used for multiple purposes (plant observations, light observations).
The other counter example would be an IR temperature gun mounted on a servo. It is located in a fixed position, but can be aimed at different locations to record a temperature reading (I think of my veterinarian cousin who use to check reptile temperatures at the Indianapolis Zoo by standing outside an exhibit and aiming his ‘gun’ at different lizards).
We need to define our use cases or we will go crazy with the possibilities.