How to move forward on sharing data? - Plant Recipes


#9

Data Collection & Sharing: Categories

Data tends to cluster in groups, with relationships (often activities) that connect the different groups. There are two main clusters:

  • Plan: what you intend to do. This is documented before you start growing (ie. the recipe)
  • Actual: data collected in the process of growing

The main categories are:

  1. Context
  2. Genetics
  3. Agronomics
    3a. Planting, Transplanting, Pruning, Harvesting, …
    3b. Treatments
  4. Observations
    4a. Environment
    4b. Phenotype

Context is the things that are assumed and often overlooked when thinking of data. For OpenAg this is relatively simple, as the primary context is the Food Computer itself - what sensors are installed, where it is located, type of growing (agroponic, hydroponic, raft, deep water, …). This can be simple, or go off into some complex detail (ie. sensors may have calibration requirements, drift rates, maintenance history, certification, …). This data can often be overlooked, but becomes critical when discrepancies are found (“but I thought you …”). The hardware_fixture and hardware_fixture_type would be pieces of information in this category.
Genetics This is a part of context, but a significant context. This can be as simple as Bently Buttercrunch Bibb lettuce, or the complexity of breeding history and DNA sequencing. I would recommend including a Latin name (though this has its own problems). Where you got the seed from (Bently, Burpee, …) is provenance (where you got it from), and not a direct attribute of the seed. I would suggest starting simple (Latin name, common name, supplier) for most work, unless you are getting into seed breeding or DNA/phenotype relationships. There are interesting things in the seed business, where a company will sell different genotypes in different markets under the same name; and with heirloom, open pollinated seeds there is no assurance that the genetics are consistent between seeds.
Agronomics: These are the farming activities - actions performed on the plants as a part of their life-cycle, or on their context (ie. plowing in traditional farming). These things are usually part of a ‘handbook’ of standard practices. Agronomic data will come as two parts: the plan of when and how to do something, and the actual of who did what when. A third part is derived data, comparing the actual to the plan. These comparisons are the bread-and-butter of administrative analysis, and irrelevant to research (assuming the plan was followed).
Treatment: applications of fertilizer, pesticides and pro-biotics. Where there is irrigation or ‘ebb - flood’ watering, this would also be considered a treatment. Treatments may be a regularly scheduled part of a recipe; or it may be an interrupt/exception such as a fungicide or insecticide treatment. pH up and down dosing is a treatment activity.
Observations: these will become the real ‘meat’ of OpenAg, as these are the variables that get mesured.
Environment: this is what most of the sensors are picking up; temperature, humidity, light, conductivity, pH. This is the easiest to collect, most abundant, and likely will end up having the least value (to be explained later). Unlike field agriculture, climate for the food computer should be a controlled variable.
Phenotype: this is what Caleb and others see as the future for OpenAg, it is also what nobody has defined as to how it will be collected and recorded. Phenotype data is the big variable that we want to watch as it responds to different controlled variables. In the phrasing of OBO, these are measures of the quality of a substance (leaf length). It is a personal bias, but I am going to push hard to use the OBO ontologies for this data.

Until we pin down how to collect phenotype observations, there will be little significant data to share.

Next Up: Data Levels


#10

#Data Collection and Sharing: Data Levels
Not all data has the same value to all people. In my experience there are three broad groups of data users:

  • Operations
  • Administration
  • Analytics

These three groups will work with the same data, but with different perspectives and at different summary levels.
Operations: is the moment-to-moment running of the business (ie. what the Food Computer does). Whether the goal is research or food manufacturing, they are all driven by the data. The questions that are being asked are of the nature of: “what do I need to know in order to decide what to do next?” Plans and configuration data set the context, this data determines the action. This is the temperature sensor data that is used to determine if the thermostat adjusts the temperature up or down. These are the observations that decide if the lettuce head is big enough, and healthy enough to be sent to market. Normally it is this operational level that determines the ‘finest grained’ level of data. I am a strong believer in creating a ‘business process model’ that defines the action steps: observing, deciding, acting. Based on this model you then know what data to collect and how it will be used.

Administration: Administration tends to look at trending data to determine performance and efficiency. Is the germination rate increasing or decreasing, does it differ by time, location, or who did the planting? Does a different light save energy, how does its performance change over time? Is one food computer performing better than another? These are business questions (or scientific questions) based on aggregate, summary data. The operational data is summed/averaged/aggregated by one or more categories (day, week, month, batch, growth chamber, genetic strain, recipe, equipment, …), then compared against similar sets. There will likely be standard reports (weekly production reports, …) as well as ad-hoc inquiries. In many businesses, the operational data is summarized and moved to a data warehouse for reporting and analysis (ie the map reduce jobs of CouchDB). The detailed operational data is not important to these people, unless they are ‘drilling down’ to find out why something doesn’t look right.

Analytisc:This is usually another level of summary and abstraction above the administration data. In sales, this is the marketing trends and forecasting. Often these summaries are bases on random sample subsets of operational data (‘how much nutrient does the average plant use?’). This is where the ‘big data’ analytics come in; where OpenAg will take the results of multiple growth cycles of multiple food computers to look at correlations between recipes. Often the context data (what brand of fertilizer, lights, …) becomes important here, where it is assumed knowledge at the lower levels (the operations person knows what bag they got the fertilizer from!)

Two take-away points here:

  • In designing data capture and storage, it is important to consider all users (and potential users).
  • If done with some foresight, it is possible to set up the operational systems to capture data that can easily be transformed for multiple uses (I will propose a standard data template in a future post).
  • Failure to capture key information (in a usable format) can turn a terabyte data warehouse into a data landfill.

#11

Thinking about this question, I stumbled upon the Open Ag Data Alliance. It is not clear what they are doing, and I am still digging in. The open source section is somewhat active, with an API definition draft.

Perhaps there are some interesting things to see there.


#12

I was peripheral to Open Ag Data Alliance when I worked at Climate/Monsanto. At one level this is helpful, and we share similar goals; but at another level we are in two different worlds.

At one level we are all growing genetic material, with actions of preparation, planting, treatments, observations and harvest; but scratch below the surface and things get context specific quickly. The focus here is yield data, soil data, seed varieties and to some degree planting data.
Farming of corn and soybeans is fairly standardized (row widths, seeds per foot), and while manufacturers may vary in how they store planting and harvest information, it is very similar - but quite different from our world. We do not measure by the acre! We think of plants/fruit harvested rather than seed harvest (bushels per acre).
Their recipes center around seeding density and fertilization. We are strong on controlling the environment, yet weather data is not a part of their scope.
I like their methodology of starting with use cases, and think we should do the same thing. Looking at this use case, it is obvious that the data exchange activities they are concerned with are quite different than our activities (standardizing combine yield monitor formats). Their use case focuses on data exchange, not agricultural activities.
My other big complaint with Open Ag is that they are starting with an interchange standard (infrastructure definition) rather than starting with a data definition. They are focusing on how to exchange data rather than on what data is in and of itself. You have to get to the how, but , but it should not be the first step.


#13

#Data Collection and Sharing: Plan, Act, Evaluate & Decide
There is a pattern of fractal cycles in data, decisions within decisions. Is hydroponics my best business plan? Is this recipe the most profitable? How many plants should I start? Do I need to adjust the temperature? All the data we work with falls into these four uses, and frequently the decisions from one level become the basis for evaluations at a higher level.
Plan data is often is PowerPoint presentations needed for meetings, videos or text documents. The information exists, but it is not usable by an automated system (ie convenient to compare the actual activities to the planned activities).
Action Within action data I am including both recordings of activities (the fan turned on) and observations (the plant is 100cm tall). Action data is usually not a problem: how many widgets did we make? What was the temperature in the Food Computer? When did the fan turn on? These are the things that are easy to wire up with sensors and record.
Evaluations are there, but often hidden. At a low level, these comparisons are often buried in code as ‘IF’ statements. The decision logic was documented in code, but often hard to find as data. Sometimes the evaluations are complex equations, at other times they are gut reactions. I have seen situations where a ‘scientific decision’ has pages of data and elaborate statistical/geospatial analysis, only to have the final decision be someone’s opinion (I know my project is the best!). It is difficult to document which outliers of a data set were discarded or assumptions made for setting calculation parameters; yet if this information is not available we easily end up with irreproducable research.
It is also easy to confuse an evaluation with a decision (which makes changing processes, or later evaluations difficult to impossible). There is a big difference between saying “the head of lettuce weighs 1kg” and saying “the lettuce is ready for harvest”. 1kg may be an observation, and it may be a decision criteria, but it is not a decision*.
Decision Some decisions are easy to document, the fan turned on, the heater turned off. Others may be well know, but less documented (“Lets buy another Food Computer!!”). It is important to capture these decisions, as they are the input data for higher level evaluations (“is the fan working?”).

All the data we collect is there for a reason, to be used for a decision - either immediate or as part of a future question. In spite of the mantra of ‘big data’ to “collect everything”, we don’t. OpenAg is not interested in the fact that I had two slices of cold pizza and a handful of M&Ms for breakfast, or that my two mismatched socks have not been changed for a week - it has no statistically significant correlation to how the plants in the food computer are growing.

A major difficulty is in getting all the data into a consistent, usable format. My last forum post in this series will be on the StAT pattern (Standard Activity Template), a standardized data format that I have used for data capture that simplifies statistical summary and analysis.

*I am in strong agreement with Barry Smith that scientific ontologies must be based on what is ‘out there’ and not on what is ‘in our heads’. There is a big difference between saying ‘the lettuce weighs 1kg’ and saying ‘I am of the opinion that the lettuce weighs 1kg’. The former may be right or wrong, while the latter just is (and cannot be verified or disputed).


#14

Thank you for your feedback on OADA. Good to know that it is related, yet out of scope for now.

The use case methodology has served in many situations, and I think it is a strong step forward to address the question raised in this thread.

Any suggestions on how to get started? I am wondering about a discussion format more appropriate than here, as well as avoiding over-engineering. Perhaps a new repository on GitHub?


#15

For whatever it’s worth, I’m hoping for further discussion of the goals which are motivating people to come here and participate. It would be interesting to hear from any educators who might be following this thread.

Categories of motivation

So far here on the forums I’ve noticed four major categories of apparent motivation:

  1. Education: Caleb’s group at MIT has worked with schools, and teachers and parents have posted about approaching OpenAg as an educational project. Caleb has mentioned in talks–or perhaps here in a forum post–speaking with at least one leader who was interested in renewing interest in farming among young people in his country. Also, see this thread.

  2. Maker project: Several people have posted here about their enthusiasm for building a food computer in a way that I interpret as being part of–or aligned with–the Maker movement. They seem to be excited about the build and not too concerned about how much food they end up growing–at least that’s not the part they talk about most.

  3. Open source commercial research: @webbhm, from the language you use, it seems like you’re giving lots of thought to how the OpenAg project could be relevant to applied research as part of a business.

  4. Growing food: Looking through old posts, I noticed at least a couple people who came here wanting to discuss technical details of growing food hydroponically. These folks generally seem to get frustrated and leave when they realize the conversations here are more about software and building devices rather than about experience gained from actually growing food.


#16

Thank you for all the great ideas @wsnook and @webbhm . I’m going to start putting together plans for the back end that will manage data. All this input has been very helpful.


#17

#Data Collection and Sharing: StAT Pattern
In research, if you don’t have good data, you have not done the work. The following is a summary of what I have learned about capturing research data.
Peter Coad’s Universal Model started me down the path of this thinking years ago, and the development of what has become known as the StAT pattern (Standard Activity Template). It is a pattern for capturing data, and allows easy aggregation and summaries for different reporting needs (ie Data Warehousing). This was used to standardize data from multiple research systems into consistent reporting for an entire research pipeline. For those familiar with data warehousing, this follows the ‘star/snowflake’ pattern where there are transaction tables surrounded by standardized ‘facts’.
It assumes that activities tie everything together, whether it is a 7-11 retail system selling grape slurpees (Activity: sale; Subject: slurpee; Where: store, register; Who: clerk, customer; …), gene transformation or growing plants.
The activity table has the following core columns:

  • Subject_Id
  • Activity_Id
  • Activity_Name
  • Start_Date
  • End_Date
  • Participant
  • Current_Status
  • Status_Qualifier
  • Status_Qualifier_Reason
  • Comment

Subject_Id:This is usually a reference to a plant or part of the environment (air, water). This is a reference, in the same way that the firmware_module has an id reference to the firmware_module_type.
Activity_Id: A reference to the activity being done - planting, observing, temperature_up, …
Activity_Name: a denormalized convenience field, a recognizable name that corresponds to the id.
Start_Date: Some activities are ‘punctiliar’ or momentary and will have only one data. Other processes span a period of time and will have a start data and and end date. This field will always be populated, often when the record is created.
End_Date: If the activity spans a period of time, this is the time it finished; otherwise it is left blank.
Participant: Who did the work. This may be a person, or it may be the a piece of equipment (sensor, RaspberryPi, …)
Current_Status: This is one of three values - Success, Failure, Cancel. This is a high level activity summary that allows for cross activity summaries (process timelines).
Status_Qualifier: Each activity type will have a standardized set of qualifiers, usually a list of why the activity failed. If the activity was germination, it is important to know why the plant might have died: fungus, eaten by insects, grower dropped the tray, … This information is important for process improvement inquiries.
Comment: At an operational level, sometimes the status qualifier is not enough information to understand what happened. Comments allow for any free text that might be useful in the future (“Mice chewed through the water pipe and drained the system”).

Often a separate table is created for each activity, and additional fields are added to support the activity (inputs/materials and outputs/products). An observation will have not just the subject (lettuce plant) but possibly the plant part (leaf), attribute (length), unit (inches) and value (2.5). Sometimes this my be detailed with the protocol used and equipment (ie model of pH sensor). These detailed tables support the operational needs.In a small, single food computer it may not be necessary to tag each record with common knowledge (ie the variety of lettuce being grown, who performed the activity), but it is important to add this information in when the data is aggregated and combined with other projects (ie handed off to MIT). Such ‘common knowledge’ needs to be captured somewhere and made available. In the rush to ‘move on’ I often saw end dates ignored (not captured), from an operational perspective this is not important, but it made it impossible to determine performance and effeciency at a later time (ie did Task A end when the Task B started, or was there a delay between when Task A finished and Task B started?)
Queries or map reduce jobs are used to strip out common fields for summary reporting: what is the average time to complete this task? How many tasks have finished this step? What is the relationship between pH and leaf length? What is the success rate of fertilizer A to fertilizer B?

Reporting is what drives data collection. If you don’t use data, there is no point in collecting it. There is a need to think through the types of reports (operational, administrative, analytic) and the particulars of reports before defining what is to be collected. The ‘big data’ mantra of ‘collect is all and figure it out later’ does not work. You are wasting time and money if you fail to collect data, or collect it in incompatible formats (ie trying to compare light LUX readings to photosynthetically active radiation). It will never be perfect, but having a plan and thinking about this from the start will pay off with dividends later.


#18

@webbhm I like your StAT pattern description.

I can only speak for myself, but I suspect many others on these forums may be in the same boat: I can see why the formality you describe will be important for the data to be useful. But, I don’t feel like I’m at all prepared to contribute at that level. For me, it would be a significant accomplishment to learn how to grow plants consistently without killing most of them.

Much of the appeal of indoor hydroponics for me boils down to, “Hey, maybe I could have affordable, super-fresh salads all year long.” That leads me back to one of your earlier observations:

what do you hope to achieve by sharing data? I could share my entire CouchDB (20k data points at the moment), but if you don’t know what plant I am growing, the recipe, or the plant health (phenotype data); the temperature and LUX are of no value by themselves. On the other hand, you can get started just knowing a seed (bib lettuce), light regiment (16 hr on, 8 hr off), fertilizer (1 teaspoon Jacks 20/20/20) and temperature (24 C). Different goals require different data.

To me, collecting scientifically valid data about phenotype expression isn’t motivating at all–it’s not a goal I care about. Rather, I’m excited about learning to grow fresh salad even when the weather isn’t cooperating. I want to eat good food, and I want to help others eat good food.

My current thinking on “How to move forward on sharing data?” is that, as a community, we’re not ready to do research yet. We could dream up some sort of distributed CouchDB replication scheme, but what would be the point? Whose data would we replicate? What reports would we make from that? If there were a bunch of people here who were already experienced in growing food hydroponically, formalizing research methods and data collection might make sense. But, it seems like only a few people are actually growing food.

The goal I’m excited about now–one that seems within reach–is helping people learn the basics of growing food hydroponically.


#19

@wsnook
I totally agree with you, if you are not into research, much of this is not significant; however, you can still provide useful data, but at a higher (and simpler) level.
All you need is a recipe that has a good track record. That, and the operational data to know that your food computer is keeping the right temperature, light and nutrients is all you need to produce a nice salad.
Sharing the details of temperature or even phenotype has minimal value for a known recipe, but it is still very valuable to know how many people tried the recipe, and the success rate you have.
At that point, a single record (along this pattern) would be fine:
Subject: RecipeXYZ
Activity: Growth Cycle
Start_Date: mm/dd/yy
End_Date: mm/dd/yy
Participant your name
Current_Status: Success
Comments: Tastes great!! (especially with just vinegar and oil)

Thanks for the feedback.


#20

@webbhm What do you think about the idea of approaching all this from the angle of matching suitable crops to available microclimates? I’ve been thinking about something like this:

  1. Design a cheap data logger–perhaps based on Arduino or the new Raspberry Pi Zero W–that people can use to monitor microclimates inside their homes. This could be a fun and cheap Maker project.

  2. Write a program or guide that can help match their available microclimate profiles to suitable crop varieties. [edit: this information is available in seed catalogs]

  3. Write guides for how to start seedlings and grow crops according to some semi-standardized procedures that would help with comparing results (e.g. if you want to go super cheap, do Kratky method like this…, or if you want to spend a little more, do DWC like this…).

  4. Build tools to help log procedures, timing, and results in a semi-standardized way. I’m thinking something like taking one of the new Raspberry Pi Zero Ws with the camera top case, adding a simple temp & humidity sensor, zip-tying it next to the plants on a wire shelf, and taking time lapse photos with overlaid timestamps and temp/humidity measurements. There could also be a little web GUI that people could log into with a phone, laptop, iPad, or whatever to start an experiment and take notes.

  5. Write a tool to help people share their progress from step 4 on social media as a way of building community and spreading knowledge.

What I’m getting at is that building climate control chambers seems like an expensive and inefficient way for people to get started learning to grow food indoors–hydroponics doesn’t have to be that complicated.


#21

As someone who’s day job involves analytics and modeling, I’m very interested in where this conversation will lead. To @wsnook’s point, those interested in producing food can find much cheaper and easier ways of doing so. Instead, I see this project primarily as a networked experimental interface. I agree with @webbhm, this project has the potential to create a data repository of optimized recipes for growth that lead to success. Since each unit can be treated as roughly identical, our experimental sample sizes grow along with our community. Even if we all we gain are incidental insights from our logged inputs/outputs, that could still generate hypotheses that could lead to larger academic testing. Ultimately I see the insights learned here from makers/hackers/students as being applied to optimize larger growing operations that are done at scale. The sooner we can get a standardized data warehouse, the sooner we can view this project as one big network of controlled experiments.


#22

I think of the food computer as just another appliance. I want one so I can run tested recipes that have a high probability of succeeding.

I envision people finding a recipe website with ratings, comments, success rates and animations of the plants growing. Then asking “what is this food computer?” And learning about it, purchasing one in kit or finished form (appliance) then just running it.

I agree with @wsnook, @webbhm and @will_codd that we need an open data format / plan. We will want as much data as we can collect, so it can be shared and optimized. With enough data we can use an evolutionary algorithm to find optimizations on existing recipes more quickly than doing a real life run.


#23

To throw something out to get started…
This is a BPM/tData model hybrid of a generic growing process. It may be too abstract for what we need, as well as too detailed (ie I don’t think we will be dealing with plants that need USDA permitting to transport and plant). The rectangles are ‘fact tables’ (to use a Data Warehouse term) and the rounded boxes are activities.

Questions for going forward:

  • Do we have standard protocols for planting, harvesting, making phenotypic observations (measurements)?
  • What cycles and growth stages do we track (lettuce is different than corn)?
  • How much detail do we want around plans, observations, etc?
  • What do we have an a UI for collecting this information? Personally I think it should be in the database (Json formats?)

As an exercise, I propose we take the Cornell Hydroponic Lettuce Handbook, and the context of the Food Computer v2.0, and see what we can agree upon for data examples and models. I will start a new topic in Data for this discussion.


#24

To clarify my position a bit, I don’t feel qualified to spend my time on data formats now. But, if other people want to work on it, that’s great.

I started this thread because I hoped to inspire more collective reflection on what we’re ultimately trying to accomplish and on what steps are required to make that happen. I wasn’t sure where I stood, and the conversation here has helped a lot in clarifying my thoughts. Thanks for all the great feedback.

I’ve concluded that I’m not ready to contribute to work on data formats because I’ve not yet grown any food hydroponically. For me, making recommendations on data formats would be talking about things I don’t understand. First, I need to meet my prerequisites–grow food.

My main goals for now are:

  1. Gather and share information about simple hydroponic methods to start growing food
  2. Start growing food and sharing updates on my progress
  3. Help other people start growing food

[edit: I don’t want to give the impression that I’m abandoning this topic. Rather, I think that experience growing food will help me discuss this intelligently. Right now I don’t understand clearly what data might actually matter.]


#25

Great to read this piece. I am working on this part only, actually, although not with GAs. This will require as much data as possible to get convincing results.

Following @wsnook , collecting data on a clear scenario (e.g. hydroponics food growth) for optimization would be a terrific first stage. This would already give momentum data for algorithms to chew on. This would also check whether the community is on a good track to get more.

This is my approach with respect to optimization at this stage. But the whole thread and its breadth is really interesting. I hope such a first step can go toward the fully-fledged version envisioned by @webbhm.


#26

For folks who haven’t seen it, the Plant Growth Chamber Handbook goes on at great length about how to record and report measurements for scientific studies involving plants in growth chambers.

I found the link on the USDA’s NCERA-101 Committee on Controlled Environment Technology and Use publications page. They also link to the International Lighting in Controlled Environments Workshop which might be useful if people want to look at recipes that involve variations in lighting.


#27

Also, you might want to take a look at NCERA-101’s Guidelines for Measuring and Reporting Environmental Parameters for Plant Experiments in Controlled Environments. This is their background blurb from the main page:

Conditions in controlled environment plant growth rooms & chambers, greenhouses and tissue culture facilities should be reported in detail to allow for comparison of results and duplication of experiments. The guidelines presented here, including additional explanations and reporting examples, should help meet these aims and describe what is deemed the minimum required amount of information that should be gathered and reported. They also highlight parameters that could be important, but that may not have been considered for measurement and reporting.


#28

@wsnook I don’t know how I’ve never run into NCERA-101’s publications before. I’ve seen the Plant Growth Chamber Handbook but didn’t know they had so much else.

As a note, the link you gave in the last post is returning a 404 error. I think this is the new link:

@jimbell @ferguman This a pretty complete summary of everything to track.