Ambient Findability is the title of a book by Peter Morville. It is a good read and a springboard for some of my data ranting.
Data is not the OpenAg problem. It is not that difficult to collect data, I have 183K data points, and counting, in my CouchDB at the moment. The real problem we must deal with is the findability of data and data relationships that are meaningful to my need. The findability is the difference between a meaningful data warehouse and a data landfill. The key to this is taxonomy and ontology - the creation of meaningful labels and relationships/structures.
There are two basic types of labels (think Twitter hash tags): Folksonomies and Taxonomies. Folksonomies are quick and easy taggings like Twitter hash tags, they lack formal discipline, but are quick and useful. Their problem is that they are often not unique or distinct. If I tag something as “#NewYork”, does that refer to New York city or New Your state? The tag by itself does not help me here. There are similar problems with tags like “#MyVacation” or “#Micky”.
The other tagging is a formal taxonomy. This is a disciplined, often hierarchical, list of terms that managed and controlled. Formal definitions are given so that any label has only one disambiguated meaning. Thus, I may have “#NewYorkState” and “#NewYorkCity”.
Some large systems may actually use both tagging systems at the same time; one intended for informal, personal use; and the other as the more formal, corporate, official tagging.
Tagging and labels only get us so far. I still have the problem of distinguishing Bloomington, Indiana from Bloomington, Illinois. This is where ontologies and data structures are needed, a way to say that one label belongs to (or is related to) another label. That the city name (Bloomington) is related to a particular state (Indiana). For OpenAg, these are the questions of how are “plants” (lettuce) related to “experiments”, “trials” and “recipes”. How do I search for all experiments that used lettuce and PAR 50 lights? How do I find a good recipe for high calcium lettuce?
The place to start (data modeling methodology) is not with the data, but with the questions. What are the typical research questions that we expect people to ask of the data? We need to collect a dozen ‘use cases’ and tear them apart (good old English sentence diagramming!!) to look at the words and word relationships, and let this drive the data structuring.
Having said all of this, I have two caveats:
Goodle, in the early days did not care about ontology and the meaning of words for searching or translation - they just crunched the numbers. If you know that a country like Switzerland has three official languages (French, German and Italian), you don’t need to know the meaning of the words in their official documents, you just need to find the correlation of words between the three official versions of the same document. With enough documents you get a fairly good translation. This can get you so far, but then it hits a wall. If Google could walk away with 80% of the web market share, they were willing (at that time) to leave 20% on the table.
The other problem is that tagging and structures only helps to answer the questions you plan for. I read once that in the early days of NASA they were looking for geographical features they could use for optical targeting from space. The requirement was for discrete land forms that had high albedo - relatively small bright objects. They ended up going to the library to ask the research librarian (this was pre-Google/web days!!). She immediately told them to go to the ornithology section and look for ‘birds of the Pacific’. She had given them the exact answer they needed, but it was one they realized they would never have thought of: bird rookeries are often isolated islands or atolls covered with white guano.