Data Identity: Ownership


#1

Discussion of what should be required to identify record ‘ownership’. CouchDB supplies an _id as a document identifier, but how do we associate data to a particular device or organization?


#2

@rbaynes, @Webb.Peter
The problem if identity comes up when we want to aggregate data from multiple sources into a single database. The context that can be assumed at the source needs to be explicitly stated in the aggregate. If I have a database of sensor readings and name the database mvp_sensor_ag-7e-55-ae-1e, I might know that this is sensor data from the Raspberry Pi with a particular MAC address. I don’t need to label each document with the MAC address, as the context gives me that information. It is when I combine data from multiple Raspberries that it is necessary to add the MAC address to each record.
Things get a bit more complicated. In a relational database, I can add another table that that associates the MAC address with other ownership data - what group of PFCs this is a part of (a classroom), or add another table with even higher ownership (a school). This will be important, not only for reporting but also security; we need identifiers so we can say a particular user can access/modify data at different levels (a student may be limited to one box, but the teacher needs to access the whole classroom or school). With relational systems there is a flexibility to modify and add data-tables as needed to expand relationships. However, with CouchDB it is more like a denormalized reporting system - shove everything you may possibly want into every document; however, updating of data gets expensive, and there are no relationships - we need to put all this information into a document when it is created. Thus, getting this ‘right’ from the start is more critical.
I am inclined to say we want ‘Enterprise’, ‘Farm’ and ‘Field’ (school, classroom and MAC) on each record, but this may will get us into trouble if there is ever a need to move a box from one classroom to another. I think we can get by assuming the relationships are static (historical) with no need to change the record .
The other solution is to do multiple queries (Enterprise, Farm, Field in one database, sensor data in another) and combine things at the program level. This may actually involve pulling the sensor data from CouchDB while pulling the higher level relationship from a relational database. Or managing the dynamic aspects in a relational database and pulling a static copy to CouchDB for appending historical data.
Another solution (which I have not tested) is to have multiple document types in the sensor database and combine this information via the view, either at query time, or at the time data is moved off of the Raspberry and pushed up to the cloud.
Proposal: We stamp each document with ‘Enterprise, Farm and Field’ at the time it is created.


#3

All excellent points, Howard @webbhm.

I am just finishing up our data dictionary - everything we know about the
data we will use and how I will track it.

I want to separate the “knowledge about the data” from the back end
database it is stored in, from the way it is communicated.

Regarding ownership, I am also drawing up a relationship diagram so I can
clearly picture how all the entities are connected. I envision storing the
data on the backend in a relational database.

When I am done, I will share it with you and look forward to your review /
comments.

Thanks,
Rob Baynes


#4

Looking forward to seeing the results of your guys work. I’m just about done bringing a food computer online and would like to move the Couchdb to the cloud for this particular food computer and hope to leverage your work.

Regarding the use of MAC addresses: Please think about using globally unique random hashes instead. This would make every source a first class citizen when it comes to inserting data into the database. One RP for example could host multiple food computers.

I do think we need to create a relational database that bakes in the schema that you guys are cooking up. I know that relational dbs are grandpa sh** but believe it or not they may be making a comeback.


#5

@ferguman
How do you envision multiple sources working with the code on a single Raspberry? I envision a major code rework to do that. If you have a trick I am not thinking of, I would like to know it. If we can, then definitely the MAC is not a solution for unique identity of a PFC.
98% of my career was with Oracle. Relational is the only way to go for data integrity and flexibility; but dies on reporting performance and scalability. Pick your poison.


#6

HI Howard,

I won’t be of much use until I get further down the road with the existing
FC V2 code base and the Couchdb setup in particular. I come to the table
with background in relational databases and not much experience with no SQL
databases.

Joe


#7

Welcome to the club. My background is relational, and the NoSQL stuff reminds me more of a throw-back to COBOL days. It is fast and scalable, but data integrity and data flexibility are not its strength.
My primary concerns are to get a good (normalized) data model and pin down the naming and semantics. Sounds like these are something you have experience with.


#8

@webbhm @ferguman
I think the NoSQL DB on the device is a good temporary storage mechanism for time series data and can handle the cache / offline feature.

For the backend, I’m testing Google’s BigQuery data warehouse. It has a lot going for it:

  • Integrated into the google cloud platform, so easy to access from any language and any of their VM offerings.
  • Has command line, console and various language bindings to its REST API.
  • Massively scalable.
  • Can make a dataset public. That means OpenAg pays for the storage space, but any user can make all the queries they want against the data, the user pays for the query processing time.
  • Data in BQ can be a source for their analytic / charting services.
  • Any public data or query results can be downloaded as JSON or CSV.
  • Easy command line or python scriptable schema creation / modification and data upload. I need this to upload all our current research data.

I’m still figuring out the device side of the solution. It will most likely be a public key access local event publisher to a PubSub service I host on a google app engine that does the validation and insertion into the BQ public dataset.

https://cloud.google.com/bigquery/what-is-bigquery


#9

@rbaynes, @ferguman
I don’t know enough about Big Query to give an answer, but this does sound like it has more features and flexibility than CouchDB.
My two key feature concerns are the ability to set up user dashboards to monitor plant growth, and the ease of query and extract for analytics.

Rather than client side upload scheduling and transfer management, how about CouchDB replication to Google, then a CouchDB to BigQuery move in the cloud? From my days of data warehouse work, there are advantages in having a ‘staging’ platform where data clean-up and validation can be performed before moving data to a ‘high quality’ trusted system.


#10

@webbhm, great point about the staging / validation step. I still need to
think about how to implement that part. I’m thinking it should happen
locally on the device, so the user gets immediate feedback. It also needs
to happen on the backend before the data is inserted, to keep our DB clean.

Thanks,
Rob Baynes


#11

@rbaynes
There are multiple definitions of ‘validate’ going on here.

  1. The immediate feedback of manually entered data, this should be via edit checks to prevent anything other than valid data being entered (value range checks, numeric value check, …)
  2. Data semantic check. This is part of the work of the staging platform. Make sure ‘temperature’ is spelled correctly to the standard (not: tmp, temparature, Temperature, temperature_C, air_temperature). This may be some data cleansing to correct legacy naming to the current standard.
  3. Data set checking. All the needed records for an experiment are present; ie, if we require height, length and width of the plant on a weekly basis then all three data types are present with the required frequency. If this standard is not passed, the data may be useful for the user, but it is not useful for larger scale analytics. This is another place where a staging platform may be useful, to create derived data (daily, weekly, experiment summaries; convert dates to duration (time since planting/germination)

#12

Thanks Rob,

I’m hoping to get 2 V2 FC’s in shape to publish to your Google cloud db.
Let me know if you are game for that.


#13

I’ll let you (and the community) know when we have a backend ready to hold all the data. It’s my full time focus until it’s done (but it’s only me working on it).


#14

@Webb.Peter
I am interested in what other parts of the Google stack you are looking into, and if you need any help exploring them. We have CouchDB up an running on the cloud, and it is as easy to pull reports from there as it is from the local Raspberry.
In particular, I am interested in:

  1. Dashboard reporting and charting
  2. User management (logins, user profiles)
  3. Messaging (how close are you to having something here?)
  4. Back end tools (python or Java script on the server side)

Let me know what I can do to help you out.