Future cloud thoughts: EC2, CouchDB, Google App Engine

Stephen O’Grady adeptly summarizes the distinction between ‘fabric’, e.g. Google App Engine, vs. ‘instance’ cloud computing styles.

Fabric-style cloud services may carry the long-term advantage for many problem domains since developers can largely forget about hardware concerns, OS and web server security patches, scaling/load balancing, etc. At the present, though, fabric-style cloud services (Google App Engine in particular) have a long way to go before wide adoption is a possibilitiy. First of all, the service is still in ‘preview’ status, meaning you can’t pay to up your usage quotas (though you can ‘request’ quota extensions). The query language available to AppEngine Apps, GQL, does not sufficiently compensate for its lack of support for complex WHERE clauses and table joins, which is my perfect segue to CouchDB

CouchDB is a document-oriented database under development at Apache. Unlike Google Datastore, CouchDB is not column-oriented. In fact, there really are no schema in CouchDB. Each JSON document has its own structure that may or may not relate to the structure of any other document. There may be large groups of documents with identical structures, or with only slightly different structures. This is referred to as semi-structured data.

In order for the data to be query-able, user-defined views (and view indexes) provide a window into the semi structured data. The views consist of javascript that is stored in design documents (’documents’, meaning they are stored and replicated like any other CouchDB data). The views are applied to documents in a map/reduce fashion (each view in a design document consists of a map function and an optional reduce function), and CouchDB employs some pretty smart architecture to keep the views indexed and up-to-date without redundant data crunching or superfluous disk seeking. (You’ll see some mentions in the CouchDB docs that the design documents serve as only the ‘map’ portion in the map/reduce system, however ‘reduce’ functions can be defined in the same design documents.)

The query interface into CouchDB view indexes is a RESTful interface with some JSON thrown in there. While view indexes are pretty much pre-run queries, the GET arguments to a URL query allow you to set limits, grouping, and ordering. Unlike GQL, CouchDB has no set limits on the number or records you can retrieve at a time. Like GQL, CouchDB doesn’t allow anything like table joins. However, with the document-based, schema-free model, information that might otherwise be stored in separate tables can be stored directly in the document whose identity would have been spread across many tables in the relational model. For instance, in a standard RDBMS, one might join the ‘users’ table to the ‘addresses’ table using ‘user_id’. In the CouchDB world, a user’s address might be stored directly in the user document. Different views could then be built to query address information vs. name information, or to just grab full user data by id in one shot. Some referential integrity is bound to be lost when the relational model is abandoned, but CouchDB’s document-oriented model allows for more information to be effectively stored together without tremendous loss of efficiency, mitigating the effects of denormalization (as compared to what would be experienced denormalizing in a traditional RDBMS architecture).

Although Erlang (environment in which CouchDB lives) virtual machines can be run across multiple nodes, CouchDB does not currently support partitioning across multiple nodes, but the long-term mission of CouchDB is to support multi-node instances. The map/reduce methodology employed for the views is certainly well-suited for the transition to multiple nodes, and, indeed, everything about the CouchDB ‘mission’, from what I can gather, is about being massively scalable and reliable.

In the meantime, CouchDB supports master-slave*n replication… enough to get you started with some pretty high-demand applications in the near-term. My guess is that the development of CouchDB is going to progress at a very fast clip, and we’ll see some amazingly easy-to-implement distributed configurations.

Returning to the topic of ‘fabric’ vs. ‘instance’ cloud styles, I can see CouchDB kicking some serious ass in the near-term in the instance cloud space. A small CouchDB cluster on Amazon EC2 backed up with ElasticDrive or regular S3 backups could be a great, nearly rock solid datastore choice for some start-ups.

Down the road, I’m sure we’ll see CouchDB beef up the way map/reduce is leveraged, hopefully allowing easy sharding and partitioning, even between geographically disparate servers. Once CouchDB supports multi-master replication as well as partitioning in multi-machine-spanning Erlang VMs, I see no reason why CouchDB can’t be offered very conveniently as a ‘fabric’ cloud service. New instances could be created (or instances could expand across additional machines) and replicated without user input, growing to potentially large clusters of physical machines, all aggregated efficiently by CouchDBs map/reduce functionality.

Certainly, this is not going to happen overnight. I should also add that I have no idea whether ’sharding’, that is, the parceling up of different sections of the data between several different CouchDB instances, would be necessary once CouchDBs brand of partitioning (one datastore VM instance, several machines) is implemented. My imagination says truly large projects will leverage both sharding across many instances as well as partitioning those instances across multiple machines. My imagination also says that, if the whole map/reduce basis of CouchDB views and view indexes is leveraged to its conceptual end, arbitrary trees of CouchDB instances should be able to operate together in a way that looks like just one instance to the consumer.

Even if the current Erlang implementation of CouchDB never gains traction, its interfaces/APIs would make a great platform-agnostic standard for datastore manipulation. Whereas Google has gone the ‘proprietary API’ route, meaning App Engine apps are not very easily portable, a competing fabric service could support CouchDB-style javascript views and a JSON document format. Regardless of the implementation underlying such a service, it could be nearly or completely portable to other CouchDB-style datastore services. By carefully incorporating javascript, json, and REST standards into a document-oriented database, CouchDB has made a great contribution even if you disregard what looks like a very promising implementation.

social bookmark of choice:
  • Digg
  • del.icio.us
  • Ma.gnolia
  • Reddit
  • Slashdot

Tags: , , , , ,

Leave a Reply