To get a feel for the different ways of interacting with Couch, let’s start with a Hello World. Here’s
the boilerplate to set up a tornado server with two endpoints at /hello
and /hi
:
import tornado application = tornado.web.Application([ (r'/hello/([^/]+)', JumpyHello), (r'/hi/([^/]+)', RelaxedHello), ]).listen(1920) tornado.ioloop.IOLoop.instance().start()
Since both of the request handlers (defined below) will need a reference to the database, set up a global for them to share:
from corduroy import Database, NotFound, relax people_db = Database('people') # i.e., http://127.0.0.1:5984/people
Our first request handler will parse the url and pull out the last component (which will be treated as a document ID). It will then try to retrieve that doc from the database and print a greeting based on its contents. Here is a handler that uses an explicit callback to make a non-blocking request:
class JumpyHello(tornado.web.RequestHandler): @tornado.web.asynchronous def get(self, user_id): # Request the corresponding user's doc. This will return # immediately and control will leave this method. Later on # the got_user_doc callback will be invoked with the # response and a status object as arguments. people_db.get(user_id, callback=self.got_user_doc) def got_user_doc(self, doc, status): # Generate output based on the db's response if status.ok: self.write('hello %s %s'%(doc['first'],doc['last'])) elif status.error is NotFound: self.write('hello whoever you are') else: raise status.exception self.finish()
Though it’s gratifying to know that your server process isn’t blocking while waiting for Couch to
respond, the resulting code doesn’t feel terribly pythonic. Ideally something as simple as a GET
request should be a one-liner. In addition, the use of callbacks means your code is no longer
in the call stack should an exception occur during the request. As a result, error handling becomes a
C-like process of manual status
inspection in lieu of idiomatic try
/except
blocks.
A particularly nice solution to this problem of twisted async code is provided by the tornado.gen
module. As the
abbreviated name suggests, their approach is to use python generators
to turn request handler methods into coroutines that can be suspended during i/o then restarted when
the response arrives.
Corduroy is happy to work in this style and provides the @relax
decorator to make
the syntax more transparent (or at least more glazed with sugar). When applied to one of your
methods, the decorator allows you to treat the API as if it were blocking and no longer requires
explicit callbacks. All your code needs to do to make this possible is place a yield
in front of calls to the library.
The result of this yield
expression will be the data that would ordinarily be passed to your
callback function but can now be captured through simple assignment. If an error occurrs, the decorator
handles that as well by raising the exception at the point of the yield
in your code:
class RelaxedHello(tornado.web.RequestHandler): @relax def get(self, user_id): try: doc = yield people_db.get(user_id) self.write('hello %s %s'%(doc['first'],doc['last'])) except NotFound: self.write('hello whoever you are') self.finish()
This code looks like ‘normal’ blocking code but will in fact execute asynchronously. At the
point of the yield
statement, the request handler releases control of the event loop while
Corduroy creates a callback behind the scenes. Once this internal callback fires, the request handler’s
method is resumed. The handler can then make other asynchronous requests or just return its output if
there’s nothing more to be done.
Virtually every method in Corduroy accepts an optional keyword argument called callback
.
When this argument is omitted (and the @relax
decorator is not in effect), the call will
use blocking i/o and the function call will not return until the operation is complete (or until an exception is raised).
If you pass a callable object (a.k.a. a function) as the callback arg, the call will complete almost immediately and the return value will not be the final data but a replica of the HTTP Request (which can be useful for debugging e.g., when passing a lot of options). The callback will be invoked moments later, when the server response arrives.
Callbacks should expect two arguments and be of the form:
def mycallback(data, status): pass
The first argument will contain the response from the server, either as a unicode string or as a decoded json object.
The second argument allows for error checking and has five attributes of intrest:
ok
: False
if a response code >= 400 was receivedcode
: The numeric HTTP response codeheaders
: A dictionary of response headersexception
: Either None
or the exception that would have been raised
were this a non-blocking call. Feel free to raise it yourself.error
: Either None
or the class of the exception. This is redundant,
but also allows callback functions to use the syntax:if status.error is Conflict:
if isinstance(status.exception, Conflict):
When status.ok
is False
, the contents of the first argument are somewhat
variable. Sometimes Couch responds verbosely to error conditions and the data
argument will
contain a json object. At other times data
will simply be None
. When in
doubt, consult the HTTP API.
When you call the database from within a @relax
-decorated function you don't have to
provide a callback; the decorator will do it for you. To give you some idea of what happens when you
type yield
, the decorator-provided callback’s logic looks something like this:
def relax_callback(data, status): if status.exception: raise status.exception # (in the context of your function) else: return data # (and assign it to the yield's lvalue)
The rest of this guide will use the implicit callback style for brevity’s sake. Just keep in mind that anywhere you
see a yield
before a library call, you could pass a callback
argument
instead.
When thought of purely as a key/value store, a CouchDB installation is a cascade of json objects with three basic levels of hierarchy. In Couch nomenclature, a single Server can contain many Databases which in turn contain many Documents.
Servers are represented by Couch
objects. The constructor takes a url as an argument,
but will use the values in corduroy.defaults
to construct a default url if none is provided.
You can also include login credentials, either inline or as a 2-tuple.
If you haven’t overridden the host
and port
defaults, all of these instantiations should be equivalent:
couchdb = Couch('http://username:pass@127.0.0.1:5984') couchdb = Couch('http://127.0.0.1:5984', auth=('username','pass')) couchdb = Couch(auth=('username','pass'))
All three of the above will return with:
<Couch 'http://127.0.0.1:5984'>
Creating a Couch
object doesn’t actually connect to the server. As a result it’s safe to
call the constructor in an event handler without needing a callback. You can then use the object’s
all_dbs
method to obtain a list of available databases, monitor server activity with
tasks
, or read/write configuration options with config
. See the Reference
docs for other server-level operations.
In all likelihood the only methods you’ll use regularly are db
and create
which let you retrieve a reference to a specifc database using its name:
couch = Couch(auth=('user','pass')) try: mydb = yield couch.db('mine') except NotFound: mydb = yield couch.create('mine')
Since this pattern is fairly common, the above can be simplified to:
mydb = yield couch.db('mine', create_if_missing=True)
You don’t actually need to create a Couch
just to access a database. You can also
instantiate them directly by passing the full url to the db to the Database
constructor.
As with Couch
objects, the default server url will be prepended if necessary. Both of these
are equivalent:
db = Database('http://127.0.0.1:5984/some_db_name') db = Database('some_db_name')
<Database 'some_db_name'>
Creating a Database
object doesn’t contact the server, but an efficient check for its
presence can be performed by calling its exists
method. Similarly useful is the
info
method which performs a GET on the database’s root url.
CouchDB documents are dict-style json objects with two required keys: _id
and _rev
.
The _id
field is a unique-to-that-database string that allows the document to be requested
by name. The _rev
value is filled in by the server when you create or update a document. It
is also the basis of Couch’s transaction-less mechanism for detecting write conflicts (see the Eventual Consistency
section for details).
Corduroy will accept any dict
-like you pass and treat it as a document. The objects it
returns will by default use the corduroy.Document
class (though this can be overridden). The
Document
class inherits from dict
and has two noticeable differences from the
stdlib model:
To fetch a single document, call the Database
object’s get
method with the
desired _id
string. The Document
object it returns can be used just like a dictionary:
db = Database('underlings') ollie = yield db.get('lackey-129') print ollie print 'Mr. %(first)s %(last)s can be found at %(office)s.' % ollie
<Document lackey-129[1] {first:"Oliver", last:"Reeder", office:"Richmond Terrace"}> Mr. Oliver Reeder can be found at Richmond Terrace.
After making local changes, a new version of the document can be written to the db by calling the
save
method:
del ollie.office ollie.education = u'Oxbridge' yield db.save(ollie) print ollie
<Document lackey-129[2] {first:"Oliver", last:"Reeder", education:"Oxbridge"}>
Note that the number in brackets incremented as a result of the save. This number is the first portion
of the _rev
value (the full value looks more like “2-0348cd6cc49cb4cacdb9b94c87c83808
”).
The _rev
will change on every successful update. Also notice that we ignored the return value since the
save
method updates its argument as a side effect.
To remove a document from the db, pass a current version of the doc to the delete
method.
If your argument’s _rev
value doesn’t match the server copy, a conflict exception will
be raised.
try: yield db.delete(ollie) except Conflict: print 'Our doc is stale. Need to refetch and try again.'
To create a new document, save
a dict with a valid _id
string. Its _rev
will be set in the process:
newdoc = {'_id':'lackey-130', 'first':'Angela', 'last':'Heaney'} yield db.save(newdoc) print newdoc._rev
1-d0a259b5b8e71a3c0b0bc7facbb690d5
If you don’t specify an _id
, one will be chosen for you. Corduroy keeps a cache of identifiers
collected from the couch server’s _uuids
API. Relying on server-provided IDs can
purportedly improve performance since the generated identifiers are semi-sequential in a way that
is friendly to b-tree traversal. I take no definite stance on this.
anon = {'first':'Julius', 'last':'Nicholson'} yield db.save(anon) print anon._id, anon._rev
6efcdf33df6c82bfc53d7416c660ef5a 1-967a00dff5e02add41819138abb3284d
Both the get
and save
methods can accept either a single value or a list
as the first argument. To fetch multiple documents in a single request, pass a list of ID strings. The
return value is a list of Document
objects in the same order as the IDs list:
db = Database('backbench') doc_ids = ['ballentine', 'holhurst', 'swain'] docs = yield db.get(doc_ids) print docs[0]
<Document ballentine[9] {first:"Claire", last:"Ballentine", highly_regarded:True}>
Whereas the single-doc get
call will raise a NotFound
exception should the
requested doc not exist, a batch get
flags missing documents by including a None
at the
corresponding element of the results list.
To update multiple documents in a single request, pass your list of updated docs to save
. To
delete one or more docs in the batch, add a _deleted
key to each such doc before submitting
the request:
claire, geoff, ben = docs claire.standing = u'not standing' geoff._deleted = True ben.newsnight = {'paxman':1, 'swain':0} yield db.save([claire, geoff, ben])
One of the fundamental differences between CouchDB and traditional RDBMSs is the way data integrity and transactional semantics are handled. The tl;dr version is that Couch abandons nearly all SQL-ish guarantees and pushes responsibility for conflict resolution to the client.
This might sound like a mis-feature, but in the best Worse-is-Better tradition, the lack of abstraction over what-gets-written-when can force you to improve the way you structure your code. This may just be stockholm syndrome talking, but there’s an argument to be made that the client can do a better job of deciding how to deal with conflicts than a one-size-fits-all transaction could.
Regardless of how they’re rationalized, update conflicts are a common enough occurrance when dealing with documents that your code should generally consider them the norm rather than an exceptional condition.
The basic rule Couch uses to determine whether an update should succeed is quite simple:
The new version of a doc must have the same _rev
value as the copy currently
in the database.
To see how this on some level solves the entire problem of lost data consider the scenario where two
clients simultaneously download the same copy of a doc whose revision is currently 1
. Both
clients will modify their local copy of the doc and attempt to save it back to the database.
The first client to connect will succeed, since the _rev
of its modified doc matches
the value in the database. The server’s copy of the doc is then updated and its _rev
is
incremented to 2
.
When the second client’s save attempt is handled, the client’s _rev
(1
) no longer matches
the server’s copy (now 2
). As a result the save will fail and raise a Conflict
exception.
The second client must now fetch the newly updated doc from the server, re-apply its modifications, then attempt the save again.
The end result of this _rev
-matching rule is that any writes attempted by your code should
use the following algorithm:
_rev
from the newly-retrieved copy of the doc to
your local copy. Ideally do something clever that merges the two docs without losing any edits.GOTO 2
It’s worth acknowledging that doing things ‘correctly’ is a fair amount of (fairly repetitive) work. Thus the temptation to perform blindfolded writes and just hope for the best can be dangerously strong.
To try to make it easier to be responsible in this context, Corduroy treats every write as a potentially
conflict-inducing operation. In previous examples of the save
method, the return value was
ignored. Let’s now take a look at the ConflictResolution
object that save
returns:
# create a pair of new docs docs = [{'_id':'first', 'n':1}, {'_id':'second', 'n':2}] conflicts = yield db.save(docs) print conflicts
<Success: 2 docs updated>
# create a conflict by deleting the rev and re-saving del docs[1]._rev conflicts = yield db.save(docs) print conflicts
<Conflict: second>
From the repr strings you can see whether the write was successful and the list of conflicted IDs if not. To access this information from your code, the conflicts variable contains a pair of attributes to be inspected:
pending
: A dictionary (keyed by _id
) containing a reference to each doc whose
write attempt was unsuccessful.resolved
: A dictionary of all the successfully written docs.You could use the values in pending
to plan a fetch request, merge the results with your
local edits, then resubmit them. But since this is such a common pattern, the ConflictResolution
object provides a method called resolve
to handle all of this in one shot.
Since all parts of the Standard Recipe are identical except for step 4 (merging the local and fetched copies
of the doc), Corduroy allows you to encapsulate your merge logic in a function and pass that to
resolve
which will orchestrate the required HTTP traffic.
def mergefn(local_doc, server_doc): # just copy over the rev (a.k.a. not a real strategy) local_doc._rev = server_doc._rev return local_doc conflicts = yield db.save(docs) if conflicts.pending: print 'pre-merge: ', len(conflicts.pending) yield conflicts.resolve(mergefn) print 'post-merge:', len(conflicts.pending)
pre-merge: 1 post-merge: 0
The ConflictsResolution
object fetches the current versions of all the pending
docs, then repeatedly calls your merge function with local and remote copies of each. The merge function
should return a dictionary that merges the data in the divergent copies. These returned dictionaries will then
be sent to the server in a batch write attempt. If the merge function returns None
, no attempt to
write that doc will be made.
When the resolve
call completes, the pending
and resolved
dictionaries
will be updated to reflect the new state.
The same merge functions that that the resolve
method accepts can also be passed to the
Database.save
method directly. If any conflicts occur, a resolution will automatically be
attempted using your merge function.
def mergefn(local_doc, server_doc): local_doc._rev = server_doc._rev return local_doc yield db.save(docs, merge=mergefn)
Despite this particular merge function’s obvious unsuitability for use in production, a forced overwrite is ocasionally just the thing you need (especially during development). As a further shorthand, the above can be rewritten as:
yield db.save(docs, force=True)
Beyond accessing documents individually, Couch provides a mechanism for building named indexes called ‘views’ that can aggregate data across documents. Views are defined by javascript functions on the server that update the index every time a document is added or modified. From the client’s perspective they are a series of ‘rows’ with three attributes:
key
: A string or other json-serializable object (often a list)id
: The _id
of the document this row represents.value
: An arbitrary value returned by the serverside function.The mapping of rows to documents is not one-to-one, so it’s quite possible for one doc to be represented by multiple rows while another is omitted altogether.
Similarly, key
values may be unique between rows but
often will not be. In fact one of the more useful properties of views is that multiple documents can be
grouped together on the basis of having the same key.
To retrieve all of the rows from a view, call the database object’s view
method with
the ‘name’ of that view as an argument. Views are named according to the
design documents they live in, so for a view
called byname
in a design document called _design/employees
, its ‘name’
would be "employees/byname
".
The result of a call to view
is an iterable list of Row
objects. Here we
grab all of the rows from a view and begin printing them out:
db = Database('dosac') rows = yield db.view('employees/byname') print rows
<employees/byname: 17/17 rows>
for row in rows: print 'key:%s\t| id:%s' % (row.key, row.id)
key:abbott | id:emp-2c9792a3 key:coverley | id:emp-fad5095d key:cullen | id:emp-687e8224 ⋮ key:reeder | id:emp-c110c0a8
Since every row in a view has an associated key, you can selectively query the view only for rows matching a particular value. For instance, to grab just the first row the query would be:
yield db.view('employees/byname', key='abbott')
Queries with multiple keys are also possible:
yield db.view('employees/byname', keys=['coverley', 'murray'])
Part of what makes this a useful feature is that the documents associated with rows can also be included
in the response. As a result, views can be used to define aliases to documents independent of their potentially
unwieldy _id
values:
rows = yield db.view('employees/byname', key='abbott', include_docs=True) print rows[0].doc
<Document emp-2c9792a3[5932] {first:"Hugh", last:"Abbott", locale:"Unknown"}>
Since the rows in a view are sorted in ascending order based on their keys, you can also request all rows
within a range by specifying startkey
and endkey
values. To select all the “C”
names, the query wouldbe:
rows = yield db.view('employees/byname', inclusive_end=False, startkey='c', endkey='d') print [row.key for row in rows]
[u"coverley", u"cullen"]
If it’s not immediately apparent why the above query looks the way it does I highly recommend the CouchDB Guide’s chapter on the subject.
For the most part views as described so far – an ordered set of key/id/value rows – provide all
the data-access flexibility you need for reading out the state of your database. But Couch also
allows for a serverside pre-processing step to be associated with each view through the use of
a reduce
function.
Couch’s semantics would seem to encourage using reduce
as a
way to create denormalized copies of your view data (e.g., creating a list of all the values for
a given key). But in practice this sort of usage is to be avoided due to an unfortunate trade-off
in Couch’s b-tree-based internal design.
Instead, reduce
should be thought of as a way to convert rows into a handful
of numeric values – often only one. Accepting that limitation, it frequently makes sense to use one of
Couch’s built-in reductions
instead of writing your own. In combination with View Collation
this becomes a surprisingly powerful technique.
Here is a view that uses the _count
built-in reduction on rows keyed by the date.
Views with reduce
functions will return the output of that reduction by default. The
pre-reduction rows can still be accessed with the query:
for row in (yield db.view('chronological/counts', reduce=False)): print row.key
["2007","11","28"] ["2007","11","29"] ["2007","11","29"] ⋮ ["2007","12","02"]
The group
and group_level
keyword arguments allow you to control whether
the reduce
function is applied uniformly to the rows (resulting in a single value),
or to groupings based on shared key values.
for row in (yield db.view('chronological/counts', group=True)): print "%(key)s → %(value)i" % row
["2007","11","28"] → 1 ["2007","11","29"] → 17 ["2007","11","30"] → 3 ["2007","12","01"] → 92 ["2007","12","02"] → 50
for row in (yield db.view('chronological/counts', group_level=2)): print "%(key)s → %(value)i" % row
["2007","11"] → 20 ["2007","12"] → 142
Couch is not limited to presenting your documents and views as json-formatted text. It allows you to define special handlers to transform the raw data into HTML, BibTeX, WAD or whatever is appropriate for your application.
These custom formatters are defined in a design document specific to your database and are referred to as ‘lists’ (which process the output of views) and ‘shows’ (which process the content of documents). Their client APIs are quite similar, in both cases you provide the name of the formatter along with the view or document it will be applied to. In response an object is returned with two attributes of interest:
headers
: A dictionary of response headersbody
: A bytestring (or decoded json object if a mime-type of application/json
was
present in the headers)‘Show’ functions are reuseable javascript routines that have access to a specified document as well
as any query arguments passed in the request. Here the function csv
in the design doc
records
is being used to format a document. The include_titles
argument
instructs this particular ‘show’ to output a header row in addition to the document data:
db = Database('mannion') response = yield db.show('records/csv', '1978-wotw', include_titles=True) print response.headers['Content-Type']
text/csv
print response.body
artist,title,review,doc_id Jeff Wayne,War of the Worlds,Stimulating,1978-wotw
‘List’ functions use similar syntax but apply to a view instead of a single document. In addition to query arguments, the same slicing operations that apply to views can be included to filter out selected rows before formatting:
db = Database('tucker') response = yield db.list('agenda/html', 'nomfup', key='n', limit=3, descending=True) print response.body
<ul><li>Nicola Murray</li><li>Newcastle</li><li>National Trust</li></ul>
If your list function is general enough, you can apply it to other views as well; even views in other design documents:
response = yield db.list('agenda/html', 'bollockings/byday', key='2007-10-13') print response.body
<ul><li>Steve Fleming 13:00-13:07</li><li>Julius Nicholson 13:30–</li></ul>
Reliable, master-less replication is one of Couch’s marquee features. A replication involves a source
and a target database and propagates changes from the former to the latter. To copy an existing database,
pass the source and target database names (or urls) to a Couch
object:
couch = Couch() yield couch.replicate('olddb', 'newdb', create_target=True)
Database
objects can also replicate themselves through their push
and pull
methods. Their behavior is analogous to the eponymous commands used by distributed version control systems.
These operations are one-way in nature meaning a full synchronization of two databases (each with local edits)
requires a reciprocal push and pull:
local_db = Database('localdb') remote_db = Database('http://elsewhere.com:5984/remotedb') yield local_db.push(remote_db) yield local_db.pull(remote_db)
In addition to one-off, ‘anonymous’ replications, Couch maintains a system-controlled database called _replicator
to track named replications over time. This database is exposed by the Couch
object and behaves
like any other (supporting get
, save
, and friends).
The _replicator
database has the special property that when docs of a certain form are created,
they will trigger a new replication. Its ongoing status can then be monitored by polling the document:
couch = Couch() repl = { "_id":"local-to-remote", "source":"localdb", "target":"http://elsewhere.com:5984/remotedb", "continuous":True } yield couch.replicator.save(repl) repl_doc = yield couch.replicator.get('local-to-remote') print json.dumps(repl_doc, indent=2)
{ "_id":"local-to-remote", "source":"localdb", "target":"http://elsewhere.com:5984/remotedb", "continuous":true, "_replication_id": "c0ebe9256695ff083347cbf95f93e280", "_replication_state": "triggered", "_replication_state_time": 1297974122 }
As a convenience, all of the replication methods accept a keyword argument called _id
. If included,
a new document will be created in the _replicator
database using the arguments for its fields.
Thus the previous example could be rewritten as:
local_db = Database('localdb') yield local_db.push(remote_db, "http://elsewhere.com:5984/remotedb", continuous=True, _id='local-to-remote')
Or if you want to get fancy:
repl = { "_id":"local-to-remote", "target":"http://elsewhere.com:5984/remotedb", "continuous":True } yield local_db.push(**repl)
At the filesystem level, a couch database is stored as something more akin to a journal file than a snapshot of the current state. Every document modification is appended to this journal and the ‘live’ database can be thought of as the result of playing back all these modification records in order.
Beyond being just an implementational detail, this alternative method of representing a database (as a sequence
of changesets) is exposed through the _changes
API. Your application code can make use of this data in two main ways:
The former can be useful for reinventing differently-sized wheels (see also Replication), while the latter can be used to trigger scripted events such as cache invalidation or summary statistics generation.
When requesting a list of changes, you can either specify no arguments (in which case somewhere between 0 and update_seq
change records will be returned), or you can bracket the time period in order to limit the response size:
db = Database('watched') info = yield db.info() print info.update_seq
11519
changes = yield db.changes() print "seq: %i (%i changes)"% (changes.last_seq, len(changes.results))
seq: 11519 (210 changes)
changes = yield db.changes(since=11500, limit=5) print "seq: %i (%i changes)"% (changes.last_seq, len(changes.results))
seq: 11506 (5 changes)
Each element of the changes.results
list corresponds to a document and contains a delta relative
to its previous state. For all the details see the CouchDB Book’s chapter on the subject.
Changes can also be accessed as a ‘feed’ in which an HTTP connection is left open and Couch will write
individual change notifications to it as they occur. To listen for changes in this manner, request a
continous feed and pass an explicit callback (i.e., do not call yield
this time):
def listener(last_seq, changes): pass feed = db.changes(since=11500, feed='continuous', callback=listener)
The return value of a feed request is a ChangesFeed
object. It will keep the connection alive
and handles the invocation of your callback at regular intervals (see the latency
parameter).
The feed will continue listening until it is explicitly stopped:
feed.stop()
The overall documentation scene for Couch isn’t quite as focused as it could be. There’s quite a bit of good infomation out there, but it’s scattered all over the net. Here are some I’ve found useful: