iPhylo: database

Showing posts with label database. Show all posts

Tuesday, July 28, 2015

Modelling taxonomic names in databases

Quick notes on modelling taxonomic names in databases, as part of an ongoing discussion elsewhere about this topic.

Simple model

One model that is widely used (e.g., ITIS, WoRMS) and which is explicit in Darwin Core Archive is something like this:

We have a table for taxa and we don't distinguish between taxa and their names. the taxonomic hierarchy is represented by the parentID field, which points to your parent. If you don't have a (non NULL) value for parentID you are not an accepted taxon (i.e., you are a synonym), and the field acceptedID points to the accepted taxon. Simple, fits in a single database table (or, let's be honest, and Excel spreadsheet).

The tradeoff is that you conflate names and taxa, you can't easily describe name-only relationships (e.g., homonyms, nomenclatural synonyms) without inventing "taxa" for each name.

Separating names and taxa

The next model, which I've drawn rather clunky below as if you were doing this in a relational database, is based on the TDWG LSID vocabularies. One day someone will explain why the biodiversity informatics community basically ignored this work, despite the fact that all the key nomenclators use it.

In this model we separate out names as first-class objects with globally unique identifiers. The taxa table refers to the names table when it mentions a name. Any relationships between names are handled separately from taxa, so we can easily handle things like replacement names for homonyms, basionyms, etc. Not that we can also remove a lot of extraneous stuff from the taxa table. For example, if we decide that Poissonia heterantha is the accepted name for a taxon, we don't need to create taxa for Coursetia heterantha or Tephrosia heterantha, because by definition those names are synonyms of Poissonia heterantha.

The other great advantage of this model is that it enables us to take the work the nomenclators have done straight without having to first shoe-horn it into the Darwin Core format, which assumes that everything is a taxon.

Friday, March 15, 2013

BioNames: yet another taxonomic database

Yet another taxonomic database, this time I can't blame anyone else because I'm the one building it (with some help, as I'll explain below).

BioNames was my entry in EOL's Computable Data Challenge (you can see the proposal here: http://dx.doi.org/10.6084/m9.figshare.92091). In that proposal I outlined my goal:

BioNames aims to create a biodiversity “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon, and that information is seamlessly linked together in one place. It combines classifications from EOL with animal taxonomic names from ION, and bibliographic data from multiple sources including BHL, CrossRef, and Mendeley. The goal is to create a database where the user can drill down from a taxonomic name to see the original description, track the fate of that name through successive revisions, and see other related literature. Publications that are freely available will displayed in situ. If the taxon has been sequenced, the user can see one or more phylogenetic trees for those sequences, where each sequence is in turn linked to the publication that made those sequences available. For a biologist the site provides a quick answer to the basic question “what is this taxon?”, coupled with with graphical displays of the relevant bibliographic and genomic information.

The bulk of the funding from EOL is going into interface work by Ryan Schenk (@ryanschenk), author of synynyms among other cool things. EOL's Chief Scientist Cyndy Parr (@cydparr) is providing adult supervision ("Chief Scientist", why can't I have a title like that?).

Development of BioNames is taking place in the open as much as we can, so there are some places you can see things unfold:

Key features and milestones are on Trello
Design details are on GitHub
Database is hosted by Cloudant
There is a (currently private) design document in Google Docs. I've posted a snapshot on FigShare (http://dx.doi.org/10.6084/m9.figshare.652203

I've lots of terrible code scattered around which I am in the process of organising into something usable, which I'll then post on GitHub. Working with Ryan is forcing me to be a lot more thoughtful about coding this project, which is a good thing. Currently I'm focussing on building an API that will support the kinds of things we want to do. I'm hoping to make this public shortly.

The original proposal was a tad ambitious (no, really). Most of what I hope to do exists in one form or another, but making it robust and usable is a whole other matter.

As the project takes shape I hope to post updates here. If you have any suggestions feel free to make them. The current target is to have this "out the door" by the end of May.

Friday, October 19, 2012

The failure of phylogeny databases

Only 4% of all published phylogenies have data in #treebase or @datadryad
— Karen Cranston (@kcranstn) October 18, 2012

It is well known that phylogeny databases such as TreeBASE capture a small fraction of the published phylogenies. This raises the question of how to increase the number of trees that get archived. One approach is compulsion:

@kcranstn @theatavism @rdmpage Wellcome Trust withhold 10% of grant if not #OA. Why not similar policy for data too? #HitThemInThePocket
— Ross Mounce (@rmounce) October 19, 2012

In other words:

Databasing trees is the Right Thing™ to do
Few people are doing the Right Thing™
This is because those people are bad/misguided and must be made to see the light

I want to suggest an alternative explanation:

It is not at all obvious that databasing trees is useful
The databases we have suck
There's no obvious incentive for the people producing trees to database them

Why do we need a database of trees?

That we don't have a decent, widely used database of trees suggests that the argument still has to be made. Way back in the mid 1990's when TreeBASE was first starting I was at Oxford University and Paul Harvey (coauthor of The Comparative Method in Evolutionary Biology) was sceptical of its merits. Given that the comparative method depends on phylogenies, and people like Andy Purvis were in the Harvey lab building supertrees (http://dx.doi.org/10.1098/rstb.1995.0078) this may seem odd (it certainly did to me) but Paul shared the view of many systematists. Phylogenies are labile, they change with increased data and taxon sampling, hence individual trees have a short life span.

Data, in contrast, is long-lived. You'd happily reuse GenBank sequences published a decade ago, you probably wouldn't use a decade-old phylogeny. I made this point in an earlier post about the data archive Dryad (Data matters but do data sets?). A problem facing packages of data (such as papers, data sets, and phylogenies) is that the package itself may be of limited interest, beyond reproducing earlier results and benchmarking. In the case of phylogenies, if someone has a tree ((a,b),c) and someone else has a tree ((d,e),f), it's not obvious that we can combine these. But if we have sequences for the same gene from the same six taxa we can build a larger tree, say (((a,d),(b,e)),(c,f)).

I think this is part of the reason why GenBank works. Yes, there is compulsion (it's very hard to publish on sequences if you haven't deposited the data in GenBank), but there are clear benefits of depositing data. As the database grows we can do bigger analyses. If you are trying to identify a species based on its DNA, the chances are that the nearest sequence will have been deposited by somebody else. By depositing data your work it also lasts longer than if people just had the paper (your tree is likely to be outdated, that sequence from a rare, hard to obtain species might be used for decades to come).

Note that I'm not saying a database of trees isn't a good idea, but there seems to be an assumption that it is so obvious that it doesn't need justification. Demonstrably this isn't the case. Maybe we should figure out what we'd want to do with such a database, then tackle how we'd make that possible. For example, I'd want to query a phylogeny database geographically (show me trees from this part of the globe), by ecological association (find the trees for any parasites on this clade), by temporal period (what clades originated in the Miocene?), by data (what trees used this sequence which we now know is chimeric?), by topology (have we settled on the sister group to snakes yet?), and so on. I would also argue that much of this is doable, but might not actually require archiving published phylogenies. Personally I think anybody tackling these questions would do well to use PhyLoTA as their starting point.

TreeBASE sucks

Yes, I'm as sick of saying this as you are of reading it. But it doesn't change the fact that just about everything about TreeBASE from the complexity of the underlying data model, the choice of programming language, the use of a Java applet to display trees, the Byzantine search interface, and the voluminous XML output make TreeBASE a bag of hurt. None of this would matter much if it was an indispensable part of people's research toolkit, but this isn't the case. If you are trying to convince people of the benefits of sharing trees you really want a tool that makes a it seem a no brainer. We aren't there yet.

The "fuck this" point

In a great post on the piracy threshold, Matt Gemmell argues that piracy is largely the fault of content providers because they make being honest too difficult. How many times have you wanted to buy something such as a book or a movie only to discover that the content provider doesn't sell it in your part of the world (e.g., in the iBooks store in the US but not the UK) or doesn't provide it in the media you want (e.g., DVD but not online)? To top it off every time you go to the movies you are subjected to emotional blackmail or threats of unlimited fines if you were to copy the movie you already paid to watch?

6892585935 32d4e21e77 o

I think databases have the same "fuck this" threshold. If you are asking people to submit data you want to make it as easy as possible. And you want at least some of the benefits to be immediate and obvious. Otherwise you are left with coercing people, and that's being, at best, lazy.

If you want an example of how to do it right, look at Mendeley's model. They want to build a public cloud of academic papers, a laudable goal, the Right Thing™ to do. But they sell the idea not as a public good, not as the Right Thing™, nor by trying to compel people (they can't, they're a private company). Instead they address a major point of pain - where the hell did I put that PDF? - and make it trivial to organise your collection of articles. Then they make it possible to back them up to the cloud, to view them on multiple devices, to share them, and viola, we get a huge database of publications. The sociology works. So, my question is, what would the equivalent be for phylogenetics?

Friday, August 07, 2009

Wikispecies is not a database

This post was prompted by Stephen Thorpe's post on TAXACOM about Wikispecies in which he wrote (in a thread discussing Roger Hyam's recent blog post) that

[i]f it [Wikispecies] isn't a true database, then it is BETTER than a database. It can do anything a database can do, and more, if you know how it works properly.

I beg to differ. Wikispecies runs on a database (the Mediawiki software uses a database to store the wiki), and Mediawiki can be thought of as a database of semi-structured text, but it lacks a lot of the functionality database users would expect. For example, in Wikispecies there's no way to perform basic queries such as how many descendants a given taxon has, what names a particular author has published, or to find out in which geographic region most new names are being described from. Much of this information is in Wikispecies, it just isn't in a form that we can usefully use.

These limitations are mostly due to the underlying software (Mediawiki), which fortunately can be extended to address these issues using Semantic Mediawiki. I've explored these ideas earlier. With some restructuring, Wikispecies could become a database, but it would require some serious work.

But this raises the real issue with Wikispecies, namely what is it for? Wikipedia is much more informative for many taxa, and the two wikis are very poorly linked (surely we'd want Wikipedia pages linked to the corresponding Wikispecies pages?). Given that Wikipedia is the basis for some core efforts in linked data (e.g., DBPedia), it seems a no brainer that we would want our information stored in Wikipedia, rather than Wikispecies.

It seems to me that the split between Wikipedia and Wikispecies parallels that between "taxonomic concepts" and "taxonomic names". Wikipedia provides the former, in that it provides one (consensus) view of what a taxon is. Wikispecies would be ideally placed to be a nomenclatural database (and a great place to put all the synonyms that we've accumulated over time, but which would swamp Wikipedia). But Wikispecies seems also to want to provide a classification as well, which strikes me as unnecessary (and raises the issue of how this relates to the classification in Wikipedia).

I don't wish to denigrate the efforts of Wikispecies contributors (they are doing some neat things, such as harvesting new names from Zookeys), and by clever use of templates they avoid some of the serious problems with classification in Wikipedia, but it's not a taxonomic database, at least, not yet.

Friday, June 05, 2009

Taxonomy on a hard disk

This post is likely to seem somewhat off the wall, given the rush to getting everything in the cloud, but it's Friday, so let's give it a whirl.

One idea I've been toying with is dispensing with relational databases, wikis, etc. and just storing taxonomic data using files and folders on a disk. There are several reasons for this:

File system naturally enforces hierachy

There are existing systems for putting files and folders under version control (e.g., CVS, Subversion, git)

Native text and image editors handily beat web-based ones

Some file systems have great tools for searching on metadata (e.g., "smart folders" and Spotlight on Mac OS X)

Some of the visualisations that we would like for classifications (such as treemaps) already exist in very polished form for viewing file systems

By way of background, I've been prompted to think along these lines by David Shorthouse's observation that we could place a taxonomic hierachy under version control (e.g., github) and deal with changes/multiple versions that way. I've also been inspired by tools such as couchdb, which is a schema-less database that one can talk to directly via HTTP. This reflects a trend where people are starting to exploit the untapped power in some basic, well-known technologies (such as the HTTP protocol), avoiding the need for lots of middle-ware in the process (why write stuff in Java/Ruby/PHP, etc. when HTTP GET, PUT, POST, etc. cover the bases?). Another inspiration is Dropbox, which enables replication of files across multiple machines and the web. The web interface to Dropbox is very clean, and essentially mirrors the local folder structure.

So, in some ways this probably sounds silly, and closely resembles the naive way many of us started making digital versions of taxonomies, and it will have many database people rolling their eyes and muttering about "data consistency" and "queries". But, a key thing to remember is that the file system is a database that resides under a graphical user interface, and it maintains some forms of consistency that classical relational databases are poor at handling. For example, file systems enforce hierarchical consistency (if I move a folder to another folder, all the files and folders below that folder move as well). Of course, we can program this with a relational database, but our track record in doing this is pretty miserable. I've found inconsistencies in versions of ITIS (haven't checked recently), and last years' Catalogue of Life database had all sorts of orphans lurking in the tree table.

Then there's the GUI. If I write a taxonomic database in the classical way, I need to write code to talk to the database, edit records, support user authentication, data versioning, etc. If I use the file system, I get this pretty much for free. Authentication? It's called the login screen. Versioning? I put it in a public repository like Google Code or github, and that takes care of that (plus I get online authentication for free). Editing? Well, I can drag and drop items onto folders, and I can open them in native editors.

What I envisage is replicating a taxonomic hierarchy on disk, and representing key-value pairs of attributes (such as taxon name authorship, bibliographic details) as text files where the name of the file is the key (e.g., publishedIn) and the content of the file is the value (e.g., doi:10.1590/S0101-81752005000300004). I could add images and PDFs, and the neat thing is that they have lots of useful metadata embedded inside (where, arguably, it belongs).

I'm also toying with the idea of using symbolic links (Windows users, look away now) to represent relationships such as basionym links to original names.

This is all a bit half-baked at present, but it seems worth pursing. One could argue that having a full taxonomic hierachy is overkill (and raises the issue of which one to use), but binomial names are themselves hierachical (species epithet nested inside genus name), so we need some degree of hierarchy anyway. I like the idea that copying a folder called "behreae" in the folder "Pinnixa" and placing the copy under "Austinixa", then within Austinixa/behreae adding a symbolic link to Pinnixa/behreae pretty much takes care of synonomy. I also like the idea that one could download an entire taxonomy, and using just the native tools on your computer, edit and annotate it, then merge changes with a remote copy. It makes mamnaging the data little different from writing code.

In practise we'll want to add some things. It would be nice to have a web interface for browsing, but this could be as trivial as having a script that read the contents of a folder, display folders as HTML links, and list the files (keys) and their contents (values) in the web page.

Perhaps this is a little silly, but I like the idea of having data on my machine that is trivially easy to edit. I also like the idea of getting functionality for free, rather than having to invent it from scratch.