iPhylo: users

Showing posts with label users. Show all posts

Tuesday, May 25, 2010

TreeBASE II makes me pull my hair out

I've been playing a little with TreeBASE II, and the more I do the more I want to pull my hair out.

Broken URLs
The old TreeBASE had a URL API, which databases such as NCBI made use of. For example, the NCBI page for Amphibolurus nobbi has a link to this taxon in TreeBASE. The link is http://www.treebase.org/cgi-bin/treebase.pl?TaxonID=T31183&Submit=Taxon+ID. Now, this is a fragile looking link to a Perl CGI script, and sure enough, it's broken. Click on it and you get a 404. In moving to the new TreeBASE II, all these inward links have been severed. At a stroke TreeBASE has cut itself off from an obvious source of traffic from probably the most important database in biology. Please, please, throw in some mod_rewrite and redirect these CGI calls to TreeBASE II.

New identifiers
All the TreeBASE studies and taxa have new identifiers. Why? Imagine if GenBank decided to trash all the accession numbers and start again from scratch. TreeBASE II does support "legacy" StudyIDs, so you can find a study using the old identifier (you know, the one people have cited in their papers). But there's no support for legacy TaxonIDs (such as T31183 for Amphibolurus nobbi). I have to search by taxon name. Why no support for legacy taxon IDs?

Dumb search
Which brings me to search. The search interface for taxa in TreeBASE is gloriously awful:

So, I have to tell the computer what I'm looking for. I have to tell it whether I'm looking for an identifier or doing a text search, then within those categories I need to be more specific: do I want a TreeBASE taxon ID (new ones of course, because the old ones have gone), NCBI id, or uBio? And this is just the "simple" search, because there's an option for "Advanced search" below.

Maybe it's just me, I get really annoyed when I'm asked to do something that a computer can figure out. I shouldn't have to tell a computer that I'm searching for a number or some text, nor should I tell it what that number of text means. Computers are pretty good at figuring that stuff out. I want one search box, into which I can type "Amphibolurus nobbi", or "Tx1294" or "T31183" or "206552" or "6457215" or "urn:lsid:ubio.org:namebank:6457215" (or a DOI, or a text string, or pretty much anything) and the computer does the rest. I don't ever want to see this:

Computers are dumb, but they're not so dumb that they can't figure out if something is a number or not. What I want is something close to this:

Is this really too much to ask? Can we have a search interface that figures out what the user is searching for?

Note to self: Given that TreeBASE has an API, I wonder how hard it would be to knock up a tool that took a search query, ran some regular expressions to figure out what the user might be interested in, then hit the API with that search, and returned the results?

My concern here is that TreeBASE II is important, very important. Which means it's important to make it usable, which means don't break existing URLs, don't make old identifiers disappear, and don't have a search interface that makes me want to pull my hair out.

Friday, March 19, 2010

Where next for BHL?

You can't just ask customers what they want and then try to give that to them. By the time you get it built, they'll want something new. - Steve Jobs

It's Friday, so time for either a folly or a rant. BHL have put another user survey into the field http://www.surveymonkey.com/s/BHLsurvey. I loathe user surveys. They don't ask the questions I would ask, then when you see the results, often the most interesting suggestions are ignored (see the Evaluation of the User Requirement Survey Oct-Nov 2009). And we've been here before, with EDIT (see this TAXACOM message about the moribund Virtual Taxonomic Library). Why go to the trouble of asking users if you aren't going to deliver?

I suspect surveys exist not to genuinely help figure out what to do, but as an internal organisational tool to convince programmers what needs to be done, especially in large, multinational consortia where the programmers might be in a different institution, and don't have any particular vested interest in the project (if they did, they wouldn't need user surveys, they'd be too busy making stuff to change the world).

So, what should BHL be doing? There's lots of things to do, but for me the core challenges are findability and linkage. BHL needs to make its content more findable, both in terms of bibliographic metadata and search terms (e.g., taxa, geographic places). It also needs to be much more strongly linked, both internally (e.g., cross referencing between articles where one BHL article cites another BHL article), and externally (to the non-BHL literature, for example, and to nomenclators), and the external links need to be reciprocal (BHL should link to nomenclators, and nomenclators should point back to BHL).

There are immediate benefits from improved linkage. Users could navigate within BHL content by citation links, for example, in the same way we can in the recent literature. If BHL cleaned up its metadata and had a robust article-level OpenURL resolver it could offer services to publishers to add additional links to their content, driving traffic to BHL itself. Better findability leads to better links.

One major impediment to improving things is the quality of the OCR text extracted from BHL scans. There have been various automated attempts to extract metadata from OCR scans (e.g., "A metadata generation system for scanned scientific volumes" doi:10.1145/1378889.1378918), but these have met with mixed success. There's a lot of scope for improving this, but I suspect a series of grad student theses on this topic may not be the way forward (grad students rarely go all the way and develop something that can be deployed). Which leaves crowd sourcing. Given the tools already available for correcting Internet Archive-derived book scans (e.g., Wikisource discussed in an earlier post), it seems to me the logical next move for BHL is to dump all their content into a Wikisource-style environment, polish the tools and interface a bit, and encourage the community to have at it. Forming and nurturing that community will be a challenge, but providing BHL can demonstrate some clear benefits (e.g., generating clean pages with new taxon names, annotated illustrations, OpenURL tools for publishers to use), then I think the task isn't insurmountable. It just needs some creativity (e.g., why not engage EOL users who land on BHL content to go one step further and clean it up, or link with Wikipedia and Wikispecies to attract users interested in actively contributing?).

I doubt any of this will be in any user survey...