iPhylo: programming

Showing posts with label programming. Show all posts

Thursday, February 03, 2011

Web Hooks and OpenURL: making databases editable

For me one of the most frustrating things about online databases is that they often can't be edited. For example, I've recently created a version of the Australian Faunal Directory on CouchDB, which contains a list of all animals in Australia, and a fairly comprehensive bibliography of taxonomic publication on those animals. What I'd like to do is locate those publications online. Using various scripts I've found DOIs for some 2,500 articles, and located nearly 4,900 article in BHL, and added these to the database, but browsing the database (using, say, the quantum treemap interface) makes it clear there are lots of publications that I've missed.

It would be great if I could go to the Australian Faunal Directory on CouchDB and edit these on that site, but that would require making the data editable, and that means adding a user interface. And that's potentially a lot of work. Then, if I go to another database (say, my CouchDB version of the Catalogue of Life) and want to make that editable then I have to add an interface to that database as well. I could switch to using a wiki, which I've done for some projects (such as the NCBI to Wikipedia mapping), but wikis have their own issues (in particular, they don't easily support the kinds of queries I want to do).

There is, as they say, a third way: web hooks. I first came across web hooks when I discovered that Post-Commit Web Hooks in Google Code. The idea is you can create a web service that gets called every time you commit code to the Google Code repository. For example, each time you commit code you can call a web hook that uses the Twitter API to tweet details of what you just committed (I tried this for a while, until some of my Twitter followers got seriously pissed off by the volume of tweets this was generating).

What has this to do with making databases editable? Well, imagine the following scenario. A web page displays a publication, but no DOI. However, the web page embeds an OpenURL in the form of a COinS (in other words, a URL with key-value pairs describing the publication). If you use a tool such as the OpenURL Referrer in Firefox you can use an OpenURL resolver to find that publication. Examples of OpenURL resolvers include bioGUID and BioStor. Let's say you find the publication, and it has a DOI. How do you tell the database about this? Well, you can try and find an email address of someone running the database so you can send them the information, but this is a hassle. What if the OpenURL resolver that you used to find the DOI could automatically tell the source database that it's found the DOI? That's the idea behind web hooks.

I've started to experiment with this, and have most of the pieces working. Publication pages in Australian Faunal Directory on CouchDB have COinS that include two additional pieces of information: (1) the database identifier for the publication (in this case a UUID, in the hideously complex jargon of OpenURL this the "Referring Entity Identifier"), and (2) the URL of the web hook. The idea is that an OpenURL resolver can take the OpenURL and try and locate the article. If it succeeds it will call the web hook URL supplied by the database, tell it "hey, I've found this DOI for the publication with this database identifier". The database can then update its data, so the next time a user visits the page for that publication in the database, the user will see the DOI. This has the huge advantage over tools that just modify the web page on the fly, such as David Shorthouse's reference parser of persistence: the database itself is updated, not just the web page.

In order to make this work, all the database needs to do is have a web hook, namely a URL that accepts POST requests. The heavy lifting of searching for the publication, or enabling users to correct and edit the data can be devolved to a single place, namely the OpenURL resolver. As a first step I'm building an OpenURL resolver that displays a form the in which the user can edit bibliographic details, and launch searches in CrossRef (and soon BioStor). When the user is done they can close the form, which is when it calls the web hook with the edited data. The database can then choose to accept or reject the update.

Given that it's easy to create the web hook, and trivial to get a database to output an OpenURL with its internal identifier and the URL of the web hook, this seems like a light-weight way of making databases editable.

Friday, August 29, 2008

Turning Japanese: EUC-JP, UTF-8, and percent-encoding

In case I forget how to do this, and as an example of how easy it is to get sucked into a black hole of programming micro-details, I spent a hour or more trying to figure out how to handle Japanese characters.

I'm building a database of publications linked to taxonomic names, and I'm interested in linking to electronic versions of those publications. CrossRef and JSTOR provide a lot of references, as does BHL (once they get an OpenURL resolver in place), but there are numerous other sources to be harvested. One is CiNii, the Japanese National Institute of Informatics Scholarly and Academic Information Navigator, which have an OpenURL resolver. For example, I can query CiNii for an article using this URL
http://ci.nii.ac.jp/openurl/query?ctx_ver=Z39.88-2004&url_ver=Z39.88-2004&ctx_enc=info%3aofi%2fenc%3aUTF-8&rft.date=2003&rft.volume=58&rft.spage=1&rft.epage=6&rft.jtitle=Entomological%20Review%20of%20Japan.

If I want to harvest bibliographic metadata, I can parse the resulting HTML. I could follow the links to formats such as BibTex, but there's enough information in the link itself. For example, there's a link to the BibTex format that looks like this:


http://ci.nii.ac.jp/openurl/servlet/createData?type=bib
&ca=@article
&au=%B7%A6%CC%DA+%B4%B4%C9%D7
&title=%A5%AB%A5%DF%A5%AD%A5%EA%A5%E0%A5%B7%B2%CAPidonia%C2%B0%A4%CE%BF%B7%B0%A1%C2%B0%A4%CB%A4%C4%A4%A4%A4%C6
&jtitle=%BA%AB%EA%B5%D5%DC%C9%BE%CF%C0+%3D+The+entomological+review+of+Japan
&year=20030430
&vol=00058
&num=00001
&spage=1-6
&id=10011061577
&lang=jp
&issn=02869810
&publish=%C6%FC%CB%DC%B9%C3%C3%EE%B3%D8%B2%F1
&perm_link=http%3A%2F%2Fci.nii.ac.jp%2Fnaid%2F10011061577%2F

Note the percent-encoded fields, such as %B7%A6%CC%DA+%B4%B4%C9%D7. This string represents the author's name, 窪木幹夫. It took me a little while to figure out how to convert %B7%A6%CC%DA+%B4%B4%C9%D7 to 窪木幹夫. Eventually I discovered this table, which shows that there are a number of ways to represent Japanese characters, including JIS, SJIS, and EUC-JP. Given that C9D7 = 夫, the string is EUC-JP encoded. What I want is UTF-8. After some fussing, it turns out that all I need to do (in PHP) is:


 $decoded_str = rawurldecode($str);
 if (mb_detect_encoding($decoded_str) != 'ASCII')
 {
    $decoded_str = mb_convert_encoding($decoded_str, 'UTF-8', 'EUC-JP');
 }

rawurldecode decodes the percent-encoding to EUC-JP, then mb_convert_encoding gives me UTF-8.
As an example, here is the above reference displayed by the bioGUID OpenURL resolver. A small victory, but it is nice to display the Japanese title. The English title of this article is "A New Subgenus of the Genus Pidonia MULSANT (Coleoptera: Cerambycidae)". It's perhaps the major triumph of Linnean taxonomy that even though I can't read a word of Japanese, I know the paper is about Pidonia.