Showing posts with label code. Show all posts
Showing posts with label code. Show all posts

Wednesday, December 04, 2013

Towards BioStor articles marked up using Journal Archiving Tag Set

A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content:

Workflow
I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:
  1. BioStor articles need to be archived somewhere. At the moment they live on my server, and metadata is also served by BHL (as the "parts" you see in a scanned volume). Long term maybe PubMed Central is a possibility (BHL essentially becomes a publisher). Imagine PubMed Central becoming the primary archival repository for biodiversity literature.
  2. BioStor articles could be more useful if the OCR text was cleaned up and marked up (e.g., highlighting taxon names, localities, extracting citations, etc.).
  3. If BioStor articles were marked up to same extent as ZooKeys then we could use tools developed for ZooKeys (see Towards an interactive taxonomic article: displaying an article from ZooKeys) for a richer reading experience.
  4. Cleaned OCR text could also be used to generate searchable PDFs, which are still the most popular way for people to read articles (see Why do scientists tend to prefer PDF documents over HTML when reading scientific journals?). BioStor already generates PDFs, but these are simply made by wrapping page images in a PDF. Searchable PDFs would be much friendlier.

For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.

The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article "Microsporidian encephalitis of farmed Atlantic salmon (Salmo salar) in British Columbia" which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).

With this in mind, I dusted off some old code, put it together and created an example of the first baby steps towards BioStor and JATS. The code is in github, and there is a live example here.

Jats
The example takes BioStor article 65706, converts the metadata to JATS, links in the page scans, and also extracts images from the page scans based on data in the ABBYY OCR files. I've also generated HTML from the DjVu files, and this HTML includes hOCR tags that embed information about the OCR text. This format can be edited by tools such as (see Jim Garrison's moz-hocr-edit discussed in Correcting OCR using hOCR in Firefox). This HTML can be processed to output a PDF that includes the page scans but also has the OCR text as "hidden text" so the reader can search for phrases, or copy and paste the text (try the PDF for article 65706).

I've put the HTML (and all the XML and images) in github, so one obvious model for working on an article is to put it into a github repository, push any edits made to the repository, then push that to a web server that displays the articles.

There are still a lot of rough edges, and I think we can buld nicer interfaces than moz-hocr-edit (e.g., using the "contenteditable" attribute in the HTML), althogh moz-hocr-edit has the nice feature of being able to save the edits straight back to the HTML file (saving edited HTML to disk is a non-trivial task in most web browsers). I also need to add the code for building the initial JATS file (currently this is hidden on the BioStor server). There are also issues about PDF quality. At the moment I output black and white PNGs, which look nice and clean but can mangle plates and photos. I need to tweak that aspect of the process.

One application of these tools would be to take a single journal and convert all the BioStor articles into JATS, then make it available for people to further clean and markup as needed. There is an extraordinary amount of information locked away in this literature, it would be nice if we made better use of that treasure trove.

Monday, December 12, 2011

Exporting data from Australian Faunal Directory on CouchDB

Quick note to self about exporting data from my Australian Faunal Directory on CouchDB project. To export data from a CouchDB view you can use a list function (see Formatting with Show and List). Following the example on the Kanapes IDE blog, I created the following list function:

{
"_id": "_design/publication",
"_rev": "14-467dee8248e97d874f1141411f536848",
"language": "javascript",
"lists": {
"tsv": "function(head,req) {
var row;
start({
'headers': {
'Content-Type': 'text/tsv'
}
});
while(row = getRow()) {
send(row.value + '\\t' + row.key + '\\n');
}}"
},
"views": {
.
.
.
}
}


I can use this function with the view below, which lists Australian Faunal Directory publications by UUID ("value"), indexed by DOI ("key").

Couch

I can get the tab-delimited dump from http://localhost:5984/afd/_design/publication/_list/tsv/doi. Note that instead of, say, /afd/_design/publication/_view/doi to get the view, we use /afd/_design/publication/_list/tsv/doi to get the tab-delimited dump.

I've created files listing DOIs and BioStor ids for publications in the Australian Faunal Directory. I'll play with lists a bit more, specially as I would like to extract the mapping from the Australian Faunal Directory on CouchDB project and add it to the iTaxon project.

Thursday, August 26, 2010

Mendeley API PHP client

mendeley.pngFollowing on from my earlier post about the Mendeley API, I've bundled up my code for OAuth access to the Mendeley API for anyone who's interested in playing with the API using PHP. You can browse the code on Google Code, or grab a tarball here. You'll need a consumer key and a consumer secret from Mendeley for the demos to work, and if you're behind a HTTP proxy you'll have to tweak the code (this is explained in the ReadMe.txt file that comes with the code).

The code is pretty rough, and doesn't use all the Mendeley API calls, but I've other things to do, and it felt like a case of either bundle this up now, or it will get lost among a host of other projects. The Mendeley API still feels woefully under-developed. I'd be more interested in developing this client further if the API was powerful enough to do the kinds of things I'd like to do.

Friday, October 24, 2008

Google Books and Mediawiki

Following on from the previous post, I wrote a simpe Mediawiki extension to insert a Google Book into a wiki page. Written in a few minutes, not tested much, etc.

To use this, copy the code below and save in a file googlebook.php in the extensions directory of your Mediawiki installation.


<?php
# rdmp

# Google Book extension
# Embed a Google Book into Mediawiki
#
# Usage:
# <googlebook id="OCLC:4208784" />
#
# To install it put this file in the extensions directory
# To activate the extension, include it from your LocalSettings.php
# with: require("extensions/googlebook.php");

$wgExtensionFunctions[] = "wfGoogleBook";

function wfGoogleBook() {
global $wgParser;
# registers the <googlebook> extension with the WikiText parser
$wgParser->setHook( "googlebook", "renderGoogleBook" );
}

# The callback function for converting the input text to HTML output
function renderGoogleBook( $input, $argv )
{
$width = 425;
$height = 400;

if (isset($argv["width"]))
{
$width = $argv["width"];
}
if (isset($argv["height"]))
{
$width = $argv["height"];
}

$output = '<script type="text/javascript"
src="http://books.google.com/books/previewlib.js">
</script>';
$output .= '<script type="text/javascript">
GBS_insertEmbeddedViewer(\''
. $argv["id"] . '\','. $width . ',' . $height . ');
</script>';

return $output;
}
?>


In your LocalSettings.php file add the line


require("extensions/googlebook.php");


Now you can add a Google book to a wiki page by adding a <googlebook> tag. For example:


<googlebook id="OCLC:4208784" />


The id gives the book identifier (such as an OCLC number or a ISBN (you need to include the identifier prefix). By defaulot, the book will appear in a box 425 × 400 pixels in size. You can add optional width and height parameters to adjust this.