iPhylo: github

Showing posts with label github. Show all posts

Friday, June 21, 2019

Messages from Melbourne: Towards linking all the things

I'm doing some work with Nicole Kearney (@nicolekearney) at the Melbourne Museum on the general theme of "linking all the things". It's the end of the first full week we've had, so here's a quick update of what we've been up to.

Brainstorming

The things we want to do are being captured as a project on GitHub. This is where we come up with ideas, comment on then, then try to figure out which ones can be done. So far there are three things we've made a serious start on.

Unpaywall

Unpaywall is a project by Impactstory. It is sort of a Sci-Hub without the legal issues (for the record, I think Alexandra Elbakyan's work on Sci-Hub is nothing short of heroic). Unpaywall scans open access archives for legal, freely available versions of articles and makes them easy to find. If you have Firefox or Chrome you can get a plugin that lights up if the paywall article you're looking at has a free version somewhere else.
Nicole has long wanted the BHL to provide data to Unpaywall, because BHL has open access versions of many papers relevant to taxonomy and biodiversity more broadly defined. After a bit of digging we figured out that Unpaywall didn't have access to BHL's data, so we've set about fixing that. We've got the data harvested, but we're still waiting for Unpaywall to process that data. So, for now, we're still waiting for the little green light to appear on pages such as this one: https://doi.org/10.1080/00222932208632640.

Adding taxonomic literature to Atlas of Living Australia

Part of "linking all the things" is making the taxonomic literature a first class citizen of biodiversity databases. It is frankly embarrassing to see how much better the scientific literature is handled by projects such as Wikipedia than scientific databases such as GBIF and the ALA. We've decided to try and do something about this by showing how easily the literature could be embedded into the existing ALA web site. Nicole crafted a mockup of the ALA names tab, and I wrote some code to make it "live". For example, if you click on this link you will see a list of publications for Pauropsalta herveyensis Owen & Moulds, 2016. Note that we have DOIs and links to BHL where ever possible (and we use Unpaywall's API to flag whether an article with a DOI is freely available). We want this literature (the primary evidence for what we know about a species) to be visible and accessible. The demo is powered by my Ozymandias project, but we hope to work out a mechanism for delivering the mapping between taxa and literature to ALA (and, indeed, anyone else) as a dataset.
Because Ozymandias only has data for animals, we've had to exclude plants from this demo. I'm frantically trying to figure out how to work with data in Australia's plant name databases to resolve this. I'm discovering that never mind having more than one name for the same species, taxonomists also delight in having many different ways of representing taxonomic information in their databases. So, plants will be a challenge.

Mapping taxonomists to ORCID and Wikidata

One reason for adding literature to taxonomic databases is to make the work of taxonomists more visible. One way to do this is to move beyond using only "dumb strings" as people names and linking taxonomists to their ORCIDs and to entries in Wikidata (this is something I touched on in Ozymandias, and David Shorthouse is doing on an epic scale in Bloodhound). We're playing with the idea of being able to generate a list of active taxonomists in Australia, linked to their identifiers and publications, solely based on querying Wikidata. The first step is to try and automate the initial mapping between taxonomists and Wikidata as much as possible, we've only just started looking at this.

Summary

It is early days, and we're still identifying things we could work on. As always, there are so manythings which could be done, we're hoping we can make progress on at least some of these in the next few weeks.

Wednesday, December 09, 2015

Visualising the difference between two taxonomic classifications

It's a nice feeling when work that one did ages ago seems relevant again. Markus Döring has been working on a new backbone classification of all the species which occur in taxonomic checklists harvested by GBIF. After building a new classification the obvious question arises "how does this compare to the previous GBIF classification?" A simple question, answering it however is a little tricky. It's relatively easy to compare two text files -- and this function appears in places such as Wikipedia and GitHub -- but comparing trees is a little trickier. Ordering in trees is less meaningful than in text files, which have a single linear order. In other words, as text strings "(a,b,c)" and "(c,b,a)" are different, but as trees they are the same.

Classifications can be modelled as a particular kind of tree where (unlike, say, phylogenies) every node has a unique label. For example, the tips may be species and the internal nodes may be higher taxa such as genera, families, etc. So, what we need is a way of comparing two rooted, labelled trees and finding the differences. Turns out, this is exactly what Gabriel Valiente and I worked on in this paper doi:10.1186/1471-2105-6-208. The code for that paper (available on GitHub) computes an "edit script" that gives a set of operations to convert one fully labelled tree into another. So I brushed up my rusty C++ skills (I'm using "skills" loosely here) and wrote some code to take two trees and the edit script, and create a simple web page that shows the two trees and their differences. Below is a screen shot showing a comparison between the classification of whales in the Mammals Species of the World, and one from GBIF (you can see a live version here).

The display uses colours to show whether a nodes has been deleted from the first tree, inserted into the second tree, or moved to a different position. Clicking on a node in one tree scrolls the corresponding node in the other tree (if it exists) to scroll into view. Most of the differences between the two trees are due to the absence of fossils from Mammals Species of the World, but there are other issues such as GBIF ignoring tribes, and a few taxa that are duplicated due to spelling typos, etc.

Wednesday, June 24, 2015

Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit

Man03gTw 400x400 I spent last Friday and Saturday at (Research in the 21st Century: Data, Analytics and Impact, hashtag #ReCon_15) in Edinburgh. Friday 19th was conference day, followed by a hackday at CodeBase. There's a Storify archive of the tweets so you can get a sense of the meeting.

Sitting in the audience a few things struck me.

No identifier wars, DOIs have won and are everywhere.
GitHub is influencing the way we do science, but we've much still to learn.
ORCIDs are gaining traction.
Nobody really understands "impact".

GitHub

GitHub is becoming more and more important, not only as a repository of scientific code and data, but as a useful model of sorts of things we need to be doing. Arron Smith gave a fascinating talk on GitHub. Apart from the obvious things such as version control, Arfon discussed the tools and mindset of open source programmers, and who that could be applied to scientific data. For example, software on GitHub is often automatically tested for bugs (and GitHub displays a badge saying whether things are OK). Imagine doing this for a data set, having it automatically checked for errors and/or internal consistency. Reproducibility is a big topic in science, but open source software has to be reproducible by default in the sense that it has to be able to be downloaded and compiled on a user's computer. This is just a couple of the things Arfon covered, see his slides for more.

Transitive Credit

One idea which particularly struck me was that of "transitive credit":

Katz, D. S. (2014, February 10). Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. JORS. Ubiquity Press, Ltd. http://doi.org/10.5334/jors.be

From the above paper:

The idea of transitive credit is as follows: The credit map for product A, which is used by product B, feeds into the credit map for product B. For example, product A is a software package equally written by two authors and its credit map is that 50 percent of the credit for this should go the lead developer, 20 percent to the second developer, and 10 percent to the third developer. In addition, 5 percent should go to each of the four libraries that are needed to run the code. When this product is created and registered, this credit map is registered along with it. Product B is a paper that obtains new science results, and it depended on Product A. The person who registers the publication also registers its credit map, in this case 75 percent to her/himself, and 25 percent to the software code previous mentioned. Credit is now transitive, in that the lead software developer of the code can be given credit for 12.5 percent of the paper. If another paper is later written that extends the product B paper and gives 10% credit to that paper, the lead software package developer will also have 1.25% credit for the new paper.

The idea of being able to track credit across derived products is interesting, and is especially relevant to projects such as GBIF, where users can download large datasets that are themselves aggregations of data from numerous different providers (making it was to calculate the relative contributions of each provider). If we then track citations of that data (and citations of those citations) we could give data providers a better estimate of the actual impact of their data.

Impact

Euan Adie of altimetric talked about "impact", and remarked on an example of a paper being cited in a policy document and this being picked up by altimetric and seen by the authors of the paper, who had no idea that their work had influenced a policy document. This raises some intriguing possibilities, related to the idea of "transitive credit" above.

In building BioNames I've added the ability to show altimetric "donuts" and I'm struck by examples like this one (see also reference in BioNames):

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

This paper has no recent "buzz" (e.g., Twitter, Facebook, Mendeley) but is cited on three Wikipedia pages. So, this paper has impact, albeit in social media. Many papers like this will slip below the social media radar but will be used by various databases and may contribute to subsequent work. Perhaps we could expand alt metrics sources of information to include some of those databases. For example, if a paper has been aggregated/cited by a major databases (such as GBIF) then it would be nice to see that on the altimetric donut. For authors this gives them another example of the impact of their work, but for the databases it's also an opportunity to increase engagement (if people have relevant work that doesn't appear in the donut they can take steps to have that work included in the aggregation). Obviously there are issues about what databases to count as providing signal for alt metrics, but there's scope here to broaden and quantify our notion of impact.

Hackday

The ReCon hackney was an pretty informal event held at CodeBase just down from Edinburgh Castle, and apparently the largest start-up incubator in the European tech scene. It was a pretty amazing place, and a great venue for a hackney.

Next up are @rdmpage & @jacaryl at the #ReCon_15 hackday - details at https://t.co/oqS5OXUTxI pic.twitter.com/Va15aMG7gr
— Graham Steel (@McDawg) June 20, 2015

I spent the day looking at the ORCID API and seeing if I could create some mashups with Journal Map and my own BioNames. One goal was to see if we could generate a map of researcher's study sites starting with their ORCID, using ORCID's API to retrieve a list of their publications, then talking to the Journal Map API to get point localities for those papers. The code worked, but the results were a little disappointing because Jim Caryl and I were focussing on University of Glasgow researchers, and they had few papesri n Journal Map. The code, such as it is, is in GitHub.

My original idea was to focus on BioNames, and see how many authors of taxonomic papers had ORCIDs. Initial experiments seemed promising (see GitHub for code and data). Time was limited, so I got as far has building lists of DOIs from BioNames and discovering the associated ORCIDs. The next steps would be (a) providing ORCID login to BioNames, and using ORCID to help cluster author name strings in BioNames. Still much to do.

I've not been to many hackdays/hackathons, but I find them much more rewarding than simply sitting in a lecture theatre and listening to people talk. Combining both types of meeting is great, and I look forward to similar event sin the future.

Visualising Geophylogenies in Web Maps Using GeoJSON

Fig3 GoogleMaps CC BY no logo 300x205 I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life:

Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.

At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.

I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:

FFS. So it appears I can't use either Google Maps or Open Street Map in a @PLOSCurrents article. Open licensing somehow feels worse than ©
— Roderic Page (@rdmpage) June 16, 2015

Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.

But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.

Duplication

As an example, consider the following two occurrences for Psilogramma menephron:

occurrence	taxon	longitude	latitude	catalogue number	sequence
887386322	Psilogramma menephron Cramer, 1780	145.86301	-17.44	BC ZSM Lep 01337
1009633027	Psilogramma menephron Cramer, 1780	145.86	-17.44	KJ168695	KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

Nanopublications

If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:

A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.

A nanopublication comprises three elements:

The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112

Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

Assertion
- 668534424 identifiedAs Rhacophorus borneensis
Provenance
- doi:10.5358/hsj.32.112
PublicationInfo
- author: 0000-0002-7101-9767
- identifier doi:xxxx

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

Summary

My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Tuesday, September 23, 2014

Exploring the chameleon dataset: broken GBIF links and lack of georeferencing

Following on from the discussion of the African chameleon data, I've started to explore Angelique Hjarding's data in more detail. The data is available from figshare (doi:10.6084/m9.figshare.1141858), so I've grabbed a copy and put it in github. Several things are immediately apparent.

There is a lot of ungeoreferenced data. With a little work this could be geotagged and hence placed on a map.
There are some errors with the georeferenced data (chameleons in Soutb America or off the coast, a locality in Tanzania that is now in Ethiopia, etc.).
Rather alarmingly, most of the URLs to GBIF records that Angelique gives in the dataset no longer resolve.

The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.

That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.

To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.

I also drew some polygons around points for the same taxon, to get a sense of their distributions.

Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.

I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.

Monday, August 25, 2014

Geotagging stats for BioStor

Note to self for upcoming discussion with JournalMap.

As of Monday August 25th, BioStor has 106,617 articles comprising 1,484,050 BHL pages. From the full text for these articles, I have extracted 45,452 distinct localities (i.e., geotagged with latitude and longitude). 15,860 BHL pages in BioStor pages have at least one geotag, these pages belong to 5,675 BioStor articles.

In summary, BioStor has 5,675 full-text articles that are geotagged. The largest number of geotags for an article is 2,421, for Distribución geográfica de la fauna de anfibios del Uruguay (doi:10.5479/si.23317515.134.1).

The SQL for the queries is here.

Tuesday, May 06, 2014

Very large phylogeny viewer

As announced on phylobabble I've started to revisit visualising large phylogenies, building on some work I did a couple of years ago (my how time flies). This time, there is actual code (see https://github.com/rdmpage/deep-tree) as well as a live demo http://iphylo.org/~rpage/deep-tree/demo/.

You can see the amphibian tree below at http://iphylo.org/~rpage/deep-tree/demo/show.php?id=5369171e32b7a:

Tree

You can upload or paste a tree (for now in NEXUS format), or paste in a URL to a NEXUS file (e.g., from TreeBASE). I'll formats when I get the chance. The viewer uses the same approach as Google Maps, breaking the image of the tree into "tiles" of a fixed size, so even if the tree image is huge, the web browser only ever displays the same number of tiles. You can zoom in to see individual taxa, or zoom out for an overview. One reason I'm building this is to display DNA barcoding trees alongside the million DNA barcode map.

As ever this is a crude first attempt, but feel free to try it and let me know how you get on.

Thursday, March 13, 2014

Publishing biodiversity data directly from GitHub to GBIF

Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

The data I uploaded came from this paper:

Shapiro, L. H., Strazanac, J. S., & Roderick, G. K. (2006, October). Molecular phylogeny of Banza (Orthoptera: Tettigoniidae), the endemic katydids of the Hawaiian Archipelago. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2006.04.006

This is the data I used to build the geophylogeny for Banza using Google Earth. Prior to uploading this data, GBIF had no georeferenced localities for these katydids, now it has 21 occurrences:

Dataset

How it works

I give details of how I did this in the GitHub repository for the data. In brief, I took data from the appendix in the Shapiro et al. paper and created a Darwin Core Archive in a repository in GitHub. Mostly this involved messing with Excel to format the data. I used GBIF's registry API to create a dataset record, pointed it at the GitHub repository, and let GBIF do the rest. There were a few little hiccups, such as needing to tweak the meta.xml file that describes the data, and GBIF's assumption that specimens are identified by the infamous "Darwin Core Triplet" meant I had to invent one for each occurrence, but other than that it was pretty straightforward.

I've talked about using GitHub to help clean up Darwin Core Archives from GBIF, and VertNet are using GitHub as an issue tracker, but what I've done here differs in one crucial way. I'm not just grabbing a file from GBIF and showing that it is broken (with no way to get those fixes to GBIF), nor am I posting bug reports for data hosted elsewhere and hoping that someone will fix it (like VertNet), what I'm doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

Beyond nodes

GBIF's default publishing model is a federated one. Data providers in countries (such as museums and herbaria) digitise their data and make it available to national aggregators ("nodes"), which typically host a portal with information about the biodiversity of that nation (the Atlas of Living Australia is perhaps the most impressive example). These nodes then make the data available to GBIF, which provides a global portal to the world's biodiversity data (as opposed to national-level access provided by nodes).

This works well if you assume that most biodiversity data is held by national natural history collections, but this is debatable. There are other projects, some of them large and not necessarily "national" that have valuable data. These projects can join GBIF and publish their data. But what about all the data that is held in other databases (perhaps not conventionally thought of as biodiversity databases), or the huge amount of information in the published literature. How does that get into GBIF? People like me who data mine the literature for information on specimens and localities, such as this map of localities mentioned in articles in BioStor. How do we get that data into GBIF?

Biostor

Data publishing

Being able to publish data directly to GBIF makes putting the effort into publishing data seem less onerous, because I can see it appear in GBIF within minutes. Putting 21 records of katydids is clearly a drop in the ocean, but there is potentially vastly more data waiting to be mined. managing the data on GitHub also makes the whole process of data cleaning and edit transparent. As ever, there are a couple of things that still need to be tackled.

It's who you know

I've been able to do this because I have links with GBIF, and they have made the (hopefully reasonable) assumption that I'm not going to publish just any old crap to GBIF. I still had to get "endorsed" by the UK node (being the chair of the GBIF Science Committee probably helped), and I'm lucky that Tim Roberston was online at the time and guided me through the process. None of this is terribly scalable. It would be nice if we had a way to open up GBIF to direct publishing, but also with a review process built in (even if it's a post-review so that data may have to be pulled if it becomes clear it's problematic). Perhaps this could be managed via GitHub, for example data could be uploaded and managed there, and GBIF can then choose to pull that repository and the data would appear on GBIF. Another model is something like the Biodiversity Data Journal, but that doesn't (as far as I know) have a direct feed into GBIF.

Whichever approach we take, we need a simple, frictionless way to get data into GBIF, especially if we want to tackle the obvious geographic and taxonomic biases in the data GBIF currently has.

DOIs please

It would be great if I could get a DOI for this data set. I had toyed with putting it on Figshare which would give me a DOI, but that just puts an additional layer between GitHub and GBIF. Ideally instead (or as well as) the UUID I get from GBIF to identify the dataset, I'd get a DOI that others can cite, and which would appear on my ORCID profile. I'd also want a way to link the data DOI to the DOI for the source paper (doi:10.1016/j.ympev.2006.04.006), so that citations of the data can pass some of that "link love" to the original authors. So, GBIF needs to mint DOIs for datasets.

Summary

The ability to upload data to GitHub and then have that harvested by GBIF is really exciting. We get great tools for managing changes in data, with a simple publication process (OK, simple if you know Tim, and can speak REST to the GBIF API). But we are getting closer to easy publishing and, just as importantly, easy editing and correcting data.

Friday, March 07, 2014

GBIF data overlayed on Google Maps

As part of a project exploring GBIF data I've been playing with displaying GBIF data on Google Maps. The GBIF portal doesn't use Google Maps, which is a pity because Google's terrain and satellite layers are much nicer than the layers used by GBIF (I gather there are issues with the level of traffic that GBIF receives is above the threshold at which Google starts charging for access).

But because the GBIF developers have a nice API it's pretty easy to put GBIF data on Google maps, like this (the map is live):

The source code for this map is available as a gist, and you can see it live above, and at http://bl.ocks.org/rdmpage/9411457.

Friday, January 24, 2014

VertNet starts issue tracking using GitHub

VertNet has announced that they have implemented issue tracking using GitHub. This is a really interesting development, as figuring out how to capture and make use of annotations in biodiversity databases is a problem that's attracting a lot of attention. VertNet have decided to use GitHub to handle annotations, but in a way that hides most of GitHub from users (developers tend to love things like GitHub, regular folks, not so much, see The Two Cultures of Computing).

The VertNet blog has a detailed walk through of how it works. I've made some comments on that blog, but I'll repeat them here.

At the moment the VertBet interface doesn't show any evidence of issue tracking (there's a link to add an issue, but you can't see if there are any issues). For example, visiting an example CUMV Amphibian 1766 I don't see any evidence on that page that there is an issue for this record (there is an issue, see https://github.com/cumv-vertnet/cumv-amph/issues/1). It think it's important that people see evidence of interaction (that way you might encourage others to participate). This would also enable people to gauge how active collection managers are in resolving issues ("gee, they fixed this problem in a couple of days, cool").

Likewise, it would be nice to have a collection-level summary in the portal. For example, looking at CUMV Amphibian 1766 I'm not able to click through to a page for CUMV Amphibians (why can't I do this at the moment - there needs to be a way for me to get to the collection from a record) to see how many issues there are for the whole collection, and how fast they are being closed.

I think the approach VertNet are using has a lot of potential, although it sidesteps some of the most compelling features of GitHub, namely forking and merging code and other documents. I can't, for example, take a record, edit it, and have those edits merged into the data. It's still a fairly passive "hey, there's a problem here", which means that the burden is still on curators to fix the issue. This raises the whole question of what to do with user-supplied edits. There's a nice paper regarding validating user input into Freebase that is relevant here, see "Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation" (http://dx.doi.org/10.1145/2556195.2556227 [not live yet], PDF here).

Thursday, November 21, 2013

GBIF, GitHub, and taxonomy (again)

Quick notes on yet another attempt to marry the task of editing a taxonomic classification with versioning it in GitHub.

The idea of dumping the whole GBIF classification into GitHub as a series of nested folders looks untenable. So, maybe there's another way to tackle the problem.

Let's imagine that we dump, say, the GBIF classification down to family-level as a series of nested folders (i.e., we recreate the classification on disk). For each family we then create a bunch of files and store them in that folder. For example, we could have the classification in Darwin Core Archive format (basically, delimited text). Let's also create a graph that corresponds to that classification, using a format for which we have tools available for visualising and editing.

For example, I've created a Graph Modelling Language (GML) file for the Pinnotheridae here. Using software such as yEd I can load this file, display it, and edit it. For example, below is a compact tree layout of the graph:

This image is a bitmap, if you opened the GML file in yEd it would be interactive, and you could zoom in, alter the layout, edit the graph, etc.

Looking at the graph there are a few oddities, such as "orphan" genera that lack any species, and some names that appear very similar. For example, there is an orphan genus Glassella, and a similar genus Glassellia (note the "i") with a single species Glassellia costaricana. A little digging in BioNames shows that Glassellia is a misspelling of Glassella. The original description appears in:

E Campos, M K Wicksten (1997) A New Genus For The Central American Crab Pinnixa costaricana Wicksten, 1982 (Crustacea: Brachyura: Pinnotheridae). Proceedings of the Biological Society of Washington 110(1): 69–73. http://biostor.org/reference/81137

So, we have one genus that appears twice due to a typo. Furthermore, there are nodes in the graph for the taxa Glassellia costaricana and Pinnixa costaricana, but these are the same thing (the names are synonyms, albeit Glassellia costaricana has the genus misspelt). So, we could delete Pinnixa costaricana, delete the mispelling Glassellia, fix the misspelling in Glassellia costaricana, and move it to the correctly spelt Glassella. There are other problems with this classification, but let's leave them for the moment.

Now, imagine that after editing I use the graph to regenerate the DWCA file, which now has the edited classification. I then commit the changes to GitHub, and anyone else (including GBIF) could grab the DWCA and, for example, replace their Pinnotheridae classification with the edited version.

We could also go further, and add what i think is a missing component of the GBIF classification, namely a link to the nomenclators. For example, in an ideal world we would have each name in the classification linked to a stable identifier for the name provided by a nomenclator, and that nomenclator would know, for example, that Pinnixa costaricana and Glassella costaricana were objective synonyms. If we had those links then we could automatically detect cases such as this where logically you can have either Pinnixa costaricana or Glassella costaricana in the same classification, but not both.

There are some wrinkles to figure out, for example it would be nice to compute the difference between the original and edited graphs in terms of graph operations (not simply the difference as text files) so we could do things like list nodes that have been moved or deleted. I did some work on this a while back (Page, R. D., & Valiente, G. (2005).BMC Bioinformatics, 6(1), 208. doi:10.1186/1471-2105-6-208), something like that tool might do the trick.

There is an element here of trying to coerce a problem into a form that can existing tools can solve, but in a way that's what makes it attractive. If we can use things that already exist then we can move from talking about it to actually doing it.

Wednesday, November 06, 2013

ZooKeys, GBIF, and GitHub: fixing Darwin Core Archives part 2

Here's another example of a Darwin Core Archive that is "broken" such that GBIF is missing some information. GBIF data set A checklist to the wasps of Peru (Hymenoptera, Aculeata) comes from Pensoft, and corresponds to the paper:

Rasmussen, C., & Asenjo, A. (2009). A checklist to the wasps of Peru (Hymenoptera, Aculeata). ZooKeys, 15(0). doi:10.3897/zookeys.15.196

As with the previous example GBIF says there are 0 georeferenced records in this dataset. This is odd, because the ZooKeys page for this article lists three supplementary files, including KML files for Google Earth. I've used one to create the image below:

So, clearly there is georeferenced data here. Looking at the Darwin Core Archive (which I've put on GitHub there are a bunch of issues with this data. The occurrence.txt file has decimal latitude and longitude values with a comma rather than a decimal point, the file has some character encoding issues, and the columns with latitude and longitude data are labelled as "verbatim" fields not "decimal" fields. All of this means GBIF lacks all the point data for this dataset (over 2000 records). If we fix these problems, we get a map like this:

This illustrates one problem with publishing data, namely the data is rarely checked in the same way a manuscript is. Peer-review of data is a phrase that always struck me as odd, because you only get to be able to evaluate a data set by using it. In other words, data almost demands post- rather than pre-publication review. It's only when people start trying to use the data that problems emerge.

At the same time, we could improve checking of data prior to publication. In the case of the Darwin Core Archives I've looked at so far, it would be easier to find the problems if we had a simple tool that could take a Darwin Core Archive, extract the information and display it in various ways. If, for example, we have georeferenced records but we don't get a map, we would immediate wonder why that was, and figure out what the problem was. At the moment it seems easy to send data to GBIF, thinking you are contributing important information, whereas in fact that information never makes it onto a GBIF map.

GBIF and Github: fixing broken Darwin Core Archives

Following on from Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite here's a quick and dirty example of using GitHub to help clean up a Darwin Core Archive.

The dataset 3i - Cicadellinae Database has 2,152 species and 4,749 taxa, but GBIF says it has no georeferenced data. As a result, the map for this dataset looks like this:

Gbif 3i

I downloaded the Darwin Core Archive and was puzzled because the occurrence.txt file contained in the archive has latitude and longitude pairs for some of the records. How come there is no map? After a bit of fussing I discovered that the meta.xml file that describes the data is broken. It lists a column which doesn't appear in the data file, so everything after that column gets shifted along and hence the column headings for latitude and longitude are out of alignment with the data.

So, I loaded the Darwin Core Archive into GitHub (you can see it here), then fixed the error, and then for fun extracted the latitude and longitude pairs as a GeoJSON file. GitHub can display this on a map:

Note that we now have a fairly extensive set of georeferenced data points for these insects, and this data hasn't made it onto a GBIF map because of a simple error in the metadata. I keep finding cases like this, which suggests that GBIF has more georeferenced data than it realises.

Friday, November 01, 2013

Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite

This is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data.

GitHub

The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.

DOI

When GBIF firsts loads the repository it assigns it a DOI (using, say, DataCite). Actually we assign two DOIs, one for this version of the data (e.g., 10.1234/data.v1) and one for all versions of the data, say 10.1234/data. The data is considered to be published, authorship is determined by the provider, which may be an individual, a project, an institution, etc.

Big scale annotation and cleaning

Anyone familiar with GitHub can fork the repository of data and do their own cleaning (e.g., fixing dates, latitudes and longitudes, links to taxon names, etc.).

Small scale, casual annotation

Anyone visiting the GBIF portal and noticing an error (or something that they want to comment on) does so on the portal. Behind the scenes these comments are stored as issues on the GBIF repository in GitHub. To do this GBIF can either (a) enable users with an existing GitHub account to link that to their GBIF user account, or (b) create a GitHub account for the user. The user need not actually interact directly with GitHub (a similar approach is described by Mark Holder for the social curation of phylogenetic studies).

This means all annotation, big or small, is in the open and on GitHub. There is very little programming to do, GBIF simply talks to GitHub using GitHub's API. GBIF could display known "issues" for a dataset, so portal users immediate know if any data has been flagged as problematic.

All the annotations belong to the "community", in the sense that each annotation is linked to GitHub user (even if the user might not ever actually go to GitHub). This also means that the provider can, at any point, pull in those annotations so they can update their own data (and hence gain direct benefit form exposing it in the first place).

Updating

When GBIF decides that enough annotations have been made and resolved, the latest version of the repository is loaded into GBIF and gets a new DOI (e.g., 10.1234/data.v2). This means an analysis based on that version is citable. We add a link to the overall DOI so someone who doesn't care about versions can still cite the data.

Authorship and credit

Now we come to the fun part. The revision will include the input from a bunch of people. This will be recorded on GitHub, but that will only mean something the handful of geeks who think GitHub is awesome. But, let's imagine that we do the following:

Anyone with a GBIF account can link that to their ORCID (if you are a researcher you really should have one of these).
Anyone contributing to this version of the repository gets authorship (appended to the end of the list, so the original provider is first author).
GBIF uses the ORCID API to automatically load the DOI of the new version of the dataset onto the list of works for each contributor. They instantly get credit as an co-author of a citable dataset, and this appears on their ORCID profile.

Benefits

This approach has a number of benefits:

It creates citable data
It gives credit in a way many people will recognise (authorship of a citable work that has a DOI)
The annotations are freely available, there is a complete version history, anyone can contribute at whatever scale suits them.
Anyone can grab the repo at any time and load it into their own system, including the original provider, who can see what people are added to their original data.
There is virtually no programming to do, no new domain-specific protocols, everything is pretty much in place. GitHub does versioning, DataCite does citable identifiers, ORCID handles identify and credit.

Caveats

There are a couple of potential issues. Darwin Core Archive data files can be large, and GitHub can be less effective with large files (although it is ideally suited to the delimited-text files that Darwin Core Archive uses, see Git (and Github) for Data). One approach to impose a limit on the size of an individual "occurrence.txt" file in the archive, so we may have multiple files, none of which is too big. Another task will be linking issues to specific occurrences (if they concern just one occurrence), the GitHub issues will be at level of the complete file. This could be handled in a form-based interface on GBIF that sent the occurrenceID as part of the issue report.

Summary

The key point of this proposal is that everything is in place already to do this. The ducks are lining up, and serious, credible projects are handling the things we need (versioning, identifiers, credit). Sometimes the smart thing is to do nothing and wait to someone else solves the problems you face. I think the waiting may be over.

Wednesday, August 14, 2013

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).

Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera. As discussed in the gibbon example, GBIF merges several competing classifications for mammals, and these often don't agree on the "accepted name" for a species. In the absence of a decent database of taxonomic synonyms, GBIF ends up duplicating species, and each duplicate is often associated with different occurence data. If you are trying to get the distribution for a species this can be a disaster.

To get a sense of the scale of the problem I put together a simple tool to create cluster maps. The code is on github) and there is a live service at http://iphylo.org/~rpage/cluster-map/. The service takes a simple tab-delimited file that lists sets and their members, computes the overlap between the sets, calls Graphviz to layout a graph in SVG, then draws in the members of each cluster (phew).

The input file looks something like this:


Molossops	aequatorianus
Chaerephon	aloysiisabaudiae
Tadarida	aloysiisabaudiae
Chaerephon	ansorgei
Tadarida	ansorgei
Molossus	ater
Mormopterus	petrophilus
Sauromys	petrophilus

What can we do with this tool? Well, I created a quick list of all the species of bat in the family Molossidae according to GBIF. The sets are the bat genera, the members are the species (you can see the file here). I then ran this through the cluster map, and got something like this (this is only part of the cluster map):

Bats

(now can you see why I call these "papaya plots"?). Note that there are species names (i.e., specific epithets) in common to more than one genus. Some of these may be perfectly OK (it's not unusual for the same epithet to be used in different species, e.g. "major", etc.). But in many cases these bat species turn out to be the same species, just in different genera in different classifications. For example, GBIF has both Cynomops greenhalli and Molossops greenhalli. These are the same thing. Species in the genus Mormopterus may also occur in other genera. In some cases the issue is competing classifications, sometimes it is conflict over whether a species is a species or merely a subspecies, and some generic conflicts are because some genera are relegated to subgeneric status in some classifications. In short, it's an unholy mess.

Does this matter? Well, consider Mormopterus petrophilus and Sauromys petrophilus, which GBIF both regard as valid species (they're the same thing). Here are the distributions for the two different names in GBIF:

Depending on which name you use you'll get a very different picture of the distribution of this bat.

The next step is to figure out how to fix this. Is there a way we can automate fixing the GBIF classification so that it is not riddled with spurious duplicates like these?

Wednesday, July 17, 2013

Augmenting ZooKeys bibliographic data to flesh out the citation graph

In a previous post (Learning from eLife: GitHub as an article repository) I discussed the advantages of an Open Access journal putting its article XML in a version-controlled repository like GitHub. In response to that post Pensoft (the publisher of ZooKeys) did exactly that, and the XML is available at https://github.com/pensoft/ZooKeys-xml.

OK, "now what?" I hear you ask. Originally I'd used the example of incorrect bibliographic data for citations as the motivation, but there are other things we can do as well. For example, when reading a ZooKeys article (say, using my eLife Lens-inspired viewer) I notice references that should have a DOI but which don't. With the XML available I could add this. This adds another link in the citation graph (in this case connecting the ZooKeys paper with the article it cites). If Pensoft were to use that XML to regenerate the HTML version of the article on their web site then the reader will be able to click on the DOI and read the cited article (instead of the "cut-and-paste-and-Google-it" dance). Furthermore, Pensoft could update the metadata they've submitted to CrossRef, so that CrossRef knows that the reference with the newly added DOI has been cited by the ZooKeys paper.

To experiment with this I've written some scripts that take ZooKeys XML, extract each citation from the list of literature cited, and look up DOIs for each reference that lacks them (using the CrossRef metadata search API). If a DOI is found then I insert it into the original XML. I then push this XML to my fork of Pensoft's repository (https://github.com/rdmpage/ZooKeys-xml). I can then ask Pensoft to update their repository (by issuing a "pull request"), and if Pensoft like what they see, they can accept my edits.

Automating the process makes this much more scalable, although manual editing will still be useful in some cases, especially where the original references haven't been correctly atomised into title, journal, etc.

So that the output is visible independently of Pensoft deciding whether to accept it, I've updated my Zookeys article viewer to fetch the XML not from the ZooKeys web site, but from my GitHub repository. This means you get the latest version of the XML, complete with additional DOIs (if any have been added).

Initial experiments are encouraging, but it's also apparent that lots of citations lack DOIs. However, this doesn't mean that they aren't online. Indeed, a growing number of articles are available through my BioStor repository, and through BioNames. Both of these sites have an API, so the next step is to add them to the script that augments the XML. This brings us a little closer to the ultimate goal of having every taxonomic paper online and linked to every paper that either cites, or is cited by, that paper.

Friday, July 12, 2013

Learning from eLife: GitHub as an article repository

Playing with my eLife Lens-inspired article viewer and some recent articles from ZooKeys I regularly come across articles that are incorrectly marked up. As a quick reminder, my viewer takes the DOI for a ZooKeys article (just append it to http://bionames.org/labs/zookeys-viewer/?doi=, e.g. http://bionames.org/labs/zookeys-viewer/?doi=10.3897/zookeys.316.5132), fetches the corresponding XML and displays the article.

Taking the article above as an example, I was browsing the list of literature cited and trying to find those articles in BioNames or BioStor. Sometimes an article that should have been found wasn't, and on closer investigation the problem was that the ZooKeys XML has mangled the citation. To illustrate, take the following XML:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

I've highlighted the contents of the article-title (title) and source (journal) tags, respectively. Unfortunately the actual title and journal should look like this:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

Tools to find articles that rely on accurately parsed metadata, such as OpenURL, will fail in cases like this. Of course, we could use tools that don't have this requirement, but we could also fix the XML so that OpenURL resolves will succeed.

This is where the example of the journal eLife comes in. They deposit article XML in GitHub where anyone can grab it and mess with it. Let's imagine we did the same for ZooKeys, created a GitHub repository for the XML, and then edited it in cases where the article metadata is clearly broken. A viewer like mine could then fetch the XML, not from ZooKeys, but from GitHub, and thus take advantage of any corrections made.

We could imagine this as part of a broader workflow that would also incorporate other sources of articles, such as BHL. We could envisage workflows that take BHL scans, convert them to editable XML, then repurpose that content (see BHL to PDF workflow for a sketch). I like the idea that there is considerable overlap between the most recent publishing ventures (such as eLife and ZooKeys) and the goal of bringing biodiversity legacy literature to life.

Sunday, July 01, 2012

Using orthographic projections to map organism distributions

For a current project I'm currently working I show organism distributions using data from GBIF, and I display that data on a map that uses the equirectangular projection. I've recently started to create a series of base maps using the GBIF colour scheme, which is simple but effective:

#666698 for the sea
#003333 for the land
#006600 for borders
yellow for localities

The distribution map is created by overlaying points on a bitmap background using SVG (see SVG specimen maps from SPARQL results for details). SVG is ideally suited to this because you can take the points, plot them in the x,y plane (where x is longitude and y is latitude) then use SVG transformations to move them to the proper place on the map.

For the base maps themselves I've also started to use SVG, partly because it's possible to edit them with a text editor (for example if you want to change the colours). I then use Inkscape to export the SVG to a PNG to use on the web site.

Gbif360x180

One thing that has bothered me about the equirectangular projection is that, although it is familiar and easy to work with, it gives a distorted view of the world:

This is particularly evident for organisms that have a circumpolar distribution. For example, Kerguelen's petrel Aphrodroma has a distribution that looks like this using the equirectangular projection:

This long, thin distribution looks rather different if we display it on a polar projection:

Likewise, classic Gondwanic distributions such as that of Gripopterygidae become clearer on a polar projection.

Computing the polar coordinates for a set of localities is straightforward (see for example this page) and using SVG to lay out the points also helps, because it's trivial to rotate them so that they match the orientation of the map. Ultimately it would be nice to have an embedded, rotatable 3D globe (like the Google Earth plugin, or a Javascript+SVG approach like this). But for now I think it's nice to have the option of using different projections available to help display distributions more faithfully.

The bitmap maps and their SVG sources are available on github.

Thursday, June 28, 2012

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:

Screenshot

Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare http://dx.doi.org/10.6084/m9.figshare.92668.