iPhylo: GeoJSON

Showing posts with label GeoJSON. Show all posts

Wednesday, June 24, 2015

Visualising Geophylogenies in Web Maps Using GeoJSON

Fig3 GoogleMaps CC BY no logo 300x205 I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life:

Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.

At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.

I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:

FFS. So it appears I can't use either Google Maps or Open Street Map in a @PLOSCurrents article. Open licensing somehow feels worse than ©
— Roderic Page (@rdmpage) June 16, 2015

Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.

But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.

Thursday, January 22, 2015

GeoJSON and geophylogenies

For the last few weeks I've been working on a little project to display phylogenies on web-based maps such as OpenStreetMap and Google Maps. Below I'll sketch out the rationale, but if you're in a hurry you can see a live demo here: http://iphylo.org/~rpage/geojson-phylogeny-demo/, and some examples below.

The first is the well-known example of Banza katydids from doi:10.1016/j.ympev.2006.04.006, which I used in 2007 when playing with Google Earth.

The second example shows DNA barcodes similar to ABFG379-10 for Proechimys guyannensis and its relatives.

Background

People have been putting phylogenies on computer-based maps for a while, but in most cases these have required stand-alone software, such as Google Earth, or GeoJSON for encoding geographic information. Despite the obvious appeal of placing trees in maps, and calls for large-scale geophylogeny databases (e.g., do:10.1093/sysbio/syq043), computerised drawing trees on maps has remained a bit of a niche activity. I think there are several reasons for this:

Drawing trees on maps needs both a tree and geographic localities for the nodes in the tree. The later are not always readily available, or may be in different databases to the source of phylogenetic data.
There's no accepted standard for encoding geographic information associated with the leaves in a tree, so everyone pretty much invents their own format.
To draw the tree we typically need standalone software. This means users have to download software, instead of work on the web (which is where all the data is).
Geographic formats such as KML (used by Google Earth) are not particularly easy to store and index in databases.

So there are a number of obstacles to making this easy. The increasing availability of geotagged sequences in GenBank (see Guest post: response to "Putting GenBank Data on the Map"), especially DNA barcodes, helps. For the demo I created a simple pipeline to take a DNA barcode, query BOLD for similar sequences, retrieve those, align them, build a neighbour joining tree, annotate the tree with latitude and longitudes, and encode that information in a NEXUS file.

To layout the tree on a map (say OpenStreetMap using Leaflet or Google Maps) I convert the NEXUS file to GeoJSON. There are a couple of problems to solve when doing this.Typically when drawing a phylogeny we compute x and y coordinates for a device such as a computer screen or printer where these coordinates have equal units and are linear in both horizontal and vertical dimensions. In web maps coordinates are expressed in terms of latitude and longitude, and in the widely-used Web Mercator projection the vertical axis (latitude) is non-linear. Furthermore, on a web map the user can zoom in and out, so pixel-based coordinates only make sense with respect to a given zoom level.

To tackle this I compute the layout of the tree in pixels at zoom level 0, when the web map comprises a single "tile".

The tile coordinates are then converted to latitude and longitude, so that they can be placed on the map. The map applications take care of zooming in and out, so the tree scales appropriately. The actual sampling localities are simply markers on the map. Another problem is to reduce the visual clutter that results from criss-crossing lines connecting connecting the tips of the tree and the associated sampling localities. To make the diagram more comprehensible, I adopt the approach used by GenGIS to reorder the nodes in the tree to minimise the crossings (see algorithm in doi:10.7155/jgaa.00088). The tree and the lines connecting it to the localities are encoded as "LineString" objects in the GeoJSON file.

There are a couple of things which could be done with this kind of tool. The first is to add it as a visualisation to a set of phylogenies or occurrence data. For example, imagine my "million barcode map" having the ability to display a geophylogeny for any barcode you click on.

Another use would be to create a geographically indexed database of phylogenies. There are databases such as CouchDB that store JSON as a native format, and it would be fairly straightforward to consume GeoJSON for a geophylogeny, ignore the bits that draw the tree on the map, and index the localities. We could then search for trees in a given region, and render them on a map.

There's still some work to do (I need to make the orientation of the tree optional and there are some edges case that need to be handled), but it's starting to reach the point when it's fun just to explore some examples, such as these microendemic Agnotecous crickets in New Caledonia (data from doi:10.1371/journal.pone.0048047 and GBIF).

Thursday, May 02, 2013

GBIF data quality: visualising Mesibov's millipedes

Bob Mesibov (who has been a guest author on this blog) recently published a paper on data quality in in ZooKeys:

Mesibov, R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys, 293(0), 1–18. doi:10.3897/zookeys.293.5111

In this paper Bob documents some significant discrepancies between data in his Millipedes of Australia (MoA) database and the equivalent data in the Atlas of Living Australia and GBIF (disclosure, I was a reviewer of the paper, and also sit on GBIF's science committee). This paper spawned a thread on TAXACOM, and also came up at the GBIF meeting I was at earlier this week.

One thing lacking from the discussion is a clear sense of just how big are the discrepancies between GBIF and MoA data, so I grabbed the data provided by Bob (http://dx.doi.org/10.3897/zookeys.293.5111.app and extracted the records where GBIF and MoA disagreed. I converted these to GeoJSON and threw them on Google Maps:

Mesibov2

You can see a live version here http://bl.ocks.org/rdmpage/raw/5501293/ (it can take a little while for the map to appear). I've connected the MoA and GBIF localities for the same occurrence by a straight line, and the the MoA records are encircled by an estimate of their uncertainty (for many records the circle is invisible at this scale).

There are some fairly spectacular discrepancies, and a lot of relatively small scale displacements of records. Does this matter? The answer to this question will depend on what people want to do with the data. You may regard the discrepancies as serious (certainly it's interesting that there are so many differences between the two data sets), or minor given the geographic scale. But visualising them at least makes it possible to form a judgement.