Showing posts with label data. Show all posts

Friday, April 15, 2016

The Zika virus, GBIF, and the missing mosquitoes

One of GBIF's goals is to provide up to date, comprehensive data on the distribution of species. Although GBIF's taxonomy and geographic scope is global, not all species are equal, in the sense that the need for information on some species is potentially much more pressing. An example are mosquitoes of the genus Aedes, such as the species A. aegypti and A. albopictus that spread the Zika virus.

Over the last few days I discovered how poor GBIF's coverage of these two vectors is, and a way to fix that gap quickly. Like many things I work on, I stumbled across the problem by accident. GBIF has released a report on whether GBIF data are fit for modeling species distributions. The publicity material included a psychedelic image showing a map for Aedes aegypti from a recent eLife paper by Kraemer et al. (The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus http://doi.org/10.7554/elife.08347 ).

Moritz et al 2015 Global Aedes aegypti distribution detail2

Curious, I read the paper and the phrase "GBIF" occurs only once in the text:

we selected 10,000 occurrence records of Aedes species from the Global Biodiversity Information Facility (http://www.gbif.org), omitting all records of Ae. aegypti and Ae. albopictus. This dataset is intended to reflect biases in mosquito reporting in areas which are suitable for Aedes mosquitoes.

So, GBIF data on these two mosquitoes wasn't used. A quick look at what GBIF had for Aedes albopictus and it's not surprising why GBIF data played such a small role:

Compare this with the data shown in the Scientific Data paper (http://doi.org/10.1038/sdata.2015.35 on the data that underpins the eLife paper.

Note the striking lack of any GBIF records from Brazil. Fortunately the data collected by Kraemer et al. are freely available in Dryad http://doi.org/10.5061/dryad.47v3c, so I grabbed the files, fussed about with them a bit (https://github.com/rdmpage/global-distribution-arbovirus-vectors) to get them into the format required by GBIF, and uploaded them. Below is the data for Aedes albopictus in GBIF:

This is looking more like it! If you are more interested in Aedes aegypti then that data is also available.

Questions

This example raises a number of questions:

How come GBIF had such poor data to start with? If GBIF is going to be relevant to people who need biodiversity data, in some cases urgently, then there's an argument to be made that GBIF should be targeting species such as disease vectors that are likely to be in demand in the future.
Why wasn't the latest data in GBIF? One reason GBIF's data was poor is that the relevant data was widely scattered in the literature (Kraemer et al. list over 1000 papers that they looked at, not including the unpublished sources). This clearly requires a lot of effort to assemble. But once assembled, why wasn't it deposited in GBIF? Is it a case of researchers not thinking this would be a useful thing to do, or not knowing how to do it?
What about all the other data out there? This particular example was prompted by me wondering what is that hideous image on the GBIF post, reading the eLife article, wondering where the data was, and having sufficient access to GBIF to simply upload the data. This is clearly not a scalable approach. How can we improve this process? Can we automate harvesting relevant data from repositories such as Dryad so that this data gets fed into GBIF automatically? Should GBIF become a data repository itself so authors can store their data there? And how do we retrospectively harvest all the rest of the data languishing in the scientific literature?

Side note

One aspect of the Kraemer et al. data I've not focussed on is that it is derived from the literature, most of it unpublished, but some is in the primary literature (the list of papers is missing from the Dryad repository but I obtained a copy from Moritz Kraemer (@MOUGK and it's now on github). This means we can link individual occurrence records back to the evidence for that occurrence (i.e., the paper that made the assertion that this species of mosquito is found at this locality). This means we can (a) provide provenance for the data, and (b) provide credit to the authors of that observation. I hope to explore this topic in a subsequent blog post.

References

Kraemer, M. U. G., Sinka, M. E., Duda, K. A., Mylne, A., Shearer, F. M., Brady, O. J., … Hay, S. I. (2015, July 7). The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data. Nature Publishing Group. http://doi.org/10.1038/sdata.2015.35

Kraemer, Moritz U. G., Sinka, Marianne E., Duda, Kirsten A., Mylne, Adrian, Shearer, Freya M., Brady, Oliver J., … Hay, Simon I. (2015). Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository. http://doi.org/10.5061/dryad.47v3c

Kraemer, M. U., Sinka, M. E., Duda, K. A., Mylne, A. Q., Shearer, F. M., Barker, C. M., … Hay, S. I. (2015, June 30). The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus . eLife. eLife Sciences Organisation, Ltd. http://doi.org/10.7554/elife.08347

Wednesday, December 17, 2014

The Natural History Museum launches their data portal

The Natural History Museum has released their data portal (http://data.nhm.ac.uk/). As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.

I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.

Nhmportal

Thursday, August 28, 2014

BioNames database can be downloaded

B8e253dc3be3d84f2c69c51b0af86c03 400x400

My BioNames project has been going for over a year now, but I hadn't gotten around to providing bulk access to the data I've been collecting and cleaning. I've gone some way towards fixing this. You can now grab a snapshot of the BioNames database as a Darwin Core Archive here. This snapshot was generated on the 22nd August, so it is already a little out of date (BioNames is edited almost daily as I clean and annotate it when I should be doing other things).

The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.

Tuesday, August 19, 2014

Guest post: Response to the discussion on Red List assessments of East African chameleons

This is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.

Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427

Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.

One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.

Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858

It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.

Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.

Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.

Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.

Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.

We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.

Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess

Monday, March 31, 2014

Rethinking annotating biodiversity data

TL;DR By using bookmarklets and a central annotation store, we can build a system to annotate any biodiversity database, and display those annotations on those databases.

A couple of weeks ago I was at GBIF meeting in Copenhagen, and there was a discussion about adding a new feature to the GBIF portal. The conversation went something like this:

Advisor: "We really need this feature, now!"

Developer: "OK, but which of these other things you've told us we need to do should we stop doing, so we can add this new feature?"

Resources are limited, and adding new features to a project can be difficult. This got me thinking about the issue of annotating data, in GBIF and other biodiversity projects. There have been a number of recent papers on annotating biodiversity data, such as:

Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013, November 4). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0076093

Tschöpe, O., Macklin, J. A., Morris, R. A., Suhrbier, L., & Berendsohn, W. G. (2013, December 20). Annotating biodiversity data via the Internet. Taxon. International Association for Plant Taxonomy (IAPT). doi:10.12705/626.4

It seems to me that these potentially suffer the assumption that data aggregators such as GBIF, and data providers such as natural history collections have sufficient resources in place to (a) implement such systems, and (b) process the annotations made by the community and update their records. What if neither assumption holds true?

Everyone is busy

Any system which requires a project to add another feature is going to have to compete with other priorities. I ran into this with my BioNames project, which was partly funded by EOL. BioNames links taxonomic names for animals (obtained from ION) to the primary literature, for example Pinnotheres atrinicola was published in the following paper:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904.

Ideally, all the links between names and publications that I'd assembled in BioNames would have been added to EOL, so that (wherever possible) users of EOL could see the original description of a taxon in EOL. But this didn't happen. In order to get BioNames into EOL I had to export the data in Darwin Core format, which is poorly suited to this kind of data. It also became clear that BioNames and EOL had rather different data models when it came to taxa, names, and publications. This meant it was going to be challenge providing the data in w ay that was usable by EOL. Plus, EOL was pretty busy doing other things such as developing TraitBank™ (yes, that's a "™" after TraitBank). So, I never did get BioNames content into EOL.

But there's another way to do this.

The Web means never having to ask for permission

It occurred to me (around about the time that I was at the pro-iBiosphere hackathon at Leiden) that there's another way to tackle this, a way which uses bookmarklets. Bookmarklets are little snippets of Javascript that can be stored as bookmarks in your web browser, and they can add extra functionality to an existing web page. You may well have come across these already, such as Save to Mendeley , or Altmetric it.

How does this help us with annotation? Well, with a little programming, you can add features that you think are "missing" from a web page, and you don't need to ask anyone's permission to do it. So, I could negotiate with EOL about how to get data from BioNames into EOL, or I can simply do this:

2014 03 31 17 05 26

What I've done here is create a bookmarklet that recognises that you are looking at an EOL page, it then calls the BioNames API and displays the original publication of the taxon displayed on the page (in this case, Pinnotheres atrincola). So, I've added the information from BioNames to the EOL page, without needing EOL to do anything.

But it gets better. We can do this with pretty much any web page. The example above displays the original publication of a taxon name, but imagine we are looking at the publisher's page for that article (you can see it here: http://dx.doi.org/10.1080/03014223.1983.10423904). Wouldn't it be nice if the publisher knew that this paper described a new species of crab? We could negotiate with the publisher about how to give them that information, and how they could display it, or we can just add it:

2014 03 31 17 15 08

This time the bookmarklet recognises that the web page has a DOI, then asks BioNames whether there have been any names published in the paper with that DOI, if it finds any they are displayed in the popup.

Bookmarklets enable you to enhance a web page with any information you like. This makes them ideal for displaying annotations on a page. If you want to try yourself, you can grab the bookmarklet from here.

Making annotations visible

Bookmarklets can be used to solve one part of the annotation problem, namely showing existing annotations. I have lots of exmaples of errors in datasets, I blog about some of these, I store some in Evernote for future reference, some end up in unfinished manuscripts, and so on. The problem is that these annotations are of little use to anyone else because if you go to GBIF you don't see my annotations (or, indeed, anyone else's). But we can use a bookmarklet to display these, without having to pester GBIF themselves to add this feature! Imagine a bookmarklet that you could click on and it shows you whether anyone as queried the identification of a specimen, or the location of a specimen?

Where do the annotations come from?

Of course, all this presupposes that we have annotations to start with. I think there are at least two classes of annotations. The first, most obvious annotations are ones that change or add attributes to an object. For example, adding latitude and longitude coordinates to a specimen. These are annotations we would want to display just on the corresponding page in the source database (e.g., displaying a map in the annotation popup on GBIF for a record we've georeferenced).

The second class comprise cross-links between data sets. For example, linking a species in EOL to DOI of the publication that first described that species. Or linking a specimen in GBIF to the sequences in GenBank that were obtained from that specimen. These annotations are different in that we might want to display them on multiple web pages (e.g., pages served by both a biodiversity database and an academic publisher). From this perspective, a database like BioNames is essentially a big store of annotations.

But we need more than this, we need to be able to annotate any class of data that is relevant to biodiversity. We need to be able to edit erroneous GBIF records, flag Genbank sequences that have been misidentified, document taxonomic names that are entirely spurious, and so on. And we need to make these annotations available via APIs so that anywone can access them. To me, it seems obvious that we need a single, centralised annotation store.

A global annotation store

One way to implement an annotation store would be to create a wiki-style database that the community could edit. This database gets populated with data that can then be edited, refined, and discussed. For example, imagine a GBIF user spots an occurrence that is clearly wrong (a frog in the middle of the ocean). They could have a bookmarklet that they click on, and it displays any existing annotations of that record. If there aren't any, let's imagine there is a link to the annotaion store. Clicking on that creates a record for that occurrence, and the user then edits that. Perhaps they discover that the latitude and longitudes have been swapped, so they swap them back, and save the record. The next person to go to that page in GBIF clicks on their bookmarklet and discoveres that there is a potential issue with that record (the popup displayed by the bookmarklet will have a "warning symbol", and an updated map).

Some annotations will be simple, some may require some analysis. For example, a claim that a GenBank sequence has been misidentified would be stronger if it was backed up by a BLAST analysis that demonstrated that the sequence was clustered with taxa that you would not expect based on its putitative identification.

We can also annotate in bulk, and upload these annotations directly to the annotation store. For example, we could map GBIF taxa to taxonomic name identifiers from nomenclators such as ION, ZooBank, IPNI, Index Fungorum, etc., then map those identifiers to the primary litertaure, and upload all of that data to the annotation store, making them available to anyone visiting GBIF (or, indeed, the nomenclators). We could BLAST DNA barcode sequences and suggest potential identifications. We could add lists of publications that cite museum specimen codes, and display those on the GBIF page that corresponds to each code. There is almost no limit to the richness of annotations we could add to existing webpages.

Filtered push

One aspect of annotation that I've glossed over is how the annotations get back to the primary data providers. There has been some work on this (see papers cited at the start), but in a sense I don't think this is the most pressing problem (in part because I suspect most providers are in no position to undertake the kind of data editing and cleaning required). My concern is at the other end of the process. Users of biodiversity data are frequently presented with data that is demonstrably erroneous, and it inconveniences them, as well as hurting the reputation of aggregators such as GBIF, or databases such as GenBank. Anyone doing an analysis of these sorts of data will spend some time cleaning and correcting the data, we desperately need mechanisms to capture these annotations and make them available to other users. The extent to which these annotations filter back to the primary data providers is, in my view, a less pressing issue.

That said, a central annotation store would have lots of advantages for primary providers. It's one place to go to get annotations. The fate of a user's edits could help develop metrics of reliability of annotations, and so on.

Summary

The reason I find this approach attractive is that it frees us from having to wait for projects like GBIF and GenBank to support annotations. We don't need to wait, we can simply do it ourselves right now. We can add overlays that augment existing data (e.g., adding original publications to EOL web pages), or flag errors. Take the example bookmarklet from here for a spin and see what it can do. It's very crude, but I think it gives an indication of the potential of this approach.

So, "all" we need is a centralised, editable, database of annotations that we can hook the bookmarklet into. Simples.

Wednesday, February 19, 2014

Five Stages of Data Grief

There is a great post by Jeni Tennison on the Open Data Institute blog entitled Five Stages of Data Grief. It resonates so much with my experience working with biodiversity data (such as building BioNames, or exploring data errors in GBIF) that I've decide to reproduce it here.

Five Stages of Data Grief

by Jeni Tennison (@JeniT)

As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.

Data analysts have a maxim:

If you don’t think you have a quality problem with your data, you haven’t looked at it yet.

Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.

But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.

So here is an outline of what that looks like.

Denial

This can’t be right: there’s nothing wrong with our data! Your analysis/code/visualisation must be doing something wrong.

At this stage data custodians can’t believe what they are seeing. Maybe they have been using the data themselves but never run into issues with it because they were only using it in limiting ways. Maybe they had only ever been collecting the data, and not actually using it at all. Or maybe they had been viewing it in a form where the issues with data quality were never surfaced (it’s hard to spot additional spaces, or even zeros, when you just look at a spreadsheet in Excel, for example).

So the first reason that they reach for is that there must be something wrong with the analysis or code that seems to reveal issues with the data. There may follow a wild goose chase that tries to track down the non-existent bug. Take heart: this exercise is useful in that it can pinpoint the precise records that are causing the problems in the first place, which forces the curators to stop denying them.

Anger

Who is responsible for these errors? Why haven’t they been spotted before?

As the fact that there are errors in the data comes to be understood, the focus can come to rest on the people who collect and maintain the data. This is the phase that the maintainers of data dread (and can be a reason for resisting sharing the data in the first place), because they get blamed for the poor quality.

This painful phase should eventually result in an evaluation of where errors occur — an evaluation that is incredibly useful, and should be documented and kept for the Acceptance phase of the process — and what might be done to prevent them in future. Sometimes that might result in better systems for data collection but more often than not it will be recognised that some of the errors are legacy issues or simply unavoidable without massively increasing the maintenance burden.

Bargaining

What about if we ignore these bits here? Can you tweak the visualisation to hide that?

And so the focus switches again to the analysis and visualisations that reveal the problems in the data, this time with an acceptance that the errors are real, but a desire to hide the problems so that they’re less noticeable.

This phase puts the burden on the analysts who are trying to create views over the data. They may be asked to add some special cases, or tweak a few calculations. Areas of functionality may be dropped in their entirety or radically changed as a compromise is reached between utility of the analysis and low quality data to feed it.

Depression

This whole dataset is worthless. There’s no point even trying to capture this data any more.

As the number of exceptions and compromises grows, and a realisation sinks in that those compromises undermine the utility of the analysis or visualisation as a whole, a kind of despair sets in. The barriers to fixing the data or collecting it more effectively may seem insurmountable, and the data curators may feel like giving up trying.

This phase can lead to a re-examination of the reasons for collecting and maintaining the data in the first place. Hopefully, this process can aid everyone in reasserting why the data is useful, regardless of some aspects that are lower quality than others.

Acceptance

We know there are some problems with the data. We’ll document them for anyone who wants to use it, and describe the limitations of the analysis.

In the final stage, all those involved recognise that there are some data quality problems, but that these do not render the data worthless. They will understand the limits of analyses and interpretations that they make based on the data, and they try to document them to avoid other people being misled.

The benefits of the previous stages are also recognised. Denial led to double-checking the calculations behind the analyses, making them more reliable. Anger led to re-examination of how the data was collected and maintained, and documentation that helps everyone understand the limits of the data better. Bargaining forced analyses and visualisations to be focused and explicit about what they do and don’t show. Depression helped everyone focus on the user needs from the data. Each stage makes for a better end product.

Of course doing data analysis isn’t actually like being diagnosed with a chronic illness or losing a loved one. There are things that you can do to remedy the situation. So I think we need to add a sixth stage to the five stages of data grief described above:

Hope

This could help us spot errors in the data and fix them!

Providing visualisations and analysis provides people with a clearer view about what data has been captured and can make it easier to spot mistakes, such as outliers caused by using the wrong units when entering a value, or new categories created by spelling mistakes. When data gets used to make decisions by the people who capture the data, they have a strong motivation to get the data right. As Francis Irving outlined in his recent Friday Lunchtime Lecture at ODI, Burn the Digital Paper, these feedback loops can radically change how people think about data, and use computers within their organisations.

Making data open for other people to look at provides lots more opportunities for people to spot errors. This can be terrifying — who wants people to know that they are running their organisation based on bad-quality data? — but those who have progressed through the five stages of data grief find hope in another developer maxim:

Given enough eyeballs, all bugs are shallow.

— Linus’s Law, The Cathedral and the Bazaar by Eric Raymond

The more people look at your data, the more likely they are to find the problems within it. The secret is to build in feedback mechanisms which allow those errors to be corrected, so that you can benefit from those eyes and increase your data quality to what you thought it was in the first place.

Thursday, September 05, 2013

"Lost Branches on the Tree of Life" - why must the answer be enforcing behaviour?

Bryan Drew and colleagues have published a piece in PLoS Biology bemoaning the lack of databased phylogenies:

Drew, B. T., Gazis, R., Cabezas, P., Swithers, K. S., Deng, J., Rodriguez, R., Katz, L. A., et al. (2013). Lost Branches on the Tree of Life. PLoS Biology, 11(9), e1001636. doi:10.1371/journal.pbio.1001636 (see also blog post Dude, Where’s My Data?)

This is an old problem (see for example "Towards a Taxonomically Intelligent Phylogenetic Database" doi:10.1038/npre.2007.1028.1), but alas the solution proposed by Drew et al. is also old:

Optimally, all peer-reviewed journals that publish phylogenetic datasets should require deposition (and activation for public access) of alignments and trees prior to publication, and these trees and alignments will include the same characters and taxa (and taxon names) as in the published study.

In my opinion, as soon as you start demanding people do something you've lost the argument, and you're relying on power ("you don't get to publish with us unless you do 'x'"). This is also lazy. In a talk I gave to the NSF AVATOL meeting I argued that this is the wrong approach, when building shared resources carrots are better than sticks.

Late night thoughts of a jet-lagged phylogeneticist from Roderic Page

In that talk I used the example of Mendeley where they build an incredibly valuable resource (a bibliography of academic research in the cloud that they sold for $US 100M) by providing a service that meet people's needs ("where's that damn PDF again?"). No brow beating, no "you must do this", just clever social engineering.

So, my challenge to the phylogenetics community (and the authors of "Lost Branches on the Tree of Life" in particular) is to stop resorting to bullying people, and ask instead how you could make it a no brainer for people to share their trees. In other words, build something people actually need and will be inspired to contribute to.

Wednesday, November 21, 2012

Species wait 21 years to be described - show me the data

Benoît Fontaine et al. recently published a study concluding that average lag time between a species being discovered and subsequently described is 21 years.

Fontaine, B., Perrard, A., & Bouchet, P. (2012). 21 years of shelf life between discovery and description of new species. Current Biology, 22(22), R943–R944. doi:10.1016/j.cub.2012.10.029

The paper concludes:

With a biodiversity crisis that predicts massive extinctions and a shelf life that will continue to reach several decades, taxonomists will increasingly be describing from museum collections species that are already extinct in the wild, just as astronomers observe stars that vanished thousands of years ago.

This is a conclusion that merits more investigation, especially as the title of the paper suggests there is an appalling lack of efficiency (or resources) in the way we decsribe biodiversity. So, with interest I looked at the Supplemental Information for the data:

I was hoping to see the list of the 600 species chosen at random, the publication containing their original description, and the date of their first collection. Instead, all we have is a description of the methods for data collection and analysis. Where is the data? Without the data I have no way of exploring the conclusions, asking additional questions. For example, what is the distribution of date of specimen collection in each species? One could imagine situations where a number of specimens are recently collected, prompting recognition and description of a new species, and as part of that process rummaging through the collections turns up older, unrecognised members of that species. Indeed, if it takes a certain number of specimens to describe a species (people tend to frown upon descriptions based on single specimens) perhaps what we are seeing is the outcome of a sampling process where specimens of new species are rare, they take a while to accumulate in collections, and the distribution of collection dates will have a long tail.

These are the sort of questions we could have if we had the data, but the authors don't provide that. The worrying thing is that we are seeing a number of high-visibility papers that potentially have major implications for how we view the field of taxonomy but which don't publish their data. Another recent example is:

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551–553. doi:10.1016/j.tree.2011.07.010

Biodiversity is a big data science, it's time we insisted on that data being made available.

Thursday, April 05, 2012

EOL Computable Data Challenge community

Now we are awash in challenges! EOL has announced its Computable Data Challenge:

We invite ideas for scientific research projects that use EOL, including the Biodiversity Heritage Library (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project. We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.

Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a Computable Data Challenge community to discuss the challenge.

It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.

Friday, April 01, 2011

Data matters but do data sets?

Interest in archiving data and data publication is growing, as evidenced by projects such as Dryad, and earlier tools such as TreeBASE. But I can't help wondering whether this is a little misguided. I think the issues are granularity and reuse.

Taking the second issue first, how much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses.

Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much").

But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.

To me, citing data sets makes almost as much sense as citing journal volumes - the level of granularity is wrong. Journal volumes are largely arbitrary collections of articles, it's the articles that are the typical unit of citation. Likewise I think sequences will be cited more often than alignments.

It might be argued that there are disciplines where the dataset is the sensible unit, such as an ecological study of a particular species. Such a data set may lack obvious subsets, and hence it makes sense to be cited as a unit. But my expectation here is that such datasets will see limited re-use, for the very reason that they can't be easily partitioned and mashed up. Data sets, such as alignments, are built from smaller, reusable units of data (i.e., sequences) can be recombined, trimmed, or merged, and hence can be readily re-used. Monolithic datasets with largely unique content can't be easily mashed up with other data.

Hence, my suspicion is that many data sets in digital archives will gather digital dust, and anyone submitting a data set in the expectation that it will be cited may turn out to be disappointed.

Wednesday, December 29, 2010

The Plant List: nice data, shame it's not open

The Plant List (http://www.theplantlist.org/) has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:

You may not alter, transform, or build upon this work.

So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:

Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.

Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:

If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at editors@theplantlist.org to request written permission and to ensure that credits are properly made.

Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?

Friday, May 22, 2009

Dryad, DOIs, and why data matters more than journal articles

For the last two days I've been participating in a NESCent meeting on Dryad, a "repository of data underlying scientific publications, with an initial focus on evolutionary biology and related fields". The aim of Dryad is to provide a durable home for the kinds of data that don't get captured by existing databases such as GenBank and TreeBASE (for example, the Excel spreadsheets, Word files, and tarballs of data that, if they are lucky, make it on to a journal's web site as supplementary material (like this example). These data have an alarming tendency to disappear (see "Unavailability of online supplementary scientiﬁc information from articles published in major journals" doi:10.1096/fj.05-4784lsf).

Perhaps it was because I was participating virtually (via Adobe Connect, which worked very well), but at times I felt seriously out of step with many of the participants. I got the sense that they regard the scientific article as primary, data as secondary, and weren't entirely convinced that data needed to be treated in the same way as a publication. I was arguing that Dryad should assign DOIs to data sets, join CrossRef, and ensure data sets were cited in the same way as papers. For me this is a no brainer -- by making data equivalent to a publication, journals don't need to do anything special, publishers know how to handle DOIs, and will have fewer qualms than handling URLs, which have a nasty tendency to break (see "Going, Going, Gone: Lost Internet References" doi:10.1126/science.1088234).

Furthermore, being part of CrossRef would bring other benefits. Their cited-by linking service enables publishers to display lists of articles that cite a given paper -- imagine being able to do this for data sets. Dryad could display not just the paper associated with publication of the data set, but all subsequent citations. As an author, I'd love to see this. It would enable me to see what others had done with my data, and provide an incentive to submit my data to Dryad (providing incentives to authors to archive data is a big issue, see Mark Costello's recent paper doi:10.1525/bio.2009.59.5.9).

Not everyone saw things this way, and it's often a "reality check" to discover that things one takes for granted are not at all obvious to others (leading to mutual incomprehension). Many editors, understandably, think of the the journal article as primary, and data as something else (some even struggle to see why one would want to cite data). There's also (to my mind) a ridiculous level of concern about whether ISI would index the data. In the age of Google, who cares? Partly these concerns may reflect the diversity of the participants. Some subjects, such as phylogenetics, are built on reuse of previous data, and it's this reuse that makes data citation both important and potentially powerful (for more on this see my papers hdl:10101/npre.2009.3173.1 and doi:10.1093/bib/bbn022). In many ways, the data is more important than the publication. If I look at a phylogenetics paper published, say 5 or more years ago, the methods may be outmoded, the software obsolete (I might not be able to run it on a modern machine), and the results likely to be outdated (additional data and/or taxa changing the tree). So, the paper might be virtually useless, but the data continues to be of value. Furthermore, the great thing about data (especially sequence data) is that it can be used in all sorts of unexpected ways. In disciplines such as phylogenetics, data reuse is very common. In other areas in evolution and ecology, this might not be the case.

It will be clear from this that I buy the idea articulated by Philip Bourne (doi:10.1371/journal.pcbi.0010034) that there's really no difference between a database and a journal article and that the two are converging (I've argued for a long time that the best thing that could happen to phylogeneics would be if Molecular Phylogenetics and Evolution and TreeBASE were to merge and become one entity). Data submission would equal publication. In the age of Google where data is unreasonably effective (doi:10.1109/mis.2009.36, PDF here), privileging articles at the expense of data strikes me as archaic.

So, whither Dryad? I wish it every success, and I'm sure it will be a great start. There are some very clever people behind it, and it takes a lot of work to bring a community on board. However, I think Dryad's use of Handles is a mistake (they are the obvious choice of identifier given Dryad is based on DSpace), as this presents publishers with another identifier to deal with, and has none of the benefits of DOIs. Indeed, I would go further and say that the use of Handles + DSpace marks Dryad as being basically yet another digital library project, which is fine, but it puts it outside the mainstream of science publishing, and I think that is a strategic mistake. An example of how to do things better is Nature Precedings, which assigns DOIs to manuscripts, reports, and presentations. I think the use of DOIs in this context demonstrated that Nature was serious, and valued these sorts of resource. Personally, I'd argue that Dryad should be more ambitious, and see itself as a publisher, not a repository. In fact, it could think of itself as a journal publisher. Ironically, maybe the editors at the NESCent meeting were well advised to be wary, what they could be witnessing is the formation of a new kind of publication, where data is the article, and the article is data.