iPhylo: CSL

Showing posts with label CSL. Show all posts

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Citation parsing tool released

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as:

Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156.
de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.
Morlet L (1889) Catalogue des coquilles recueillies, par M. Pavie dans le Cambodge et le Royaume de Siam, et description ďespèces nouvelles (1). Journal de Conchyliologie, 37: 121–199.
Naggs F (1997) William Benson and the early study of land snails in British India and Ceylon. Archives of Natural History, 24:37–88.

and return structured data. This is an old problem, and pretty much a "solved" problem. See for example AnyStyle. I've played with AnyStyle and it's great, but I had to install it on my computer rather than simply use it as a web service. I also wanted to explore the approach a bit more as a possible a model for finding citations of specimens.

Loving Sylvester Keil's AnyStyle reference parser https://t.co/pcbvctr5vf, but not loving the whole Ruby experience (whadda mean I need to upgrade Ruby on my Mac to install it?). Oh for a Docker version of a web service... still, very cool tool.
— Roderic Page (@rdmpage) July 7, 2020

After trying to install the underlying conditional random fields (CRF) engine used by AnyStyle and running into a bunch of errors, I switched to a tool I could get working, namely CRF++. After figuring out how to compiling a C++ application to run on Heroku I started to wonder how to use this as the basis of a citation parser. Fortunately, I had used the Perl-based ParsCit years ago, and managed to convert the relevant bits to PHP and build a simple web service around it.

Although I've abandoned the Ruby-based AnyStyle I do use AnyStyle's XML format for the training data. I also built a crude editor to create small training data sets that uses a technique published by the author of the blogging tool I'm using to write this post (see MarsEdit Live Source Preview). Typically I use this to correctly annotate examples where the parser failed. Over time I add these to the training data and the performance gets better.

This is pretty much a side project or a side project, but ultimately the goal is to employ it to help extract citation data from publications, both to generate data to populate (BioStor), and also start to flesh out the citation graph for publications in Wikidata.

If you want to play with the tool it is at https://citation-parser.herokuapp.com. At the moment it takes some citation strings and returns the result in CSL-JSON, which is becoming the default way to represent structured bibliographic data. Code is on GitHub.