--- http_interactions: - request: method: get uri: https://rogue-scholar.org/api/blogs/tyfqw20 body: encoding: UTF-8 string: '' headers: Connection: - close Host: - rogue-scholar.org User-Agent: - http.rb/5.1.1 response: status: code: 200 message: OK headers: Age: - '0' Cache-Control: - public, max-age=0, must-revalidate Content-Length: - '49530' Content-Type: - application/json; charset=utf-8 Date: - Sun, 18 Jun 2023 15:24:19 GMT Etag: - '"xv42bhvvc21253"' Server: - Vercel Strict-Transport-Security: - max-age=63072000 X-Matched-Path: - "/api/blogs/[slug]" X-Vercel-Cache: - MISS X-Vercel-Id: - fra1::iad1::6gh26-1687101859043-0548b5ea306e Connection: - close body: encoding: UTF-8 string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.
ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger 7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA barcoding as intergenerational transfer of taxonomic knowledge","summary":"I tweeted about this but want to bookmark it for later as well. The paper “A molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510 contains the following: …the annotated barcode records assembled by FinBOL participants represent a tremendous intergenerational transfer of taxonomic knowledge … the time contributed by current taxonomists in identifying and contributing voucher specimens represents a great gift to future generations who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

I tweeted about this but want to bookmark it for later as well. The paper “A molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510 contains the following:

\n
…the annotated barcode records assembled by FinBOL participants represent a tremendous intergenerational transfer of taxonomic knowledge … the time contributed by current taxonomists in identifying and contributing voucher specimens represents a great gift to future generations who will benefit from their expertise when they are no longer able to process new material.
\n

I think this is a very clever way to characterise the project. In an age of machine learning this may be commonest way to share knowledge , namely as expert-labelled training data used to build tools for others. Of course, this means the expertise itself may be lost, which has implications for updating the models if the data isn’t complete. But it speaks to Charles Godfrey’s theme of “Taxonomy as information science”.

Note that the knowledge is also transformed in the sense that the underlying expertise of interpreting morphology, ecology, behaviour, genomics, and the past literature is not what is being passed on. Instead it is probabilities that a DNA sequence belongs to a particular taxon.

This feels is different to, say iNaturalist, where there is a machine learning model to identify images. In that case, the model is built on something the community itself has created, and continues to create. Yes, the underlying idea is that same: “experts” have labelled the data, a model is trained, the model is used. But the benefits of the iNaturalist model are immediately applicable to the people whose data built the model. In the case of barcoding, because the technology itself is still not in the hands of many (relative to, say, digital imaging), the benefits are perhaps less tangible. Obviously researchers working with environmental DNA will find it very useful, but broader impact may await the arrival of citizen science DNA barcoding.

The other consideration is whether the barcoding helps taxonomists. Is it to be used to help prioritise future work (“we are getting lots of unknown sequences in these taxa, lets do some taxonomy there”), or is it simply capturing the knowledge of a generation that won’t be replaced:

\n
The need to capture such knowledge is essential because there are, for example, no young Finnish taxonomists who can critically identify species in many key groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges, most mite lineages).
\n

The cycle of collect data, test and refine model, collect more data, rinse and repeat that happens with iNaturalist creates a feedback loop. It’s not clear that a similar cycle exists for DNA barcoding.

\n
Written with StackEdit.
\n

","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a guest post by Tony Rees. It would be difficult to encounter a scientist, or anyone interested in science, who is not familiar with the microscope, a tool for making objects visible that are otherwise too small to be properly seen by the unaided eye, or to reveal otherwise invisible fine detail in larger objects. A select few with a particular interest in microscopy may also have encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

$\"YtNkVT2U\"$ This is a guest post by Tony Rees.

\n\n

It would be difficult to encounter a scientist, or anyone interested in science, who is not familiar with the microscope, a tool for making objects visible that are otherwise too small to be properly seen by the unaided eye, or to reveal otherwise invisible fine detail in larger objects. A select few with a particular interest in microscopy may also have encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised for low-power macro-photography. However in this overview I discuss the \"Macroscope\" in a different sense, which is that of the antithesis to the microscope: namely a method for visualizing subjects too large to be encompassed by a single field of vision, such as the Earth or some subset of its phenomena (the biosphere, for example), or conceptually, the universe.

\n\n

$\"\"$

My introduction to the term was via addresses given by Jesse Ausubel in the formative years of the 2001-2010 Census of Marine Life, for which he was a key proponent. In Ausubel''s view, the Census would perform the function of a macroscope, permitting a view of everything that lives in the global ocean (or at least, that subset which could realistically be sampled in the time frame available) as opposed to more limited subsets available via previous data collection efforts. My view (which could, of course, be wrong) was that his thinking had been informed by a work entitled \"Le macroscope, vers une vision globale\" published in 1975 by the French thinker Joël de Rosnay, who had expressed such a concept as being globally applicable in many fields, including the physical and natural worlds but also extending to human society, the growth of cities, and more. Yet again, some ecologists may also have encountered the term, sometimes in the guise of \"Odum''s macroscope\", as an approach for obtaining \"big picture\" analyses of macroecological processes suitable for mathematical modelling, typically by elimination of fine detail so that only the larger patterns remain, as initially advocated by Howard T. Odum in his 1971 book \"Environment, Power, and Society\".

\n\n

From the standpoint of the 21st century, it seems that we are closer to achieving a \"macroscope\" (or possibly, multiple such tools) than ever before, based on the availability of existing and continuing new data streams, improved technology for data assembly and storage, and advanced ways to query and combine these large streams of data to produce new visualizations, data products, and analytical findings. I devote the remainder of this article to examples where either particular workers have employed \"macroscope\" terminology to describe their activities, or where potentially equivalent actions are taking place without the explicit \"macroscope\" association, but are equally worthy of consideration. To save space here, references cited here (most or all) can be found via a Wikipedia article entitled \"Macroscope (science concept)\" that I authored on the subject around a year ago, and have continued to add to on occasion as new thoughts or information come to hand (see edit history for the article).

\n\n

First, one can ask, what constitutes a macroscope, in the present context? In the Wikipedia article I point to a book \"Big Data - Related Technologies, Challenges and Future Prospects\" by Chen et al. (2014) (doi:10.1007/978-3-319-06245-7), in which the \"value chain of big data\" is characterised as divisible into four phases, namely data generation, data acquisition (aka data assembly), data storage, and data analysis. To my mind, data generation (which others may term acquisition, differently from the usage by Chen et al.) is obviously the first step, but does not in itself constitute the macroscope, except in rare cases - such as Landsat imagery, perhaps - where on its own, a single co-ordinated data stream is sufficient to meet the need for a particular type of \"global view\". A variant of this might be a coordinated data collection program - such as that of the ten year Census of Marine Life - which might produce the data required for the desired global view; but again, in reality, such data are collected in a series of discrete chunks, in many and often disparate data formats, and must be \"wrangled\" into a more coherent whole before any meaningful \"macroscope\" functionality becomes available.

\n\n

Here we come to what, in my view, constitutes the heart of the \"macroscope\": an intelligently organized (i.e. indexable and searchable), coherent data store or repository (where \"data\" may include imagery and other non numeric data forms, but much else besides). Taking the Census of Marine Life example, the data repository for that project''s data (plus other available sources as inputs) is the Ocean Biodiversity Information System or OBIS (previously the Ocean Biogeographic Information System), which according to this view forms the \"macroscope\" for which the Census data is a feed. (For non habitat-specific biodiversity data, GBIF is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\", by this definition (which may or may not have an explicit geographic, i.e. spatial, component) would include inventories of biological taxa such as the Catalogue of Life and so on, all the way back to the pioneering compendia published by Linnaeus in the eighteenth century; while for cartography and topographic imagery, the current \"blockbuster\" of Google Earth and its predecessors also come well into public consciousness.

\n\n

In the view of some workers and/or operations, both of these phases are precursors to the real \"work\" of the macroscope which is to reveal previously unseen portions of the \"big picture\" by means either of the availability of large, synoptic datasets, or fusion between different data streams to produce novel insights. Companies such as IBM and Microsoft have used phraseology such as:

\n\n

By 2022 we will use machine-learning algorithms and software to help us organize information about the physical world, helping bring the vast and complex data gathered by billions of devices within the range of our vision and understanding. We call this a \"macroscope\" – but unlike the microscope to see the very small, or the telescope that can see far away, it is a system of software and algorithms to bring all of Earth''s complex data together to analyze it by space and time for meaning.\" (IBM)

\n\n

As the Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors, we will gain a better understanding of our environment via a virtual, distributed whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time tracking of disease and targeted responses to potential pandemics. Our virtual \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft) (references available via the Wikipedia article cited above).

\n\n

Whether or not the analytical capabilities described here are viewed as being an integral part of the \"macroscope\" concept, or are maybe an add-on, is ultimately a question of semantics and perhaps, personal opinion. Continuing the Census of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization and summary tools, but also makes its data available for download to users wishing to analyse it further according to their own particular interests; using OBIS data in this manner, Mark Costello et al. in 2017 were able to demarcate a finite number of data-supported marine biogeographic realms for the first time (Costello et al. 2017: Nature Communications. 8: 1057. doi:10.1038/s41467-017-01121-2), a project which I was able to assist in a small way in an advisory capacity. In a case such as this, perhaps the final function of the macroscope, namely data visualization and analysis, was outsourced to the authors'' own research institution. Similarly at an earlier phase, \"data aggregation\" can also be virtual rather than actual, i.e. avoiding using a single physical system to hold all the data, enabled by open web mapping standards WMS (web map service) and WFS (web feature service) to access a set of distributed data stores, e.g. as implemented on the portal for the Australian Ocean Data Network.

\n\n

So, as we pass through the third decade of the twenty first century, what developments await us in the \"macroscope\" area\"? In the biodiversity space, one can reasonably presume that the existing \"macroscopic\" data assembly projects such as OBIS and GBIF will continue, and hopefully slowly fill current gaps in their coverage - although in the marine area, strategic new data collection exercises may be required (Census 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will continue its progress towards a \"complete\" species inventory for the biosphere. The Landsat project, with imagery dating back to 1972, continues with the launch of its latest satellite Landsat 9 just this year (21 September 2021) with a planned mission duration for the next 5 years, so the \"macroscope\" functionality of that project seems set to continue for the medium term at least. Meanwhile the ongoing development of sensor networks, both on land and in the ocean, offers an exciting new method of \"instrumenting the earth\" to obtain much more real time data than has ever been available in the past, offering scope for many more, use case-specific \"macroscopes\" to be constructed that can fuse (e.g.) satellite imagery with much more that is happening at a local level.

\n\n

So, the \"macroscope\" concept appears to be alive and well, even though the nomenclature can change from time to time (IBM''s \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on implementation, and is now simply the \"Geospatial Analytics component within the IBM Environmental Intelligence Suite\" according to available IBM publicity materials). In reality this illustrates a new dichotomy: even if \"everyone\" in principle has access to huge quantities of publicly available data, maybe only a few well funded entities now have the computational ability to make sense of it, and can charge clients a good fee for their services...

\n\n

I present this account partly to give a brief picture of \"macroscope\" concepts today and in the past, for those who may be interested, and partly to present a few personal views which would be out of scope in a \"neutral point of view\" article such as is required on Wikipedia; also to see if readers of this blog would like to contribute further to discussion of any of the concepts traversed herein.

","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

$\"\"$

\n\nMy dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t really have the words for this right now.\n\n

$\"\"$

","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes on interface ideas for digital libraries and/or knowledge graphs. Recently there’s been something of an explosion in bibliographic tools to explore the literature. Examples include: Elicit which uses AI to search for and summarise papers _scite which uses AI to do sentiment analysis on citations (does paper A cite paper B favourably or not?) ResearchRabbit which uses lists, networks, and timelines to discover related research Scispace which navigates connections between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

Some quick notes on interface ideas for digital libraries and/or knowledge graphs.

Recently there’s been something of an explosion in bibliographic tools to explore the literature. Examples include:

Elicit which uses AI to search for and summarise papers
_scite which uses AI to do sentiment analysis on citations (does paper A cite paper B favourably or not?)
ResearchRabbit which uses lists, networks, and timelines to discover related research
Scispace which navigates connections between papers, authors, topics, etc., and provides AI summaries.

As an aside, I think these (and similar tools) are a great example of how bibliographic data such as abstracts, the citation graph and - to a lesser extent - full text - have become commodities. That is, what was once proprietary information is now free to anyone, which in turns means a whole ecosystem of new tools can emerge. If I was clever I’d be building a Wardley map to explore this. Note that a decade or so ago reference managers like Zotero were made possible by publishers exposing basic bibliographic data on their articles. As we move to open citations we are seeing the next generation of tools.

Back to my main topic. As usual, rather than focus on what these tools do I’m more interested in how they look. I have history here, when the iPad came out I was intrigued by the possibilities it offered for displaying academic articles, as discussed here, here, here, here, and here. ResearchRabbit looks like this:

Scispace’s “trace” view looks like this:

What is interesting about both is that they display content from left to right in vertical columns, rather than the more common horizontal rows. This sort of display is sometimes called Miller columns or a cascading list.

\n\n

$\"\"$

\n\n

By Gürkan Sengün (talk) - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=594715

I’ve always found displaying a knowledge graph to be a challenge, as discussed elsewhere on this blog and in my paper on Ozymandias. Miller columns enable one to drill down in increasing depth, but it doesn’t need to be a tree, it can be a path within a network. What I like about ResearchRabbit and the original Scispace interface is that they present the current item together with a list of possible connections (e.g., authors, citations) that you can drill down on. Clicking on these will result in a new column being appended to the right, with a view (typically a list) of the next candidates to visit. In graph terms, these are adjacent nodes to the original item. The clickable badges on each item can be thought of as sets of edges that have the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.). Each of these nodes itself becomes a starting point for further exploration. Note that the original starting point isn’t privileged, other than being the starting point. That is, each time we drill down we are seeing the same type of information displayed in the same way. Note also that the navigation can be though of as a card for a node, with buttons grouping the adjacent nodes. When we click on an individual button, it expands into a list in the next column. This can be thought of as a preview for each adjacent node. Clicking on an element in the list generates a new card (we are viewing a single node) and we get another set of buttons corresponding to the adjacent nodes.

One important behaviour in a Miller column interface is that the current path can be pruned at any point. If we go back (i.e., scroll to the left) and click on another tab on an item, everything downstream of that item (i.e., to the right) gets deleted and replaced by a new set of nodes. This could make retrieving a particular history of browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have the ability to add items to a collection, so you can keep track of things you discover.

Lots of food for thought, I’m assuming that there is some user interface/experience research on Miller columns. One thing to remember is that Miller columns are most often associated with trees, but in this case we are exploring a network. That means that potentially there is no limit to the number of columns being generated as we wander through the graph. It will be interesting to think about what the average depth is likely to be, in other words, how deep down the rabbit hole will be go?

\n\n

Update

Should add link to David Regev''s explorations of Flow Browser.\n\n

\n
Written with StackEdit.
\n

","tags":["cards","flow","Knowledge Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The Business of Extracting Knowledge from Academic Publications","summary":"Markus Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting Knowledge from Academic Publications\". I spent months working on domain-specific search engines and knowledge discovery apps for biomedicine and eventually figured that synthesizing \"insights\" or building knowledge graphs by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc— Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

Markus Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting Knowledge from Academic Publications\".

\n\n

I spent months working on domain-specific search engines and knowledge discovery apps for biomedicine and eventually figured that synthesizing "insights" or building knowledge graphs by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc
— Markus Strasser (@mkstra) December 7, 2021

\n\n

His TL;DR:

\n\n

\nTL;DR: I worked on biomedical literature search, discovery and recommender web applications for many months and concluded that extracting, structuring or synthesizing \"insights\" from academic publications (papers) or building knowledge bases from a domain corpus of literature has negligible value in industry.
\n\n
Close to nothing of what makes science actually work is published as text on the web.\n

\n\n

After recounting the many problems of knowledge extraction - including a swipe at nanopubs which \"are ... dead in my view (without admitting it)\" - he concludes:

\n\n

\nI’ve been flirting with this entire cluster of ideas including open source web annotation, semantic search and semantic web, public knowledge graphs, nano-publications, knowledge maps, interoperable protocols and structured data, serendipitous discovery apps, knowledge organization, communal sense making and academic literature/publishing toolchains for a few years on and off ... nothing of it will go anywhere.
\n\n
Don’t take that as a challenge. Take it as a red flag and run. Run towards better problems.\n

\n\n

Well worth a read, and much food for thought.

","tags":["ai","business model","text mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting to feel like iPhylo might end up being a list of obituaries of people working on biodiversity informatics (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole talking to Dave at the height of the uBio project, which really kickstarted large scale indexing of taxonomic names, and the use of taxonomic name finding tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

I heard yesterday from Martin Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting to feel like iPhylo might end up being a list of obituaries of people working on biodiversity informatics (e.g., Scott Federhen).

\n\n

I spent several happy visits at MBL at Woods Hole talking to Dave at the height of the uBio project, which really kickstarted large scale indexing of taxonomic names, and the use of taxonomic name finding tools to index the literature. His work on uBio with David (\"Paddy\") Patterson led to the Encyclopedia of Life (EOL).

\n\n

A number of the things I''m currently working on are things Dave started. For example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1] to ChecklistBank where I''m working on augmenting that original dataset by adding links to the taxonomic literature. My BioRSS project is essentially an attempt to revive uBioRSS[2] (see Revisiting RSS to monitor the latest taxonomic research).

\n\n

I have fond memories of those visits to Woods Hole. A very sad day indeed.

\n\n

Update: The David Remsen Memorial Fund has been set up on GoFundMe.

\n\n

1. Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin, 210(1), 18–24. https://doi.org/10.2307/4134533

\n\n

2. Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109

","tags":["David Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating bibliographic data","summary":"There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of duplicates. Similarly, articles in Wikispecies often have extensive lists of references cited, and the same reference may appear on multiple pages (for an initial attempt to extract these references see https://doi.org/10.5281/zenodo.5801661 and https://github.com/rdmpage/wikispecies-parser).

\n\n

There are several reasons I want to merge these references. If I want to build a citation graph for Zootaxa or Phytotaxa I need to merge references that are the same so that I can accurate count citations. I am also interested in harvesting the metadata to help find those articles in the Biodiversity Heritage Library (BHL), and the literature cited section of scientific articles is a potential goldmine of bibliographic metadata, as is Wikispecies.

\n\n

After various experiments and false starts I''ve created a repository https://github.com/rdmpage/bib-dedup to host a series of PHP scripts to deduplicate bibliographics data. I''ve settled on using CSL-JSON as the format for bibliographic data. Because deduplication relies on comparing pairs of references, the standard format for most of the scripts is a JSON array containing a pair of CSL-JSON objects to compare. Below are the steps the code takes.

\n\n

Generating pairs to compare

\n\n

The first step is to take a list of references and generate the pairs that will be compared. I started with this approach as I wanted to explore machine learning and wanted a simple format for training data, such as an array of two CSL-JSON objects and an integer flag representing whether the two references were the same of different.

\n\n

There are various ways to generate CSL-JSON for a reference. I use a tool I wrote (see Citation parsing tool released) that has a simple API where you parse one or more references and it returns that reference as structured data in CSL-JSON.

\n\n

Attempting to do all possible pairwise comparisons rapidly gets impractical as the number of references increases, so we need some way to restrict the number of comparisons we make. One approach I''ve explored is the “sorted neighbourhood method” where we sort the references 9for example by their title) then move a sliding window down the list of references, comparing all references within that window. This greatly reduces the number of pairwise comparisons. So the first step is to sort the references, then run a sliding window over them, output all the pairs in each window (ignoring in pairwise comparisons already made in a previous window). Other methods of \"blocking\" could also be used, such as only including references in a particular year, or a particular journal.

\n\n

So, the output of this step is a set of JSON arrays, each with a pair of references in CSL-JSON format. Each array is stored on a single line in the same file in line-delimited JSON (JSONL).

\n\n

Comparing pairs

\n\n

The next step is to compare each pair of references and decide whether they are a match or not. Initially I explored a machine learning approach used in the following paper:

\n\n

\nWilson DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks. 9–14. DOI: 10.1109/IJCNN.2011.6033192\n

\n\n

Initial experiments using https://github.com/jtet/Perceptron were promising and I want to play with this further, but I deciding to skip this for now and just use simple string comparison. So for each CSL-JSON object I generate a citation string in the same format using CiteProc, then compute the Levenshtein distance between the two strings. By normalising this distance by the length of the two strings being compared I can use an arbitrary threshold to decide if the references are the same or not.

\n\n

Clustering

\n\n

For this step we read the JSONL file produced above and record whether the two references are a match or not. Assuming each reference has a unique identifier (needs only be unique within the file) then we can use those identifier to record the clusters each reference belongs to. I do this using a Disjoint-set data structure. For each reference start with a graph where each node represents a reference, and each node has a pointer to a parent node. Initially the reference is its own parent. A simple implementation is to have an array index by reference identifiers and where the value of each cell in the array is the node''s parent.

\n\n

As we discover pairs we update the parents of the nodes to reflect this, such that once all the comparisons are done we have a one or more sets of clusters corresponding to the references that we think are the same. Another way to think of this is that we are getting the components of a graph where each node is a reference and pair of references that match are connected by an edge.

\n\n

In the code I''m using I write this graph in Trivial Graph Format (TGF) which can be visualised using a tools such as yEd.

\n\n

Merging

\n\n

Now that we have a graph representing the sets of references that we think are the same we need to merge them. This is where things get interesting as the references are similar (by definition) but may differ in some details. The paper below describes a simple Bayesian approach for merging records:

\n\n

\nCouncill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: 10.1145/1141753.1141817.\n

\n\n

So the next step is to read the graph with the clusters, generate the sets of bibliographic references that correspond to each cluster, then use the method described in Councill et al. to produce a single bibliographic record for that cluster. These records could then be used to, say locate the corresponding article in BHL, or populate Wikidata with missing references.

\n\n

Obviously there is always the potential for errors, such as trying to merge references that are not the same. As a quick and dirty check I flag as dubious any cluster where the page numbers vary among members of the cluster. More sophisticated checks are possible, especially if I go down the ML route (i.e., I would have evidence for the probability that the same reference can disagree on some aspects of metadata).

\n\n

Summary

\n\n

At this stage the code is working well enough for me to play with and explore some example datasets. The focus is on structured bibliographic metadata, but I may simplify things and have a version that handles simple string matching, for example to cluster together different abbreviations of the same journal name.

","tags":["data cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic Page"}],"image":null,"content_html":"

$\"\"$

I''ve released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe.

\n\n

As I''ve mentioned earlier it''s taken me a while to see the point of GraphQL. But it is clear it is gaining traction in the biodiversity world (see for example the GBIF Hosted Portals) so it''s worth exploring. My take on GraphQL is that it is a way to create a self-describing API that someone developing a web site can use without them having to bury themselves in the gory details of how data is internally modelled. For example, WikiData''s query interface uses SPARQL, a powerful language that has a steep learning curve (in part because of the administrative overhead brought by RDF namespaces, etc.). In my previous SPARQL-based projects such as Ozymandias and ALEC I have either returned SPARQL results directly (Ozymandias) or formatted SPARQL results as schema.org DataFeeds (equivalent to RSS feeds) (ALEC). Both approaches work, but they are project-specific and if anyone else tried to build based on these projects they might struggle for figure out what was going on. I certainly struggle, and I wrote them!

\n\n

So it seems worthwhile to explore this approach a little further and see if I can develop a GraphQL interface that can be used to build the sort of rich apps that I want to see. The demo I''ve created uses SPARQL under the hood to provide responses to the GraphQL queries. So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully) simpler overlay on top of SPARQL so that we can retrieve the data we want without having to learn the intricacies of SPARQL, nor how Wikidata models publications and people.

","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}' recorded_at: Sun, 18 Jun 2023 15:24:19 GMT recorded_with: VCR 6.1.0