-
Notifications
You must be signed in to change notification settings - Fork 9
Herbadrop Use Case
GEF has been identified for the EUDAT Herbadrop Data Pilot for improving the data workflow by linking the data infrastructure and the HPC infrastructure. One important aspect of the pilot is the analysis of the data quality of the ingested herbaria. The main requirement is that operations must be strictly reproductible independently of the computing infrastructure used. For that the GEF is a suitable approach for creating images of the overall processes, tools and libraries.
For several centuries, Natural History Collection (NHC) institutes (i.e. museums and botanical garden) accross the world have been responsible for preserving the physical copies of herbaria. These herbaria are collections of plants sticked on a sheet with annotations that describe a given specimen. Each herbarium specimen has been collected, carefully prepared and annotated by botanists.
Herbaria hold large numbers of collections: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. Through the years, these collections have been studied and enriched by taxonomists. This represents altogether a very precious yet fragile scientific basement. Therefore, in the digital era, it has become obvious for the NHC institutes to yield a vaste digitization campaign of their herbaria and limit the manipulation of the physical copies.
Nevertheless, the responsibility of preserving the the digital versions is a new challenge. High-resolution images of herbarium specimens require substantial bandwidth and disk space. Moreover, ensuring long-term preservation of digital objects is not a straight forward task for organisms that cannot afford to manage high volumes and acquire the suitable knowledge on data format that will still be readable in many years.
Another significant challenge is the new perspective for performing all kind of image analysis using intensive computing and post-processing techniques. Again, this requires computing skills that are not always obvious in NHC institutes.
Since data storage and image processing are not natural skills of NHC institute, some of them decided to rely on a third party that can provide a Trusted Digital Repository (TDR) and a shared access to the data for the whole community. Thus, all demanding tasks in terms of operational constrains, respect of formal OAIS processes, etc. are delegated.
The two core objectives of the Herbadrop data pilot are:
-
long-term preservation of scientific natural heritage: collections of digitalized herbaria are transferred from several European museums and botanical gardens to a TDR.
-
extraction of written information from these images by using Optical Character Recognition (OCR) analysis using intensive computing.
The long term preservation is an ongoing task that is mainly focusing on operational aspects. Therefore, for the purpose of this paper, the focus in made on reporting results for the analysis of the OCR results.
Initially, the consortium was formed by five NHC institutes from Finland, France, Germany, Netherlands and Scotland. Their common objective was to share their herbaria for future research projects by making the specimen images and data available on-line from different institutes allows cross domain research and data analysis for botanists and researchers with diverse interests (e.g. ecology, social and cultural history, climate change).
BGBM (De.): The Botanischer Garten und Botanisches Museum (BGBM) of Berlin is to a large extent based on its scientific plant collections. A central element of its activities is taxonomic research, through which plants are identified, described, named and classified.
MNHN (Fr.): The Muséum National d'Histoire Naturelle (MNHN) of Paris is in charge of the main collection of botanical and zoological specimens in France. Between 2008 and 2012, it completed a massive digitization program of the herbarium specimens, putting online nearly 6 millions of images. It will greatly benefit of the pilot for both long-term preservation of image files and extraction of the label information.
RBGE (Sco.): The Royal Botanic Garden Edinburgh (RBGE) has a very active herbarium of 3 million specimens and living collection of around 64,000 plants. All of the living collection records, including more than 40,000 linked images, are online and 300,000 of the herbarium specimens are images at high resolution which are available online. RBGE has incorporated OCR technology into the digitisation workflow and is currently testing Handwritten Text Recognition.
Digitarium (Fin.): Digitarium is the digitisation centre of the Finnish Museum of Natural History and the University of Eastern Finland. In 2014, Digitarium coordinated a H2020 proposal for designing a European distributed digitisation infrastructure for natural heritage (acronym: ICEDIG).
Naturalis (NL): Naturalis Biodiversity Center (Naturalis, Leiden, The Netherlands) is the merger of the National Museum of Natural History, the Zoological Museum of Amsterdam and the National Herbarium of the Netherlands. Naturalis has just finished its mass digitisation project in which 4,2 M higher plants were scanned, databased and published. Naturalis decided to not extend its membership after the fist phase of Herbadrop, but will come back since it is partner of the forthcoming ICEDIG project.
A new partner, the Botanic Garden of MEISE, Belgium, joined the consortium during 2017. The herbarium of Botanic Garden Meise houses around 4 million specimens. The Vascular Plant Herbarium contains three main collections: the General Herbarium with more than one million specimens; the Belgian Herbarium with about 200,000 specimens; and the African Herbarium comprising at least one million specimens (of which over half are from central Africa). The 800,000 specimens in the Cryptogam Herbarium consist of mosses, lichens, algae, fungi and myxomycetes.
They are a number of benefits that contributes to the acceptation of GEF by both communities and service provider.
For the community point of view, every operations are under control and are reproductible.
For the service provider point of view, the service is generic, there is no user specific tools and libraries to install for every use case. This has a positive impact on operational cost.
However, another important requirement for accepting the GEF is security. Even if discussions with security experts are promizing since the docker technology has made tremendous improvement, there are still reticences to deploy such solutions in a production environment. Hopefully, the GEF approach will offer more security guaranties than installing specific tools.