Callosciurus erythraeus subsp. thaiwanensis (Bonhote, 1901) 赤腹松鼠, 尹若宇 攝, Photo via iNaturalist (CC BY-NC 4.0)
Callosciurus erythraeus subsp. thaiwanensis (Bonhote, 1901) 赤腹松鼠, 尹若宇 攝, Photo via iNaturalist (CC BY-NC 4.0)

Chapter 3: Prepare datasets for publishing


3.1 Overview of publishing biodiversity dataset through GBIF

A dataset includes metadata and raw data. The simplest dataset can just only include metadata. Currently, there are four types of datasets GBIF supported to publish: 1) metadata-only, 2) checklist, 3) occurrence, and 4) sampling-event datasets, with increasing richness in data content and complexity in data structure. To preserve the completeness of a dataset we suggest data providers publish the richest datasets as possible as they can.

To publish a dataset on the web is just like to produce a data product. Metadata is just like the specification and manual of a product. It tells the users about the content, usage of the product, information for customer services, etc. In other words, metadata may tell you how the data came from, how to use the data, whom you can contact with if you have any question about the data, etc. A product is a kind of resource, therefore, a dataset is also a resource on the web. In context of GBIF, dataset is one kind of resource. For publishing a dataset (or resource) via GBIF, you have to prepare resource metadata with checklist, occurrence, or sampling-event data using a collection of terms in the DarwinCore Extensions as the standardized column names in the data tables of your checklist, occurrence, or sampling-event data.

The simple way to prepare your dataset is to use an Excel template with predefined format to fill in the metadata and raw data. This will help you easily organize your data into the standardized data structure accepted by GBIF. GBIF provides simple Excel templates, which include DarwinCore terms as the standardized column names.

However, when you have a huge data table, for example, more ten thousand records, you will find it is not easy to handle too many records using Excel. In that situation, we will suggest you use R, an open source statistical and data-processing software. If you are new to R, we encourage you learn R because it is one of the most popular tools for data analysis. In this chapter, we will show you how to use Excel template and R-scripts to prepare a sampling-event dataset.

3.2 What is sampling-event data?

Sometimes datasets can provide far more information than you might imagine, not just where and when a species was discovered, but also the opportunity to discover the community composition of a broader taxonomic group or species abundance at different times and places. These datasets typically come from standard operating procedures for measuring and monitoring biodiversity, such as transect surveys of plant communities, bird survey programs, freshwater and marine sampling. By indicating the methods, events, and relative abundances of species used to record the samples, these datasets can enhance comparisons between data collected at different times and locations under the same standard operating procedures, and sometimes even allow researchers to discover the absence of species in certain locations. Original reference:http://www.gbif.org/publishing-data/summary)

3.2.1 Selected core terms for Sampling Event + Location

(**Required)
eventID** minimumElevationInMeters
parentEventID maximumElevationInMeters
eventDate** minimumDepthInMeters
Habitat maximumDepthInMeters
samplingProtocol** minimumDistanceAboveSurfaceIn Meters
sampleSizeValue** maximumDistanceAboveSurfaceIn Meters
sampleSizeUnit** Continent
samplingEffort Country
locationID countryCode
decimalLatitude stateProvince
decimalLongitude Country
geodeticDatum Municipality
coordinateUncertaintyInMeters** 地點(locality)
coordinatePrecision**  

Reference:http://tools.gbif.org/dwca-validator/extension.do?id=dwc:Event

3.2.2 MeasurementOrFact Core Terms

  • eventID**
  • measurementID
  • measurementType
  • measurementValue
  • measurementAccuracy
  • measurementUnit
  • measurementDeterminedBy
  • measurementDeterminedDate
  • measurementMethod
  • measurementRemarks

Reference:http://tools.gbif.org/dwca-validator/extension.do?id=dwc:MeasurementOrFact

3.2.3 Selected core terms for Occurrence + Taxon


eventID** individualCount
occurrenceID** organismQuantity
scientificNameID organismQuantityType
scientificName sex
taxonID lifestage
vernacularName reproductiveCondition
taxonRank behavior
kingdom occurrenceStatus
phylum acceptedNameUsageID
class acceptedNameUsage
order specificEpithet
family infraspecificEpithet
genus scientificNameAuthorship
recordedBy  

Reference:http://tools.gbif.org/dwca-validator/extension.do?id=dwc:Occurrence

3.3 Darwin Core - Sampling-event Star Schema

3.3.1 Star schema

Structure your core data containing column names by refering to Darwin Core terms, and the non-core data can be included as an extension set (it can be regarded as a supplementary description of the core set). All extension sets must have a field (strongly recommended as a unique identifier ID) that is the same as the core set to allow tracing back to the core data. You can think of these as an excel sheet, the homepage data is the core set, and the other pages are extension sets, and the content is related to the homepage.

3.3.2 How to publish data?

  1. Prepare raw data
  2. Check and Clean the data (correct scientific names or other errors)
  3. Adopt terms of Darwin Core Event vocabulary and MeasureOrFacts. Occurrence, … extensions as the column names of your data table
  4. Transform data structure innto separate tables fitted into DarwinCore star-schema and save the tables as UTF-8 encoded CSV text files.
  5. Prepare metadata
  6. Register an account on the IPT hosted by the node (such as TaiBIF), which endorse your orgainzation as a data publishing orgainzation
  7. Log into the IPT, create a new resource, upload your data tables and do DarwinCore mappings
  8. Provide metadata
  9. Publish the data


How to publish sampling-event data?
http://github.com/gbif/ipt/wiki/samplingEventData#exemplar-datasets

3.4 Prepare sampling-event data using Excel template

Example dataset annd tools for preparing data

  • Data file: 1_AlienPlantSurey.xlsx
  • MS Excel / LibreOffice Calc
  • Text editor: Notepad++ / PSPad / …
  • TaiBIF Scientific names matching service (Nomenmatch): http://match.taibif.tw

Example data file

  • 1_AlienPlantSurvey.xlsx
  • 7 tables
    1. 1_partialSurveyPlot
    2. 2_partialPlantOccurrence
    3. 3_partialMatchingData
    4. 4_partialCleanedData
    5. 5_partialEvent
    6. 6_partialMeasurement
    7. 7_partialOccurrence


Nomenmatch


Data cleaning and transformation workflow

  1. Check and study the original data structure
  2. Check the spelling of the input scientific name using matching service and get the matching results CSV text file
  3. Using the Notepad++ to open the matching results file and copy all text and paste into an spreadsheet table
  4. Cleaning names based on the matching results and get data of taxonRank and names for higher taxonomic ranks
  5. Create Event, MeasurementOrFacts, and Occurrence tables
  6. Mapping original column names to Darwin Core terms
  7. Copy data of Event, Measurement, Occurrence tables respectively into Notepad++, encoded in UTF-8, then save as CSV text files.





Congratulations! You are ready to publish the dataset using IPT.