Cookbook 2 - 資料標準化

2.1 Use common raw data format

As described in the previous chapter, biodiversity data are everywhere, and in different format in different media. They may be fixed in notebooks, in PC as Excel or Word files, or already printed on papers with available PDF.

To avoid confusion, let’s define some words. As you see, a set of biodiversity data is a combination of data to show what species occurred in where and when, and how it was recognized. A single description of such observation or description is termed “record”. If you go to a field research multiple records can be obtained by a certain methodology. These set of record are termed “dataset” (e.g. “tree species observed in Mt. MM by line transect”).

Imagine that if you get a data table, maybe an Excel file or a text file with tens of columns and hundreds of rows filled with text-strings or numbers. What will you do first to understand the meaning of this data file? If there is no extra information to explain the data, probably you will try to read the first row, the header, of this data table.

Table 2.1 An example of dataset

ID	Plot	Zone	Common Name	Sp	x	y
1	4094	北部	鵝兒腸	Stellaria aquatica (L.) Scop.	120.9785	24.68604
2	4094	北部	藿香薊	Ageratum conyzoides L.	120.9785	24.68604
3	4094	北部	大花咸豐草	Bidens pilosa L. var. radiata(Sch. Bip.) J. A. Schmidt	120.9785	24.68604
4	4094	北部	野茼蒿	Conyza sumatrensis(Retz.) E. Walker	120.9785	24.68604
5	4094	北部	粗毛小米菊	Galinsoga quadriradiataRuiz & Pav.	120.9785	24.68604

For example, what are the meaning of “id”, “plot”, “zone”, “common name”, “sp”, “x”, “y”? If you are a botanist or a plant ecologist and you also can read Chinese characters, you may easily guess this probably a data file about a field survey of plant, with record number, plot number, survey area, Chinese common name of plant, scientific name, coordinates of plot location.Different people may use different ways to name the column titles of their data table. If you get a data table that you don’t understand the meaning of its header it will wastes you a lot of time to guess. Even worse, if you have to integrate many data tables from different people without common way or standardization. Soon you will be crazy. Therefore, to facilitate data exchange and integration, why don’t we have a standard for documenting the data to save everybody’s life? The first step is to adopt standardized column names!Take a look at Fig. 2.1, representing a concept of merging data based on literature, observation, and specimens. The three data sets obtained from each of these data source all includes taxonomy (name of the species), when (date, time), where (location), and additional information. However, it is not possible just appending one record after another, because of the following reason.

Fig 2.1 Consolidation concept of document, observation, and specimen data. The information obtained from each data source in these three datasets includes taxonomy (species name), when (date, time), where (location) and other information, but cannot be directly appended to the record for the following reasons:

Observation Table A

Name	When	Where
麻雀	上午7:00, 2015.04.01	台北，台灣

Specimen Table B

Genus	Species	Author	Date	Where	Collector	Identified	by
Dasyscyphella	Longistipitata	Hosoya	2015-iv-1	日本茨城，筑波，植物園	T. Hosoya	T. Hosoya	日本山毛櫸殼斗

Literature Table C

Name	Date	Where	Latitute	Longitude
Dasyscyphella longistipitata	2015/04/01	青森，八甲田山	40.523591	140.970943

The property of each record in the cell is indicated differently. For instance, the name of the species is shown under “Name” written in common name (House sparrow) in Table 1, while species name is given in scientific name (Latin name) and separated into three cells (Genus, Species, and Author) in Table B, but in one cell in Table 3. It should be noted that the same word “Name” shows different meaning in Table A and Table C (Name in Table 1 indicates common name while the same word indicates the scientific name) .
The definition of each column is different from each other.
The format of the data is not following the same rule. For example, The locality in Table C (indicated as “Where”) is not accompanied with the name of the country, while the other two are. The same date (April 1, 2015) is indicated differently in the three tables).

From above analyses, we recognize the following caution must be drawn to merge the table.

Each column must be consistent in terminology, and appropriately defined.
The format of the data must follow a standardized rule.

Then, is there any standardized data format? Actually, there is! It is called “Darwin Core” (DwC).

2.2 Use Darwin Core as the main standard

Darwin core was developed by a group of scientist (Taxonomic Databases Working Group, or TDWG, now known as Biodiversity Information Standards). In a simple explanation, you may say it’s a collection of vocabularies of terms with predefined meaning and suggested usage used to document biodiversity data. It was originally developed as an extension of Dublin Core for literature, and now world widely known as stable standard reference for sharing information on biodiversity. DwC allows us to make a stable dataset based on more than 150 items (termed “fields” in database jargon). Each field is strictly defined and follow a rule in format. Some of the representative fields are listed in Table #. Since DwC is consisted of wide range of fields, some fields are grouped based on their common nature. This group is called “Class” (Note they are different from the word “Class” used in taxonomy). DwC are maintained and being revised and expanded by TDWG.

Below is the list of classes and some examples of the fields in DwC. For full descriptions, visit http://rs.tdwg.org/dwc/terms/ or http://tools.gbif.org/dwca-assistant/.

The Class Record-level Terms provides the information to manage each data. This fields in this class may provide links with data in other databases.
The Class Occurrence provides the fundamental information when and where the data were recorded.
The Class Organism provides the information regarding the nature of the given organisms, such as textual names or labels, whether the organism was recognized as a clone or colony, etc. This class is distinguished from the Class Taxon (see below) in describing more informal taxonomic information.
The Classes MaterialSample, LivingSpecimen, PreservedSpecimen may be chosen based on the nature of the basis of the record. It gives distinction whether the record is based on sampled materials or living specimens or preserved specimens.
The Classes Event, HumanObservation, MachineObservation gives the information whether the record was obtained by any kind of sampling activity or observation made by human or machines. Like the previous class, one of these should be chosen on necessary basis.
The Class Location provides information regarding the location where the record was obtained.
The Class GeologicalContext provides the geological information, such as stratigraphy, that qualifies a region or place. Since this class is in DwC, fossil information can be incorporated.
The Class Identification provides information for taxonomic determination, such as who and when identified, references for the identification, etc.
The Class Taxon provides taxonomic information, such as scientific names, higher taxonomic ranks, etc.
[Auxiliary Terms] The Class MeasurementOrFact provides data if the record describes the measurment or the fact.
The Class ResourceRelationship provides XXXXXXXXXXXX

To simplify choosing suitable terms as the column names of tables for documenting your biodiversity data, Global Biodiversity Information Facility (GBIF) kindly prepares Excel templates for documenting checklists and occurrence of organisms as well as extra detailed information about the data table, called “metadata”. We will talk about “metadata” later. You can download the Excel templates for documenting species checklist data （https://github.com/gbif/ipt/wiki/checklistData）, species occurrences data （https://github.com/gbif/ipt/wiki/occurrenceData）, and sampling-event data （https://github.com/gbif/ipt/wiki/sampleEventData）. But you have to notice that some of the fields (columns) are mandatory, some may be optional. If you need more optional terms to describe your data you can look up the extended vocabularies, called Darwin Core extensions （http://tools.gbif.org/dwca-validator/extensions.do）, and pick up suitable terms.

BOX1: Converting data between sexagesimal and decimal system in coordinates

Many of the readers may have used longitude and latitude to indicate the coordinates. They are usually indicated by degrees, minutes, and seconds. Degrees, minutes, and seconds follow sexagesimal system, but seconds are sometimes indicated to 1/100. Once getting used to it, the system is convenient to indicate a certain point on the map. However, it is complicated because we need a set of three figures to indicate latitude/longitude. Recently a number of map drawing software adopt decimal systems. Both systems is compatible based on the calculations indicated in Fig. X. Since only a set of two figures are required to indicate a point, decimal systems is more increasingly applied to indicate longitude and latitude.

From the above experience, we can learn an important lesson. We should design the table beforehand based on DwC. This increases the efficiency and avoid confusion and fluctuation in description of the data.

2.3 Use plain text instead of Excel file

If you have any experience in gathering data of any kind, the first thing that comes in your mind may be using “Excel”. Excel is a nice spreadsheet software that allows not only the accumulation of data, but also analysis and exploitation.

The next question is to save the data. We now have a merged digitized data in a single table, which enables us to analyze. To save the data, we need to select a file format. Most of the readers may use Microsoft Excel or other spreadsheet software as a standard software. It is therefore natural to choose Excel format (.xls or .xlsx), but there is a certain limitation for the numbers of records in Excel format. Another disadvantage may be that Excel format file is not directly incorporated in database software with higher capability. Besides, Excel format sometimes causes garbling up of the data. Therefore, it is recommended for you to save the file as a plain text format (.csv or .tsv). In the former format known as Comma-separated Values Format (CSV), comma are inserted between the field, and line break (CR) is inserted between the data. Therefore, when the excel data looked like below

Column1	Column2	Column3	Column4
1	A	B	C
2	D	E	F

may be seen like below under editor software
Column1 [,/tab] Column2 [,/tab] Column3 [,/tab] 欄位4 [CR]
1 [,/tab] A [,/tab] B [,/tab] C [CR]
2 [,/tab] D [,/tab] E [,/tab] F [CR]
where [,/tab] is a comma (,) or tab code, respectively in CSV and TSV, and [CR] is a line break code.
The difference of Tab-separated Value Format (TSV) is that TSV files use [Tab] instead of comma [,]. Both file format are readable in editors, and usually supported by various database software.

2.4 Adopt UTF-8 character encoding

Finally, one more thing to remember as a tip when saving the data. Use UTF-8 character encoding when you deal with the data. In the present textbook, detailed reason for this will not be described because it is highly technical and informatician directed.

Simply, UTF-8 is a universally applicable character encoding that avoids garble up for almost all the languages, and to ensure data readability in almost any computers. It is currently taken as a global standard character encoding, and recent websites generally adopt UTF-8. Excel does not support UTF-8, but usually editors support UTF-8 (Usually you can choose UTF-8 from “character encoding” menu). So, it is advisable to save the data in plain text (CSV or TSV) format with UTF-8 encoding, if you are using letters other than alphabets and digits.

BOX2: How to generate unique identifiers?

In the context of biodiversity informatics, a number of data are to be distinguished by identifiers (ID). IDs are supposed to be given to any data that should be uniquely distinguished. Naturally, an ID must be uniquely given to a specific data in the given dataset. For instance, if “101” is given to a scientific name “X”, 101 should be always used when X is used throughout the given database, thus distinguishing X from other scientific name. In the database of occurrence, occurrence ID may be given. In that case, given ID must appear only once to distinguish the data.

Duplication of the IDs with other database may be acceptable, if they mean the same data. However, in case of occurrence ID, it should be globally unique and resolvable. Here, “resolvable” means the ID also works as a link to enable a data user (or a machine) to find the useful information about the thing that is identified.

To generate such an ID, UUID (Universally Unique Identifier) is applicable. UUID is a strings (a line of digits and letters) generated at random (e.g. 8612280c-f762-11e1-a439-00145eb45e92). The methodology of generating UUID is established, and can be obtained through various ways. For example, “uuidgen command” is available for LINUX and macOS (OS X); Microsoft GUID Generator can be downloaded for windows; and online UUID generator services are also available such as UUID Generator.

Data Mobilization and Application Workshop

Chapter 2: Standardize data