2.1 Use common raw data format
As described in the previous chapter, biodiversity data are everywhere, and in different format in different media. They may be fixed in notebooks, in PC as Excel or Word files, or already printed on papers with available PDF.
To avoid confusion, let’s define some words. As you see, a set of biodiversity data is a combination of data to show what species occurred in where and when, and how it was recognized. A single description of such observation or description is termed “record”. If you go to a field research multiple records can be obtained by a certain methodology. These set of record are termed “dataset” (e.g. “tree species observed in Mt. MM by line transect”).
Imagine that if you get a data table, maybe an Excel file or a text file with tens of columns and hundreds of rows filled with text-strings or numbers. What will you do first to understand the meaning of this data file? If there is no extra information to explain the data, probably you will try to read the first row, the header, of this data table.
Table 2.1 An example of dataset
For example, what are the meaning of “id”, “plot”, “zone”, “common name”, “sp”, “x”, “y”? If you are a botanist or a plant ecologist and you also can read Chinese characters, you may easily guess this probably a data file about a field survey of plant, with record number, plot number, survey area, Chinese common name of plant, scientific name, coordinates of plot location.
Different people may use different ways to name the column titles of their data table. If you get a data table that you don’t understand the meaning of its header it will wastes you a lot of time to guess. Even worse, if you have to integrate many data tables from different people without common way or standardization. Soon you will be crazy. Therefore, to facilitate data exchange and integration, why don’t we have a standard for documenting the data to save everybody’s life? The first step is to adopt standardized column names!
Take a look at Fig. 2.1, representing a concept of merging data based on literature, observation, and specimens. The three data sets obtained from each of these data source all includes taxonomy (name of the species), when (date, time), where (location), and additional information. However, it is not possible just appending one record after another, because of the following reason.
Fig 2.1 Consolidation concept of document, observation, and specimen data. The information obtained from each data source in these three datasets includes taxonomy (species name), when (date, time), where (location) and other information, but cannot be directly appended to the record for the following reasons:
Observation Table A
|Sparrow||7:00 am, April 1, 2015||Taipei,Taiwan|
Specimen Table B
|Dasyscyphella||Longistipitata||Hosoya||2015-iv-1||Botanical Garden, Tsukuba, Ibaraki, Japan||T. Hosoya||T. Hosoya||On Fagus crenata cupule|
Literature Table C
- The property of each record in the cell is indicated differently. For instance, the name of the species is shown under “Name” written in common name (House sparrow) in Table 1, while species name is given in scientific name (Latin name) and separated into three cells (Genus, Species, and Author) in Table B, but in one cell in Table 3. It should be noted that the same word “Name” shows different meaning in Table A and Table C (Name in Table 1 indicates common name while the same word indicates the scientific name) .
- The definition of each column is different from each other.
- The format of the data is not following the same rule. For example, The locality in Table C (indicated as “Where”) is not accompanied with the name of the country, while the other two are. The same date (April 1, 2015) is indicated differently in the three tables).
From above analyses, we recognize the following caution must be drawn to merge the table.
- Each column must be consistent in terminology, and appropriately defined.
- The format of the data must follow a standardized rule.
2.2 Use Darwin Core as the main standard
Darwin core was developed by a group of scientist (Taxonomic Databases Working Group, or TDWG (http://www.tdwg.org/), now known as Biodiversity Information Standards). In a simple explanation, you may say it’s a collection of vocabularies of terms with predefined meaning and suggested usage used to document biodiversity data. It was originally developed as an extension of Dublin Core for literature, and now world widely known as stable standard reference for sharing information on biodiversity. DwC allows us to make a stable dataset based on more than 150 items (termed “fields” in database jargon). Each field is strictly defined and follow a rule in format. Some of the representative fields are listed in Table #. Since DwC is consisted of wide range of fields, some fields are grouped based on their common nature. This group is called “Class” (Note they are different from the word “Class” used in taxonomy). DwC are maintained and being revised and expanded by TDWG.
The Class Record-level Terms provides the information to manage each data. This fields in this class may provide links with data in other databases.
To simplify choosing suitable terms as the column names of tables for documenting your biodiversity data, Global Biodiversity Information Facility (GBIF) kindly prepares Excel templates for documenting checklists and occurrence of organisms as well as extra detailed information about the data table, called “metadata”. We will talk about “metadata” later. You can download the Excel templates for documenting species checklist data (https://github.com/gbif/ipt/wiki/checklistData), species occurrences data (https://github.com/gbif/ipt/wiki/occurrenceData), and sampling-event data (https://github.com/gbif/ipt/wiki/sampleEventData). But you have to notice that some of the fields (columns) are mandatory, some may be optional. If you need more optional terms to describe your data you can look up the extended vocabularies, called Darwin Core extensions ( http://tools.gbif.org/dwca-validator/extensions.do), and pick up suitable terms.
為了簡化適用於記錄生物多樣性資料的欄位名稱，全球生物多樣性資訊機構（GBIF）提供了Excel模板，用於記錄物種名錄和出現紀錄，以及任何有關於此資料表的其他資訊細節（亦稱為「詮釋資料」）。 我們之後會討論「詮釋資料」。 你可以從這裡（https://github.com/gbif/ipt/wiki/checklistData）下載用於記錄物種名錄資料的Excel模板，從這裡下載物種出現紀錄資料模板（https://github.com/gbif/ipt/wiki/occurrenceData）， 以及從這裡下載調查活動資料模板（https://github.com/gbif/ipt/wiki/sampleEventData）。 但要注意的是，有些欄位是必填的，而有些是選擇性提供即可。 若你需要更多的選擇性欄位詞彙來描述你的資料，你可以從達爾文核心延伸集找到更多的選項（http://tools.gbif.org/dwca-validator/extensions.do），再從中選取適合的詞彙。
- Darwin Core: http://rs.tdwg.org/dwc/terms/
- Doublin Core:http://dublincore.org/
BOX1: Converting data between sexagesimal and decimal system in coordinates
Many of the readers may have used longitude and latitude to indicate the coordinates. They are usually indicated by degrees, minutes, and seconds. Degrees, minutes, and seconds follow sexagesimal system, but seconds are sometimes indicated to 1/100. Once getting used to it, the system is convenient to indicate a certain point on the map. However, it is complicated because we need a set of three figures to indicate latitude/longitude. Recently a number of map drawing software adopt decimal systems. Both systems is compatible based on the calculations indicated in Fig. X. Since only a set of two figures are required to indicate a point, decimal systems is more increasingly applied to indicate longitude and latitude.
Since the data structure is now being standardized, they are ready to merge. The merged result is Table #. Note we may be able to fill up some of the fields (yellow cells) based on taxonomic information, or add more information to latitude and longitude. We may be also able to add some more fields in the Class of Location (e.g. continent, country, countryCode, etc.) as given in DwC.
From the above experience, we can learn an important lesson. We should design the table beforehand based on DwC. This increases the efficiency and avoid confusion and fluctuation in description of the data.
2.3 Use plain text instead of Excel file
If you have any experience in gathering data of any kind, the first thing that comes in your mind may be using “Excel”. Excel is a nice spreadsheet software that allows not only the accumulation of data, but also analysis and exploitation.
The next question is to save the data. We now have a merged digitized data in a single table, which enables us to analyze. To save the data, we need to select a file format. Most of the readers may use Microsoft Excel or other spreadsheet software as a standard software. It is therefore natural to choose Excel format (.xls or .xlsx), but there is a certain limitation for the numbers of records in Excel format. Another disadvantage may be that Excel format file is not directly incorporated in database software with higher capability. Besides, Excel format sometimes causes garbling up of the data. Therefore, it is recommended for you to save the file as a plain text format (.csv or .tsv). In the former format known as Comma-separated Values Format (CSV), comma are inserted between the field, and line break (CR) is inserted between the data. Therefore, when the excel data looked like below
may be seen like below under editor software
Field_1 [,/tab] Field_2 [,/tab] Field_3 [,/tab] Field_4 [CR]
1 [,/tab] A [,/tab] B [,/tab] C [CR]
2 [,/tab] D [,/tab] E [,/tab] F [CR]
where [,/tab] is a comma (,) or tab code, respectively in CSV and TSV, and [CR] is a line break code.
The difference of Tab-separated Value Format (TSV) is that TSV files use [Tab] instead of comma [,]. Both file format are readable in editors, and usually supported by various database software.
2.4 Adopt UTF-8 character encoding
Finally, one more thing to remember as a tip when saving the data. Use UTF-8 character encoding when you deal with the data. In the present textbook, detailed reason for this will not be described because it is highly technical and informatician directed.
Simply, UTF-8 is a universally applicable character encoding that avoids garble up for almost all the languages, and to ensure data readability in almost any computers. It is currently taken as a global standard character encoding, and recent websites generally adopt UTF-8. Excel does not support UTF-8, but usually editors support UTF-8 (Usually you can choose UTF-8 from “character encoding” menu). So, it is advisable to save the data in plain text (CSV or TSV) format with UTF-8 encoding, if you are using letters other than alphabets and digits.
BOX2: How to generate unique identifiers?
In the context of biodiversity informatics, a number of data are to be distinguished by identifiers (ID). IDs are supposed to be given to any data that should be uniquely distinguished. Naturally, an ID must be uniquely given to a specific data in the given dataset. For instance, if “101” is given to a scientific name “X”, 101 should be always used when X is used throughout the given database, thus distinguishing X from other scientific name. In the database of occurrence, occurrence ID may be given. In that case, given ID must appear only once to distinguish the data.
Duplication of the IDs with other database may be acceptable, if they mean the same data. However, in case of occurrence ID, it should be globally unique and resolvable. Here, “resolvable” means the ID also works as a link to enable a data user (or a machine) to find the useful information about the thing that is identified.
To generate such an ID, UUID (Universally Unique Identifier) is applicable. UUID is a strings (a line of digits and letters) generated at random (e.g. 8612280c-f762-11e1-a439-00145eb45e92). The methodology of generating UUID is established, and can be obtained through various ways. For example, “uuidgen command” is available for LINUX and macOS (OS X); Microsoft GUID Generator can be downloaded for windows; and online UUID generator services are also available such as UUID Generator.