Changes

Jump to navigation Jump to search
643 bytes added ,  17:36, 20 February 2017
Line 1: Line 1:  
== TSG suggested file format for experiment data ==
 
== TSG suggested file format for experiment data ==
The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file format is a tab-separated values (tsv) file with the following specifications:
+
The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:
    
===== File =====
 
===== File =====
 
* File encoding is ASCII or UTF-8.
 
* File encoding is ASCII or UTF-8.
* The file contains no byte order mark (BOM) or other magic number.
+
* The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.
    
===== Lines =====
 
===== Lines =====
 
* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.
 
* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.
* The line delimiter should also be added after the last line, because...
+
* The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
 
* The first line contains a header with column/field names.
 
* The first line contains a header with column/field names.
    
===== Fields =====
 
===== Fields =====
* Field are separated by the '''tab''' field delimiter, because they rarely occur in texts and therefore require no escaping.
+
* Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
* The field delimiter should also be added after each line's last field, because...
+
* The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line.
* The last field in a line must not be empty, because... if there is no value, wat do...
+
* The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
* Fields are not surrounded by a quoting character.
+
* Fields are never surrounded by a quoting character.
* White space between field delimiters are considered part of the field.
+
* White space before or after field delimiters are considered part of a field.
* There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format.
+
* There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.
    
===== Data =====
 
===== Data =====
 
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
 
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
   −
== File Format ==
+
== Example ==
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
+
 
{| class="wikitable"
+
An example of what a file in this format may look like:
|-
+
<pre>
| file extension || '''tsv''' || csv || '''dat''' || txt
+
User ID&#9;Hair color&#9;Response time&#9;
|-
+
1&#9;brown&#9;1.4&#9;
| file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32
+
2&#9;blond&#9;1230.434&#9;
|-
+
3&#9;brown&#9;0.399&#9;
| magic number || '''None''' || <BOM>  
+
 
|-
+
</pre>
| line delimiter || \n || \r || '''\r\n'''
+
An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).
|-
  −
| line delimiter after last line || no || '''yes'''
  −
|-
  −
| field delimiter || '''<tab>''' || , || ;
  −
|-
  −
| field delimiter after last field || '''no''' || yes
  −
|-
  −
| quoting character || '''None''' || " || '
  −
|-
  −
| escape qc by doubling || no || yes
  −
|-
  −
| escape character || '''none''' || \
  −
|-
  −
| first line || '''contains header''' || contains data
  −
|-
  −
| last field in line || '''must not be empty''' || may be empty
  −
|-
  −
| whitespace following delimiter || '''part of field''' || not part of field
  −
|-
  −
| decimal separator || '''.''' || ,
  −
|-
  −
| thousands separator || '''none''' || . || ␣ || U+2009
  −
|}
  −
Note that tab characters and newlines cannot be present in field content.
      
== Parsing ==
 
== Parsing ==
Here is an example [[File:Example.zip|thumb]] file. Sorry, it is zipped. Importing such files can be done in many languages:
+
Importing such files can be done in many languages:
 
=== Python Standard Library===
 
=== Python Standard Library===
 
  <nowiki>
 
  <nowiki>
Line 90: Line 66:  
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 
</nowiki>
 
</nowiki>
 +
 +
== Alternatives ==
 +
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 +
{| class="wikitable"
 +
|-
 +
| File Extension || '''tsv''', csv, '''dat''', txt
 +
|-
 +
| File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32
 +
|-
 +
| [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]]
 +
|-
 +
| Line Delimiter || \n, \r, '''\r\n'''
 +
|-
 +
| Line Delimiter after Last Line || '''Yes''', No
 +
|-
 +
| Field Delimiter || '''<tab>''', <comma> , <semicolon>
 +
|-
 +
| Field Delimiter after Last Field || Yes, '''No'''
 +
|-
 +
| Quoting Character || '''None''', ', "
 +
|-
 +
| Escape QC by doubling || Yes, No
 +
|-
 +
| Escape Character || '''None''', \
 +
|-
 +
| First Line Contains: || '''Header''', Data
 +
|-
 +
| Empty Last Field in Line || Allowed, '''Not Allowed'''
 +
|-
 +
| Whitespace Following Delimiter || '''Part of Field''', Excluded
 +
|-
 +
| Decimal Separator || '''<dot>''', <comma>
 +
|-
 +
| Thousands Separator || '''None''', <dot>, <space>, U+2009
 +
|}
 +
Note that tab characters and newlines cannot be present in field content.

Navigation menu