Changes

Jump to navigation Jump to search
2,073 bytes added ,  17:36, 20 February 2017
Line 1: Line 1: −
== File Format ==
+
== TSG suggested file format for experiment data ==
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
+
The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:
{| class="wikitable"
+
 
|-
+
===== File =====
| file extension || '''tsv''' || csv || '''dat'''
+
* File encoding is ASCII or UTF-8.
|-
+
* The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.
| file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32
+
 
|-
+
===== Lines =====
| magic number || '''None''' || <BOM>
+
* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.
|-
+
* The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
| line delimiter || \n || \r || '''\r\n'''
+
* The first line contains a header with column/field names.
|-
+
 
| line delimiter after last line || no || '''yes'''
+
===== Fields =====
|-
+
* Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
| field delimiter || '''<tab>''' || , || ;
+
* The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line.
|-
+
* The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
| field delimiter after last field || '''no''' || yes
+
* Fields are never surrounded by a quoting character.
|-
+
* White space before or after field delimiters are considered part of a field.
| quoting character || '''None''' || " || '
+
* There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.
|-
+
 
| escape qc by doubling || no || yes
+
===== Data =====
|-
+
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
| escape character || '''none''' || \
+
 
|-
+
== Example ==
| first line || '''contains header''' || contains data
+
 
|-
+
An example of what a file in this format may look like:
| last field in line || '''must not be empty''' || may be empty
+
<pre>
|-
+
User ID&#9;Hair color&#9;Response time&#9;
| whitespace following delimiter || '''part of field''' || not part of field
+
1&#9;brown&#9;1.4&#9;
|-
+
2&#9;blond&#9;1230.434&#9;
| decimal separator || '''.''' || ,
+
3&#9;brown&#9;0.399&#9;
|-
+
 
| thousands separator || '''none''' || . || ␣ || U+2009
+
</pre>
|}
+
An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).
Note that tab characters and newlines cannot be present in field content.
+
 
 
== Parsing ==
 
== Parsing ==
Importing these files can be done in many languages:
+
Importing such files can be done in many languages:
 
=== Python Standard Library===
 
=== Python Standard Library===
 
  <nowiki>
 
  <nowiki>
Line 66: Line 66:  
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 
</nowiki>
 
</nowiki>
 +
 +
== Alternatives ==
 +
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 +
{| class="wikitable"
 +
|-
 +
| File Extension || '''tsv''', csv, '''dat''', txt
 +
|-
 +
| File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32
 +
|-
 +
| [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]]
 +
|-
 +
| Line Delimiter || \n, \r, '''\r\n'''
 +
|-
 +
| Line Delimiter after Last Line || '''Yes''', No
 +
|-
 +
| Field Delimiter || '''<tab>''', <comma> , <semicolon>
 +
|-
 +
| Field Delimiter after Last Field || Yes, '''No'''
 +
|-
 +
| Quoting Character || '''None''', ', "
 +
|-
 +
| Escape QC by doubling || Yes, No
 +
|-
 +
| Escape Character || '''None''', \
 +
|-
 +
| First Line Contains: || '''Header''', Data
 +
|-
 +
| Empty Last Field in Line || Allowed, '''Not Allowed'''
 +
|-
 +
| Whitespace Following Delimiter || '''Part of Field''', Excluded
 +
|-
 +
| Decimal Separator || '''<dot>''', <comma>
 +
|-
 +
| Thousands Separator || '''None''', <dot>, <space>, U+2009
 +
|}
 +
Note that tab characters and newlines cannot be present in field content.

Navigation menu