Difference between revisions of "Data Files"

From TSG Doc
Jump to navigation Jump to search
Line 22: Line 22:
 
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
 
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
  
== File Format ==
+
== Example ==
 +
 
 +
This is what the file format looks like:
 +
<pre>
 +
User ID&#9;Hair color&#9;Response time&#9;
 +
1&#9;brown&#9;1.4&#9;
 +
2&#9;blond&#9;1230.434&#9;
 +
3&#9;brown&#9;0.399&#9;
 +
 
 +
</pre>
 +
An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).
 +
 
 +
== Parsing ==
 +
Importing such files can be done in many languages:
 +
=== Python Standard Library===
 +
<nowiki>
 +
import csv
 +
with open('example.tsv', 'rb') as csvfile:
 +
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 +
    for row in reader:
 +
        print(', '.join(row))
 +
</nowiki>
 +
or with header extraction
 +
<nowiki>
 +
import csv
 +
with open('example.tsv', 'rb') as csvfile:
 +
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 +
    print(', '.join(reader.fieldnames)) # print header
 +
    for row in reader:
 +
        print(', '.join([row[key] for key in reader.fieldnames]))
 +
</nowiki>
 +
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
 +
 
 +
=== Python Pandas ===
 +
Pandas can interpret column type. You will have to store it separately or hardcode it.
 +
<nowiki>
 +
import pandas as pd
 +
 
 +
d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
 +
</nowiki>
 +
=== GNU R ===
 +
<nowiki>
 +
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 +
</nowiki>
 +
 
 +
== Alternatives ==
 
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 
{| class="wikitable"
 
{| class="wikitable"
Line 57: Line 102:
 
|}
 
|}
 
Note that tab characters and newlines cannot be present in field content.
 
Note that tab characters and newlines cannot be present in field content.
 
== Parsing ==
 
Here is an example [[File:Example.zip|thumb]] file. Sorry, it is zipped. Importing such files can be done in many languages:
 
=== Python Standard Library===
 
<nowiki>
 
import csv
 
with open('example.tsv', 'rb') as csvfile:
 
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 
    for row in reader:
 
        print(', '.join(row))
 
</nowiki>
 
or with header extraction
 
<nowiki>
 
import csv
 
with open('example.tsv', 'rb') as csvfile:
 
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 
    print(', '.join(reader.fieldnames)) # print header
 
    for row in reader:
 
        print(', '.join([row[key] for key in reader.fieldnames]))
 
</nowiki>
 
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
 
 
=== Python Pandas ===
 
Pandas can interpret column type. You will have to store it separately or hardcode it.
 
<nowiki>
 
import pandas as pd
 
 
d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
 
</nowiki>
 
=== GNU R ===
 
<nowiki>
 
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 
</nowiki>
 

Revision as of 11:59, 20 February 2017

TSG suggested file format for experiment data

The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:

File
  • File encoding is ASCII or UTF-8.
  • The file contains no byte order mark (BOM) or other magic number.
Lines
  • Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
  • The line delimiter should also be added after the last line, because...
  • The first line contains a header with column/field names.
Fields
  • Field are separated by the tab field delimiter, because they rarely occur in texts and therefore require no escaping.
  • The field delimiter should also be added after each line's last field, because...
  • The last field in a line must not be empty, because... if there is no value, wat do...
  • Fields are not surrounded by a quoting character.
  • White space between field delimiters are considered part of the field.
  • There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format.
Data
  • For numbers the decimal separator is a dot, not a comma. There is no thousands separator.

Example

This is what the file format looks like:

User ID	Hair color	Response time	
1	brown	1.4	
2	blond	1230.434	
3	brown	0.399	

An example file can be downloaded here File:Example.zip (sorry, it is zipped).

Parsing

Importing such files can be done in many languages:

Python Standard Library

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        print(', '.join(row))

or with header extraction

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    print(', '.join(reader.fieldnames)) # print header
    for row in reader:
        print(', '.join([row[key] for key in reader.fieldnames]))

Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).

Python Pandas

Pandas can interpret column type. You will have to store it separately or hardcode it.

import pandas as pd

d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)

GNU R

d <- read.csv("example.tsv", head=TRUE, sep = "\t")

Alternatives

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

file extension tsv csv dat txt
file extension ascii UTF-8 UTF-16BE UTF-16LE UCS-4/UTF-32
magic number None <BOM>
line delimiter \n \r \r\n
line delimiter after last line no yes
field delimiter <tab> , ;
field delimiter after last field no yes
quoting character None " '
escape qc by doubling no yes
escape character none \
first line contains header contains data
last field in line must not be empty may be empty
whitespace following delimiter part of field not part of field
decimal separator . ,
thousands separator none . U+2009

Note that tab characters and newlines cannot be present in field content.