Difference between revisions of "Data Files"

Revision as of 11:29, 20 February 2017

TSG suggested file format for experiment data

The TSG suggests the following file format for data storage for most experiments.

Lines

Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
The line delimiter should also be added after the last line, because...
The first line contains a header with column/field names.

Fields

Field are separated by the tab field delimiter, because they rarely occur in texts and therefore require no escaping.
The field delimiter should also be added after each line's last field, because...
The last field in a line must not be empty, because... if there is no value, wat do...
Fields are not surrounded by a quoting character.
White space between field delimiters are considered part of the field.
There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format.

File

The file is encoded as ASCII or UTF-8.
The file contains no byte order mark (BOM) or other magic number.

Last field must not be empty (what if no data?)

Explanation for these choices (also nice for ourselves). In progress..

File Format

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

file extension	tsv	csv	dat	txt
file extension	ascii	UTF-8	UTF-16BE	UTF-16LE	UCS-4/UTF-32
magic number	None	<BOM>
line delimiter	\n	\r	\r\n
line delimiter after last line	no	yes
field delimiter	<tab>	,	;
field delimiter after last field	no	yes
quoting character	None	"	'
escape qc by doubling	no	yes
escape character	none	\
first line	contains header	contains data
last field in line	must not be empty	may be empty
whitespace following delimiter	part of field	not part of field
decimal separator	.	,
thousands separator	none	.	␣	U+2009

Note that tab characters and newlines cannot be present in field content.

Parsing

Here is an example File:Example.zip file. Sorry, it is zipped. Importing such files can be done in many languages:

Python Standard Library

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        print(', '.join(row))

or with header extraction

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    print(', '.join(reader.fieldnames)) # print header
    for row in reader:
        print(', '.join([row[key] for key in reader.fieldnames]))

Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).

Python Pandas

Pandas can interpret column type. You will have to store it separately or hardcode it.

import pandas as pd

d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)

GNU R

d <- read.csv("example.tsv", head=TRUE, sep = "\t")

@@ Line 1: / Line 1: @@
-== TSG suggested File Format For Dummies (WIP by Arvind for extra readability!) ==
+== TSG suggested file format for experiment data ==
 The TSG suggests the following file format for data storage for most experiments.
-* Field delimiter: tab (also after last field)
+===== Lines =====
-* Line delimiter: \r\n (also after last line)
+* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.
-* Quoting character: none
+* The line delimiter should also be added after the last line, because...
-* File extension: "tsv"
+* The first line contains a header with column/field names.
-* File encoding: ASCII / UTF-8
-* Magic number: none
+===== Fields =====
-* First line contains header, not data
+* Field are separated by the '''tab''' field delimiter, because they rarely occur in texts and therefore require no escaping.
+* The field delimiter should also be added after each line's last field, because...
+* The last field in a line must not be empty, because... if there is no value, wat do...
+* Fields are not surrounded by a quoting character.
+* White space between field delimiters are considered part of the field.
+* There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format.
+===== File =====
+* The file is encoded as ASCII or UTF-8.
+* The file contains no byte order mark (BOM) or other magic number.
 * Last field must not be empty (what if no data?)
 Explanation for these choices (also nice for ourselves). In progress..
+#
 == File Format ==