Difference between revisions of "Data Files"

Latest revision as of 17:36, 20 February 2017

TSG suggested file format for experiment data

The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:

File

File encoding is ASCII or UTF-8.
The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.

Lines

Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
The first line contains a header with column/field names.

Fields

Fields are separated by the tab field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
The field delimiter should not be added after each line's last field. This allows for the use of a split() function for parsing a line.
The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
Fields are never surrounded by a quoting character.
White space before or after field delimiters are considered part of a field.
There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.

Data

For numbers the decimal separator is a dot, not a comma. There is no thousands separator.

Example

An example of what a file in this format may look like:

User ID	Hair color	Response time	
1	brown	1.4	
2	blond	1230.434	
3	brown	0.399

An example file can be downloaded here File:Example.zip (sorry, it is zipped).

Parsing

Importing such files can be done in many languages:

Python Standard Library

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        print(', '.join(row))

or with header extraction

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    print(', '.join(reader.fieldnames)) # print header
    for row in reader:
        print(', '.join([row[key] for key in reader.fieldnames]))

Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).

Python Pandas

Pandas can interpret column type. You will have to store it separately or hardcode it.

import pandas as pd

d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)

GNU R

d <- read.csv("example.tsv", head=TRUE, sep = "\t")

Alternatives

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

File Extension	tsv, csv, dat, txt
File Encoding	ASCII, UTF-8, UTF-16BE, UTF-16LE, UCS-4/UTF-32
Magic Number	None, BOM
Line Delimiter	\n, \r, \r\n
Line Delimiter after Last Line	Yes, No
Field Delimiter	<tab>, <comma> , <semicolon>
Field Delimiter after Last Field	Yes, No
Quoting Character	None, ', "
Escape QC by doubling	Yes, No
Escape Character	None, \
First Line Contains:	Header, Data
Empty Last Field in Line	Allowed, Not Allowed
Whitespace Following Delimiter	Part of Field, Excluded
Decimal Separator	<dot>, <comma>
Thousands Separator	None, <dot>, <space>, U+2009

Note that tab characters and newlines cannot be present in field content.

@@ Line 1: / Line 1: @@
-== File Format ==
+== TSG suggested file format for experiment data ==
-Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
+The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:
-{| class="wikitable"
-|-
+===== File =====
-| file extension || '''tsv''' || csv || '''dat'''
+* File encoding is ASCII or UTF-8.
-|-
+* The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.
-| file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32
-|-
+===== Lines =====
-| magic number || '''None''' || <BOM>
+* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.
-|-
+* The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
-| line delimiter || \n || \r || '''\r\n'''
+* The first line contains a header with column/field names.
-|-
-| line delimiter after last line || no || '''yes'''
+===== Fields =====
-|-
+* Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
-| field delimiter || '''<tab>''' || , || ;
+* The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line.
-|-
+* The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
-| field delimiter after last field || '''no''' || yes
+* Fields are never surrounded by a quoting character.
-|-
+* White space before or after field delimiters are considered part of a field.
-| quoting character || '''None''' || " || '
+* There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.
-|-
-| escape qc by doubling || no || yes
+===== Data =====
-|-
+* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
-| escape character || '''none''' || \
-|-
+== Example ==
-| first line || '''contains header''' || contains data
-|-
+An example of what a file in this format may look like:
-| last field in line || '''must not be empty''' || may be empty
+<pre>
-|-
+User ID&#9;Hair color&#9;Response time&#9;
-| whitespace following delimiter || '''part of field''' || not part of field
+&#9;brown&#9;1.4&#9;
-|-
+&#9;blond&#9;1230.434&#9;
-| decimal separator || '''.''' || ,
+&#9;brown&#9;0.399&#9;
-|-
-| thousands separator || '''none''' || . || ␣ || U+2009
+</pre>
-|}
+An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).
-Note that tab characters and newlines cannot be present in field content.
 == Parsing ==
-Importing these files can be done in many languages:
+Importing such files can be done in many languages:
 === Python Standard Library===
   <nowiki>
@@ Line 66: / Line 66: @@
 d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 </nowiki>
+== Alternatives ==
+Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
+{| class="wikitable"
+|-
+| File Extension || '''tsv''', csv, '''dat''', txt
+|-
+| File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32
+|-
+| [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]]
+|-
+| Line Delimiter || \n, \r, '''\r\n'''
+|-
+| Line Delimiter after Last Line || '''Yes''', No
+|-
+| Field Delimiter || '''<tab>''', <comma> , <semicolon>
+|-
+| Field Delimiter after Last Field || Yes, '''No'''
+|-
+| Quoting Character || '''None''', ', "
+|-
+| Escape QC by doubling || Yes, No
+|-
+| Escape Character || '''None''', \
+|-
+| First Line Contains: || '''Header''', Data
+|-
+| Empty Last Field in Line || Allowed, '''Not Allowed'''
+|-
+| Whitespace Following Delimiter || '''Part of Field''', Excluded
+|-
+| Decimal Separator || '''<dot>''', <comma>
+|-
+| Thousands Separator || '''None''', <dot>, <space>, U+2009
+|}
+Note that tab characters and newlines cannot be present in field content.