TSG suggested file format for experiment data
The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:
- File encoding is ASCII or UTF-8.
- The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.
- Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
- The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
- The first line contains a header with column/field names.
- Fields are separated by the tab field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
- The field delimiter should not be added after each line's last field. This allows for the use of a split() function for parsing a line.
- The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
- Fields are never surrounded by a quoting character.
- White space before or after field delimiters are considered part of a field.
- There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.
- For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
An example of what a file in this format may look like:
User ID Hair color Response time 1 brown 1.4 2 blond 1230.434 3 brown 0.399
An example file can be downloaded here File:Example.zip (sorry, it is zipped).
Importing such files can be done in many languages:
Python Standard Library
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) for row in reader: print(', '.join(row))
or with header extraction
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) print(', '.join(reader.fieldnames)) # print header for row in reader: print(', '.join([row[key] for key in reader.fieldnames]))
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
Pandas can interpret column type. You will have to store it separately or hardcode it.
import pandas as pd d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
|File Extension||tsv, csv, dat, txt|
|File Encoding||ASCII, UTF-8, UTF-16BE, UTF-16LE, UCS-4/UTF-32|
|Magic Number||None, BOM|
|Line Delimiter||\n, \r, \r\n|
|Line Delimiter after Last Line||Yes, No|
|Field Delimiter||<tab>, <comma> , <semicolon>|
|Field Delimiter after Last Field||Yes, No|
|Quoting Character||None, ', "|
|Escape QC by doubling||Yes, No|
|Escape Character||None, \|
|First Line Contains:||Header, Data|
|Empty Last Field in Line||Allowed, Not Allowed|
|Whitespace Following Delimiter||Part of Field, Excluded|
|Decimal Separator||<dot>, <comma>|
|Thousands Separator||None, <dot>, <space>, U+2009|
Note that tab characters and newlines cannot be present in field content.