Line 1: |
Line 1: |
| + | == TSG suggested file format for experiment data == |
| + | The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications: |
| + | |
| + | ===== File ===== |
| + | * File encoding is ASCII or UTF-8. |
| + | * The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible. |
| + | |
| + | ===== Lines ===== |
| + | * Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems. |
| + | * The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line. |
| + | * The first line contains a header with column/field names. |
| + | |
| + | ===== Fields ===== |
| + | * Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character. |
| + | * The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line. |
| + | * The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed. |
| + | * Fields are never surrounded by a quoting character. |
| + | * White space before or after field delimiters are considered part of a field. |
| + | * There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format. |
| + | |
| + | ===== Data ===== |
| + | * For numbers the decimal separator is a dot, not a comma. There is no thousands separator. |
| + | |
| + | == Example == |
| + | |
| + | An example of what a file in this format may look like: |
| + | <pre> |
| + | User ID	Hair color	Response time	 |
| + | 1	brown	1.4	 |
| + | 2	blond	1230.434	 |
| + | 3	brown	0.399	 |
| + | |
| + | </pre> |
| + | An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped). |
| + | |
| + | == Parsing == |
| + | Importing such files can be done in many languages: |
| + | === Python Standard Library=== |
| + | <nowiki> |
| + | import csv |
| + | with open('example.tsv', 'rb') as csvfile: |
| + | reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) |
| + | for row in reader: |
| + | print(', '.join(row)) |
| + | </nowiki> |
| + | or with header extraction |
| + | <nowiki> |
| + | import csv |
| + | with open('example.tsv', 'rb') as csvfile: |
| + | reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) |
| + | print(', '.join(reader.fieldnames)) # print header |
| + | for row in reader: |
| + | print(', '.join([row[key] for key in reader.fieldnames])) |
| + | </nowiki> |
| + | Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string). |
| + | |
| + | === Python Pandas === |
| + | Pandas can interpret column type. You will have to store it separately or hardcode it. |
| + | <nowiki> |
| + | import pandas as pd |
| + | |
| + | d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE) |
| + | </nowiki> |
| + | === GNU R === |
| + | <nowiki> |
| + | d <- read.csv("example.tsv", head=TRUE, sep = "\t") |
| + | </nowiki> |
| + | |
| + | == Alternatives == |
| Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | | Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. |
| {| class="wikitable" | | {| class="wikitable" |
| |- | | |- |
− | | file extension || '''tsv''' || csv || '''dat''' | + | | File Extension || '''tsv''', csv, '''dat''', txt |
| |- | | |- |
− | | file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32 | + | | File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32 |
| |- | | |- |
− | | magic number || '''None''' || <BOM> | + | | [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]] |
| |- | | |- |
− | | line delimiter || \n || \r || '''\r\n''' | + | | Line Delimiter || \n, \r, '''\r\n''' |
| |- | | |- |
− | | line delimiter after last line || no || '''yes''' | + | | Line Delimiter after Last Line || '''Yes''', No |
| |- | | |- |
− | | field delimiter || '''<tab>''' || , || ; | + | | Field Delimiter || '''<tab>''', <comma> , <semicolon> |
| |- | | |- |
− | | field delimiter after last field || '''no''' || yes | + | | Field Delimiter after Last Field || Yes, '''No''' |
| |- | | |- |
− | | Quoting character || '''None''' || " || ' | + | | Quoting Character || '''None''', ', " |
| |- | | |- |
− | | Escape character || '''None''' || \ | + | | Escape QC by doubling || Yes, No |
| |- | | |- |
− | | First line || '''Contains header''' || Contains data | + | | Escape Character || '''None''', \ |
| |- | | |- |
− | | Last field in line || '''Must not be empty''' || May be empty | + | | First Line Contains: || '''Header''', Data |
| |- | | |- |
− | | Whitespace following delimiter || '''Part of field''' || Not part of field | + | | Empty Last Field in Line || Allowed, '''Not Allowed''' |
| |- | | |- |
| + | | Whitespace Following Delimiter || '''Part of Field''', Excluded |
| + | |- |
| + | | Decimal Separator || '''<dot>''', <comma> |
| + | |- |
| + | | Thousands Separator || '''None''', <dot>, <space>, U+2009 |
| |} | | |} |
| + | Note that tab characters and newlines cannot be present in field content. |