Changes

3,386 bytes added , 17:36, 20 February 2017

Line 1: Line 1: +

== TSG suggested file format for experiment data ==

+

The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:

+

===== File =====

+

* File encoding is ASCII or UTF-8.

+

* The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.

+

===== Lines =====

+

* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.

+

* The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.

+

* The first line contains a header with column/field names.

+

===== Fields =====

+

* Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.

+

* The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line.

+

* The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.

+

* Fields are never surrounded by a quoting character.

+

* White space before or after field delimiters are considered part of a field.

+

* There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.

+

===== Data =====

+

* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.

+

== Example ==

+

An example of what a file in this format may look like:

+

<pre>

+

User ID	Hair color	Response time	

+

1	brown	1.4	

+

2	blond	1230.434	

+

3	brown	0.399	

+

</pre>

+

An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).

+

== Parsing ==

+

Importing such files can be done in many languages:

+

=== Python Standard Library===

+

+

import csv

+

with open('example.tsv', 'rb') as csvfile:

+

reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)

+

for row in reader:

+

print(', '.join(row))

+

</nowiki>

+

or with header extraction

+

+

import csv

+

with open('example.tsv', 'rb') as csvfile:

+

reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)

+

print(', '.join(reader.fieldnames)) # print header

+

for row in reader:

+

print(', '.join([row[key] for key in reader.fieldnames]))

+

</nowiki>

+

Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).

+

=== Python Pandas ===

+

Pandas can interpret column type. You will have to store it separately or hardcode it.

+

+

import pandas as pd

+

d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)

+

</nowiki>

+

=== GNU R ===

+

+

d <- read.csv("example.tsv", head=TRUE, sep = "\t")

+

</nowiki>

+

== Alternatives ==

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

{| class="wikitable"

|-

−

| ~~file extension~~ || '''tsv''' || csv || '''dat'''

+

| File Extension || '''tsv''', csv, '''dat''', txt

|-

−

| ~~file extension~~ || '''~~ascii~~''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32

+

| File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32

|-

−

| ~~magic number~~ || '''None''' |~~| <~~BOM>

+

| [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]]

|-

−

| ~~line delimiter~~ || \n || \r || '''\r\n'''

+

| Line Delimiter || \n, \r, '''\r\n'''

|-

−

| ~~line delimiter~~ after ~~last line || no~~ || '''~~yes~~'''

+

| Line Delimiter after Last Line || '''Yes''', No

|-

−

| ~~field delimiter~~ || '''<tab>''' || , ~~|| ;~~

+

| Field Delimiter || '''<tab>''', <comma> , <semicolon>

|-

−

| ~~field delimiter~~ after ~~last field~~ || '''no''' ~~|| yes~~

+

| Field Delimiter after Last Field || Yes, '''No'''

|-

−

| Quoting ~~character~~ || '''None''' || " ~~|| '~~

+

| Quoting Character || '''None''', ', "

|-

−

| Escape ~~character~~ || ~~'''None''' || \~~

+

| Escape QC by doubling || Yes, No

|-

−

| ~~First line~~ || '''~~Contains header~~''' ~~|| Contains data~~

+

| Escape Character || '''None''', \

|-

−

| ~~Last field in line~~ || '''~~Must not be empty~~''' ~~|| May be empty~~

+

| First Line Contains: || '''Header''', Data

|-

−

| ~~Whitespace following delimiter~~ || '''~~Part of field~~''' ~~|| Not part of field~~

+

| Empty Last Field in Line || Allowed, '''Not Allowed'''

|-

+

| Whitespace Following Delimiter || '''Part of Field''', Excluded

+

|-

+

| Decimal Separator || '''<dot>''', <comma>

+

|-

+

| Thousands Separator || '''None''', <dot>, <space>, U+2009

|}

+

Note that tab characters and newlines cannot be present in field content.

E.vandenberge

Bureaucrats, Administrators

1,344

edits

Changes

Data Files (view source)

Revision as of 17:36, 20 February 2017

Navigation menu

Search