Changes

2,073 bytes added , 17:36, 20 February 2017

Line 1: Line 1: −

== ~~File Format~~ ==

+

== TSG suggested file format for experiment data ==

−

~~Data can be saved in~~ a ~~lot of~~ file ~~formats~~. ~~If there is no reason~~ to ~~do otherwise, we prefer delimited~~ files ~~with the options shown in bold~~. ~~Alternative options are also shown~~.

+

The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:

−

~~{| class="wikitable"~~

+

−

|-

+

===== File =====

−

~~| file extension || '''tsv''' || csv || '''dat'''~~

+

* File encoding is ASCII or UTF-8.

−

|-

+

* The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.

−

~~| file extension || '''ascii''' || '''~~UTF-8~~''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32~~

+

−

|-

+

===== Lines =====

−

| magic number ~~|| '''None''' || <BOM>~~

+

* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems.

−

|-

+

* The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.

−

~~| line delimiter || \n || \r ||~~ '''\r\n'''

+

* The first line contains a header with column/field names.

−

|-

+

−

| line delimiter after last line ~~|| no || '''yes'''~~

+

===== Fields =====

−

|-

+

* Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.

−

~~| field delimiter ||~~ '''<tab>''' ~~|| , || ;~~

+

* The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line.

−

|-

+

* The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.

−

| field delimiter ~~after last field ||~~ '~~''no''' || yes~~

+

* Fields are never surrounded by a quoting character.

−

|-

+

* White space before or after field delimiters are considered part of a field.

−

~~| quoting~~ character ~~|| '''None''' || " || '~~

+

* There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.

−

|-

+

−

~~| escape qc by doubling || no || yes~~

+

===== Data =====

−

|-

+

* For numbers the decimal separator is a dot, not a comma. There is no thousands separator.

−

~~| escape character ||~~ '''~~none~~''' ~~|| \~~

+

−

|-

+

== Example ==

−

~~| first~~ line || '~~''contains header''' || contains data~~

+

−

|-

+

An example of what a file in this format may look like:

−

| last field in line ~~|| '''~~must not be empty~~''' || may be empty~~

+

<pre>

−

|-

+

User ID	Hair color	Response time	

−

~~| whitespace following delimiter || '''~~part of field~~''' || not part of~~ field

+

1	brown	1.4	

−

|-

+

2	blond	1230.434	

−

| decimal separator ~~|| '''~~.~~''' || ,~~

+

3	brown	0.399	

−

|-

+

−

~~| thousands separator || '''none''' ||~~ . ~~|| ␣ || U+2009~~

+

</pre>

−

|}

+

An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped).

−

~~Note that tab characters and newlines cannot~~ be ~~present in field content~~.

+

== Parsing ==

−

Importing ~~these~~ files can be done in many languages:

+

Importing such files can be done in many languages:

=== Python Standard Library===

Line 66: Line 66:

d <- read.csv("example.tsv", head=TRUE, sep = "\t")

</nowiki>

+

== Alternatives ==

+

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

+

{| class="wikitable"

+

|-

+

| File Extension || '''tsv''', csv, '''dat''', txt

+

|-

+

| File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32

+

|-

+

| [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]]

+

|-

+

| Line Delimiter || \n, \r, '''\r\n'''

+

|-

+

| Line Delimiter after Last Line || '''Yes''', No

+

|-

+

| Field Delimiter || '''<tab>''', <comma> , <semicolon>

+

|-

+

| Field Delimiter after Last Field || Yes, '''No'''

+

|-

+

| Quoting Character || '''None''', ', "

+

|-

+

| Escape QC by doubling || Yes, No

+

|-

+

| Escape Character || '''None''', \

+

|-

+

| First Line Contains: || '''Header''', Data

+

|-

+

| Empty Last Field in Line || Allowed, '''Not Allowed'''

+

|-

+

| Whitespace Following Delimiter || '''Part of Field''', Excluded

+

|-

+

| Decimal Separator || '''<dot>''', <comma>

+

|-

+

| Thousands Separator || '''None''', <dot>, <space>, U+2009

+

|}

+

Note that tab characters and newlines cannot be present in field content.

E.vandenberge

Bureaucrats, Administrators

1,344

edits

Changes

Data Files (view source)

Revision as of 17:36, 20 February 2017

Navigation menu

Search