Difference between revisions of "Data Files"
A.datadien (talk | contribs) |
|||
(10 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
===== File ===== | ===== File ===== | ||
* File encoding is ASCII or UTF-8. | * File encoding is ASCII or UTF-8. | ||
− | * The file contains no byte order mark (BOM) or other magic number. | + | * The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible. |
===== Lines ===== | ===== Lines ===== | ||
* Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems. | * Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems. | ||
− | * The line delimiter should also be added after the last line | + | * The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line. |
* The first line contains a header with column/field names. | * The first line contains a header with column/field names. | ||
===== Fields ===== | ===== Fields ===== | ||
− | * | + | * Fields are separated by the '''tab''' field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character. |
− | * The field delimiter should | + | * The field delimiter should '''not''' be added after each line's last field. This allows for the use of a split() function for parsing a line. |
− | * The last field in a line must not be empty, because | + | * The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed. |
− | * Fields are | + | * Fields are never surrounded by a quoting character. |
− | * White space | + | * White space before or after field delimiters are considered part of a field. |
− | * There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format. | + | * There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format. |
===== Data ===== | ===== Data ===== | ||
* For numbers the decimal separator is a dot, not a comma. There is no thousands separator. | * For numbers the decimal separator is a dot, not a comma. There is no thousands separator. | ||
− | == | + | == Example == |
− | + | ||
− | + | An example of what a file in this format may look like: | |
− | + | <pre> | |
− | + | User ID	Hair color	Response time	 | |
− | + | 1	brown	1.4	 | |
− | + | 2	blond	1230.434	 | |
− | + | 3	brown	0.399	 | |
− | + | ||
− | + | </pre> | |
− | + | An example file can be downloaded here [[File:Example.zip|thumb]] (sorry, it is zipped). | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Parsing == | == Parsing == | ||
− | + | Importing such files can be done in many languages: | |
=== Python Standard Library=== | === Python Standard Library=== | ||
<nowiki> | <nowiki> | ||
Line 90: | Line 66: | ||
d <- read.csv("example.tsv", head=TRUE, sep = "\t") | d <- read.csv("example.tsv", head=TRUE, sep = "\t") | ||
</nowiki> | </nowiki> | ||
+ | |||
+ | == Alternatives == | ||
+ | Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | ||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | | File Extension || '''tsv''', csv, '''dat''', txt | ||
+ | |- | ||
+ | | File Encoding || '''ASCII''', '''UTF-8''', UTF-16BE, UTF-16LE, UCS-4/UTF-32 | ||
+ | |- | ||
+ | | [[wikipedia:Magic_number_(programming)|Magic Number]] || '''None''', [[wikipedia:Byte_order_mark|BOM]] | ||
+ | |- | ||
+ | | Line Delimiter || \n, \r, '''\r\n''' | ||
+ | |- | ||
+ | | Line Delimiter after Last Line || '''Yes''', No | ||
+ | |- | ||
+ | | Field Delimiter || '''<tab>''', <comma> , <semicolon> | ||
+ | |- | ||
+ | | Field Delimiter after Last Field || Yes, '''No''' | ||
+ | |- | ||
+ | | Quoting Character || '''None''', ', " | ||
+ | |- | ||
+ | | Escape QC by doubling || Yes, No | ||
+ | |- | ||
+ | | Escape Character || '''None''', \ | ||
+ | |- | ||
+ | | First Line Contains: || '''Header''', Data | ||
+ | |- | ||
+ | | Empty Last Field in Line || Allowed, '''Not Allowed''' | ||
+ | |- | ||
+ | | Whitespace Following Delimiter || '''Part of Field''', Excluded | ||
+ | |- | ||
+ | | Decimal Separator || '''<dot>''', <comma> | ||
+ | |- | ||
+ | | Thousands Separator || '''None''', <dot>, <space>, U+2009 | ||
+ | |} | ||
+ | Note that tab characters and newlines cannot be present in field content. |
Latest revision as of 16:36, 20 February 2017
TSG suggested file format for experiment data
The TSG suggests a common file format for storing experimental data. Adhering to this format whenever practical makes it easier to re-use files and tools. The file is plain text for easy inspection and manipulation. The file format is a tab-separated values (tsv) file with the following specifications:
File
- File encoding is ASCII or UTF-8.
- The file contains no byte order mark (BOM) or other magic number. This makes it ASCII compatible.
Lines
- Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
- The line delimiter should also be added after the last line. This simplifies stream reading since all records (lines) are terminated. This allows for the use of a readline() function for acquiring a line.
- The first line contains a header with column/field names.
Fields
- Fields are separated by the tab field delimiter, because they rarely occur in texts. This allows for the use of comma's and semicolons in sentences without using an escape character.
- The field delimiter should not be added after each line's last field. This allows for the use of a split() function for parsing a line.
- The last field in a line must not be empty, because it will show to parsers that the previous rule was obeyed.
- Fields are never surrounded by a quoting character.
- White space before or after field delimiters are considered part of a field.
- There is no defined escape character. If your data can contain tabs or newlines, use a different field delimiter or file format.
Data
- For numbers the decimal separator is a dot, not a comma. There is no thousands separator.
Example
An example of what a file in this format may look like:
User ID Hair color Response time 1 brown 1.4 2 blond 1230.434 3 brown 0.399
An example file can be downloaded here File:Example.zip (sorry, it is zipped).
Parsing
Importing such files can be done in many languages:
Python Standard Library
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) for row in reader: print(', '.join(row))
or with header extraction
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) print(', '.join(reader.fieldnames)) # print header for row in reader: print(', '.join([row[key] for key in reader.fieldnames]))
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
Python Pandas
Pandas can interpret column type. You will have to store it separately or hardcode it.
import pandas as pd d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
GNU R
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
Alternatives
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
File Extension | tsv, csv, dat, txt |
File Encoding | ASCII, UTF-8, UTF-16BE, UTF-16LE, UCS-4/UTF-32 |
Magic Number | None, BOM |
Line Delimiter | \n, \r, \r\n |
Line Delimiter after Last Line | Yes, No |
Field Delimiter | <tab>, <comma> , <semicolon> |
Field Delimiter after Last Field | Yes, No |
Quoting Character | None, ', " |
Escape QC by doubling | Yes, No |
Escape Character | None, \ |
First Line Contains: | Header, Data |
Empty Last Field in Line | Allowed, Not Allowed |
Whitespace Following Delimiter | Part of Field, Excluded |
Decimal Separator | <dot>, <comma> |
Thousands Separator | None, <dot>, <space>, U+2009 |
Note that tab characters and newlines cannot be present in field content.