Difference between revisions of "Data Files"
A.datadien (talk | contribs) (start of simpler text (WIP)) |
A.datadien (talk | contribs) (in progress save) |
||
Line 1: | Line 1: | ||
− | == TSG suggested | + | == TSG suggested file format for experiment data == |
The TSG suggests the following file format for data storage for most experiments. | The TSG suggests the following file format for data storage for most experiments. | ||
− | * | + | ===== Lines ===== |
− | * | + | * Lines are separated by the '''\r\n''' line delimiter for better compatibility between operating systems. |
− | * | + | * The line delimiter should also be added after the last line, because... |
− | * File | + | * The first line contains a header with column/field names. |
− | * | + | |
− | * | + | ===== Fields ===== |
− | + | * Field are separated by the '''tab''' field delimiter, because they rarely occur in texts and therefore require no escaping. | |
+ | * The field delimiter should also be added after each line's last field, because... | ||
+ | * The last field in a line must not be empty, because... if there is no value, wat do... | ||
+ | * Fields are not surrounded by a quoting character. | ||
+ | * White space between field delimiters are considered part of the field. | ||
+ | * There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format. | ||
+ | |||
+ | ===== File ===== | ||
+ | * The file is encoded as ASCII or UTF-8. | ||
+ | * The file contains no byte order mark (BOM) or other magic number. | ||
+ | |||
* Last field must not be empty (what if no data?) | * Last field must not be empty (what if no data?) | ||
Explanation for these choices (also nice for ourselves). In progress.. | Explanation for these choices (also nice for ourselves). In progress.. | ||
− | + | # | |
== File Format == | == File Format == |
Revision as of 10:29, 20 February 2017
TSG suggested file format for experiment data
The TSG suggests the following file format for data storage for most experiments.
Lines
- Lines are separated by the \r\n line delimiter for better compatibility between operating systems.
- The line delimiter should also be added after the last line, because...
- The first line contains a header with column/field names.
Fields
- Field are separated by the tab field delimiter, because they rarely occur in texts and therefore require no escaping.
- The field delimiter should also be added after each line's last field, because...
- The last field in a line must not be empty, because... if there is no value, wat do...
- Fields are not surrounded by a quoting character.
- White space between field delimiters are considered part of the field.
- There is no defined escape character. If your data can contain tabs, use a different field delimiter or file format.
File
- The file is encoded as ASCII or UTF-8.
- The file contains no byte order mark (BOM) or other magic number.
- Last field must not be empty (what if no data?)
Explanation for these choices (also nice for ourselves). In progress..
File Format
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
file extension | tsv | csv | dat | txt | |
file extension | ascii | UTF-8 | UTF-16BE | UTF-16LE | UCS-4/UTF-32 |
magic number | None | <BOM> | |||
line delimiter | \n | \r | \r\n | ||
line delimiter after last line | no | yes | |||
field delimiter | <tab> | , | ; | ||
field delimiter after last field | no | yes | |||
quoting character | None | " | ' | ||
escape qc by doubling | no | yes | |||
escape character | none | \ | |||
first line | contains header | contains data | |||
last field in line | must not be empty | may be empty | |||
whitespace following delimiter | part of field | not part of field | |||
decimal separator | . | , | |||
thousands separator | none | . | ␣ | U+2009 |
Note that tab characters and newlines cannot be present in field content.
Parsing
Here is an example File:Example.zip file. Sorry, it is zipped. Importing such files can be done in many languages:
Python Standard Library
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) for row in reader: print(', '.join(row))
or with header extraction
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) print(', '.join(reader.fieldnames)) # print header for row in reader: print(', '.join([row[key] for key in reader.fieldnames]))
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
Python Pandas
Pandas can interpret column type. You will have to store it separately or hardcode it.
import pandas as pd d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
GNU R
d <- read.csv("example.tsv", head=TRUE, sep = "\t")