Difference between revisions of "Data Files"
A.datadien (talk | contribs) (start of simpler text (WIP)) |
|||
| Line 1: | Line 1: | ||
| + | == TSG suggested File Format For Dummies (WIP by Arvind for extra readability!) == | ||
| + | The TSG suggests the following file format for data storage for most experiments. | ||
| + | |||
| + | * Field delimiter: tab (also after last field) | ||
| + | * Line delimiter: \r\n (also after last line) | ||
| + | * Quoting character: none | ||
| + | * File extension: "tsv" | ||
| + | * File encoding: ASCII / UTF-8 | ||
| + | * Magic number: none | ||
| + | * First line contains header, not data | ||
| + | * Last field must not be empty (what if no data?) | ||
| + | |||
| + | Explanation for these choices (also nice for ourselves). In progress.. | ||
| + | |||
| + | |||
| + | |||
== File Format == | == File Format == | ||
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | ||
Revision as of 17:17, 16 February 2017
TSG suggested File Format For Dummies (WIP by Arvind for extra readability!)
The TSG suggests the following file format for data storage for most experiments.
- Field delimiter: tab (also after last field)
- Line delimiter: \r\n (also after last line)
- Quoting character: none
- File extension: "tsv"
- File encoding: ASCII / UTF-8
- Magic number: none
- First line contains header, not data
- Last field must not be empty (what if no data?)
Explanation for these choices (also nice for ourselves). In progress..
File Format
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
| file extension | tsv | csv | dat | txt | |
| file extension | ascii | UTF-8 | UTF-16BE | UTF-16LE | UCS-4/UTF-32 |
| magic number | None | <BOM> | |||
| line delimiter | \n | \r | \r\n | ||
| line delimiter after last line | no | yes | |||
| field delimiter | <tab> | , | ; | ||
| field delimiter after last field | no | yes | |||
| quoting character | None | " | ' | ||
| escape qc by doubling | no | yes | |||
| escape character | none | \ | |||
| first line | contains header | contains data | |||
| last field in line | must not be empty | may be empty | |||
| whitespace following delimiter | part of field | not part of field | |||
| decimal separator | . | , | |||
| thousands separator | none | . | ␣ | U+2009 |
Note that tab characters and newlines cannot be present in field content.
Parsing
Here is an example File:Example.zip file. Sorry, it is zipped. Importing such files can be done in many languages:
Python Standard Library
import csv
with open('example.tsv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
print(', '.join(row))
or with header extraction
import csv
with open('example.tsv', 'rb') as csvfile:
reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
print(', '.join(reader.fieldnames)) # print header
for row in reader:
print(', '.join([row[key] for key in reader.fieldnames]))
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
Python Pandas
Pandas can interpret column type. You will have to store it separately or hardcode it.
import pandas as pd
d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
GNU R
d <- read.csv("example.tsv", head=TRUE, sep = "\t")