Difference between revisions of "Data Files"
Jump to navigation
Jump to search
(Created page with "Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown...") |
|||
Line 1: | Line 1: | ||
+ | == File Format == | ||
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 16: | Line 17: | ||
| field delimiter after last field || '''no''' || yes | | field delimiter after last field || '''no''' || yes | ||
|- | |- | ||
− | | | + | | quoting character || '''None''' || " || ' |
|- | |- | ||
− | | | + | | escape qc by doubling || no || yes |
|- | |- | ||
− | | | + | | escape character || '''none''' || \ |
|- | |- | ||
− | | | + | | first line || '''contains header''' || contains data |
|- | |- | ||
− | | | + | | last field in line || '''must not be empty''' || may be empty |
|- | |- | ||
+ | | whitespace following delimiter || '''part of field''' || not part of field | ||
+ | |- | ||
+ | | decimal separator || '''.''' || , | ||
+ | |- | ||
+ | | thousands separator || '''none''' || . || ␣ || U+2009 | ||
|} | |} | ||
+ | Note that tab characters and newlines cannot be present in field content. | ||
+ | == Parsing == | ||
+ | Importing these files can be done in many languages: | ||
+ | === Python Standard Library=== | ||
+ | <nowiki> | ||
+ | import csv | ||
+ | with open('example.tsv', 'rb') as csvfile: | ||
+ | reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) | ||
+ | for row in reader: | ||
+ | print(', '.join(row)) | ||
+ | </nowiki> | ||
+ | or with header extraction | ||
+ | <nowiki> | ||
+ | import csv | ||
+ | with open('example.tsv', 'rb') as csvfile: | ||
+ | reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) | ||
+ | print(', '.join(reader.fieldnames)) # print header | ||
+ | for row in reader: | ||
+ | print(', '.join([row[key] for key in reader.fieldnames])) | ||
+ | </nowiki> | ||
+ | Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string). | ||
+ | |||
+ | === Python Pandas === | ||
+ | Pandas can interpret column type. You will have to store it separately or hardcode it. | ||
+ | <nowiki> | ||
+ | import pandas as pd | ||
+ | |||
+ | d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE) | ||
+ | </nowiki> | ||
+ | === GNU R === | ||
+ | <nowiki> | ||
+ | d <- read.csv("example.tsv", head=TRUE, sep = "\t") | ||
+ | </nowiki> |
Revision as of 16:32, 13 February 2017
File Format
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
file extension | tsv | csv | dat | ||
file extension | ascii | UTF-8 | UTF-16BE | UTF-16LE | UCS-4/UTF-32 |
magic number | None | <BOM> | |||
line delimiter | \n | \r | \r\n | ||
line delimiter after last line | no | yes | |||
field delimiter | <tab> | , | ; | ||
field delimiter after last field | no | yes | |||
quoting character | None | " | ' | ||
escape qc by doubling | no | yes | |||
escape character | none | \ | |||
first line | contains header | contains data | |||
last field in line | must not be empty | may be empty | |||
whitespace following delimiter | part of field | not part of field | |||
decimal separator | . | , | |||
thousands separator | none | . | ␣ | U+2009 |
Note that tab characters and newlines cannot be present in field content.
Parsing
Importing these files can be done in many languages:
Python Standard Library
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) for row in reader: print(', '.join(row))
or with header extraction
import csv with open('example.tsv', 'rb') as csvfile: reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) print(', '.join(reader.fieldnames)) # print header for row in reader: print(', '.join([row[key] for key in reader.fieldnames]))
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
Python Pandas
Pandas can interpret column type. You will have to store it separately or hardcode it.
import pandas as pd d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
GNU R
d <- read.csv("example.tsv", head=TRUE, sep = "\t")