Difference between revisions of "Data Files"

From TSG Doc
Jump to navigation Jump to search
Line 3: Line 3:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
| file extension || '''tsv''' || csv || '''dat'''
+
| file extension || '''tsv''' || csv || '''dat''' || txt
 
|-
 
|-
 
| file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32
 
| file extension || '''ascii''' || '''UTF-8''' || UTF-16BE || UTF-16LE || UCS-4/UTF-32
Line 34: Line 34:
 
|}
 
|}
 
Note that tab characters and newlines cannot be present in field content.
 
Note that tab characters and newlines cannot be present in field content.
 +
 
== Parsing ==
 
== Parsing ==
 
Here is an example [[File:Example.zip|thumb]] file. Sorry, it is zipped. Importing such files can be done in many languages:
 
Here is an example [[File:Example.zip|thumb]] file. Sorry, it is zipped. Importing such files can be done in many languages:

Revision as of 11:19, 14 February 2017

File Format

Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.

file extension tsv csv dat txt
file extension ascii UTF-8 UTF-16BE UTF-16LE UCS-4/UTF-32
magic number None <BOM>
line delimiter \n \r \r\n
line delimiter after last line no yes
field delimiter <tab> , ;
field delimiter after last field no yes
quoting character None " '
escape qc by doubling no yes
escape character none \
first line contains header contains data
last field in line must not be empty may be empty
whitespace following delimiter part of field not part of field
decimal separator . ,
thousands separator none . U+2009

Note that tab characters and newlines cannot be present in field content.

Parsing

Here is an example File:Example.zip file. Sorry, it is zipped. Importing such files can be done in many languages:

Python Standard Library

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        print(', '.join(row))

or with header extraction

import csv
with open('example.tsv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
    print(', '.join(reader.fieldnames)) # print header
    for row in reader:
        print(', '.join([row[key] for key in reader.fieldnames]))

Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).

Python Pandas

Pandas can interpret column type. You will have to store it separately or hardcode it.

import pandas as pd

d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)

GNU R

d <- read.csv("example.tsv", head=TRUE, sep = "\t")