Changes

Jump to navigation Jump to search
1,313 bytes added ,  17:32, 13 February 2017
no edit summary
Line 1: Line 1:  +
== File Format ==
 
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 
Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown.
 
{| class="wikitable"
 
{| class="wikitable"
Line 16: Line 17:  
| field delimiter after last field || '''no''' || yes  
 
| field delimiter after last field || '''no''' || yes  
 
|-
 
|-
| Quoting character || '''None''' || " || '
+
| quoting character || '''None''' || " || '
 
|-
 
|-
| Escape character || '''None''' || \
+
| escape qc by doubling || no || yes
 
|-
 
|-
| First line || '''Contains header''' || Contains data
+
| escape character || '''none''' || \
 
|-
 
|-
| Last field in line || '''Must not be empty''' || May be empty
+
| first line || '''contains header''' || contains data
 
|-
 
|-
| Whitespace following delimiter || '''Part of field''' || Not part of field
+
| last field in line || '''must not be empty''' || may be empty
 
|-
 
|-
 +
| whitespace following delimiter || '''part of field''' || not part of field
 +
|-
 +
| decimal separator || '''.''' || ,
 +
|-
 +
| thousands separator || '''none''' || . || ␣ || U+2009
 
|}
 
|}
 +
Note that tab characters and newlines cannot be present in field content.
 +
== Parsing ==
 +
Importing these files can be done in many languages:
 +
=== Python Standard Library===
 +
<nowiki>
 +
import csv
 +
with open('example.tsv', 'rb') as csvfile:
 +
    reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 +
    for row in reader:
 +
        print(', '.join(row))
 +
</nowiki>
 +
or with header extraction
 +
<nowiki>
 +
import csv
 +
with open('example.tsv', 'rb') as csvfile:
 +
    reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
 +
    print(', '.join(reader.fieldnames)) # print header
 +
    for row in reader:
 +
        print(', '.join([row[key] for key in reader.fieldnames]))
 +
</nowiki>
 +
Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string).
 +
 +
=== Python Pandas ===
 +
Pandas can interpret column type. You will have to store it separately or hardcode it.
 +
<nowiki>
 +
import pandas as pd
 +
 +
d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE)
 +
</nowiki>
 +
=== GNU R ===
 +
<nowiki>
 +
d <- read.csv("example.tsv", head=TRUE, sep = "\t")
 +
</nowiki>

Navigation menu