Line 1: |
Line 1: |
| + | == File Format == |
| Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. | | Data can be saved in a lot of file formats. If there is no reason to do otherwise, we prefer delimited files with the options shown in bold. Alternative options are also shown. |
| {| class="wikitable" | | {| class="wikitable" |
Line 16: |
Line 17: |
| | field delimiter after last field || '''no''' || yes | | | field delimiter after last field || '''no''' || yes |
| |- | | |- |
− | | Quoting character || '''None''' || " || ' | + | | quoting character || '''None''' || " || ' |
| |- | | |- |
− | | Escape character || '''None''' || \ | + | | escape qc by doubling || no || yes |
| |- | | |- |
− | | First line || '''Contains header''' || Contains data | + | | escape character || '''none''' || \ |
| |- | | |- |
− | | Last field in line || '''Must not be empty''' || May be empty | + | | first line || '''contains header''' || contains data |
| |- | | |- |
− | | Whitespace following delimiter || '''Part of field''' || Not part of field | + | | last field in line || '''must not be empty''' || may be empty |
| |- | | |- |
| + | | whitespace following delimiter || '''part of field''' || not part of field |
| + | |- |
| + | | decimal separator || '''.''' || , |
| + | |- |
| + | | thousands separator || '''none''' || . || ␣ || U+2009 |
| |} | | |} |
| + | Note that tab characters and newlines cannot be present in field content. |
| + | == Parsing == |
| + | Importing these files can be done in many languages: |
| + | === Python Standard Library=== |
| + | <nowiki> |
| + | import csv |
| + | with open('example.tsv', 'rb') as csvfile: |
| + | reader = csv.reader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) |
| + | for row in reader: |
| + | print(', '.join(row)) |
| + | </nowiki> |
| + | or with header extraction |
| + | <nowiki> |
| + | import csv |
| + | with open('example.tsv', 'rb') as csvfile: |
| + | reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE) |
| + | print(', '.join(reader.fieldnames)) # print header |
| + | for row in reader: |
| + | print(', '.join([row[key] for key in reader.fieldnames])) |
| + | </nowiki> |
| + | Note that when using Python 2 the field content will remain UTF-8 encoded (type=str). In Python3 strings will unicode (type=string). |
| + | |
| + | === Python Pandas === |
| + | Pandas can interpret column type. You will have to store it separately or hardcode it. |
| + | <nowiki> |
| + | import pandas as pd |
| + | |
| + | d = pd.read_csv('example.tsv', delimiter='\t', skip_blank_lines=False, quoting=csv.QUOTE_NONE) |
| + | </nowiki> |
| + | === GNU R === |
| + | <nowiki> |
| + | d <- read.csv("example.tsv", head=TRUE, sep = "\t") |
| + | </nowiki> |