Sequence formats and compression¶
All commands accept different formats and compressed input, and writing to a different sequence and compression format is also possible. The input and output formats are automatically inferred based on the file extensions.
Note: Currently, there is no auto-recognition of the formats
The following pass command reads a GZIP compressed FASTQ file and converts it to uncompressed FASTA.
st pass input.fastq.gz -o output.fasta
# or the equivalent shorthand:
st . input.fastq.gz -o output.fasta
If receiving from STDIN or writing to STDOUT, the format has to be specified unless it is FASTA (which is the default):
The output format is always assumed to be the same as the input format
if not specified otherwise by using --to <format>
or -o <path>.<extension>
.
There also exist shorthand notations such as --to-fa
(shortcut in table below).
Recognized formats¶
The following extensions and format strings are auto-recognized:
sequence format | recognized extensions | format string | shortcut (in) | ..out |
---|---|---|---|---|
FASTA | .fasta ,.fa ,.fna ,.fsa |
fasta ,fa |
--fa |
--to-fa |
FASTQ | .fastq ,.fq |
fastq ,fq ,fq—illumina ,fq—solexa |
--fq |
--to—fq |
CSV (, delimited) |
.csv |
csv |
--csv FIELDS |
--to—csv FIELDS |
TSV (tab delimited) |
.tsv ,.tsv |
tsv |
--tsv FIELDS |
--to—tsv FIELDS |
Note: Multiline FASTA is parsed and written (
--wrap
), but only single-line FASTQ is parsed and written.
Besides FASTQ, quality scores can also be parsed from / written to 454 (Roche) style QUAL
files using --qual <file>
and --to-qual <file>
.
Compression formats¶
No shortcuts are available for compression formats, therefore always use the
long form: --fmt <input_format>
/ --to <output_format>
format | recognized extensions | format string (FASTA) |
---|---|---|
GZIP | .gzip ,.gz |
fasta.gz |
BZIP2 | .bzip2 ,.bz2 |
fasta.bz2 |
LZ4 | .lz4 |
fasta.lz4 |
ZSTD | .zst |
fasta.zst |
Delimited text (CSV, TSV, ...)¶
Comma / tab / ... delimited input and output can be configured providing the
--fields
/ --outfields
argument, or directly using --csv
/--to-csv
or --tsv
/--to-tsv
. The delimiter is configured with --delim <delim>
equivalent shortcut:
Variables/functions can also be included:
Setting default format via environment variable¶
The ST_FORMAT
environment variable can be used to set a default format other
than FASTA. This is especially useful if connecting many commands via pipe,
saving the need to specify --fq
/ --tsv <fields>
/ ... repeatedly. Example:
For delimited files (CSV or TSV), the input fields can be configured
additionally after a colon (:
):
export ST_FORMAT=tsv:id,seq
## Input file:
# id1 ACGT...
# id2 ACGT...
# ...
st trim ':4' input.txt | st revcomp > trimmed_revcomp.txt
## Output:
# id1 ACGT...
# id2 ACGT...
#...
Quality scores¶
Quality scores can be read from several sources.
FASTQ files are assumed to be
in the Sanger/Illumina 1.8+ format (ASCII offset of 33).
Older formats (Illumina 1.3+ and Solexa) with an offset of 64 can be
read and written using --fmt/--to fq-illumina
or fq-solexa
. Automatic
unambiguous recognition of the formats is not possible, therefore the formats have
to be explicitly specified. Invalid characters generate an error during conversion.
Note: If no conversion is done (e.g. both input and output in Sanger/Illumina 1.8+ format), scores are not automatically checked for errors.
Quality scores can be visualized using the view command.
The following example converts a legacy Illumina 1.3+ file to the Sanger/Illumina 1.8+ format:
The exp_err
variable
uses the quality scores to calculate the total number of expected sequencing errors
(see filter command).
In order to correctly calculate the value of exp_err
it is vital that the
format is correctly specified.