filter¶
Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression
Usage: st filter [OPTIONS] <EXPRESSION> [INPUT]...
Options:
-h, --help Print help
'Filter' command options:
-d, --dropped <FILE> Output file for sequences that were removed by
filtering. The format is auto-recognized from the
extension
<EXPRESSION> Filter expression
Examples¶
Removing sequences shorter than 100 bp:
Removing DNA sequences with more than 10% of ambiguous bases:
Quick and easy way to select certain sequences:
st filter "id == 'id1' " input.fasta > filtered.fasta
st filter "['id1', 'id2', 'id3'].contains(id)" input.fasta > filtered.fasta
Note: this may not be the most efficient way, consider a text file with an ID list
Quality filtering¶
The exp_err
statistics variable
represents the total expected number of errors
in a sequence, as provided by the quality scores.
By default,
the Sanger / Illumina 1.8+ format
(with ASCII offset 33) is assumed.
See here for more information.
This example removes sequences with less than one expected error. The
output is the same as for fastq_filter
if
USEARCH
or VSEARCH.
Normalization according to sequence length is easily possible with
a math formula (corresponding to -fastq_maxee_rate
in USEARCH).
More¶
This page lists examples with execution times compared to other tools.