filter¶

Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression

Usage: st filter [OPTIONS] <EXPRESSION> [INPUT]...

Options:
  -h, --help  Print help

'Filter' command options:
  -d, --dropped <FILE>  Output file for sequences that were removed by
                        filtering. The format is auto-recognized from the
                        extension
  <EXPRESSION>          Filter expression

See this page for the options common to all commands.

Examples¶

Removing sequences shorter than 100 bp:

st filter "seqlen >= 100" input.fasta > filtered.fasta

Removing DNA sequences with more than 10% of ambiguous bases:

st filter "charcount(ACGT) / seqlen >= 0.9" input.fasta > filtered.fasta

Quick and easy way to select certain sequences:

st filter "id == 'id1' " input.fasta > filtered.fasta

st filter "['id1', 'id2', 'id3'].contains(id)" input.fasta > filtered.fasta

Note: this may not be the most efficient way, consider a text file with an ID list

Quality filtering¶

The exp_err statistics variable represents the total expected number of errors in a sequence, as provided by the quality scores. By default, the Sanger / Illumina 1.8+ format (with ASCII offset 33) is assumed. See here for more information.

This example removes sequences with less than one expected error. The output is the same as for fastq_filter if USEARCH or VSEARCH.

st filter 'exp_err <= 1' input.fastq -o filtered.fasta

Normalization according to sequence length is easily possible with a math formula (corresponding to -fastq_maxee_rate in USEARCH).

st filter 'exp_err / seqlen >= 0.002' input.fastq -o filtered.fasta

More¶

This page lists examples with execution times compared to other tools.