find¶
Search for pattern(s) in sequences or sequene headers for record filtering, pattern replacement or passing hits to next command
There are different search modes:
- Exact search
- Regular expressions (
-r/--regex
) - DNA or protein patterns with ambiguous letters
- Approximate matching up to a given edit distance
(
-D/--diffs
or-R/--max-diff-rate
)
Search results can be used in three different ways:
- Keeping (
-f/--filter
) or excluding (-e/--exclude
) matched sequences - Pattern replacement (
--rep
) with ambiguous/approximate matching (for exact/regex replacement, use the 'replace' command) - Passing the search results to the output in sequence
headers (
-a/--attr
) or TSV/CSV fields (--to-tsv/--to-csv
); seest find --help-vars
for all possible variables/ functions
Usage: st find [OPTIONS] <PATTERNS> [INPUT]...
Arguments:
<PATTERNS> Pattern string or 'file:<patterns.fasta>'
Options:
-h, --help Print help
Where to search (default: sequence):
-i, --id Search / replace in IDs instead of sequences
-d, --desc Search / replace in descriptions
Search options:
-D, --max-diffs <N> Return pattern matches up to a given maximum edit
distance of N differences (= substitutions,
insertions or deletions). Residues that go beyond the
sequence (partial matches) are always counted as
differences. [default: pefect match]
-R, --max-diff-rate <R> Return of matches up to a given maximum rate of
differences, that is the fraction of divergences
(edit distance = substitutions, insertions or
deletions) divided by the pattern length. If
searching a 20bp pattern at a difference rate of 0.2,
matches with up to 4 differences (see also
`-D/--max-diffs`) are returned. [default: pefect
match]
-r, --regex Interpret pattern(s) as regular expression(s). All
*non-overlapping* matches in are searched in headers
or sequences. The regex engine lacks some advanced
syntax features such as look-around and
backreferences (see https://docs.rs/regex). Capture
groups can be extracted by functions such as
`match_group(number)`, or `match_group(name)` if
named: `(?<name>)` (see also `st find --help-vars`)
--in-order Report hits in the order of their occurrence instead
of sorting by distance. Note that this option only
has an effect with `-D/--max-dist` > 0, otherwise
matches are always reported in the order of their
occurrence
-t, --threads <N> Number of threads to use for searching [default: 1]
--no-ambig Don't interpret DNA ambiguity (IUPAC) characters
--algo <NAME> Override decision of algorithm for testing
(regex/exact/myers/auto) [default: auto]
--gap-penalty <N> Gap penalty to use for selecting the the optimal
alignment among multiple alignments with the same
starting position and the same edit distance. The
default penalty of 2 selects hits that don't have too
InDels in the alignment. A penalty of 0 is not
recommended; due to how the algorithm works, the
alignment with the leftmost end position is chosen
among the candidates, which often shows deletions in
the pattern. Penalties >2 will further shift the
preference towards hits with substitutions instead of
InDels, but the selection is always done among hits
with the same (lowest) edit distance, so raising the
gap penalty will not help in trying to enfoce
ungapped alignments (there is currently no way to do
that) [default: 2]
Search range:
--rng <RANGE> Search within the given range ('start:end',
'start:' or ':end'). Using variables is not
possible
--max-shift-start <N> Consider only matches with a maximum of <N> letters
preceding the start of the match (relative to the
sequence start or the start of the range `--rng`)
--max-shift-end <N> Consider only matches with a maximum of <N> letters
following the end of the match (relative to the
sequence end or the end of the range `--rng`)
Search command actions:
-f, --filter Keep only matching sequences
-e, --exclude Exclude sequences that matched
--dropped <FILE> Output file for sequences that were removed by
filtering. The output format is (currently) the same as
for the main output, regardless of the file extension
--rep <BY> Replace by a string, which may also contain
{variables/functions}
Searching in headers¶
Specify -i/--id
to search in sequence IDs (everything before the first space)
or -d/--desc
to search in the description part (everything after the space).
Example: selectively return sequences that have label
in their description
(filtering with the -f/--filter
flag):
Note: use
--dropped <not_matched_out>
to write unmatched sequences to another file.
To match a certain pattern, use a regular expression (-r/--regex
).
The following example extracts Genbank accessions from sequence headers that follow
the old-style Genbank format:
You can use online tools such as https://regex101.com to build and debug your regular expression
Note: You could also replace the whole header with the accession using the replace command. This might be faster, but the original header will not be retained.
Searching in sequences¶
Without the -i
or -d
flag, the default mode is to search in the sequence.
The pattern type is automatically recognized and usually reported to avoid
problems:
Note: the sequence type of the pattern was determined as 'dna' (with ambiguous letters). If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
R
stands for A
or G
. Seqtool recognizes the IUPAC ambiguity codes for
DNA/RNA and
proteins
(with the exception of U = Selenocysteine).
⚠ Matching is asymmetric: R
in a search pattern matches [A
, G
, R
]
in sequences, but R
in a sequence will only match ambiguities sharing the same
set of bases (R
, V
, D
, N
) in the pattern. This should prevent false
positive matches in sequences with many ambiguous characters.
Approximate matching¶
Seqtool can find patterns with mismatches or insertions/deletions
(up to a given edit distance)
using the -D/--diffs
argument. Alternatively, use -R/--diff-rate
to
specify a distance limit relative to the length of the pattern
(in other words, an "error rate").
In this example, the edit distance and range of the best match are saved
into header attributes (or undefined
if not found):
In case of multiple hits, the second best hit can be returned by using
{match_diffs(2)}
or {match_range(2)}
, etc.
Use --in-order
to report hits in order from left to right instead.
Note: Approximative matching is done using Myers bit-parallel algorithm, which is very fast with short patterns and reasonably short sequences. It may not be the fastest solution if searching in large genomes.
Recognizing adapter or primers should be very fast. Further speedups can be achieved by multithreading (
-t
) and restricting the search range (--rng
).Note 2: To report all hits below the given distance threshold in order of occurrence instead of decreasing distance, specify
--in-order
.
Multiple patterns¶
The find command supports searching for several patterns at once.
They have to be supplied in a separate FASTA file (file:path
).
The best matching pattern with the smallest edit distance is always reported first.
The following example de-multiplexes sequences amplified with different forward primers and then uses trim to remove the primers, and finally distributes the sequences into different files named by the forward primer (split).
primers.fasta |
---|
st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
st trim -e '{attr(end)}:' |
st split -o '{attr(primer)}'
prA.fasta | prB.fasta | undefined.fasta |
---|---|---|
> Note: no primer, sequence not trimmed since end=undefined (see ranges).
|
Selecting other hits¶
The find command is very versatile thanks to the large number of variables/functions that provide information about the search results (see variable reference).
For instance, the best hit from the second best matching pattern can be selected using
{match_range(1, 2)}
.
It is also possible to return a comma-delimited list of matches, e.g.:
{match_range(all)}
. See the mask command for an example on how this could be useful.
Replacing matches¶
Hits can be replaced by other text (--repl
). Variables are allowed
as well (in contrast to the replace command). Backreferences to regex groups
(e.g. $1
) are not supported like the replace command does.
Instead, they can be accessed using variables (match_group()
, etc.)
More¶
This page lists more examples with execution times and comparisons with other tools.
Variables/functions provided by the 'find' command¶
see also
st find --help-vars
The find command provides many variables/functions to obtain information about the pattern matches. These are either written to header attributes (-a/--attr
) or CSV/TSV fields (e.g. --to-tsv ...
). See also examples section below.
Examples¶
Find a primer sequence with up to 2 mismatches (-d/--dist
) and write the match range and the mismatches ('dist') to the header as attributes. The result will be 'undefined' (=undefined in JavaScript) if there are > 2 mismatches:
>id1 rng=2:21 dist=1
SEQUENCE
>id2 rng=1:20 dist=0
SEQUENCE
>id3 rng=undefined dist=undefined
SEQUENCE
(...)