unique¶
De-replicate by sequence and/or other properties, returning only unique records
The unique key can be 'seq' or any variable/function, expression, or
text containing them (see st unique --help-vars
).
The order of the records is the same as in the input unless the memory limit
is exceeded, in which case temporary files are used and all remaining records
are sorted by the unique key. Use -s/--sorted
to always sort the output
by key.
Usage: st unique [OPTIONS] <KEY> [INPUT]...
Options:
-h, --help Print help (see more with '--help')
'Unique' command options:
-s, --sort Sort the output by key. Without this option, the
records are in input order if the memory limit is
*not* exceeded, but are sorted by key otherwise
--map-out <MAP_OUT> Write a map of all duplicate sequence IDs to the
given file (or '-' for stdout). The (optional)
compression format is auto-recognized from the
extension. By default, a two-column mapping of
sequence ID -> unique reference record ID is
written (`long` format). More formats can be
selected with `--map_format`
--map-fmt <MAP_FMT> Column format for the duplicate map `--map-out`
(use `--help` for details) [default: long]
[possible values: long, long-star, wide,
wide-comma, wide-key]
-M, --max-mem <SIZE> Maximum amount of memory (approximate) to use for
de-duplicating. Either a plain number (bytes) a
number with unit (K, M, G, T) based on powers of 2
[default: 5G]
--temp-dir <PATH> Path to temporary directory (only if memory limit
is exceeded)
--temp-file-limit <N> Maximum number of temporary files allowed [default:
1000]
<KEY> The key used to determine, which records are
unique. The key can be a single variable/function
such as 'seq', a composed string such as
'{attr(a)}_{attr(b)}', or a comma-delimited list of
multiple variables/functions, whose values are all
taken into account, e.g. 'seq,num(attr(a))'. In
case of identical sequences, records are still
de-replicated by the header attribute 'a'. The
'num()' function turns text values into numbers,
which can speed up the de-replication. For each
key, the *first* encountered record is returned,
and all remaining ones with the same key are
discarded
More¶
This page lists examples with execution times compared to other tools.
Variables/functions provided by the 'unique' command¶
see also
st unique --help-vars
Examples¶
De-replicate sequences using the sequence hash (faster than using the sequence seq
itself), and also storing the number of duplicates (including the unique sequence itself) in the sequence header: