count¶
Count all records in the input (total or categorized by variables/functions)
The overall record count is returned for all input files collectively.
Optionally, grouping categories (text or numeric) can be specified using
-k/--key
. The tab-delimited output is sorted by the categories.
Usage: st count [OPTIONS] [INPUT]...
Options:
-h, --help Print help
'Count' command options:
-k, --key <KEY>
Count sequences for each unique value of the given category. Can be a
single variable/function such as 'filename', 'desc' or 'attr(name)',
or a composed key such as '{filename}_{meta(species)}'. The `-k/--key`
argument can be specified multiple times, in which case there will be
multiple category columns, one per key
-l, --category-limit <CATEGORY_LIMIT>
Maximum number of categories to count before aborting with an error.
This limit is a safety measure to prevent memory exhaustion. A very
large number of categories could unintentionally occur with a
condinuous numeric key (e.g. `gc_percent`). These can be grouped into
regular intervals using `bin(<variable>, <interval>)` [default:
1000000]
Counting the overall record number¶
By default, the count command returns the overall number of records in all of the input (even if multiple files are provided):
Categorized counting¶
Print record counts per input file:
If the record count should be listed for each file separately, use the path
or filename
variable:
To print the sequence length distribution:
Multiple keys¶
It is possible to use multiple keys.
Consider an example similar to the primer finding example,
but in addition we also store the number of primer mismatches (edit distance)
in the diffs
header attribute.
After trimming, we can visualize the mismatch distribution for each primer:
st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' -a diffs='{match_diffs}' sequences.fasta |
st trim -e '{attr(end)}:' > trimmed.fasta
st count -k 'attr(primer)' -k 'attr(diffs)' trimmed.fasta
primer1 0 249640
primer1 1 23831
primer1 2 2940
primer1 3 123
primer1 4 36
primer1 5 2
primer2 0 448703
primer2 1 60373
primer2 2 8996
primer2 3 691
primer2 4 34
primer2 5 7
primer2 6 1
undefined undefined 5029
Expressions as keys¶
Assuming that we need to trim both the forward and reverse primer from a FASTQ file, we might categorize by the sum of the forward and reverse mismatches using an expression.
# first, search and trim
st find file:f_primers.fasta sequences.fastq \
-a f_primer='{pattern_name}' -a f_end='{match_end}' -a f_diffs='{match_diffs}' |
st find --fq file:r_primers.fasta \
-a r_primer='{pattern_name}' -a r_start='{match_start}' -a r_diffs='{match_diffs}' |
st trim --fq -e '{attr(f_end)}:{attr(r_start)}' > trimmed.fastq
# then count
st count -k 'attr(f_primer)' -k 'attr(r_primer)' -k 'attr(diffs)' \
-k '{ num(attr("f_diffs")) + num(attr("r_diffs")) }' trimmed.fastq
f_primer1 r_primer1 0 3457490
f_primer1 r_primer1 1 491811
f_primer1 r_primer1 2 6374
f_primer1 r_primer1 3 420
f_primer1 r_primer1 4 10
(...)
A few important points
⚠ JavaScript expressions always need to be enclosed in
{curly braces}
, while simple variables/functions only require this in some cases.⚠ Attribute names need to be in double or single quotes:
attr("f_dist")
.⚠ The
f_dist
andr_dist
attributes are numeric, but seqtool doesn't know that (see below), and the JavaScript expression would simply concatenate them as strings instead of adding the numbers up. Therefore we require thenum
function for conversion to numeric.
Numeric keys¶
With numeric keys, it is possible to summarize over intervals using the
bin(number, interval)
function. Example summarizing the GC content:
(10, 15] 2
(15, 20] 9
(20, 25] 357
(25, 30] 1397
(30, 35] 3438
(35, 40] 2080
(40, 45] 1212
(45, 50] 1424
(50, 55] 81
The intervals (start,end]
are open at the start and
closed at the end, meaning that
start <= value < end
.
Numbers stored as text¶
In case of a header attribute attr(name)
or a value from
an associated list meta(column)
, these are always interpreted
as text by default, unless the num(...)
function is used,
which makes sure that the categories are correctly sorted:
More¶
This page lists examples with execution times compared to other tools.