Variables/functions: full reference¶

This list can also be viewed in the terminal by running st command --help-vars

General properties of sequence records and input files¶


id	Record ID (in FASTA/FASTQ: everything before first space) return type: text
desc	Record description (everything after first space) return type: text
seq	Record sequence return type: text
upper_seq	Record sequence in uppercase letters return type: text
lower_seq	Record sequence in lowercase letters return type: text
seqhash seqhash(ignorecase)	Calculates a hash value from the sequence using the XXH3 algorithm. A hash is a integer number representing the sequence. In very rare cases, different sequences may lead to the same hash value. Using 'seqhash' instead of 'seq' speeds up de-replication ('unique' command) and requires less memory, at a very small risk of wrongly recognizing two different sequences as duplicates. The returned numbers can be negative. return type: number
seqhash_rev seqhash_rev(ignorecase)	The hash value of the reverse-complemented sequence return type: number
seqhash_both seqhash_both(ignorecase)	The sum of the hashes from the forward and reverse sequences. The result is always the same irrespective of the sequence orientation, which is useful when de-replicating sequences with potentially different orientations. [side note: to be precise it is a wrapping addition to prevent integer overflow] return type: number
seq_num seq_num(reset)	Sequence number (n-th sequence in the input), starting from 1. The numbering continues across all provided sequence files unless `reset` is `true`, in which case the numbering re-starts from 1 for each new sequence file. Note that the output order can vary with multithreaded processing. return type: number
seq_idx seq_idx(reset)	Sequence index, starting from 0. The index is incremented across all provided sequence files unless `reset` is `true`, in which case the index is reset to 0 at the start of each new sequence file. Note that the output order can vary with multithreaded processing. return type: number
path	Path to the current input file (or '-' if reading from STDIN) return type: text
filename	Name of the current input file with extension (or '-') return type: text
filestem	Name of the current input file without extension (or '-') return type: text
extension	Extension of the current input file (or '') return type: text
dirname	Name of the base directory of the current file (or '') return type: text
default_ext	Default file extension for the configured output format (e.g. 'fasta' or 'fastq') return type: text

Examples¶

Add the sequence number to the ID:

st set -i {id}_{seq_num}

>A_1
SEQUENCE
>B_2
SEQUENCE
>C_3
SEQUENCE
(...)

Count the number of records per file in the input:

st count -k path *.fasta

file1.fasta 1224818
file2.fasta 573
file3.fasta 99186
(...)

Remove records with duplicate sequences from the input:

st unique seq input.fasta

Remove duplicate records irrespective of the sequence orientation and whether letters are uppercase or lowercase:

st unique 'seqhash_both(true)' input.fasta

Sequence statistics¶


seqlen	Sequence length return type: number
ungapped_seqlen	Ungapped sequence length (without gap characters `-`) return type: number
gc	GC content as fraction (0-1) of total bases. Lowercase (=masked) letters or characters other than ACGTU are not taken into account. return type: number
gc_percent	GC content as percentage of total bases. Lowercase (=masked) letters or characters other than ACGTU are not taken into account. return type: number
charcount(characters)	Count the occurrences of one or more single characters, which are supplied as a string return type: number
exp_err	Total number of errors expected in the sequence, calculated from the quality scores as the sum of all error probabilities. For FASTQ, make sure to specify the correct format (--fmt) in case the scores are not in the Sanger/Illumina 1.8+ format. return type: number

Examples¶

List the GC content (in %) for every sequence:

st stat gc_percent input.fa

seq1    33.3333
seq2    47.2652
seq3    47.3684

Remove DNA sequences with more than 1% ambiguous bases:

st filter 'charcount("ACGT") / seqlen >= 0.99' input.fa

Header attributes¶

Attributes stored in FASTA/FASTQ headers. The expected pattern is ' key=value', but other patterns can be specified with --attr-format.


attr(name)	Obtain an attribute of given name (must be present in all sequences) return type: text
opt_attr(name)	Obtain an attribute value, or 'undefined' if missing (=undefined in JavaScript expressions) return type: text
attr_del(name)	Obtain an attribute (must be present), simultaneously removing it from the header. return type: text
opt_attr_del(name)	Obtain an attribute (may be missing), simultaneously removing it from the header. return type: text
has_attr(name)	Returns `true` if the given attribute is present, otherwise returns `false`. Especially useful with the `filter` command; equivalent to the expression `opt_attr(name) != undefined`. return type: boolean

Examples¶

Count the number of sequences for each unique value of an 'abund' attribute in the FASTA headers (.e.g. >id abund=3), which could be the number of duplicates obtained by the unique command (see st unique -V/--help-vars):

st count -k 'attr(abund)' seqs.fa

Summarize over a 'abund' attribute directly appended to the sequence ID like this >id;abund=3:

st count -k 'attr(abund)' --attr-fmt ';key=value' seqs.fa

Summarize over an attribute 'a', which may be 'undefined' (=missing) in some headers:

st count -k 'opt_attr(a)' seqs.fa

value1  6042
value2  1012
undefined   9566

Access metadata from delimited text files¶

The following functions allow accessing associated metadata from plain delimited text files (optionally compressed, extension auto-recognized). Metadata files must always contain a column with the sequence ID (default: 1st column; change with --meta-idcol). The column delimiter is guessed from the extension or can be specified with --meta-delim. .csv is interpreted as comma(,)-delimited, .tsv/.txt or other (unknown) extensions are assumed to be tab-delimited. The first line is implicitly assumed to contain column names if a non-numeric field name is requested, e.g. meta(fieldname). Use --meta-header to explicitly enable header lines even if column names are all numeric. Multiple metadata files can be supplied (-m file1 -m file2 -m file3 ...) and are addressed via file-num (see function descriptions). For maximum performance, provide metadata records in the same order as sequence records. Note: Specify --dup-ids if the sequence input is expected to contain duplicate IDs (which is rather unusual). See the help page (-h/--help) for more information.


meta(column) meta(column, file_number)	Obtain a value an associated delimited text file supplied with `-m` or `--meta`. Individual columns from entries with matching record IDs are selected by number (1, 2, 3, etc.) or by their name according to the column names in the first row. Missing entries are not allowed. Column names can be in 'single' or "double" quotes (but quoting is only required in Javascript expressions). If there are multiple metadata files supplied with -m/--meta (`-m file1 -m file2 -m file3, ...`), the specific file can be referenced by supplying `\<file-number\>` (1, 2, 3, ...) as first argument, followed by the column number or name. This is not necessary if only a single file is supplied. return type: text
opt_meta(column) opt_meta(column, file_number)	Like `meta(...)`, but metadata entries can be missing, i.e. not every sequence record ID needs a matching metadata entry. Missing values will result in 'undefined' if written to the output (= undefined in JavaScript expressions). return type: text
has_meta has_meta(file_number)	Returns `true` if the given record has a metadata entry with the same ID in the in the given file. In case of multiple files, the file number must be supplied as an argument. return type: boolean

Examples¶

Add taxonomic lineages to the FASTA headers (after a space). The taxonomy is stored in a GZIP-compressed TSV file (column no. 2) to the FASTA headers:

st set -m taxonomy.tsv.gz -d '{meta(2)}' input.fa > output.fa

>id1 k__Fungi,p__Ascomycota,c__Sordariomycetes,(...),s__Trichoderma_atroviride
SEQUENCE
>id2 k__Fungi,p__Ascomycota,c__Eurotiomycetes,(...),s__Penicillium_aurantiocandidum
SEQUENCE
(...)

Add metadata from an Excel-generated CSV file (semicolon delimiter) to sequence headers as attributes (-a/--attr):

st pass -m metadata.csv --meta-sep ';' -a 'info={meta("column name")}' input.fa > output.fa

>id1 info=some_value
SEQUENCE
>id2 info=other_value
SEQUENCE
(...)

Extract subsequences given a set of coordinates stored in a BED file (equivalent to bedtools getfasta):

st trim -m coordinates.bed -0 {meta(2)}..{meta(3)} input.fa > output.fa

Filter sequences by ID, retaining only those present in the given text file:

st filter -m selected_ids.txt 'has_meta()' input.fa > output.fa

Expressions (JavaScript)¶

Expressions with variables, from simple mathematical operations to arbitrarily complex JavaScript code. Expressions are always enclosed in { curly brackets }. These brackets are optional for simple variables/functions in some cases, but mandatory for expressions. In addition, the 'filter' command takes an expression (without { brackets }).

Instead of JavaScript code, it is possible to refer to a source file using 'file:path.js'.

Returned value: For simple one-liner expressions, the value is directly used. More complex scripts with multiple statements (if/else, loops, etc.) explicitly require a return statement to return the value.

Examples¶

Calculate the number of ambiguous bases in a set of DNA sequences and add the result as an attribute (ambig=...) to the header:

st pass -a ambig='{seqlen - charcount("ACGT")}' seqs.fasta

>id1 ambig=3
TCNTTAWTAACCTGATTAN
>id2 ambig=0
GGAGGATCCGAGCG
(...)

Discard sequences with >1% ambiguous bases or sequences shorter than 100bp:

st filter 'charcount("ACGT") / seqlen >= 0.99 && seqlen >= 100' seqs.fasta

Distribute sequences into different files by a slightly complicated condition. Note the 'return' statments are are necessary here, since this is not a simple expression. With even longer code, consider using an extra script and supplying -o "outdir/{file:code.js}.fasta" instead:

st split -po "outdir/{ if (id.startsWith('some_prefix_')) { return 'file_1' } return 'file_2' }.fasta" input.fasta

There should be two files now (`ls file_*.fasta`):
file_1.fasta
file_2.fasta

Data conversion and transformation¶


num(expression)	Converts any expression or value to a decimal number. Missing (undefined/null) values are left as-is. return type: number
bin(expression) bin(expression, interval)	Groups a continuous numeric number into discrete bins with a given interval. The intervals are represented as '(start, end]', whereby start \<= value \< end; the intervals are thus open on the left as indicated by '(', and closed on the right, as indicated by ']'. If not interval is given, a default width of 1 is assumed. return type: text

Examples¶

Summarize by a numeric header attribute in the form '>id n=3':

st count -k 'num(attr("n"))' seqs.fa

Summarize the distribution of the GC content in a set of DNA sequences in 5% intervals:

st count -k 'bin(gc_percent, 5)' seqs.fa

(15, 20]    73
(20, 25]    3443
(25, 30]    14138
(30, 35]    34829
(35, 40]    20354
(40, 45]    12142
(45, 50]    14019
(50, 55]    968
(55, 60]    8