FASTA/FASTQ header attributes¶
Attributes are key-value annotations stored in the FASTA/FASTQ definition line.
In a nutshell¶
- Adding attributes:
-a/--attr
key='{variables/functions...}
' or-A/--attr-append
(multiple possible), results in headers like this:>id description attribute=value
- Accessing attributes:
attr(name)
oropt_attr(name)
if some records have missing/undefined
attributes. To simultaneously delete the accessed value, useattr_del(name)
oropt_attr_del(name)
- To change the default format of recognized and inserted attributes, use
--attr-fmt
or theST_ATTR_FORMAT
environment variable. Example:st count -k 'attr(abund)' --attr-fmt ';key=value'
.
See also variable reference and detailed description of command-line options
Adding attributes to headers¶
Attributes are added in any command by using the -a/--attr
option:
Output:
The -a/--attr
can be used multiple times:
Attributes become useful when using variables/functions:
>id1 num=1 gc_content=54.3046357615894
SEQUENCE
>id2 num=2 gc_content=42.019867549668874
SEQUENCE
(...)
Performance optimization¶
In the standard worklow with -a/--attr
, seqtool has to check if an attribute
with the same name is already present. To omit this check, use -A/--attr-append
.
However, this comes with the risk of duplicating the attribute with the same name,
resulting in the appended new attribute being ignored when accessing with attr(...)
(see below).
The user thus needs to be sure that an attribute with the same name is not already present.
Accessing attributes¶
Attributes in the sequence headers are accessed using the internal function attr(name)
at any place where variables/functions can be used, that is:
In a multitude of commands:
count, stat, sort, unique, filter, split, set, trim, mask, find, replace. Examples assuming attribute
in headers, e.g. >id1 attribute=value1
st sort 'attr(attribute)' seqs.fasta
st split seqs.fasta -o '{attr(attribute)}.fasta' # -> value1.fasta, value2.fasta, etc.
st find PRRIMERSEQUENCE -a pos='{match_start}' seqs.fasta | # e.g. >id1 pos=2
st count -k 'attr(pos)'
When setting new attributes:
# seqs.fasta: >id1 key=value
st pass -a new_key='{attr(key)}_with_suffix' seqs.fasta
# output: >id1 key=value new_key=value_with_suffix
In delimited text output:
# seqs.fasta: >id1 key=value
st pass seqs.fasta --to-tsv 'id,attr(key)'
# id1 value
# id2 value2
# (...)
Interacting with other software (different attribute formats)¶
Some programs use some form of key=value
attributes in headers, too. For instance, USEARCH and VSEARCH indicate the size (number of sequences) of clusters like this:
In this case, the size
attribute is appended to the sequence ID (without space) and preceded by a semicolon. In order to recognize the attribute, we need to set the format:
Extract cluster ids and sizes into a tab delimited output
Instead writing attr-fmt
in every command, we can also define the format as environment variable (assuming it does not change too often):
export ST_ATTR_FORMAT=";key=value"
st . --to-tsv 'id,attr(size)' clusters.fasta
# to override just once given headers like this: >id;size=5 another_attr=somevalue
st . --to-tsv 'id,attr(another_attr)' --attr-fmt ' key=value' clusters.fasta
More complicated header annotations¶
Advanced patterns not following the simple key=value
format can be parsed
and converted standard header annotations using the
-r/--regex
search feature of the find command.
Missing/undefined attributes¶
Attributes should normally not be missing. In the following seqs.fasta
,
the attribute a
is missing in one record and undefined
in another:
Attribute 'a' not found in record 'id2'. Use 'opt_attr()' if attributes may be missing in some records.
Set the correct attribute format with --attr-format.
Instead, use opt_attr()
to avoid the error:
undefined
is a special keyword that indicates missing data, soid3
is treated as missing likeid2
The has_attr()
function is useful for filtering or other checks:
Deleting attributes¶
Attributes can be deleted using attr_del()
and opt_attr_del()
, if they should only serve for transient message passing between commands. In this case, the intermediate output has pos=start:end
annotations in the headers:
Alternatively, use the del command: