Skip to content

Metadata from delimited files

In all seqtool commands, it is possible to integrate external metadata from delimited text files created manually or using another program.

Files are specified using the -m/--meta option and accessed using the functions meta(column), opt_meta(column) (with missing data) or has_meta(column) (to check if the metadata is present). Column is either a number or the header name of the given column.

See also variable reference and detailed description of command-line options

By default, files are assumed to be tab-delimited, and the first column should contain the ID. However, this can be changed with --meta-delim and --id-col.

Examples

Consider this list containing taxonomic information about sequences (genus.tsv):

id  genus
seq1  Actinomyces
seq2  Amycolatopsis
(...)

The genus name can be added to the FASTA header using this command:

st set --meta genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta
# short:
st set -m genus.tsv -d '{meta(genus)}' input.fasta > with_genus.fasta
>seq1 Actinomyces
SEQUENCE
>seq2 Amycolatopsis
SEQUENCE
(...)

If any of the sequence IDs is not found in the metadata, there will be an error. If missing data is expected, use opt_meta instead. Missing entries are undefined:

st set -m genus.tsv --desc '{opt_meta(genus)}' input.fasta > with_genus.fasta
>seq1 Actinomyces
SEQUENCE
>seq2 Amycolatopsis
SEQUENCE
>seq3 undefined
SEQUENCE
(...)

Filtering by ID

Sometimes it is necessary to select all sequence records present in a list of sequence IDs. This can easily be achieved using this command:

st filter -m id_list.txt 'has_meta()' seqs.fasta > in_list.fasta

Multiple metadata sources

Several sources can be simultaneously used in the same command with -m file1 -m file2 -m file3...:

st filter -m source1.txt -m source2.txt 'meta("column", 1) == "value" && has_meta(2)' seqs.fasta > in_list.fasta

Sources are referenced using meta(column, file_number) or has_meta(file_number); see also variable reference