Home
Seqtool is a fast and flexible command line program for dealing with
large amounts of biological sequences.
It provides different subcommands for converting, inspecting
and modifying sequences.
The standalone binary (5-7 MB) is simply named st
to save some typing.
Note: this page describes the development version 0.4-beta. The older stable version (v0.3.0) is documented here.
Downloads¶
📥 download beta release (v0.4.0-beta.3)
Should be pretty safe to use despite the considerable refactoring. Approximate matching (find command) is yet to be fully tested.
📥 download stable release (v0.3.0)
âš Note: there are a few unfixed bugs in v0.3.0 (currently) when reading GZIP files or searching/replacing; see CHANGELOG for v0.4.0-beta.
Feature overview¶
File formats¶
Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed with GZIP, BZIP2, or the faster and more modern Zstandard or LZ4 formats
Example: compressed FASTQ to FASTA
Combine multiple compressed FASTQ files, converting them to FASTA, using pass.
Note: almost every command can read multiple input files and convert between formats,
but pass does nothing other than reading and writing while other command perform certain actions.
Example: FASTA to tab-separated list
Aside from ID and sequence, any variable/function such as
the sequence length (seqlen
) can be written to delimited text.
Highly versatile thanks to variables/functions¶
See also variables/functions for more details.
Example: count sequences in a large set of FASTQ files
data/sample1.fastq.gz 30601
data/sample2.fastq.gz 15702
data/sample3.fastq.gz 264965
data/sample4.fastq.gz 1120
data/sample5.fastq.gz 7021
(...)
In count, one or several categorical variables/functions
can be specified with -k/--key
.
Example: summarize the GC content in 10% intervals
The function bin(variable, interval)
groups continuous numeric values
into intervals
Example: Assign new sequence IDs
Example: De-replicate by description and sequence
seqs.fasta
with a 'group' annotation in the header:
Expressions¶
From simple math to complicated filter expressions, the tiny integrated JavaScript engine (QuickJS) offers countless possibilities for customized sequence processing.
Example: filter FASTQ sequences by quality and length
This filter command removes sequencing reads with more than one expected sequencing error (like USEARCH can do) or sequence length of <100 bp.
Header attributes for metadata storage¶
key=value
header attributes allow storing and passing on
all kinds of information
Example: De-replicate by sequence (seq variable) and/or other properties
The unique command returns all unique sequences and annotates the number of records with the same sequence in the header:
It is also possible to de-replicate by multiple keys, e.g. by sequence,
but grouped by a sample
attribute in the header:
Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon)
These steps could be part of an amplicon pipeline that de-multiplexes multi-marker amplicons. find searches for a set of primers, which are removed by trim, and finally split distributes the sequences into different files named by the forward primer.
primers.fasta
Command for searching/trimming
st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
st trim -e '{attr(end)}..' |
st split -o '{attr(primer)}'
prA.fasta | prB.fasta | undefined.fasta |
---|---|---|
Note: no primer, sequence not trimmed since end=undefined (see ranges).
|
Integration of external metadata¶
Integration of sequence metadata sources in the form of delimited text
Example: Add Genus names from a separate tab-separated list
input.fasta | genus.tsv |
---|---|
Using -m/--meta
to include genus.tsv
as metadata source:
with_genus.fasta |
---|
Example: Choose specific sequences given a separate file with an ID list
input.fasta | id_list.txt |
---|---|
subset.fasta |
---|
Commands¶
Basic conversion/editing¶
- pass: Directly pass input to output without any processing, useful for converting and attribute setting
Information about sequences¶
- view: View biological sequences, colored by base / amino acid, or by sequence quality
- count: Count all records in the input (total or categorized by variables/functions)
- stat: Return per-sequence statistics as tab delimited list
Subset/shuffle¶
- sort: Sort records by sequence or any other criterion
- unique: De-replicate by sequence and/or other properties, returning only unique records
- filter: Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression
- split: Distribute sequences into multiple files based on a variable/function or advanced expression
- sample: Get a random subset of sequences; either a fixed number or an approximate fraction of the input
- slice: Return a range of sequence records from the input
- head: Return the first N sequences
- tail: Return the last N sequences
- interleave: Interleave records of all files in the input
Search and replace¶
- find: Search for pattern(s) in sequences or sequene headers for record filtering, pattern replacement or passing hits to next command
- replace: Fast and simple pattern replacement in sequences or headers
Modifying commands¶
- del: Delete header ID/description and/or attributes
- set: Replace the header, header attributes or sequence with new content
- trim: Trim sequences on the left and/or right (single range) or extract and concatenate several ranges
- mask: Soft or hard mask sequence ranges
- upper: Convert sequences to uppercase
- lower: Convert sequences to lowercase
- revcomp: Reverse complements DNA or RNA sequences
- concat: Concatenates sequences/alignments from different files
Comparison with other tools¶
There are other tools with a similar focus such as Seqtk, SeqKit, the FASTX-Toolkit, as well as the more specialized USEARCH and VSEARCH offering some of the functions as well.
Seqtool performs well compared to these tools on a selection of diverse tasks: