Comparison of tools¶

In the following list, we show the execution time, memory footprint and CPU usage of seqtool v0.4.0-beta on a selection of tasks, compared with the following tools:

Seqtk v1.4
SeqKit v2.7.0
FASTX-Toolkit
USEARCH v11.0.667
VSEARCH v2.28.1
Cutadapt v4.6

Details on the approach are found here. The input file is a FASTQ or FASTA file containing 2.6 M reads (Illumina MiSeq, 300 bp). The comparison was run on a Ryzen 4750U CPU with frequency boost disabled, writing files to a RAM instead of the disk.

The fastest/most memory-efficient commands are highlighted by '🏆' and an indication, how many times faster / less memory they use compared to the commands ranking second. To show more details, click on the alternative commands list.

pass¶

Do nothing, just read and write FASTA

st pass input.fasta > output.fasta

SeqKit 🕓 2.2 s 🏆 (1.2x)

SeqKit

seqkit seq  input.fasta > output.fasta

🕓 2.2 s 🏆 (1.2x) 106% CPU
📈 18.0 MiB

🕓 2.6 s
📈 7.1 MiB 🏆 (2.53x)

Convert FASTQ to FASTA

st pass --to-fa input.fastq > output.fasta

FASTX-Toolkit 🕓 287.9 s ❙ Seqtk 🕓 4.3 s ❙ SeqKit 🕓 3.1 s

FASTX-Toolkit	`fastq_to_fasta -Q33 -i input.fastq > output.fasta`	🕓 287.9 s 📈 3.5 MiB 🏆 (1.00x)
Seqtk	`seqtk seq -A input.fastq > output.fasta`	🕓 4.3 s 📈 3.5 MiB
SeqKit	`seqkit fq2fa input.fastq > output.fasta`	🕓 3.1 s 📈 18.4 MiB

🕓 3.1 s 🏆 (1.0x)
📈 7.1 MiB

Convert FASTQ quality scores

st pass --to fastq-illumina input.fastq > output.fastq

VSEARCH 🕓 12.9 s ❙ SeqKit 🕓 48.8 s

VSEARCH

vsearch --fastq_convert input.fastq --fastq_asciiout 64 --fastqout output.fastq

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTQ file 100%

🕓 12.9 s
📈 4.2 MiB 🏆 (1.65x)

SeqKit

seqkit convert --from 'Sanger' --to 'Illumina-1.3+' input.fastq > output.fastq

messages

[INFO][0m converting Sanger -> Illumina-1.3+

🕓 48.8 s
📈 47.9 MiB

🕓 7.4 s 🏆 (1.8x)
📈 7.0 MiB

Write compressed FASTQ files in GZIP format

st pass input.fastq -o output.fastq.gz

SeqKit 🕓 30.3 s 🏆 (1.3x) ❙ seqtool | gzip 🕓 159.1 s ❙ gzip directly 🕓 158.6 s ❙ pigz directly (4 threads) 🕓 39.0 s

SeqKit	`seqkit seq input.fastq -o output.fastq.gz`	🕓 30.3 s 🏆 (1.3x) 📈 37.5 MiB
seqtool \| gzip	`st pass input.fastq \| gzip -c > output.fastq.gz`	🕓 159.1 s 📈 7.2 MiB
gzip directly	`gzip -kf input.fastq`	🕓 158.6 s 📈 3.5 MiB 🏆 (1.21x)
pigz directly (4 threads)	`pigz -p4 -kf input.fastq`	🕓 39.0 s 405% CPU 📈 4.2 MiB

🕓 55.8 s
📈 27.5 MiB

Write compressed FASTQ files in Zstandard format

st pass input.fastq -o output.fastq.zst

seqtool | zstd piped 🕓 12.8 s 🏆 (1.2x)

seqtool | zstd piped

st pass input.fastq | zstd -c > output.fastq.zst

🕓 12.8 s 🏆 (1.2x) 147% CPU
📈 38.8 MiB

🕓 15.5 s 114% CPU
📈 11.0 MiB 🏆 (3.52x)

Write compressed FASTQ files in Lz4 format

st pass input.fastq -o output.fastq.lz4

seqtool | lz4 piped 🕓 9.9 s

seqtool | lz4 piped

st pass input.fastq | lz4 -c > output.fastq.lz4

🕓 9.9 s 116% CPU
📈 7.4 MiB 🏆 (3.75x)

🕓 9.4 s 🏆 (1.1x) 116% CPU
📈 27.6 MiB

count¶

Count the number of FASTQ sequences in the input

st count input.fastq

🟦 output

Seqtk 🕓 0.7 s

Seqtk

seqtk size input.fasta

🟦 output

2610480 712939424

🕓 0.7 s
📈 3.4 MiB 🏆 (2.11x)

🕓 0.6 s 🏆 (1.2x)
📈 7.1 MiB

Count the number of FASTQ sequences, grouped by GC content (in 10% intervals)

st count -k 'bin(gc_percent, 10)' input.fastq

🟦 output

(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197

st with math expression 🕓 7.0 s

st with math expression

st count -k '{bin(gc_percent/100*100, 10)}' input.fastq

🟦 output

(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197

🕓 7.0 s
📈 86.0 MiB

🕓 4.2 s 🏆 (1.6x)
📈 7.4 MiB 🏆 (11.66x)

sort¶

Sort by sequence

st sort seq input.fasta > output.fasta

SeqKit 🕓 42.3 s

SeqKit

seqkit sort -s  input.fasta > output.fasta

messages

[INFO][0m read sequences ...
[INFO][0m 2610480 sequences loaded
[INFO][0m sorting ...
[INFO][0m output ...

🕓 42.3 s
📈 4595.1 MiB

🕓 13.6 s 🏆 (3.1x)
📈 1771.4 MiB 🏆 (2.59x)

Sort by sequence with ~ 50 MiB memory limit

st sort seq input.fasta -M 50M > output.fasta

messages

Memory limit reached after 78050 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.

100 MiB memory limit 🕓 20.6 s

100 MiB memory limit

st sort seq input.fasta -M 100M > output.fasta

messages

Memory limit reached after 155392 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.

🕓 20.6 s
📈 108.7 MiB

🕓 20.3 s 🏆 (1.0x)
📈 58.5 MiB 🏆 (1.86x)

Sort by record ID

st sort id input.fasta > output.fasta

SeqKit 🕓 34.2 s

SeqKit

seqkit sort  input.fasta > output.fasta

messages

[INFO][0m read sequences ...
[INFO][0m 2610480 sequences loaded
[INFO][0m sorting ...
[INFO][0m output ...

🕓 34.2 s
📈 4436.4 MiB

🕓 6.5 s 🏆 (5.3x)
📈 1119.2 MiB 🏆 (3.96x)

Sort by sequence length

st sort seqlen input.fasta > output.fasta

SeqKit 🕓 33.7 s ❙ VSEARCH 🕓 9.4 s

SeqKit

seqkit sort -l  input.fasta > output.fasta

messages

[INFO][0m read sequences ...
[INFO][0m 2610480 sequences loaded
[INFO][0m sorting ...
[INFO][0m output ...

🕓 33.7 s
📈 4153.5 MiB

VSEARCH

vsearch --sortbylength input.fasta --output output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Getting lengths 100%
Sorting 100%
Median length: 301
Writing output 100%

🕓 9.4 s
📈 891.4 MiB 🏆 (1.17x)

🕓 5.9 s 🏆 (1.6x)
📈 1042.4 MiB

Sort sequences by USEARCH/VSEARCH-style abundance annotations

ST_ATTR_FMT=';key=value' st unique seq -a size={n_duplicates} input.fasta |
  st sort '{-attr("size")}' > output.fasta

VSEARCH 🕓 20.4 s

VSEARCH

vsearch --derep_fulllength input.fasta --output - --sizeout |   vsearch --sortbysize - --output output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output fileReading file - 100%
 100%
606287856 nt in 2134929 seqs, min 35, max 301, avg 284
Getting sizes 100%
Sorting 100%
Median abundance: 1
Writing output 100%

🕓 20.4 s 113% CPU
📈 1345.8 MiB 🏆 (1.19x)

🕓 13.3 s 🏆 (1.5x) 110% CPU
📈 1606.5 MiB

unique¶

Remove duplicate sequences using sequence hashes. This is more memory efficient and usually faster than keeping the whole sequence around.

st unique seqhash input.fasta > output.fasta

SeqKit 🕓 3.3 s 🏆 (1.2x)

SeqKit

seqkit rmdup -sP  input.fasta > output.fasta

messages

[INFO][0m 475551 duplicated records removed

🕓 3.3 s 🏆 (1.2x)
📈 180.1 MiB

🕓 4.2 s
📈 117.1 MiB 🏆 (1.54x)

Remove duplicate sequences using sequence hashes (case-insensitive).

st unique 'seqhash(true)' input.fasta > output.fasta

VSEARCH 🕓 12.1 s ❙ SeqKit 🕓 6.2 s

VSEARCH

vsearch --derep_smallmem input.fasta --fastaout output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%

🕓 12.1 s
📈 90.7 MiB 🏆 (1.29x)

SeqKit

seqkit rmdup -sPi  input.fasta > output.fasta

messages

[INFO][0m 475551 duplicated records removed

🕓 6.2 s
📈 289.8 MiB

🕓 4.3 s 🏆 (1.4x)
📈 117.2 MiB

Remove duplicate sequences that are exactly identical (case-insensitive); comparing full sequences instead of not hashes (requires more memory). VSEARCH additionally treats 'T' and 'U' in the same way (seqtool doesn't).

st unique upper_seq input.fasta > output.fasta

seqtool (sorted by sequence) 🕓 13.5 s ❙ VSEARCH 🕓 15.8 s

seqtool (sorted by sequence)

st unique -s upper_seq input.fasta > output.fasta

🕓 13.5 s
📈 1640.7 MiB

VSEARCH

vsearch --derep_fulllength input.fasta --output output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%

🕓 15.8 s
📈 1345.7 MiB

🕓 5.4 s 🏆 (2.5x)
📈 729.0 MiB 🏆 (1.85x)

Remove duplicate sequences (exact mode) with a memory limit of ~50 MiB

st unique seq -M 50M input.fasta > output.fasta

messages

Memory limit reached after 151512 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up de-duplicating. Use -q/--quiet to silence this message.

🕓 19.5 s
📈 56.6 MiB

Remove duplicate sequences, checking both strands

st unique seqhash_both input.fasta > output.fasta

SeqKit 🕓 14.8 s

SeqKit

seqkit rmdup -s  input.fasta > output.fasta

messages

[INFO][0m 475687 duplicated records removed

🕓 14.8 s
📈 293.6 MiB

🕓 7.5 s 🏆 (2.0x)
📈 117.1 MiB 🏆 (2.51x)

Remove duplicate sequences, appending USEARCH/VSEARCH-style abundance annotations to the headers: >id;size=NN

st unique seq -a size={n_duplicates} --attr-fmt ';key=value' input.fasta > output.fasta

VSEARCH 🕓 16.1 s

VSEARCH

vsearch --derep_fulllength input.fasta --sizeout --output output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%

🕓 16.1 s
📈 1345.9 MiB 🏆 (1.19x)

🕓 9.3 s 🏆 (1.7x)
📈 1606.2 MiB

De-replicate both by sequence and record ID (the part before the first space in the header). The given benchmark actually has unique sequence IDs, so the result is the same as de-replication by sequence.

st unique id,seq input.fasta > output.fasta

VSEARCH 🕓 17.7 s

VSEARCH

vsearch --derep_id input.fasta --output output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2610480 unique sequences, avg cluster 1.0, median 1, max 1
Writing FASTA output file 100%

🕓 17.7 s
📈 1364.4 MiB

🕓 7.5 s 🏆 (2.3x)
📈 1090.6 MiB 🏆 (1.25x)

filter¶

Filter sequences by length

st filter 'seqlen >= 100' input.fastq > output.fastq

Seqtk 🕓 6.5 s ❙ SeqKit 🕓 4.1 s 🏆 (1.3x)

Seqtk	`seqtk seq -L 100 input.fastq > output.fastq`	🕓 6.5 s 📈 3.5 MiB 🏆 (2.07x)
SeqKit	`seqkit seq -m 100 input.fastq > output.fastq` messages `[33m[WARN][0m you may switch on flag -g/--remove-gaps to remove spaces`	🕓 4.1 s 🏆 (1.3x) 📈 28.1 MiB

🕓 5.4 s
📈 7.2 MiB

Filter sequences by the total expected error as calculated from the quality scores

st filter 'exp_err <= 1' input.fastq --to-fa > output.fastq

VSEARCH 🕓 32.9 s ❙ USEARCH 🕓 16.0 s 🏆 (1.7x)

VSEARCH

vsearch --fastq_filter input.fastq --fastq_maxee 1 --fastaout output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading input file 100%
1408755 sequences kept (of which 0 truncated), 1201725 sequences discarded.

🕓 32.9 s
📈 4.4 MiB 🏆 (1.66x)

USEARCH

usearch -fastq_filter input.fastq -fastq_maxee 1 -fastaout output.fasta

🟦 output

usearch v11.0.667_i86linux32, 4.0Gb RAM (32.1Gb total), 16 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only

messages

00:00 4.2Mb  FASTQ base 33 for file input.fastq
00:00 38Mb   CPU has 16 cores, defaulting to 10 threads
00:00 115Mb     0.1% Filtering
00:01 123Mb     1.0% Filtering, 31.4% passed
00:02 123Mb     8.7% Filtering, 31.5% passed
00:03 123Mb    16.4% Filtering, 31.8% passed
00:04 123Mb    22.1% Filtering, 40.1% passed
00:05 123Mb    26.7% Filtering, 47.6% passed
00:06 123Mb    31.5% Filtering, 52.6% passed
00:07 123Mb    36.4% Filtering, 56.2% passed
00:08 123Mb    41.3% Filtering, 59.1% passed
00:09 123Mb    47.2% Filtering, 60.1% passed
00:10 123Mb    53.5% Filtering, 60.1% passed
00:11 123Mb    61.1% Filtering, 56.6% passed
00:12 123Mb    68.7% Filtering, 53.5% passed
00:13 123Mb    75.4% Filtering, 53.7% passed
00:14 123Mb    83.4% Filtering, 51.4% passed
00:15 123Mb    89.4% Filtering, 52.2% passed
00:16 123Mb    95.1% Filtering, 53.2% passed
00:16 90Mb    100.0% Filtering, 54.0% passed
   2610480  Reads (2.6M)
   1201725  Discarded reads with expected errs > 1.00
   1408755  Filtered reads (1.4M, 54.0%)

🕓 16.0 s 🏆 (1.7x) 997% CPU
📈 34.9 MiB

🕓 27.9 s
📈 7.2 MiB

Select records from a large set of sequences given a list of 1000 sequence IDs

st filter -m ids_list.txt 'has_meta()' input.fasta > output.fasta

VSEARCH 🕓 28.1 s ❙ SeqKit 🕓 1.0 s 🏆 (1.6x)

VSEARCH

vsearch --fastx_getseqs input.fasta --labels ids_list.txt --fastaout output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading labels 100%
Extracting sequences 100%
1000 of 2610480 sequences extracted (0.0%)

🕓 28.1 s
📈 4.2 MiB 🏆 (1.85x)

SeqKit

seqkit grep -f ids_list.txt input.fasta > output.fasta

messages

[INFO][0m 1000 patterns loaded from file

🕓 1.0 s 🏆 (1.6x)
📈 21.8 MiB

🕓 1.6 s
📈 7.9 MiB

sample¶

Random subsampling to 1000 of sequences

st sample -n 1000 input.fasta > output.fasta

VSEARCH 🕓 4.3 s ❙ Seqtk 🕓 0.8 s ❙ SeqKit 🕓 11.5 s

VSEARCH

vsearch --fastx_subsample input.fasta --sample_size 1000 --fastaout output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Got 2610480 reads from 2610480 amplicons
Subsampling 100%
Writing output 100%
Subsampled 1000 reads from 1000 amplicons

🕓 4.3 s
📈 841.5 MiB

Seqtk

seqtk sample input.fasta 1000 > output.fasta

🕓 0.8 s
📈 3.5 MiB 🏆 (2.07x)

SeqKit

seqkit sample -n 1000 input.fasta > output.fasta

messages

[INFO][0m sample by number
[INFO][0m loading all sequences into memory...
[INFO][0m 1000 sequences outputted

🕓 11.5 s
📈 3112.7 MiB

🕓 0.5 s 🏆 (1.4x)
📈 7.2 MiB

Random subsampling to ~10% of sequences

st sample -p 0.1 input.fasta > output.fasta

Seqtk 🕓 1.7 s ❙ SeqKit 🕓 2.0 s

Seqtk	`seqtk sample input.fastq 0.1 > output.fasta`	🕓 1.7 s 📈 3.5 MiB 🏆 (2.04x)
SeqKit	`seqkit sample -p 0.1 input.fastq > output.fasta` messages `[INFO][0m sample by proportion [INFO][0m 260463 sequences outputted`	🕓 2.0 s 📈 27.6 MiB

🕓 0.8 s 🏆 (2.2x)
📈 7.1 MiB

find¶

Find the forward primer location in the input reads with up to 4 mismatches

st find -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

st (4 threads) 🕓 6.0 s 🏆 (3.5x) ❙ st (max. mismatches = 2) 🕓 21.1 s ❙ st (max. mismatches = 8) 🕓 26.7 s

st (4 threads)

st find -t4 -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

🕓 6.0 s 🏆 (3.5x) 402% CPU
📈 17.6 MiB

st (max. mismatches = 2)

st find -D2 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

🕓 21.1 s
📈 7.5 MiB

st (max. mismatches = 8)

st find -D8 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

🕓 26.7 s
📈 7.4 MiB

🕓 21.3 s
📈 7.4 MiB 🏆 (1.00x)

Find and trim the forward primer up to an error rate (edit distance) of 20%, discarding unmatched reads. Note: Unlike Cutadapt, seqtool currently does not offer ungapped alignments (--no-indels).

st find -f file:primers.fasta -R 0.2 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

Cutadapt 🕓 67.1 s

Cutadapt

cutadapt -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq

messages

This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 1 core ...
Finished in 66.906 s (25.630 µs/read; 2.34 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1

🕓 67.1 s
📈 20.9 MiB

🕓 16.9 s 🏆 (4.0x) 120% CPU
📈 7.4 MiB 🏆 (2.83x)

Find and trim the forward primer in parallel using 4 threads (cores).

st find -f file:primers.fasta -R 0.2 -t4 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq

messages

Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

Cutadapt 🕓 18.1 s

Cutadapt

cutadapt -j4 -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq

messages

This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -j4 -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 4 cores ...
Finished in 17.956 s (6.878 µs/read; 8.72 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1

🕓 18.1 s 413% CPU
📈 39.4 MiB

🕓 4.9 s 🏆 (3.7x) 448% CPU
📈 17.8 MiB 🏆 (2.22x)

replace¶

Convert DNA to RNA using the replace command

st replace T U input.fasta > output.fasta

st find 🕓 14.3 s ❙ SeqKit 🕓 4.8 s 🏆 (2.1x) ❙ FASTX-Toolkit 🕓 283.5 s

st find

st find T --rep U input.fasta > output.fasta

messages

Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

🕓 14.3 s
📈 7.2 MiB

SeqKit

seqkit seq --dna2rna  input.fasta > output.fasta

🕓 4.8 s 🏆 (2.1x)
📈 27.3 MiB

FASTX-Toolkit

fasta_nucleotide_changer -r -i input.fasta > output.fasta

🕓 283.5 s
📈 3.5 MiB 🏆 (2.07x)

🕓 10.1 s
📈 7.2 MiB

Convert DNA to RNA using 4 threads

st replace -t4 T U input.fasta > output.fasta

st find 🕓 8.4 s

st find

st find -t4 T --rep U input.fasta > output.fasta

messages

Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.

🕓 8.4 s 282% CPU
📈 24.6 MiB

🕓 2.7 s 🏆 (3.1x) 418% CPU
📈 9.0 MiB 🏆 (2.74x)

trim¶

Trim the leading 99 bp from the sequences

st trim 100: input.fasta > output.fasta

SeqKit (creates FASTA index) 🕓 44.8 s

SeqKit (creates FASTA index)

seqkit subseq -r '100:-1'  input.fasta > output.fasta

messages

[INFO][0m create or read FASTA index ...
[INFO][0m create FASTA index for input.fasta
[INFO][0m   2610480 records loaded from input.fasta.seqkit.fai

🕓 44.8 s
📈 1254.5 MiB

🕓 2.8 s 🏆 (16.0x)
📈 7.4 MiB 🏆 (170.10x)

upper¶

Convert sequences to uppercase

st upper input.fasta > output.fasta

Seqtk 🕓 5.2 s ❙ SeqKit 🕓 4.2 s

Seqtk	`seqtk seq -U input.fasta > output.fasta`	🕓 5.2 s 📈 3.5 MiB 🏆 (2.11x)
SeqKit	`seqkit seq -u input.fasta > output.fasta`	🕓 4.2 s 📈 62.2 MiB

🕓 3.0 s 🏆 (1.4x)
📈 7.4 MiB

revcomp¶

Reverse complement sequences

st revcomp input.fasta > output.fasta

Seqtk 🕓 5.3 s 🏆 (1.1x) ❙ VSEARCH 🕓 7.7 s ❙ SeqKit 🕓 7.8 s

Seqtk

seqtk seq -r input.fasta > output.fasta

🕓 5.3 s 🏆 (1.1x)
📈 3.5 MiB 🏆 (1.21x)

VSEARCH

vsearch --fastx_revcomp input.fasta --fastaout output.fasta

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTA file 100%

🕓 7.7 s
📈 4.2 MiB

SeqKit

seqkit seq -rp  input.fasta > output.fasta

messages

[33m[WARN][0m flag -t (--seq-type) (DNA/RNA) is recommended for computing complement sequences

🕓 7.8 s
📈 28.1 MiB

🕓 6.0 s
📈 7.2 MiB

concat¶

Concatenate sequences, adding an NNNNN spacer inbetween

st concat -s 5 -c N file1.fastq file2.fastq > output.fastq

VSEARCH 🕓 20.5 s

VSEARCH

vsearch --fastq_join file1.fastq --reverse file2.fastq --join_padgap NNNNN --fastqout output.fastq

messages

vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Joining reads 100%
2610480 pairs joined

🕓 20.5 s
📈 4.2 MiB 🏆 (1.74x)

🕓 9.9 s 🏆 (2.1x)
📈 7.4 MiB