Skip to content

Comparison of tools

In the following list, we show the execution time, memory footprint and CPU usage of seqtool v0.4.0-beta on a selection of tasks, compared with the following tools:

Details on the approach are found here. The input file is a FASTQ or FASTA file containing 2.6 M reads (Illumina MiSeq, 300 bp). The comparison was run on a Ryzen 4750U CPU with frequency boost disabled, writing files to a RAM instead of the disk.

The fastest/most memory-efficient commands are highlighted by '🏆' and an indication, how many times faster / less memory they use compared to the commands ranking second. To show more details, click on the alternative commands list.

pass

Do nothing, just read and write FASTA
st pass input.fasta > output.fasta
SeqKit 🕓 2.2 s 🏆 (1.2x)
SeqKit
seqkit seq  input.fasta > output.fasta
🕓 2.2 s 🏆 (1.2x) 106% CPU
📈 18.0 MiB
🕓 2.6 s
📈 7.1 MiB 🏆 (2.53x)
Convert FASTQ to FASTA
st pass --to-fa input.fastq > output.fasta
FASTX-Toolkit 🕓 287.9 s  ❙ Seqtk 🕓 4.3 s  ❙ SeqKit 🕓 3.1 s
FASTX-Toolkit
fastq_to_fasta -Q33 -i input.fastq > output.fasta
🕓 287.9 s
📈 3.5 MiB 🏆 (1.00x)
Seqtk
seqtk seq -A input.fastq > output.fasta
🕓 4.3 s
📈 3.5 MiB
SeqKit
seqkit fq2fa input.fastq > output.fasta
🕓 3.1 s
📈 18.4 MiB
🕓 3.1 s 🏆 (1.0x)
📈 7.1 MiB
Convert FASTQ quality scores
st pass --to fastq-illumina input.fastq > output.fastq
VSEARCH 🕓 12.9 s  ❙ SeqKit 🕓 48.8 s
VSEARCH
vsearch --fastq_convert input.fastq --fastq_asciiout 64 --fastqout output.fastq
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTQ file 100%
🕓 12.9 s
📈 4.2 MiB 🏆 (1.65x)
SeqKit
seqkit convert --from 'Sanger' --to 'Illumina-1.3+' input.fastq > output.fastq
 messages
[INFO] converting Sanger -> Illumina-1.3+
🕓 48.8 s
📈 47.9 MiB
🕓 7.4 s 🏆 (1.8x)
📈 7.0 MiB
Write compressed FASTQ files in GZIP format
st pass input.fastq -o output.fastq.gz
SeqKit 🕓 30.3 s 🏆 (1.3x)  ❙ seqtool | gzip 🕓 159.1 s  ❙ gzip directly 🕓 158.6 s  ❙ pigz directly (4 threads) 🕓 39.0 s
SeqKit
seqkit seq input.fastq -o output.fastq.gz
🕓 30.3 s 🏆 (1.3x)
📈 37.5 MiB
seqtool | gzip
st pass input.fastq | gzip -c > output.fastq.gz
🕓 159.1 s
📈 7.2 MiB
gzip directly
gzip -kf input.fastq
🕓 158.6 s
📈 3.5 MiB 🏆 (1.21x)
pigz directly (4 threads)
pigz -p4 -kf input.fastq
🕓 39.0 s 405% CPU
📈 4.2 MiB
🕓 55.8 s
📈 27.5 MiB
Write compressed FASTQ files in Zstandard format
st pass input.fastq -o output.fastq.zst
seqtool | zstd piped 🕓 12.8 s 🏆 (1.2x)
seqtool | zstd piped
st pass input.fastq | zstd -c > output.fastq.zst
🕓 12.8 s 🏆 (1.2x) 147% CPU
📈 38.8 MiB
🕓 15.5 s 114% CPU
📈 11.0 MiB 🏆 (3.52x)
Write compressed FASTQ files in Lz4 format
st pass input.fastq -o output.fastq.lz4
seqtool | lz4 piped 🕓 9.9 s
seqtool | lz4 piped
st pass input.fastq | lz4 -c > output.fastq.lz4
🕓 9.9 s 116% CPU
📈 7.4 MiB 🏆 (3.75x)
🕓 9.4 s 🏆 (1.1x) 116% CPU
📈 27.6 MiB

count

Count the number of FASTQ sequences in the input
st count input.fastq
🟦 output
2610480
Seqtk 🕓 0.7 s
Seqtk
seqtk size input.fasta
🟦 output
2610480 712939424
🕓 0.7 s
📈 3.4 MiB 🏆 (2.11x)
🕓 0.6 s 🏆 (1.2x)
📈 7.1 MiB
Count the number of FASTQ sequences, grouped by GC content (in 10% intervals)
st count -k 'bin(gc_percent, 10)' input.fastq
🟦 output
(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197
st with math expression 🕓 7.0 s
st with math expression
st count -k '{bin(gc_percent/100*100, 10)}' input.fastq
🟦 output
(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197
🕓 7.0 s
📈 86.0 MiB
🕓 4.2 s 🏆 (1.6x)
📈 7.4 MiB 🏆 (11.66x)

sort

Sort by sequence
st sort seq input.fasta > output.fasta
SeqKit 🕓 42.3 s
SeqKit
seqkit sort -s  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
🕓 42.3 s
📈 4595.1 MiB
🕓 13.6 s 🏆 (3.1x)
📈 1771.4 MiB 🏆 (2.59x)
Sort by sequence with ~ 50 MiB memory limit
st sort seq input.fasta -M 50M > output.fasta
 messages
Memory limit reached after 78050 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.
100 MiB memory limit 🕓 20.6 s
100 MiB memory limit
st sort seq input.fasta -M 100M > output.fasta
 messages
Memory limit reached after 155392 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.
🕓 20.6 s
📈 108.7 MiB
🕓 20.3 s 🏆 (1.0x)
📈 58.5 MiB 🏆 (1.86x)
Sort by record ID
st sort id input.fasta > output.fasta
SeqKit 🕓 34.2 s
SeqKit
seqkit sort  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
🕓 34.2 s
📈 4436.4 MiB
🕓 6.5 s 🏆 (5.3x)
📈 1119.2 MiB 🏆 (3.96x)
Sort by sequence length
st sort seqlen input.fasta > output.fasta
SeqKit 🕓 33.7 s  ❙ VSEARCH 🕓 9.4 s
SeqKit
seqkit sort -l  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
🕓 33.7 s
📈 4153.5 MiB
VSEARCH
vsearch --sortbylength input.fasta --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Getting lengths 100%
Sorting 100%
Median length: 301
Writing output 100%
🕓 9.4 s
📈 891.4 MiB 🏆 (1.17x)
🕓 5.9 s 🏆 (1.6x)
📈 1042.4 MiB
Sort sequences by USEARCH/VSEARCH-style abundance annotations
ST_ATTR_FMT=';key=value' st unique seq -a size={n_duplicates} input.fasta |
  st sort '{-attr("size")}' > output.fasta
VSEARCH 🕓 20.4 s
VSEARCH
vsearch --derep_fulllength input.fasta --output - --sizeout |   vsearch --sortbysize - --output output.fasta  
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output fileReading file - 100%
 100%
606287856 nt in 2134929 seqs, min 35, max 301, avg 284
Getting sizes 100%
Sorting 100%
Median abundance: 1
Writing output 100%
🕓 20.4 s 113% CPU
📈 1345.8 MiB 🏆 (1.19x)
🕓 13.3 s 🏆 (1.5x) 110% CPU
📈 1606.5 MiB

unique

Remove duplicate sequences using sequence hashes. This is more memory efficient and usually faster than keeping the whole sequence around.
st unique seqhash input.fasta > output.fasta
SeqKit 🕓 3.3 s 🏆 (1.2x)
SeqKit
seqkit rmdup -sP  input.fasta > output.fasta
 messages
[INFO] 475551 duplicated records removed
🕓 3.3 s 🏆 (1.2x)
📈 180.1 MiB
🕓 4.2 s
📈 117.1 MiB 🏆 (1.54x)
Remove duplicate sequences using sequence hashes (case-insensitive).
st unique 'seqhash(true)' input.fasta > output.fasta
VSEARCH 🕓 12.1 s  ❙ SeqKit 🕓 6.2 s
VSEARCH
vsearch --derep_smallmem input.fasta --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
🕓 12.1 s
📈 90.7 MiB 🏆 (1.29x)
SeqKit
seqkit rmdup -sPi  input.fasta > output.fasta
 messages
[INFO] 475551 duplicated records removed
🕓 6.2 s
📈 289.8 MiB
🕓 4.3 s 🏆 (1.4x)
📈 117.2 MiB
Remove duplicate sequences that are exactly identical (case-insensitive); comparing full sequences instead of not hashes (requires more memory). VSEARCH additionally treats 'T' and 'U' in the same way (seqtool doesn't).
st unique upper_seq input.fasta > output.fasta
seqtool (sorted by sequence) 🕓 13.5 s  ❙ VSEARCH 🕓 15.8 s
seqtool (sorted by sequence)
st unique -s upper_seq input.fasta > output.fasta
🕓 13.5 s
📈 1640.7 MiB
VSEARCH
vsearch --derep_fulllength input.fasta --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
🕓 15.8 s
📈 1345.7 MiB
🕓 5.4 s 🏆 (2.5x)
📈 729.0 MiB 🏆 (1.85x)
Remove duplicate sequences (exact mode) with a memory limit of ~50 MiB
st unique seq -M 50M input.fasta > output.fasta
 messages
Memory limit reached after 151512 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up de-duplicating. Use -q/--quiet to silence this message.
🕓 19.5 s
📈 56.6 MiB
Remove duplicate sequences, checking both strands
st unique seqhash_both input.fasta > output.fasta
SeqKit 🕓 14.8 s
SeqKit
seqkit rmdup -s  input.fasta > output.fasta
 messages
[INFO] 475687 duplicated records removed
🕓 14.8 s
📈 293.6 MiB
🕓 7.5 s 🏆 (2.0x)
📈 117.1 MiB 🏆 (2.51x)
Remove duplicate sequences, appending USEARCH/VSEARCH-style abundance annotations to the headers: >id;size=NN
st unique seq -a size={n_duplicates} --attr-fmt ';key=value' input.fasta > output.fasta
VSEARCH 🕓 16.1 s
VSEARCH
vsearch --derep_fulllength input.fasta --sizeout --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
🕓 16.1 s
📈 1345.9 MiB 🏆 (1.19x)
🕓 9.3 s 🏆 (1.7x)
📈 1606.2 MiB
De-replicate both by sequence and record ID (the part before the first space in the header). The given benchmark actually has unique sequence IDs, so the result is the same as de-replication by sequence.
st unique id,seq input.fasta > output.fasta
VSEARCH 🕓 17.7 s
VSEARCH
vsearch --derep_id input.fasta --output output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2610480 unique sequences, avg cluster 1.0, median 1, max 1
Writing FASTA output file 100%
🕓 17.7 s
📈 1364.4 MiB
🕓 7.5 s 🏆 (2.3x)
📈 1090.6 MiB 🏆 (1.25x)

filter

Filter sequences by length
st filter 'seqlen >= 100' input.fastq > output.fastq
Seqtk 🕓 6.5 s  ❙ SeqKit 🕓 4.1 s 🏆 (1.3x)
Seqtk
seqtk seq -L 100 input.fastq > output.fastq
🕓 6.5 s
📈 3.5 MiB 🏆 (2.07x)
SeqKit
seqkit seq -m 100 input.fastq > output.fastq
 messages
[WARN] you may switch on flag -g/--remove-gaps to remove spaces
🕓 4.1 s 🏆 (1.3x)
📈 28.1 MiB
🕓 5.4 s
📈 7.2 MiB
Filter sequences by the total expected error as calculated from the quality scores
st filter 'exp_err <= 1' input.fastq --to-fa > output.fastq
VSEARCH 🕓 32.9 s  ❙ USEARCH 🕓 16.0 s 🏆 (1.7x)
VSEARCH
vsearch --fastq_filter input.fastq --fastq_maxee 1 --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading input file 100%
1408755 sequences kept (of which 0 truncated), 1201725 sequences discarded.
🕓 32.9 s
📈 4.4 MiB 🏆 (1.66x)
USEARCH
usearch -fastq_filter input.fastq -fastq_maxee 1 -fastaout output.fasta
🟦 output
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.1Gb total), 16 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
 messages
00:00 4.2Mb  FASTQ base 33 for file input.fastq
00:00 38Mb   CPU has 16 cores, defaulting to 10 threads
00:00 115Mb     0.1% Filtering
00:01 123Mb     1.0% Filtering, 31.4% passed
00:02 123Mb     8.7% Filtering, 31.5% passed
00:03 123Mb    16.4% Filtering, 31.8% passed
00:04 123Mb    22.1% Filtering, 40.1% passed
00:05 123Mb    26.7% Filtering, 47.6% passed
00:06 123Mb    31.5% Filtering, 52.6% passed
00:07 123Mb    36.4% Filtering, 56.2% passed
00:08 123Mb    41.3% Filtering, 59.1% passed
00:09 123Mb    47.2% Filtering, 60.1% passed
00:10 123Mb    53.5% Filtering, 60.1% passed
00:11 123Mb    61.1% Filtering, 56.6% passed
00:12 123Mb    68.7% Filtering, 53.5% passed
00:13 123Mb    75.4% Filtering, 53.7% passed
00:14 123Mb    83.4% Filtering, 51.4% passed
00:15 123Mb    89.4% Filtering, 52.2% passed
00:16 123Mb    95.1% Filtering, 53.2% passed
00:16 90Mb    100.0% Filtering, 54.0% passed
   2610480  Reads (2.6M)
   1201725  Discarded reads with expected errs > 1.00
   1408755  Filtered reads (1.4M, 54.0%)
🕓 16.0 s 🏆 (1.7x) 997% CPU
📈 34.9 MiB
🕓 27.9 s
📈 7.2 MiB
Select records from a large set of sequences given a list of 1000 sequence IDs
st filter -m ids_list.txt 'has_meta()' input.fasta > output.fasta
VSEARCH 🕓 28.1 s  ❙ SeqKit 🕓 1.0 s 🏆 (1.6x)
VSEARCH
vsearch --fastx_getseqs input.fasta --labels ids_list.txt --fastaout output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading labels 100%
Extracting sequences 100%
1000 of 2610480 sequences extracted (0.0%)
🕓 28.1 s
📈 4.2 MiB 🏆 (1.85x)
SeqKit
seqkit grep -f ids_list.txt input.fasta > output.fasta
 messages
[INFO] 1000 patterns loaded from file
🕓 1.0 s 🏆 (1.6x)
📈 21.8 MiB
🕓 1.6 s
📈 7.9 MiB

sample

Random subsampling to 1000 of sequences
st sample -n 1000 input.fasta > output.fasta
VSEARCH 🕓 4.3 s  ❙ Seqtk 🕓 0.8 s  ❙ SeqKit 🕓 11.5 s
VSEARCH
vsearch --fastx_subsample input.fasta --sample_size 1000 --fastaout output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Got 2610480 reads from 2610480 amplicons
Subsampling 100%
Writing output 100%
Subsampled 1000 reads from 1000 amplicons
🕓 4.3 s
📈 841.5 MiB
Seqtk
seqtk sample input.fasta 1000 > output.fasta
🕓 0.8 s
📈 3.5 MiB 🏆 (2.07x)
SeqKit
seqkit sample -n 1000 input.fasta > output.fasta
 messages
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 1000 sequences outputted
🕓 11.5 s
📈 3112.7 MiB
🕓 0.5 s 🏆 (1.4x)
📈 7.2 MiB
Random subsampling to ~10% of sequences
st sample -p 0.1 input.fasta > output.fasta
Seqtk 🕓 1.7 s  ❙ SeqKit 🕓 2.0 s
Seqtk
seqtk sample input.fastq 0.1 > output.fasta
🕓 1.7 s
📈 3.5 MiB 🏆 (2.04x)
SeqKit
seqkit sample -p 0.1 input.fastq > output.fasta
 messages
[INFO] sample by proportion
[INFO] 260463 sequences outputted
🕓 2.0 s
📈 27.6 MiB
🕓 0.8 s 🏆 (2.2x)
📈 7.1 MiB

find

Find the forward primer location in the input reads with up to 4 mismatches
st find -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
st (4 threads) 🕓 6.0 s 🏆 (3.5x)  ❙ st (max. mismatches = 2) 🕓 21.1 s  ❙ st (max. mismatches = 8) 🕓 26.7 s
st (4 threads)
st find -t4 -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
🕓 6.0 s 🏆 (3.5x) 402% CPU
📈 17.6 MiB
st (max. mismatches = 2)
st find -D2 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
🕓 21.1 s
📈 7.5 MiB
st (max. mismatches = 8)
st find -D8 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
🕓 26.7 s
📈 7.4 MiB
🕓 21.3 s
📈 7.4 MiB 🏆 (1.00x)
Find and trim the forward primer up to an error rate (edit distance) of 20%, discarding unmatched reads. Note: Unlike Cutadapt, seqtool currently does not offer ungapped alignments (--no-indels).
st find -f file:primers.fasta -R 0.2 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
Cutadapt 🕓 67.1 s
Cutadapt
cutadapt -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq 
 messages
This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 1 core ...
Finished in 66.906 s (25.630 µs/read; 2.34 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1
🕓 67.1 s
📈 20.9 MiB
🕓 16.9 s 🏆 (4.0x) 120% CPU
📈 7.4 MiB 🏆 (2.83x)
Find and trim the forward primer in parallel using 4 threads (cores).
st find -f file:primers.fasta -R 0.2 -t4 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
Cutadapt 🕓 18.1 s
Cutadapt
cutadapt -j4 -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq 
 messages
This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -j4 -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 4 cores ...
Finished in 17.956 s (6.878 µs/read; 8.72 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1
🕓 18.1 s 413% CPU
📈 39.4 MiB
🕓 4.9 s 🏆 (3.7x) 448% CPU
📈 17.8 MiB 🏆 (2.22x)

replace

Convert DNA to RNA using the replace command
st replace T U input.fasta > output.fasta
st find 🕓 14.3 s  ❙ SeqKit 🕓 4.8 s 🏆 (2.1x)  ❙ FASTX-Toolkit 🕓 283.5 s
st find
st find T --rep U input.fasta > output.fasta
 messages
Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
🕓 14.3 s
📈 7.2 MiB
SeqKit
seqkit seq --dna2rna  input.fasta > output.fasta
🕓 4.8 s 🏆 (2.1x)
📈 27.3 MiB
FASTX-Toolkit
fasta_nucleotide_changer -r -i input.fasta > output.fasta
🕓 283.5 s
📈 3.5 MiB 🏆 (2.07x)
🕓 10.1 s
📈 7.2 MiB
Convert DNA to RNA using 4 threads
st replace -t4 T U input.fasta > output.fasta
st find 🕓 8.4 s
st find
st find -t4 T --rep U input.fasta > output.fasta
 messages
Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
🕓 8.4 s 282% CPU
📈 24.6 MiB
🕓 2.7 s 🏆 (3.1x) 418% CPU
📈 9.0 MiB 🏆 (2.74x)

trim

Trim the leading 99 bp from the sequences
st trim 100: input.fasta > output.fasta
SeqKit (creates FASTA index) 🕓 44.8 s
SeqKit (creates FASTA index)
seqkit subseq -r '100:-1'  input.fasta > output.fasta
 messages
[INFO] create or read FASTA index ...
[INFO] create FASTA index for input.fasta
[INFO]   2610480 records loaded from input.fasta.seqkit.fai
🕓 44.8 s
📈 1254.5 MiB
🕓 2.8 s 🏆 (16.0x)
📈 7.4 MiB 🏆 (170.10x)

upper

Convert sequences to uppercase
st upper input.fasta > output.fasta
Seqtk 🕓 5.2 s  ❙ SeqKit 🕓 4.2 s
Seqtk
seqtk seq -U input.fasta > output.fasta
🕓 5.2 s
📈 3.5 MiB 🏆 (2.11x)
SeqKit
seqkit seq -u  input.fasta > output.fasta
🕓 4.2 s
📈 62.2 MiB
🕓 3.0 s 🏆 (1.4x)
📈 7.4 MiB

revcomp

Reverse complement sequences
st revcomp input.fasta > output.fasta
Seqtk 🕓 5.3 s 🏆 (1.1x)  ❙ VSEARCH 🕓 7.7 s  ❙ SeqKit 🕓 7.8 s
Seqtk
seqtk seq -r input.fasta > output.fasta
🕓 5.3 s 🏆 (1.1x)
📈 3.5 MiB 🏆 (1.21x)
VSEARCH
vsearch --fastx_revcomp input.fasta --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTA file 100%
🕓 7.7 s
📈 4.2 MiB
SeqKit
seqkit seq -rp  input.fasta > output.fasta
 messages
[WARN] flag -t (--seq-type) (DNA/RNA) is recommended for computing complement sequences
🕓 7.8 s
📈 28.1 MiB
🕓 6.0 s
📈 7.2 MiB

concat

Concatenate sequences, adding an NNNNN spacer inbetween
st concat -s 5 -c N file1.fastq file2.fastq > output.fastq
VSEARCH 🕓 20.5 s
VSEARCH
vsearch --fastq_join file1.fastq --reverse file2.fastq --join_padgap NNNNN --fastqout output.fastq
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Joining reads 100%
2610480 pairs joined
🕓 20.5 s
📈 4.2 MiB 🏆 (1.74x)
🕓 9.9 s 🏆 (2.1x)
📈 7.4 MiB