Skip to content

Comparison of toolsΒΆ

In the following list, we show the execution time, memory footprint and CPU usage of seqtool v0.4.0-beta on a selection of tasks, compared with the following tools:

Details on the approach are found here. The input file is a FASTQ or FASTA file containing 2.6 M reads (Illumina MiSeq, 300 bp). The comparison was run on a Ryzen 4750U CPU with frequency boost disabled, writing files to a RAM instead of the disk.

The fastest/most memory-efficient commands are highlighted by 'πŸ†' and an indication, how many times faster / less memory they use compared to the commands ranking second. To show more details, click on the alternative commands list.

passΒΆ

Do nothing, just read and write FASTA
st pass input.fasta > output.fasta
SeqKit πŸ•“ 2.2 s πŸ† (1.2x)
SeqKit
seqkit seq  input.fasta > output.fasta
πŸ•“ 2.2 s πŸ† (1.2x) 106% CPU
πŸ“ˆ 18.0 MiB
πŸ•“ 2.6 s
πŸ“ˆ 7.1 MiB πŸ† (2.53x)
Convert FASTQ to FASTA
st pass --to-fa input.fastq > output.fasta
FASTX-Toolkit πŸ•“ 287.9 s  β™ Seqtk πŸ•“ 4.3 s  β™ SeqKit πŸ•“ 3.1 s
FASTX-Toolkit
fastq_to_fasta -Q33 -i input.fastq > output.fasta
πŸ•“ 287.9 s
πŸ“ˆ 3.5 MiB πŸ† (1.00x)
Seqtk
seqtk seq -A input.fastq > output.fasta
πŸ•“ 4.3 s
πŸ“ˆ 3.5 MiB
SeqKit
seqkit fq2fa input.fastq > output.fasta
πŸ•“ 3.1 s
πŸ“ˆ 18.4 MiB
πŸ•“ 3.1 s πŸ† (1.0x)
πŸ“ˆ 7.1 MiB
Convert FASTQ quality scores
st pass --to fastq-illumina input.fastq > output.fastq
VSEARCH πŸ•“ 12.9 s  β™ SeqKit πŸ•“ 48.8 s
VSEARCH
vsearch --fastq_convert input.fastq --fastq_asciiout 64 --fastqout output.fastq
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTQ file 100%
πŸ•“ 12.9 s
πŸ“ˆ 4.2 MiB πŸ† (1.65x)
SeqKit
seqkit convert --from 'Sanger' --to 'Illumina-1.3+' input.fastq > output.fastq
 messages
[INFO] converting Sanger -> Illumina-1.3+
πŸ•“ 48.8 s
πŸ“ˆ 47.9 MiB
πŸ•“ 7.4 s πŸ† (1.8x)
πŸ“ˆ 7.0 MiB
Write compressed FASTQ files in GZIP format
st pass input.fastq -o output.fastq.gz
SeqKit πŸ•“ 30.3 s πŸ† (1.3x)  β™ seqtool | gzip πŸ•“ 159.1 s  β™ gzip directly πŸ•“ 158.6 s  β™ pigz directly (4 threads) πŸ•“ 39.0 s
SeqKit
seqkit seq input.fastq -o output.fastq.gz
πŸ•“ 30.3 s πŸ† (1.3x)
πŸ“ˆ 37.5 MiB
seqtool | gzip
st pass input.fastq | gzip -c > output.fastq.gz
πŸ•“ 159.1 s
πŸ“ˆ 7.2 MiB
gzip directly
gzip -kf input.fastq
πŸ•“ 158.6 s
πŸ“ˆ 3.5 MiB πŸ† (1.21x)
pigz directly (4 threads)
pigz -p4 -kf input.fastq
πŸ•“ 39.0 s 405% CPU
πŸ“ˆ 4.2 MiB
πŸ•“ 55.8 s
πŸ“ˆ 27.5 MiB
Write compressed FASTQ files in Zstandard format
st pass input.fastq -o output.fastq.zst
seqtool | zstd piped πŸ•“ 12.8 s πŸ† (1.2x)
seqtool | zstd piped
st pass input.fastq | zstd -c > output.fastq.zst
πŸ•“ 12.8 s πŸ† (1.2x) 147% CPU
πŸ“ˆ 38.8 MiB
πŸ•“ 15.5 s 114% CPU
πŸ“ˆ 11.0 MiB πŸ† (3.52x)
Write compressed FASTQ files in Lz4 format
st pass input.fastq -o output.fastq.lz4
seqtool | lz4 piped πŸ•“ 9.9 s
seqtool | lz4 piped
st pass input.fastq | lz4 -c > output.fastq.lz4
πŸ•“ 9.9 s 116% CPU
πŸ“ˆ 7.4 MiB πŸ† (3.75x)
πŸ•“ 9.4 s πŸ† (1.1x) 116% CPU
πŸ“ˆ 27.6 MiB

countΒΆ

Count the number of FASTQ sequences in the input
st count input.fastq
🟦 output
2610480
Seqtk πŸ•“ 0.7 s
Seqtk
seqtk size input.fasta
🟦 output
2610480 712939424
πŸ•“ 0.7 s
πŸ“ˆ 3.4 MiB πŸ† (2.11x)
πŸ•“ 0.6 s πŸ† (1.2x)
πŸ“ˆ 7.1 MiB
Count the number of FASTQ sequences, grouped by GC content (in 10% intervals)
st count -k 'bin(gc_percent, 10)' input.fastq
🟦 output
(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197
st with math expression πŸ•“ 7.0 s
st with math expression
st count -k '{bin(gc_percent/100*100, 10)}' input.fastq
🟦 output
(10, 20]    16
(20, 30]    3004
(30, 40]    51945
(40, 50]    1149946
(50, 60]    1248702
(60, 70]    20439
(70, 80]    120
(80, 90]    63
(90, 100]   37
(100, 110]  11
(NaN, NaN]  136197
πŸ•“ 7.0 s
πŸ“ˆ 86.0 MiB
πŸ•“ 4.2 s πŸ† (1.6x)
πŸ“ˆ 7.4 MiB πŸ† (11.66x)

sortΒΆ

Sort by sequence
st sort seq input.fasta > output.fasta
SeqKit πŸ•“ 42.3 s
SeqKit
seqkit sort -s  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
πŸ•“ 42.3 s
πŸ“ˆ 4595.1 MiB
πŸ•“ 13.6 s πŸ† (3.1x)
πŸ“ˆ 1771.4 MiB πŸ† (2.59x)
Sort by sequence with ~ 50 MiB memory limit
st sort seq input.fasta -M 50M > output.fasta
 messages
Memory limit reached after 78050 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.
100 MiB memory limit πŸ•“ 20.6 s
100 MiB memory limit
st sort seq input.fasta -M 100M > output.fasta
 messages
Memory limit reached after 155392 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up sorting. Use -q/--quiet to silence this message.
πŸ•“ 20.6 s
πŸ“ˆ 108.7 MiB
πŸ•“ 20.3 s πŸ† (1.0x)
πŸ“ˆ 58.5 MiB πŸ† (1.86x)
Sort by record ID
st sort id input.fasta > output.fasta
SeqKit πŸ•“ 34.2 s
SeqKit
seqkit sort  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
πŸ•“ 34.2 s
πŸ“ˆ 4436.4 MiB
πŸ•“ 6.5 s πŸ† (5.3x)
πŸ“ˆ 1119.2 MiB πŸ† (3.96x)
Sort by sequence length
st sort seqlen input.fasta > output.fasta
SeqKit πŸ•“ 33.7 s  β™ VSEARCH πŸ•“ 9.4 s
SeqKit
seqkit sort -l  input.fasta > output.fasta
 messages
[INFO] read sequences ...
[INFO] 2610480 sequences loaded
[INFO] sorting ...
[INFO] output ...
πŸ•“ 33.7 s
πŸ“ˆ 4153.5 MiB
VSEARCH
vsearch --sortbylength input.fasta --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Getting lengths 100%
Sorting 100%
Median length: 301
Writing output 100%
πŸ•“ 9.4 s
πŸ“ˆ 891.4 MiB πŸ† (1.17x)
πŸ•“ 5.9 s πŸ† (1.6x)
πŸ“ˆ 1042.4 MiB
Sort sequences by USEARCH/VSEARCH-style abundance annotations
ST_ATTR_FMT=';key=value' st unique seq -a size={n_duplicates} input.fasta |
  st sort '{-attr("size")}' > output.fasta
VSEARCH πŸ•“ 20.4 s
VSEARCH
vsearch --derep_fulllength input.fasta --output - --sizeout |   vsearch --sortbysize - --output output.fasta  
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output fileReading file - 100%
 100%
606287856 nt in 2134929 seqs, min 35, max 301, avg 284
Getting sizes 100%
Sorting 100%
Median abundance: 1
Writing output 100%
πŸ•“ 20.4 s 113% CPU
πŸ“ˆ 1345.8 MiB πŸ† (1.19x)
πŸ•“ 13.3 s πŸ† (1.5x) 110% CPU
πŸ“ˆ 1606.5 MiB

uniqueΒΆ

Remove duplicate sequences using sequence hashes. This is more memory efficient and usually faster than keeping the whole sequence around.
st unique seqhash input.fasta > output.fasta
SeqKit πŸ•“ 3.3 s πŸ† (1.2x)
SeqKit
seqkit rmdup -sP  input.fasta > output.fasta
 messages
[INFO] 475551 duplicated records removed
πŸ•“ 3.3 s πŸ† (1.2x)
πŸ“ˆ 180.1 MiB
πŸ•“ 4.2 s
πŸ“ˆ 117.1 MiB πŸ† (1.54x)
Remove duplicate sequences using sequence hashes (case-insensitive).
st unique 'seqhash(true)' input.fasta > output.fasta
VSEARCH πŸ•“ 12.1 s  β™ SeqKit πŸ•“ 6.2 s
VSEARCH
vsearch --derep_smallmem input.fasta --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
πŸ•“ 12.1 s
πŸ“ˆ 90.7 MiB πŸ† (1.29x)
SeqKit
seqkit rmdup -sPi  input.fasta > output.fasta
 messages
[INFO] 475551 duplicated records removed
πŸ•“ 6.2 s
πŸ“ˆ 289.8 MiB
πŸ•“ 4.3 s πŸ† (1.4x)
πŸ“ˆ 117.2 MiB
Remove duplicate sequences that are exactly identical (case-insensitive); comparing full sequences instead of not hashes (requires more memory). VSEARCH additionally treats 'T' and 'U' in the same way (seqtool doesn't).
st unique upper_seq input.fasta > output.fasta
seqtool (sorted by sequence) πŸ•“ 13.5 s  β™ VSEARCH πŸ•“ 15.8 s
seqtool (sorted by sequence)
st unique -s upper_seq input.fasta > output.fasta
πŸ•“ 13.5 s
πŸ“ˆ 1640.7 MiB
VSEARCH
vsearch --derep_fulllength input.fasta --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
πŸ•“ 15.8 s
πŸ“ˆ 1345.7 MiB
πŸ•“ 5.4 s πŸ† (2.5x)
πŸ“ˆ 729.0 MiB πŸ† (1.85x)
Remove duplicate sequences (exact mode) with a memory limit of ~50 MiB
st unique seq -M 50M input.fasta > output.fasta
 messages
Memory limit reached after 151512 records, writing to temporary file(s). Consider raising the limit (-M/--max-mem) to speed up de-duplicating. Use -q/--quiet to silence this message.
πŸ•“ 19.5 s
πŸ“ˆ 56.6 MiB
Remove duplicate sequences, checking both strands
st unique seqhash_both input.fasta > output.fasta
SeqKit πŸ•“ 14.8 s
SeqKit
seqkit rmdup -s  input.fasta > output.fasta
 messages
[INFO] 475687 duplicated records removed
πŸ•“ 14.8 s
πŸ“ˆ 293.6 MiB
πŸ•“ 7.5 s πŸ† (2.0x)
πŸ“ˆ 117.1 MiB πŸ† (2.51x)
Remove duplicate sequences, appending USEARCH/VSEARCH-style abundance annotations to the headers: >id;size=NN
st unique seq -a size={n_duplicates} --attr-fmt ';key=value' input.fasta > output.fasta
VSEARCH πŸ•“ 16.1 s
VSEARCH
vsearch --derep_fulllength input.fasta --sizeout --output output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2134929 unique sequences, avg cluster 1.2, median 1, max 136182
Writing FASTA output file 100%
πŸ•“ 16.1 s
πŸ“ˆ 1345.9 MiB πŸ† (1.19x)
πŸ•“ 9.3 s πŸ† (1.7x)
πŸ“ˆ 1606.2 MiB
De-replicate both by sequence and record ID (the part before the first space in the header). The given benchmark actually has unique sequence IDs, so the result is the same as de-replication by sequence.
st unique id,seq input.fasta > output.fasta
VSEARCH πŸ•“ 17.7 s
VSEARCH
vsearch --derep_id input.fasta --output output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Dereplicating file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Sorting 100%
2610480 unique sequences, avg cluster 1.0, median 1, max 1
Writing FASTA output file 100%
πŸ•“ 17.7 s
πŸ“ˆ 1364.4 MiB
πŸ•“ 7.5 s πŸ† (2.3x)
πŸ“ˆ 1090.6 MiB πŸ† (1.25x)

filterΒΆ

Filter sequences by length
st filter 'seqlen >= 100' input.fastq > output.fastq
Seqtk πŸ•“ 6.5 s  β™ SeqKit πŸ•“ 4.1 s πŸ† (1.3x)
Seqtk
seqtk seq -L 100 input.fastq > output.fastq
πŸ•“ 6.5 s
πŸ“ˆ 3.5 MiB πŸ† (2.07x)
SeqKit
seqkit seq -m 100 input.fastq > output.fastq
 messages
[WARN] you may switch on flag -g/--remove-gaps to remove spaces
πŸ•“ 4.1 s πŸ† (1.3x)
πŸ“ˆ 28.1 MiB
πŸ•“ 5.4 s
πŸ“ˆ 7.2 MiB
Filter sequences by the total expected error as calculated from the quality scores
st filter 'exp_err <= 1' input.fastq --to-fa > output.fastq
VSEARCH πŸ•“ 32.9 s  β™ USEARCH πŸ•“ 16.0 s πŸ† (1.7x)
VSEARCH
vsearch --fastq_filter input.fastq --fastq_maxee 1 --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading input file 100%
1408755 sequences kept (of which 0 truncated), 1201725 sequences discarded.
πŸ•“ 32.9 s
πŸ“ˆ 4.4 MiB πŸ† (1.66x)
USEARCH
usearch -fastq_filter input.fastq -fastq_maxee 1 -fastaout output.fasta
🟦 output
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.1Gb total), 16 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
 messages
00:00 4.2Mb  FASTQ base 33 for file input.fastq
00:00 38Mb   CPU has 16 cores, defaulting to 10 threads
00:00 115Mb     0.1% Filtering
00:01 123Mb     1.0% Filtering, 31.4% passed
00:02 123Mb     8.7% Filtering, 31.5% passed
00:03 123Mb    16.4% Filtering, 31.8% passed
00:04 123Mb    22.1% Filtering, 40.1% passed
00:05 123Mb    26.7% Filtering, 47.6% passed
00:06 123Mb    31.5% Filtering, 52.6% passed
00:07 123Mb    36.4% Filtering, 56.2% passed
00:08 123Mb    41.3% Filtering, 59.1% passed
00:09 123Mb    47.2% Filtering, 60.1% passed
00:10 123Mb    53.5% Filtering, 60.1% passed
00:11 123Mb    61.1% Filtering, 56.6% passed
00:12 123Mb    68.7% Filtering, 53.5% passed
00:13 123Mb    75.4% Filtering, 53.7% passed
00:14 123Mb    83.4% Filtering, 51.4% passed
00:15 123Mb    89.4% Filtering, 52.2% passed
00:16 123Mb    95.1% Filtering, 53.2% passed
00:16 90Mb    100.0% Filtering, 54.0% passed
   2610480  Reads (2.6M)
   1201725  Discarded reads with expected errs > 1.00
   1408755  Filtered reads (1.4M, 54.0%)
πŸ•“ 16.0 s πŸ† (1.7x) 997% CPU
πŸ“ˆ 34.9 MiB
πŸ•“ 27.9 s
πŸ“ˆ 7.2 MiB
Select records from a large set of sequences given a list of 1000 sequence IDs
st filter -m ids_list.txt 'has_meta()' input.fasta > output.fasta
VSEARCH πŸ•“ 28.1 s  β™ SeqKit πŸ•“ 1.0 s πŸ† (1.6x)
VSEARCH
vsearch --fastx_getseqs input.fasta --labels ids_list.txt --fastaout output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading labels 100%
Extracting sequences 100%
1000 of 2610480 sequences extracted (0.0%)
πŸ•“ 28.1 s
πŸ“ˆ 4.2 MiB πŸ† (1.85x)
SeqKit
seqkit grep -f ids_list.txt input.fasta > output.fasta
 messages
[INFO] 1000 patterns loaded from file
πŸ•“ 1.0 s πŸ† (1.6x)
πŸ“ˆ 21.8 MiB
πŸ•“ 1.6 s
πŸ“ˆ 7.9 MiB

sampleΒΆ

Random subsampling to 1000 of sequences
st sample -n 1000 input.fasta > output.fasta
VSEARCH πŸ•“ 4.3 s  β™ Seqtk πŸ•“ 0.8 s  β™ SeqKit πŸ•“ 11.5 s
VSEARCH
vsearch --fastx_subsample input.fasta --sample_size 1000 --fastaout output.fasta
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading file input.fasta 100%
712939424 nt in 2610480 seqs, min 35, max 301, avg 273
Got 2610480 reads from 2610480 amplicons
Subsampling 100%
Writing output 100%
Subsampled 1000 reads from 1000 amplicons
πŸ•“ 4.3 s
πŸ“ˆ 841.5 MiB
Seqtk
seqtk sample input.fasta 1000 > output.fasta
πŸ•“ 0.8 s
πŸ“ˆ 3.5 MiB πŸ† (2.07x)
SeqKit
seqkit sample -n 1000 input.fasta > output.fasta
 messages
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 1000 sequences outputted
πŸ•“ 11.5 s
πŸ“ˆ 3112.7 MiB
πŸ•“ 0.5 s πŸ† (1.4x)
πŸ“ˆ 7.2 MiB
Random subsampling to ~10% of sequences
st sample -p 0.1 input.fasta > output.fasta
Seqtk πŸ•“ 1.7 s  β™ SeqKit πŸ•“ 2.0 s
Seqtk
seqtk sample input.fastq 0.1 > output.fasta
πŸ•“ 1.7 s
πŸ“ˆ 3.5 MiB πŸ† (2.04x)
SeqKit
seqkit sample -p 0.1 input.fastq > output.fasta
 messages
[INFO] sample by proportion
[INFO] 260463 sequences outputted
πŸ•“ 2.0 s
πŸ“ˆ 27.6 MiB
πŸ•“ 0.8 s πŸ† (2.2x)
πŸ“ˆ 7.1 MiB

findΒΆ

Find the forward primer location in the input reads with up to 4 mismatches
st find -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
st (4 threads) πŸ•“ 6.0 s πŸ† (3.5x)  β™ st (max. mismatches = 2) πŸ•“ 21.1 s  β™ st (max. mismatches = 8) πŸ•“ 26.7 s
st (4 threads)
st find -t4 -D4 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
πŸ•“ 6.0 s πŸ† (3.5x) 402% CPU
πŸ“ˆ 17.6 MiB
st (max. mismatches = 2)
st find -D2 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
πŸ•“ 21.1 s
πŸ“ˆ 7.5 MiB
st (max. mismatches = 8)
st find -D8 file:primers.fasta input.fastq -a primer={pattern_name} -a rng={match_range} > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
πŸ•“ 26.7 s
πŸ“ˆ 7.4 MiB
πŸ•“ 21.3 s
πŸ“ˆ 7.4 MiB πŸ† (1.00x)
Find and trim the forward primer up to an error rate (edit distance) of 20%, discarding unmatched reads. Note: Unlike Cutadapt, seqtool currently does not offer ungapped alignments (--no-indels).
st find -f file:primers.fasta -R 0.2 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
Cutadapt πŸ•“ 67.1 s
Cutadapt
cutadapt -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq 
 messages
This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 1 core ...
Finished in 66.906 s (25.630 Β΅s/read; 2.34 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1
πŸ•“ 67.1 s
πŸ“ˆ 20.9 MiB
πŸ•“ 16.9 s πŸ† (4.0x) 120% CPU
πŸ“ˆ 7.4 MiB πŸ† (2.83x)
Find and trim the forward primer in parallel using 4 threads (cores).
st find -f file:primers.fasta -R 0.2 -t4 input.fastq -a primer={pattern_name} -a end={match_end} |
  st trim -e '{attr(end)}:' --fq > output.fastq
 messages
Note: the sequence type of the pattern 'ITS4' was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
Cutadapt πŸ•“ 18.1 s
Cutadapt
cutadapt -j4 -g 'file:primers.fasta;min_overlap=15' input.fastq -e 0.2 --rename '{id} primer={adapter_name}' --discard-untrimmed > output.fastq 
 messages
This is cutadapt 4.6 with Python 3.12.2
Command line parameters: -j4 -g file:primers.fasta;min_overlap=15 input.fastq -e 0.2 --rename {id} primer={adapter_name} --discard-untrimmed
Processing single-end reads on 4 cores ...
Finished in 17.956 s (6.878 Β΅s/read; 8.72 M reads/minute).
=== Summary ===
Total reads processed:               2,610,480
Reads with adapters:                   828,740 (31.7%)
== Read fate breakdown ==
Reads discarded as untrimmed:        1,781,740 (68.3%)
Reads written (passing filters):       828,740 (31.7%)
Total basepairs processed:   712,939,424 bp
Total written (filtered):    209,047,405 bp (29.3%)
=== Adapter ITS4 ===
Sequence: GTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 21; Trimmed: 828740 times
Minimum overlap: 15
No. of allowed errors:
1-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2; 15-19 bp: 3; 20-21 bp: 4
Overview of removed sequences
length  count   expect  max.err error counts
15  8   0.0 3   3 1 3 1
16  12  0.0 3   1 3 4 4
17  7   0.0 3   3 0 0 4
18  11  0.0 3   2 6 1 2
19  12  0.0 3   1 2 6 1 2
20  15  0.0 4   3 5 3 2 2
21  29  0.0 4   2 11 4 2 10
22  73  0.0 4   5 23 8 15 22
23  221 0.0 4   10 46 39 53 73
24  723 0.0 4   27 96 180 381 39
25  8858    0.0 4   439 2961 4797 468 193
26  816649  0.0 4   202089 581641 27831 3348 1740
27  1926    0.0 4   184 840 797 74 31
28  33  0.0 4   4 22 2 3 2
29  15  0.0 4   1 11 1 1 1
30  4   0.0 4   1 3
31  1   0.0 4   1
32  3   0.0 4   2 1
33  1   0.0 4   1
34  1   0.0 4   1
35  2   0.0 4   0 2
40  2   0.0 4   0 2
41  2   0.0 4   0 2
42  3   0.0 4   1 2
45  1   0.0 4   0 1
47  1   0.0 4   0 0 0 0 1
48  1   0.0 4   1
51  6   0.0 4   0 0 0 0 6
54  1   0.0 4   0 0 0 0 1
58  16  0.0 4   0 0 0 0 16
59  2   0.0 4   0 1 0 0 1
60  2   0.0 4   0 0 0 0 2
61  20  0.0 4   0 1 0 0 19
62  1   0.0 4   0 0 0 0 1
63  12  0.0 4   0 1 0 1 10
64  2   0.0 4   0 0 0 0 2
66  2   0.0 4   0 0 0 1 1
67  24  0.0 4   0 0 3 5 16
68  4   0.0 4   0 0 1 0 3
69  1   0.0 4   0 0 0 0 1
85  2   0.0 4   0 2
86  5   0.0 4   1 3 0 0 1
105 4   0.0 4   0 0 0 0 4
138 1   0.0 4   0 0 0 0 1
190 2   0.0 4   0 0 0 0 2
203 1   0.0 4   0 0 0 0 1
226 2   0.0 4   0 0 0 0 2
227 1   0.0 4   0 0 0 0 1
228 3   0.0 4   0 0 0 0 3
230 1   0.0 4   0 0 0 0 1
247 1   0.0 4   0 0 0 0 1
249 1   0.0 4   0 0 0 0 1
251 5   0.0 4   0 0 0 0 5
252 1   0.0 4   0 0 0 0 1
255 1   0.0 4   0 0 0 0 1
258 1   0.0 4   0 0 0 0 1
290 1   0.0 4   0 0 0 0 1
πŸ•“ 18.1 s 413% CPU
πŸ“ˆ 39.4 MiB
πŸ•“ 4.9 s πŸ† (3.7x) 448% CPU
πŸ“ˆ 17.8 MiB πŸ† (2.22x)

replaceΒΆ

Convert DNA to RNA using the replace command
st replace T U input.fasta > output.fasta
st find πŸ•“ 14.3 s  β™ SeqKit πŸ•“ 4.8 s πŸ† (2.1x)  β™ FASTX-Toolkit πŸ•“ 283.5 s
st find
st find T --rep U input.fasta > output.fasta
 messages
Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
πŸ•“ 14.3 s
πŸ“ˆ 7.2 MiB
SeqKit
seqkit seq --dna2rna  input.fasta > output.fasta
πŸ•“ 4.8 s πŸ† (2.1x)
πŸ“ˆ 27.3 MiB
FASTX-Toolkit
fasta_nucleotide_changer -r -i input.fasta > output.fasta
πŸ•“ 283.5 s
πŸ“ˆ 3.5 MiB πŸ† (2.07x)
πŸ•“ 10.1 s
πŸ“ˆ 7.2 MiB
Convert DNA to RNA using 4 threads
st replace -t4 T U input.fasta > output.fasta
st find πŸ•“ 8.4 s
st find
st find -t4 T --rep U input.fasta > output.fasta
 messages
Note: the sequence type of the pattern was determined as 'dna'. If incorrect, please provide the correct type with `--seqtype`. Use `-q/--quiet` to suppress this message.
πŸ•“ 8.4 s 282% CPU
πŸ“ˆ 24.6 MiB
πŸ•“ 2.7 s πŸ† (3.1x) 418% CPU
πŸ“ˆ 9.0 MiB πŸ† (2.74x)

trimΒΆ

Trim the leading 99 bp from the sequences
st trim 100: input.fasta > output.fasta
SeqKit (creates FASTA index) πŸ•“ 44.8 s
SeqKit (creates FASTA index)
seqkit subseq -r '100:-1'  input.fasta > output.fasta
 messages
[INFO] create or read FASTA index ...
[INFO] create FASTA index for input.fasta
[INFO]   2610480 records loaded from input.fasta.seqkit.fai
πŸ•“ 44.8 s
πŸ“ˆ 1254.5 MiB
πŸ•“ 2.8 s πŸ† (16.0x)
πŸ“ˆ 7.4 MiB πŸ† (170.10x)

upperΒΆ

Convert sequences to uppercase
st upper input.fasta > output.fasta
Seqtk πŸ•“ 5.2 s  β™ SeqKit πŸ•“ 4.2 s
Seqtk
seqtk seq -U input.fasta > output.fasta
πŸ•“ 5.2 s
πŸ“ˆ 3.5 MiB πŸ† (2.11x)
SeqKit
seqkit seq -u  input.fasta > output.fasta
πŸ•“ 4.2 s
πŸ“ˆ 62.2 MiB
πŸ•“ 3.0 s πŸ† (1.4x)
πŸ“ˆ 7.4 MiB

revcompΒΆ

Reverse complement sequences
st revcomp input.fasta > output.fasta
Seqtk πŸ•“ 5.3 s πŸ† (1.1x)  β™ VSEARCH πŸ•“ 7.7 s  β™ SeqKit πŸ•“ 7.8 s
Seqtk
seqtk seq -r input.fasta > output.fasta
πŸ•“ 5.3 s πŸ† (1.1x)
πŸ“ˆ 3.5 MiB πŸ† (1.21x)
VSEARCH
vsearch --fastx_revcomp input.fasta --fastaout output.fasta 
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading FASTA file 100%
πŸ•“ 7.7 s
πŸ“ˆ 4.2 MiB
SeqKit
seqkit seq -rp  input.fasta > output.fasta
 messages
[WARN] flag -t (--seq-type) (DNA/RNA) is recommended for computing complement sequences
πŸ•“ 7.8 s
πŸ“ˆ 28.1 MiB
πŸ•“ 6.0 s
πŸ“ˆ 7.2 MiB

concatΒΆ

Concatenate sequences, adding an NNNNN spacer inbetween
st concat -s 5 -c N file1.fastq file2.fastq > output.fastq
VSEARCH πŸ•“ 20.5 s
VSEARCH
vsearch --fastq_join file1.fastq --reverse file2.fastq --join_padgap NNNNN --fastqout output.fastq
 messages
vsearch v2.28.1_linux_x86_64, 30.6GB RAM, 16 cores
https://github.com/torognes/vsearch
Joining reads 100%
2610480 pairs joined
πŸ•“ 20.5 s
πŸ“ˆ 4.2 MiB πŸ† (1.74x)
πŸ•“ 9.9 s πŸ† (2.1x)
πŸ“ˆ 7.4 MiB