Explanation of ranges¶
Ranges in seqtool are used or produced by commands like trim, find, mask, and slice.
In a nutshell¶
- Ranges in the form
start:end
include both the start and end position, unless 0-based coordinates are configured. - Negative coordinates (e.g.
-5:-1
) indicate coordinate offsets from the end - Unbounded ranges (
start:
or:end
) include everything fromstart
to the sequence end, respectively from the beginning toend
. "undefined" equals to missing coordinates. - If interpreting ranges as exclusive, the actual start or end positions are not included in the range.
Overview¶
Ranges look like this: start:end
.
The the start and end positions are always part of the range, unless
explicitly switching to 0-based coordinates.
It is also possible to use negative numbers: -1
references the last character
in the sequence, -2
the second last, and so on.
| <—————————————> | sequence: A T G C A T G C base number: 1 2 3 4 5 6 7 8 from end: -8 -7 -6 -5 -4 -3 -2 -1
The following commands all trim sequences to the blue range, resulting in the same output:
Empty ranges¶
Ranges of zero length are only possible if the start is greater than the end
(e.g. 5:4
).
seqtool interprets all ranges where start > end as
empty.
An exception are 0-based ranges.
In this specific mode, 5:5
would result in an empty range.
Unbounded ranges: start:
or :end
¶
The start or end positions can be missing, which results in the whole sequence up or from a certain position being included in the range.
No end¶
The following retains all positions from 5
to the end:
| <——————————> sequence: A T G C A T G C base number: 1 2 3 4 5 6 7 8 from end: -8 -7 -6 -5 -4 -3 -2 -1
The sequence ends at position 8, so 5:
is equivalent to 5:8
or 5:-1
.
However, if sequence lengths differ, only 5:
or 5:-1
will include everything
after position 5, while 5:8
would still only return these fixed positions:
ATGCATGC ATGCATGCMORE
⚠️
5:
is equivalent to5:-1
here, but results can differ with exclusive ranges. Usually, you might want to use the unboundedstart:
range, which will always include the whole sequence end.
No start¶
It is also possible to omit the start position to return all positions up to a given position:
ATGCATGC ATGCATGCMORE
⚠️ again,
0:3
is equivalent to:3
, but only if not using exclusive ranges.
No bounds at all¶
The following will retain the whole sequence, resulting in no trimming at all:
ATGCATGC
ATGCATGCMORE
undefined
¶
Undefined is a special keyword that equals to missing data and thus,
undefined:undefined
equals to an unbounded range :
.
undefined
may be returned by functions such as opt_attr()
and opt_meta()
.
Exclusive ranges (-e/--exclusive
)¶
The trim
and mask
commands also accept an -e/--exclusive
argument
that excludes start and end coordinates from the range.
The following commands trim to positions 3-5 (blue)
without the range bounds 2
and 6
themselves (red).
| <——————> | sequence: A T G C A T G C base number: 1 2 3 4 5 6 7 8 from end: -8 -7 -6 -5 -4 -3 -2 -1
One important corner case are unbounded ranges.
In case of missing bounds, the ranges are not trimmed or masked on that side, the range
still extends to the start or end as if it would without -e/--exclusive
:
| <——————> sequence: A T G C A T G C base number: 1 2 3 4 5 6 7 8 from end: -8 -7 -6 -5 -4 -3 -2 -1
0-based coordinates (-0
)¶
If you prefer 0-based ranges common to many programming languages, specify the -0
argument.
These are less intuitive, but have the advantage that empty slices can be more easily
obtained (e.g. st trim -0 1:1
).
The range indices start with 0
instead of 1
, and the range end (green)
is not included in the slice. Negative indices are also possible and work exactly as in Python.
| <—————————————> | sequence: A T G C A T G C base number: 1 2 3 4 5 6 7 8 0-based start: 0 1 2 3 4 5 6 7 from end: -8 -7 -6 -5 -4 -3 -2 -1