ALTREP 101
Note: In all the altrepr
vignettes, footnotes1 are used
to point to relevant information in the R source code. A casual reader
need not follow any of these links.
ALTREP, short for alternate representation, is a mechanism where base R and package authors can write alternative versions of some of R’s base data types, with a different layout in memory.
There are many possible reasons why it might be advantageous to do this:
- Data can be represented more compactly (as discussed in this vignette)
- Functions can reduce computation by returning lazy ALTREP results
that only evaluate entries if they are needed (see
vignette("deferred_string")
) - Metadata can be added to vectors that is invisible and un-editable
by users (see
vignette("wrapper_objects")
) - The storage mechanism of a vector can be abstracted away, allowing
data which might not even exist in memory (see
vignette("memmap")
)
At present, types that can be ALTREPPED include2:
- Integer vectors
- Double vectors
- Logical vectors
- String vectors
- Complex vectors
- Raw vectors
- Lists
Some more advanced data types such as environments currently can’t have ALTREPs.
Although we consider ALTREPs to have a different memory representation, in the end they are actually stored using R’s normal SEXPREC structures like lists, environments and vectors. The difference is that authors have the flexibility, to, for example, use a list to represent a vector, or use several vectors to represent a single vector.
Currently some of the behaviours that can be overridden in ALTREP include3:
- How the length is calculated (as returned by
length()
) - How the data is converted to an array of data in C
- How elements are extracted from a vector in R (using
[
or[[
) - If the array is considered sorted
- How the minimum or maximum values of a vector are calculated
You might wonder why all of this couldn’t be done with one of R’s
class systems, like S3 or S4, since they support overriding the
behaviour of length
, [
, max
etc.
Some reasons include:
- ALTREP allows editing behaviours that cannot be edited in S3, for
example making a custom
DATAPTR
, the pointer to the vector data. - Custom ALTREP state is hidden from the user (although
altrepr
makes this less true) - Authors can use hacks that would break if exposed to R, such as using a character vector containing null pointers4
- Presumably method dispatch is faster in ALTREP because it doesn’t involve searching the S3 method table. But this should be verified by benchmarks.
Background
Let’s get hands-on with altrepr
by looking at the
“compact sequences”, perhaps the most famous ALTREP class in R.
Compact sequences were
introduced in R 3.5.0. They are simply integer (or double) vectors
that are represented as a “range”: they store the start, length, and
step of a sequence, rather than storing every individual element in that
range, which can save a lot of memory. Compact sequences are created by
the :
operator, and by the seq
family of
functions. Let’s investigate!
The Tip of the Iceberg
To start we need to install and load the altrepr
package:
remotes::install_github("multimeric/altrepr")
Firstly, we can use the is_altrep()
function to
distinguish between ALTREP and ordinary vectors:
is_altrep(1:3)
#> [1] TRUE
Next, we can get some details about an ALTREP’s data using
alt_details()
:
alt_details(1:3)
#> $class_name
#> [1] "compact_intseq"
#>
#> $pkg_name
#> [1] "base"
#>
#> $base_type
#> [1] "integer"
#>
#> $data1
#> [1] 3 1 1
#>
#> $data2
#> NULL
-
class_name
is the human-readable class name -
pkg_name
is the package where the ALTREP was defined -
base_type
is the R data type that the ALTREP is representing
We’ll go into a bit more detail about the data fields later on.
If we want, instead of using alt_details()
, we can used
some more targeted functions to access these same fields:
alt_classname(1:3)
#> [1] "compact_intseq"
alt_pkgname(1:3)
#> [1] "base"
Compact Real
There is actually another related class to
compact_intseq
, which you can only get by converting an
intseq to double:
x <- as.double(1:5)
alt_details(x)
#> $class_name
#> [1] "compact_realseq"
#>
#> $pkg_name
#> [1] "base"
#>
#> $base_type
#> [1] "double"
#>
#> $data1
#> [1] 5 1 1
#>
#> $data2
#> NULL
This has almost all of the same properties as
compact_intseq
, but of course it is treated as a
numeric
/real
/double
by R.
Function Naming Scheme
Functions in altrepr
starting with alt_
or
is_alt_
relate to any ALTREP class. More specific ALTREP
classes have their own utility functions. For compact vectors, this
prefix is compact_
and is_compact
.
The first example we will see of this is a simple check for compact vectors:
is_compact_vec(1:3)
#> [1] TRUE
is_compact_vec(c(1, 2))
#> [1] FALSE
Coercing to a Standard Integer Vector
The ALTREP API doesn’t currently provide a function for forcing an ALTREP vector to be converted to it’s standard representation form, but compact vectors can be expanded using any operation that clones/duplicates the vector.
The simplest way to copy a vector is to use an empty index on it
([]
). We can use this feature to prove the importance of
the compact sequence class, by comparing the memory usage of a
traditional integer vector, and the compact version:
Let’s start with an enormous vector:
x <- 1:10^9
is_altrep(x)
#> [1] TRUE
But even with 1 billion elements, it’s actually not very big!
lobstr::obj_size(x)
#> 680 B
Now let’s force the vector to expand. You will notice that this seemingly simple operation actually takes suspiciously long. This is of course because a 1 billion element standard representation vector is being generated behind the scenes:
system.time({
y <- x[]
})
#> user system elapsed
#> 0.865 1.183 2.060
is_altrep(y)
#> [1] FALSE
lobstr::obj_size(y)
#> 4.00 GB
There we go. It’s no longer ALTREP, and we’ve worked out that we saved about 4 GB by using ALTREP!
For reference, altrepr
also provides a utility function
for this use case:
y <- compact_to_standard(x)
is_altrep(y)
#> [1] FALSE
lobstr::obj_size(y)
#> 4.00 GB
Reading ALTREP Data
ALTREP classes have two “slots” for storing data (not the same as S4
slots, but it’s a helpful analogy). These are respectively called
data1
and data2
. These slots can store any
type of R object, including recursive types like lists, so this isn’t
really a restriction.
data1
In the case of compact sequences, data1
is used to store
the parameters of the sequence as a double (not integer!) vector
- The first entry contains the length of the sequence
- The second entry contains the start value of the sequence
- The third entry contains the step, which is currently always 1 or -1
We can prove all this using the alt_data1()
function:
alt_data1(1:3)
#> [1] 3 1 1
alt_data1(2:3)
#> [1] 2 2 1
alt_data1(3:2)
#> [1] 2 3 -1
Actually altrepr
has a utility function for finding this
information, specifically for compact vectors:
compact_details(4:2)
#> $length
#> [1] 3
#>
#> $start
#> [1] 4
#>
#> $step
#> [1] -1
#>
#> $expanded
#> NULL
data2
and the Expanded Form
Compact seqs are considered to start in “compact” form, which we can
see at the very end of the output from alt_inspect()
. This
function prints some internal information about the altrep vector which
often seems to be informative when dealing with built-in ALTREP
types:
x <- 1:3
alt_inspect(x)
#> @562ce7edea58 13 INTSXP g0c0 [REF(65535)] 1 : 3 (compact)
“compact” means that data2
is NULL
, which
is its initial value:
alt_data2(x)
#> NULL
A shortcut method to check for this is
compact_is_expanded
:
compact_is_expanded(x)
#> [1] FALSE
When the DATAPTR
of the sequence is accessed (which is a
pointer to the array of data in the vector), it forces the compact
sequence to expand. altrepr
has a special built-in function
that forces a compact vector to expand without any other side
effects:
compact_expand(x)
alt_inspect(x)
#> @562ce7edea58 13 INTSXP g0c0 [REF(65535)] 1 : 3 (expanded)
compact_is_expanded(x)
#> [1] TRUE
Notably x
is still ALTREP, it hasn’t been
coerced into a standard representation vector:
is_altrep(x)
#> [1] TRUE
But the data2
value, which is linked to the expanded
form, is now set to a full vector of integer data:
alt_data2(x)
#> [1] 1 2 3
Modifying ALTREP Data
Note: improperly modifying ALTREP data can be potentially very dangerous and will risk crashing or corrupting your R session if done incorrectly!.
altrepr
also provides set_alt_data1
and
set_alt_data2
for modifying the ALTREP data. As an example,
let’s create a compact range and then modify it. This is safe because
we’re replacing a double vector with another double vector that follows
the same layout as described above.
x <- 1:5
set_alt_data1(x, c(2, 3, 1))
x
#> [1] 3 4
Firstly note that we’ve taken the range 1:5
and replaced
it with 3:4
, because we set the start of the range to be 3,
and the length of it to be 2.
Also, importantly, note that x
has been modified
in-place. This means that any shallow copy of the compact
sequence (e.g. y <- x
below) will also be modified. This
means that there’s actually no way to make a modified copy of
x
as in normal R:
x <- 1:5
y <- x
set_alt_data1(x, c(2, 3, 1))
y
#> [1] 3 4