Skip to contents

ALTREP 101

Note: In all the altrepr vignettes, footnotes1 are used to point to relevant information in the R source code. A casual reader need not follow any of these links.

ALTREP, short for alternate representation, is a mechanism where base R and package authors can write alternative versions of some of R’s base data types, with a different layout in memory.

There are many possible reasons why it might be advantageous to do this:

  • Data can be represented more compactly (as discussed in this vignette)
  • Functions can reduce computation by returning lazy ALTREP results that only evaluate entries if they are needed (see vignette("deferred_string"))
  • Metadata can be added to vectors that is invisible and un-editable by users (see vignette("wrapper_objects"))
  • The storage mechanism of a vector can be abstracted away, allowing data which might not even exist in memory (see vignette("memmap"))

At present, types that can be ALTREPPED include2:

  • Integer vectors
  • Double vectors
  • Logical vectors
  • String vectors
  • Complex vectors
  • Raw vectors
  • Lists

Some more advanced data types such as environments currently can’t have ALTREPs.

Although we consider ALTREPs to have a different memory representation, in the end they are actually stored using R’s normal SEXPREC structures like lists, environments and vectors. The difference is that authors have the flexibility, to, for example, use a list to represent a vector, or use several vectors to represent a single vector.

Currently some of the behaviours that can be overridden in ALTREP include3:

  • How the length is calculated (as returned by length())
  • How the data is converted to an array of data in C
  • How elements are extracted from a vector in R (using [ or [[)
  • If the array is considered sorted
  • How the minimum or maximum values of a vector are calculated

You might wonder why all of this couldn’t be done with one of R’s class systems, like S3 or S4, since they support overriding the behaviour of length, [, max etc. Some reasons include:

  • ALTREP allows editing behaviours that cannot be edited in S3, for example making a custom DATAPTR, the pointer to the vector data.
  • Custom ALTREP state is hidden from the user (although altrepr makes this less true)
  • Authors can use hacks that would break if exposed to R, such as using a character vector containing null pointers4
  • Presumably method dispatch is faster in ALTREP because it doesn’t involve searching the S3 method table. But this should be verified by benchmarks.

Background

Let’s get hands-on with altrepr by looking at the “compact sequences”, perhaps the most famous ALTREP class in R.

Compact sequences were introduced in R 3.5.0. They are simply integer (or double) vectors that are represented as a “range”: they store the start, length, and step of a sequence, rather than storing every individual element in that range, which can save a lot of memory. Compact sequences are created by the : operator, and by the seq family of functions. Let’s investigate!

The Tip of the Iceberg

To start we need to install and load the altrepr package:

remotes::install_github("multimeric/altrepr")

Firstly, we can use the is_altrep() function to distinguish between ALTREP and ordinary vectors:

is_altrep(c(1, 2, 3))
#> [1] FALSE
is_altrep(1:3)
#> [1] TRUE
is_altrep(seq(1, 3))
#> [1] TRUE

Next, we can get some details about an ALTREP’s data using alt_details():

alt_details(1:3)
#> $class_name
#> [1] "compact_intseq"
#> 
#> $pkg_name
#> [1] "base"
#> 
#> $base_type
#> [1] "integer"
#> 
#> $data1
#> [1] 3 1 1
#> 
#> $data2
#> NULL
  • class_name is the human-readable class name
  • pkg_name is the package where the ALTREP was defined
  • base_type is the R data type that the ALTREP is representing

We’ll go into a bit more detail about the data fields later on.

If we want, instead of using alt_details(), we can used some more targeted functions to access these same fields:

alt_classname(1:3)
#> [1] "compact_intseq"
alt_pkgname(1:3)
#> [1] "base"

Compact Real

There is actually another related class to compact_intseq, which you can only get by converting an intseq to double:

x <- as.double(1:5)
alt_details(x)
#> $class_name
#> [1] "compact_realseq"
#> 
#> $pkg_name
#> [1] "base"
#> 
#> $base_type
#> [1] "double"
#> 
#> $data1
#> [1] 5 1 1
#> 
#> $data2
#> NULL

This has almost all of the same properties as compact_intseq, but of course it is treated as a numeric/real/double by R.

Function Naming Scheme

Functions in altrepr starting with alt_ or is_alt_ relate to any ALTREP class. More specific ALTREP classes have their own utility functions. For compact vectors, this prefix is compact_ and is_compact.

The first example we will see of this is a simple check for compact vectors:

is_compact_vec(1:3)
#> [1] TRUE
is_compact_vec(c(1, 2))
#> [1] FALSE

Coercing to a Standard Integer Vector

The ALTREP API doesn’t currently provide a function for forcing an ALTREP vector to be converted to it’s standard representation form, but compact vectors can be expanded using any operation that clones/duplicates the vector.

The simplest way to copy a vector is to use an empty index on it ([]). We can use this feature to prove the importance of the compact sequence class, by comparing the memory usage of a traditional integer vector, and the compact version:

Let’s start with an enormous vector:

x <- 1:10^9
is_altrep(x)
#> [1] TRUE

But even with 1 billion elements, it’s actually not very big!

lobstr::obj_size(x)
#> 680 B

Now let’s force the vector to expand. You will notice that this seemingly simple operation actually takes suspiciously long. This is of course because a 1 billion element standard representation vector is being generated behind the scenes:

system.time({
  y <- x[]
})
#>    user  system elapsed 
#>   0.865   1.183   2.060
is_altrep(y)
#> [1] FALSE
lobstr::obj_size(y)
#> 4.00 GB

There we go. It’s no longer ALTREP, and we’ve worked out that we saved about 4 GB by using ALTREP!

For reference, altrepr also provides a utility function for this use case:

y <- compact_to_standard(x)
is_altrep(y)
#> [1] FALSE
lobstr::obj_size(y)
#> 4.00 GB

Reading ALTREP Data

ALTREP classes have two “slots” for storing data (not the same as S4 slots, but it’s a helpful analogy). These are respectively called data1 and data2. These slots can store any type of R object, including recursive types like lists, so this isn’t really a restriction.

data1

In the case of compact sequences, data1 is used to store the parameters of the sequence as a double (not integer!) vector

  • The first entry contains the length of the sequence
  • The second entry contains the start value of the sequence
  • The third entry contains the step, which is currently always 1 or -1

We can prove all this using the alt_data1() function:

alt_data1(1:3)
#> [1] 3 1 1
alt_data1(2:3)
#> [1] 2 2 1
alt_data1(3:2)
#> [1]  2  3 -1

Actually altrepr has a utility function for finding this information, specifically for compact vectors:

compact_details(4:2)
#> $length
#> [1] 3
#> 
#> $start
#> [1] 4
#> 
#> $step
#> [1] -1
#> 
#> $expanded
#> NULL

data2 and the Expanded Form

Compact seqs are considered to start in “compact” form, which we can see at the very end of the output from alt_inspect(). This function prints some internal information about the altrep vector which often seems to be informative when dealing with built-in ALTREP types:

x <- 1:3
alt_inspect(x)
#> @562ce7edea58 13 INTSXP g0c0 [REF(65535)]  1 : 3 (compact)

“compact” means that data2 is NULL, which is its initial value:

alt_data2(x)
#> NULL

A shortcut method to check for this is compact_is_expanded:

compact_is_expanded(x)
#> [1] FALSE

When the DATAPTR of the sequence is accessed (which is a pointer to the array of data in the vector), it forces the compact sequence to expand. altrepr has a special built-in function that forces a compact vector to expand without any other side effects:

compact_expand(x)
alt_inspect(x)
#> @562ce7edea58 13 INTSXP g0c0 [REF(65535)]  1 : 3 (expanded)
compact_is_expanded(x)
#> [1] TRUE

Notably x is still ALTREP, it hasn’t been coerced into a standard representation vector:

is_altrep(x)
#> [1] TRUE

But the data2 value, which is linked to the expanded form, is now set to a full vector of integer data:

alt_data2(x)
#> [1] 1 2 3

Modifying ALTREP Data

Note: improperly modifying ALTREP data can be potentially very dangerous and will risk crashing or corrupting your R session if done incorrectly!.

altrepr also provides set_alt_data1 and set_alt_data2 for modifying the ALTREP data. As an example, let’s create a compact range and then modify it. This is safe because we’re replacing a double vector with another double vector that follows the same layout as described above.

x <- 1:5
set_alt_data1(x, c(2, 3, 1))
x
#> [1] 3 4

Firstly note that we’ve taken the range 1:5 and replaced it with 3:4, because we set the start of the range to be 3, and the length of it to be 2.

Also, importantly, note that x has been modified in-place. This means that any shallow copy of the compact sequence (e.g. y <- x below) will also be modified. This means that there’s actually no way to make a modified copy of x as in normal R:

x <- 1:5
y <- x
set_alt_data1(x, c(2, 3, 1))
y
#> [1] 3 4