{box}

A way-too-deep dive into the {box} package
R
code
Published

January 26, 2023

Because I’m a Masochist

Well, I haven’t written a blog post in going on 5 years now, so what better way to get back into the swing of things than to write an entire series about a highly technical topic I’m not all that great at? 😃 This is definitely a case of blogging about something so I can use it as a reference when I forget a few months years from now. At any rate, I’m going to be shedding some digital ink on the {box} package while building up a set of functions to analyze my running data.

Programming is Hard

You know how if you’ve never written code in any other language than R before and you try to pick up something like Python or C++ or…just about anything…and you suddenly see statements like these?

```{cpp}
#include <vector>;
using namespace std; // Don't do this, by the way
```
```{python}
from pandas import * # Don't do this either
from plotnine import ggplot, aes, geom_point, facet_wrap
import numpy as np
```

On the flip side, let’s hypothetically say you’ve just updated the {tidytable} package in R (whether you have a background in R or not) and you get this error:

```{r}
library(tidytable)

Error in fifelse(condition, args$true, args$false, args$missing) : 
'yes' is of type NULL but 'no' is of type character. Please make sure both arguments have the same type.
Calls: .main ... run_hook_plot -> hook -> ifelse -> if_else -> fifelse
```

Hypothetically, you get that error only when rendering a {quarto} document; it’s not reproducible otherwise. You then, hypothetically, spend the better part of a day and a half researching the error and reading the {quarto} source code ’til you finally find the root cause (a namespace conflict between {tidytable} and {base} that’s causing a non-namespaced ifelse() to be overwritten by the {tidytable} variant). Hypothetically, you then have to open Github issues in both packages to figure out a solution that doesn’t break your entire reporting framework at your job…hypothetically (/r/suspiciouslyspecific).

It’s worth mentioning errors like our hypothetical error happen all the time; both developers/developer groups in the above not-so-hypothetical are rock stars who responded extremely quickly; and the error has been patched, but the song and dance I just described is absolutely bewildering to folks coming from languages with stricter scoping and namespacing rules.

A Brief Interlude on Scoping

We’re not going to get super deep into a scoping discussion, so if you’re interested in diving deeper, check this out. At any rate, scoping has to do with how object names are resolved in a program. What the hell does that mean, right? For a very simple example…

x <- 5

x
[1] 5

…we’ve created a variable with the name x in our global environment (more on that in a second). Because R is the way it is (highly scientific description, I know), we can immediately redefine x:

x <- 15

x
[1] 15

We’re not going to get into the hows and whys of what’s happening under the hood, just know we’ve redefined the value of our x variable to 15. Where scoping comes into play is in the following two examples:

# R's shorthand for functions...you probably won't see me 
# do this all that often because old habits die hard
double_x <- \(x) x * 2

double_x(5)
[1] 10
hidden_x <- \(y) y * 2 + x

hidden_x(5)
[1] 25

In the first example, we provided the value 5 to the function argument x, which was doubled and returned. Makes sense. So how did the second hidden_x() function know to add 15 to y * 2? Through the magic of R’s scoping rules!

Three Environments to Rule Them All

In R, you essentially have three environments: the local environment, the global environment, and the package/library/namespace environment. Think of the local environment as the inside of a function; that isn’t quite right, but it’s close enough for us. Variables and functions in the local environment are only available in said local environment.

The next step up–the global environment–is everything in the Environment panel if you’re using the default layout for RStudio. Variables and functions in the global environment are accessible and can be called from anywhere in your R process.

So when you save needlessly_long_named_dataframe_final_v2 through the assignment operator (<-) and it appears in the global environment panel, needlessly_long_named_dataframe_final_v2 is now available in your global environment and can be called inside other functions.

Finally, we have the package/library environment, which contains the namespaces of the currently loaded packages (library(whatever)). Most people access package namespaces through calling library() and calling the newly-available functions:

library(tidytable)

.df <- mutate.(.df, new_col = col * 5)

But it’s also possible to namespace specific functions:

.df <- tidytable::mutate.(.df, new_col = col * 5)

…OK?

So who cares, right? You should care. R’s scoping rules (and evaluation rules, but beyond what we care about) are why you’ve probably gotten some really wonky results before when doing something like this:

example_tidytable <- tidytable(name = c("Bob", "Cindy", "Bill", "Marsha"))

name <- "Bob"

filter.(example_tidytable, name == name)
Warning: `filter.()` was deprecated in tidytable v0.10.0.
ℹ Please use `filter()` instead.
ℹ Please note that all `verb.()` syntax has now been deprecated.
# A tidytable: 4 × 1
  name  
  <chr> 
1 Bob   
2 Cindy 
3 Bill  
4 Marsha

Likewise, it’s why our second function above (hidden_x()) is able to add x to the user-provided value without having to add an x argument to the function.

hidden_x(25)
[1] 65

Just in case you were wondering, we can solve our toy example problem via…

# OR name == !!name
filter.(example_tidytable, name == {{name}})
# A tidytable: 1 × 1
  name 
  <chr>
1 Bob  

Anyway, this is because–when searching for a variable (name, x, etc.)–R searches the environments in the order I described above: local > global > namespace. In the filtering example, since name exists as a field in the tidytable (the innermost local environment), R stops its search and uses name in the equality. Since the vector is equal to itself (if it isn’t, oh boy, do we have problems), you get back the entire tidytable. Our solution in the last chunk essentially tells R to use name from one step up the scoping chain; although, we could have just as easily solved the problem by using a different variable name.

my_name <- "Bob"

filter.(example_tidytable, name == my_name)
# A tidytable: 1 × 1
  name 
  <chr>
1 Bob  

At any rate, in the case of hidden_x(), R doesn’t find x declared inside the local/function environment, so it steps up to the global environment to see if x exists there. Since we defined x <- 15 earlier, it uses that value of x inside the function. Hopefully you see where we’re going with this.

I Got 99 Problems and a Namespace is One

Let’s circle all the way back to the workflow-breaking problem I mentioned at the start: loading {tidytable} while rendering a {quarto} document was throwing an error. This has to do with both scoping and how packages are loaded in R. When calling a function, R performs the same step-up search approach I described for variables: local functions > global functions > namespace functions. This lets you define your own local or global functions that may share names with functions from packages. The rub is how R handles packages that share functions with the same name.

When starting a new R process, the {base} package is loaded first, followed by a few additional support packages ({stats}, {utils}, etc.). This essentially creates a list of functions that are available for the user to call based on what packages have been loaded. What’s really important to know is each newly-loaded package is moved to the front of the list. This is fine and dandy in cases like {base}, {stats}, etc. because they don’t share any overlapping functions. When you load a package that has a function with the same name as a previously-loaded package, however, it creates a naming conflict. R handles these naming conflicts by masking the previously loaded function:

```{r}
library(tidytable)

...

The following objects are masked from ‘package:stats’:

    dt, filter, lag

The following object is masked from ‘package:base’:

    %in%
```

So now, if you call dt(), filter(), lag(), or %in%, you’re going to get the {tidytable} variants. You’ve probably seen something similar when calling library(tidyverse). In some cases, this masking is no big deal. In others–like our not-so-hypothetical–it can break everything.

Solving Our Namespace Problem

So how do you handle situations where multiple packages have a function with the same name AND you need functionality from both versions (e.g. filter from {stats} and {dplyr})? One method is explicitly calling the package namespace each time you invoke a function.

condition <- "trained"
ratio <- 0.25

.df |> 
  dplyr::filter(condition == {{condition}}) |> 
  dplyr::summarize(filtered_values = stats::filter(value * ratio, 1 - ratio, "recursive", init = value[1]))

While that’s generally the recommended approach when building packages, it gets pretty cumbersome for day-to-day programming. Another option is to use {conflicted}.

library(tidytable)
library(tidyverse) # Don't actually do this

conflicted::conflicts_prefer(
  stats::filter,
  tidytable::lag,
  tidytable::map_vec
)

# You get the idea

That’s definitely better, but can cause you to overlook namespace conflicts until you get an error (or not) when running a report.

Thank all that’s holy, we’ve finally made it to why {box} is a thing.

A Case for {box}