The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.
To begin, we load libraries and download the Guerry data from Rdatasets:
library(data.table)
Warning: package 'data.table' was built under R version 4.5.2
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
library(broom)
Warning: package 'broom' was built under R version 4.5.2
In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
dat[, list(clergy_sd=sd(Clergy)), by=Region]
Region clergy_sd
<char> <num>
1: E 19.04754
2: N 24.09845
3: C 25.03527
4: S 25.76706
5: W 20.29181
6: NA
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:
dat = dat[, list(data=list(.SD)), by=Region]
The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat
Region data
<char> <list>
1: E <data.table[17x23]>
2: N <data.table[17x23]>
3: C <data.table[17x23]>
4: S <data.table[17x23]>
5: W <data.table[17x23]>
6: <data.table[1x23]>
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:
dat[, model :=lapply(data, function(x) lm(Donations ~ Clergy, x))]dat
Region data model
<char> <list> <list>
1: E <data.table[17x23]> <lm[12]>
2: N <data.table[17x23]> <lm[12]>
3: C <data.table[17x23]> <lm[12]>
4: S <data.table[17x23]> <lm[12]>
5: W <data.table[17x23]> <lm[12]>
6: <data.table[1x23]> <lm[12]>
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:
modelsummary(dat$model)
Warning: Model has zero degrees of freedom!
Warning: Model has zero degrees of freedom!
Model matrix is rank deficient. Parameters `Clergy` were not estimable.
(1)
(2)
(3)
(4)
(5)
(6)
(Intercept)
3459.206
12616.594
1906.380
4792.878
4646.248
37015.000
(1698.357)
(2708.688)
(1719.969)
(869.711)
(4255.843)
Clergy
43.731
-110.731
107.241
-38.621
91.106
(43.830)
(53.525)
(29.720)
(21.940)
(70.121)
Num.Obs.
17
17
17
17
17
1
R2
0.062
0.222
0.465
0.171
0.101
0.000
R2 Adj.
-0.000
0.170
0.429
0.116
0.041
0.000
AIC
328.0
342.8
324.1
314.7
346.1
BIC
330.5
345.3
326.6
317.2
348.6
Log.Lik.
-160.988
-168.384
-159.031
-154.361
-170.052
RMSE
3136.83
4846.46
2795.64
2124.14
5346.24
0.00
We could also use the broom::tidy function to extract model estimates in a tidy format:
dat[, result :=lapply(model, tidy)]dat
Region data model result
<char> <list> <list> <list>
1: E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2: N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3: C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4: S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5: W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6: <data.table[1x23]> <lm[12]> <tbl_df[2x5]>
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:
dat[, result[[1]], by=Region]
Region term estimate std.error statistic p.value
<char> <char> <num> <num> <num> <num>
1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
11: (Intercept) 37015.00000 NaN NaN NaN
12: Clergy NA NA NA NA
Putting everything together, we have:
# Nestdat = dat[, list(data=list(.SD)), by=Region]# Operationsdat[, model :=lapply(data, function(x) lm(Donations ~ Clergy, x))]dat[, result :=lapply(model, tidy)]# Unnestdat[, result[[1]], by=Region]
tidyverse
The same result can be achieved in a tidyverse style:
# A tibble: 10 × 8
# Groups: Region [5]
Region data model term estimate std.error statistic p.value
<chr> <list<tibble[,23]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 C [17 × 23] <lm> (Inte… 1906. 1720. 1.11 2.85e-1
2 C [17 × 23] <lm> Clergy 107. 29.7 3.61 2.58e-3
3 E [17 × 23] <lm> (Inte… 3459. 1698. 2.04 5.97e-2
4 E [17 × 23] <lm> Clergy 43.7 43.8 0.998 3.34e-1
5 N [17 × 23] <lm> (Inte… 12617. 2709. 4.66 3.09e-4
6 N [17 × 23] <lm> Clergy -111. 53.5 -2.07 5.63e-2
7 S [17 × 23] <lm> (Inte… 4793. 870. 5.51 5.98e-5
8 S [17 × 23] <lm> Clergy -38.6 21.9 -1.76 9.87e-2
9 W [17 × 23] <lm> (Inte… 4646. 4256. 1.09 2.92e-1
10 W [17 × 23] <lm> Clergy 91.1 70.1 1.30 2.13e-1
Benchmarking
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:
Warning in microbenchmark(unnest_dt(), unnest_tb(), times = 5): less accurate
nanosecond times to avoid potential integer overflows
Unit: milliseconds
expr min lq mean median uq max neval
unnest_dt() 3.920789 4.044978 4.362129 4.449074 4.603644 4.792162 5
unnest_tb() 10.950157 11.021333 11.917716 11.041177 13.060345 13.515568 5
cld
a
b
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).