The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.
To begin, we load libraries and download the Guerry data from Rdatasets:
library(data.table)library(tidyverse)
Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
had status 1
In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
dat[, list(clergy_sd=sd(Clergy)), by=Region]
Region clergy_sd
1: E 19.04754
2: N 24.09845
3: C 25.03527
4: S 25.76706
5: W 20.29181
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:
dat = dat[, list(data=list(.SD)), by=Region]
The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat
Region data
1: E <data.table[17x23]>
2: N <data.table[17x23]>
3: C <data.table[17x23]>
4: S <data.table[17x23]>
5: W <data.table[17x23]>
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:
dat[, model :=lapply(data, function(x) lm(Donations ~ Clergy, x))]dat
Region data model
1: E <data.table[17x23]> <lm[12]>
2: N <data.table[17x23]> <lm[12]>
3: C <data.table[17x23]> <lm[12]>
4: S <data.table[17x23]> <lm[12]>
5: W <data.table[17x23]> <lm[12]>
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:
modelsummary(dat$model)
(1)
(2)
(3)
(4)
(5)
(Intercept)
3459.206
12616.594
1906.380
4792.878
4646.248
(1698.357)
(2708.688)
(1719.969)
(869.711)
(4255.843)
Clergy
43.731
−110.731
107.241
−38.621
91.106
(43.830)
(53.525)
(29.720)
(21.940)
(70.121)
Num.Obs.
17
17
17
17
17
R2
0.062
0.222
0.465
0.171
0.101
R2 Adj.
0.000
0.170
0.429
0.116
0.041
AIC
328.0
342.8
324.1
314.7
346.1
BIC
330.5
345.3
326.6
317.2
348.6
Log.Lik.
−160.988
−168.384
−159.031
−154.361
−170.052
RMSE
3136.83
4846.46
2795.64
2124.14
5346.24
We could also use the broom::tidy function to extract model estimates in a tidy format:
dat[, result :=lapply(model, tidy)]dat
Region data model result
1: E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2: N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3: C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4: S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5: W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:
dat[, result[[1]], by=Region]
Region term estimate std.error statistic p.value
1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
Putting everything together, we have:
# Nestdat = dat[, list(data=list(.SD)), by=Region]# Operationsdat[, model :=lapply(data, function(x) lm(Donations ~ Clergy, x))]dat[, result :=lapply(model, tidy)]# Unnestdat[, result[[1]], by=Region]
tidyverse
The same result can be achieved in a tidyverse style:
# A tibble: 10 × 8
# Groups: Region [5]
Region data model term estimate std.error statistic p.value
<chr> <list<tibble[,23]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 C [17 × 23] <lm> (Inte… 1906. 1720. 1.11 2.85e-1
2 C [17 × 23] <lm> Clergy 107. 29.7 3.61 2.58e-3
3 E [17 × 23] <lm> (Inte… 3459. 1698. 2.04 5.97e-2
4 E [17 × 23] <lm> Clergy 43.7 43.8 0.998 3.34e-1
5 N [17 × 23] <lm> (Inte… 12617. 2709. 4.66 3.09e-4
6 N [17 × 23] <lm> Clergy -111. 53.5 -2.07 5.63e-2
7 S [17 × 23] <lm> (Inte… 4793. 870. 5.51 5.98e-5
8 S [17 × 23] <lm> Clergy -38.6 21.9 -1.76 9.87e-2
9 W [17 × 23] <lm> (Inte… 4646. 4256. 1.09 2.92e-1
10 W [17 × 23] <lm> Clergy 91.1 70.1 1.30 2.13e-1
Benchmarking
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:
Unit: milliseconds
expr min lq mean median uq max neval cld
unnest_dt() 11.78367 11.98941 14.17848 13.84509 16.08075 17.19348 5 a
unnest_tb() 34.83939 35.19415 37.41286 36.53685 39.98738 40.50653 5 b
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).