How to Easily Create Descriptive Summary Statistics Tables in R Studio – By Group

August 20, 2018 By Pascal Schmidt R Statistics

Summary statistics tables or an exploratory data analysis are the most common ways in order to familiarize oneself with a data set. In addition to that, summary statistics tables are very easy and fast to create and therefore so common. In this blog post, I am going to show you how to create descriptive summary statistics tables in R. Almost all of these packages can create a normal descriptive summary statistic table in R and also one by groupings. Meaning, we can choose a factor column and stratify this column by its levels (very useful!). Moreover, one can easily knit their results to HTML, pdf, or word. This is a great way to use these tables in one’s report or presentation.

Let’s get started with a quick look at the packages we are going to present:

  • arsenal
  • qwraps2
  • amisc
  • table1
  • tangram
  • furniture
  • tableone
  • compareGroups
  • htmltable

Choosing our Data Set to Create Descriptive Summary Statistics Tables in R

For all of these packages, I am providing some code that shows the basics behind the tables and their functionality. For additional information, there is a link to the corresponding vignette which has even more examples and code snippets. In order for you to follow my code, I used the gapminder data set from the gapminder package.

In the code below, I am modifying the gapminder data set a little bit. I transformed the gdpPercap column to a factor variable with two levels. High is for countries with gdpPercap higher than the median gdpPercap and low for lower than the median gdpPercap. After that, I divided the population by one million to make the table more readable. In addition to that I also randomly introduced missing values in the data. I did that because in the real world we rarely experience data sets without any NA values. Therefore, it is important to know how different packages deal with missing values.

library(tidyverse)
library(gapminder)
data(gapminder)

median_gdp <- median(gapminder$gdpPercap)
gapminder %>%
  select(-country) %>%
  mutate(gdpPercap = ifelse(gdpPercap > median_gdp, "high", "low")) %>%
  mutate(gdpPercap = factor(gdpPercap)) %>%
  mutate(pop = pop / 1000000) -> gapminder

gapminder <- lapply(gapminder, function(x) x[sample(c(TRUE, NA),
    prob = c(0.9, 0.1),
    size = length(x),
    replace = TRUE
  )])

Let’s start and create descriptive summary statistics tables in R.

Create Descriptive Summary Statistics Tables in R with arsenal

Arsenal is my favorite package. It has so much functionality that we essentially could stop right here. We can basically customize anything and the best part about the packages is that it requires only little code.

In the code block below, we are displaying how to create a table with the tableby() function and only two lines of code.

library(arsenal) 
table_one <- tableby(continent ~ ., data = gapminder) 
summary(table_one, title = "Gapminder Data")
summary statistics table

Obviously, this table is far from perfect but especially when we are dealing with large data sets, these two lines are very powerful.

In the next code block, we are customizing our table. We are now adding a median with first and third quantiles and are also changing the order of how the statistics are displayed. The argument “Nmiss2” shows the missing values and if there are none, it shows 0. If you put the argument” Nmiss” and there are no missing values, then it won’t display a line for missing values. Moreover, we can display the missing values not only as counts but also as percentages (more examples in the vignette).

For categorical variables, the table uses a chi-squared test and for numerical variables, it uses a Kruskal Wallis test for calculating p-values. However, we can use many different tests like an f-test statistic. In fact, we can add our own p-values if we would like (more in the vignette).

We can also label our columns with more appropriate names and add a title to our table.

my_controls &lt;- tableby.control(
  test = T,
  total = T,
  numeric.test = "kwt", cat.test = "chisq",
  numeric.stats = c("meansd", "medianq1q3", "range", "Nmiss2"),
  cat.stats = c("countpct", "Nmiss2"),
  stats.labels = list(
    meansd = "Mean (SD)",
    medianq1q3 = "Median (Q1, Q3)",
    range = "Min - Max",
    Nmiss2 = "Missing"
  )
)


my_labels <- list(
  lifeExp = "Life Expectancy",
  pop = "Population (million)",
  gdpPercap = "GDP per capita",
  year = "Year"
)


table_two <- tableby(continent ~ .,
  data = gapminder,
  control = my_controls
)

summary(table_two,
  labelTranslations = my_labels,
  title = "Summary Statistic of Gapminder Data"
)
summary statistics package

Another nice feature of this package is that we can stratify our table by more than one grouping variable. Here, we group by continent and gdpPercap.

table_three <- tableby(interaction(continent, gdpPercap) ~ .,
  data = gapminder,
  control = my_controls
)

summary(table_three,
  labelTranslations = my_labels,
  title = "Summary Statistic of Gapminder Data"
)
summary statistics table

And of course, we can also only create a very simple table without any groupings.

table_four <- tableby(~year + continent + lifeExp + gdpPercap + pop, data = gapminder) 
summary(table_four)
summary statistics packages

I only covered the most essential parts of the package. Consequently, there is a lot more to discover. If you want to customize your tables, even more, check out the vignette for the package which shows more in-depth examples.

Create Descriptive Summary Statistics Tables in R with qwraps2

Another great package is the qwraps2 package. It has very high flexibility for which we have to pay a price! The price we have to pay for it are lots of lines of code. Especially if we have a large data set with lots of columns and levels.

This package uses a nested list and the function summary_table() to create the statistics table.

library(qwraps2)
options(qwraps2_markup = "markdown")
gapminder <- as.data.frame(gapminder)
summary_statistics <-
  list(
    "Life Expectancy" =
      list(
        "mean (sd)" = ~qwraps2::mean_sd(lifeExp, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(lifeExp, na_rm = TRUE),
        "min" = ~min(lifeExp, na.rm = TRUE),
        "max" = ~max(lifeExp, na.rm = TRUE),
        "Missing" = ~sum(is.na(lifeExp))
      ),
    "Population" =
      list(
        "mean (sd)" = ~qwraps2::mean_sd(pop, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(pop, na_rm = TRUE),
        "min" = ~min(pop, na.rm = TRUE),
        "max" = ~max(pop, na.rm = TRUE),
        "Missing" = ~sum(is.na(pop))
      ),
    "GDP per Capita" =
      list(
        "High GDP per Capita" = ~qwraps2::n_perc(na.omit(gdpPercap) %in% "high"),
        "Low GDP per Capita" = ~qwraps2::n_perc(na.omit(gdpPercap) %in% "low"),
        "Missing" = ~sum(is.na(gdpPercap))
      )
  )

summary_table(gapminder, summary_statistics)
summary statistics table

As you can see, it is way more lines of code than the previous package. However, it has the great flexibility to customize every single line of our summary table. This is awesome!

Now, we are going to show how to display a table stratified by a grouping. The way to do that is with the group_by function from the dplyr package.

print(qwraps2::summary_table(
  dplyr::group_by(gapminder, continent),
  summary_statistics
),
rtitle = "Summary Statistics Table for the Gapminder Data Set"
)
summary statistics table

Again, more functionality and examples can be found in the vignette.

Create Descriptive Summary Statistics Tables in R with Amisc

Amisc is a great package for summary statistics tables. Notice, however, that this package can only produce tables with groupings. If it has to build a simple summary statistics table, it will fail. Another point worth mentioning is that you can get this package from GitHub. It is currently not on CRAN. Let’s jump to the code.

library(Amisc)
library(pander)
pander::pandoc.table(Amisc::describeBy(
  data = gapminder,
  var.names = c("lifeExp", "pop", "gdpPercap"),
  by1 = "continent",
  dispersion = "sd", Missing = TRUE,
  stats = "non-parametric"
),
split.tables = Inf
)
sumary statistics table

The table is very simple but informative. It shows, mean, median, and the interquartile range, and the missing values as counts and not percentages. The package uses the pandoc.table() function from the pander package to display a nice looking table. Overall, I really like the simplicity of the table. Unfortunately, there is not much documentation about this package.

Create Descriptive Summary Statistics Tables in R with table1

The next summary statistics package which creates a beautiful table is table1. In the code below, we are first relabelling our columns for aesthetics. Then we are creating the table with only one line of code. We again created a table by groupings.

library(table1)
table1::label(gapminder$lifeExp) <- "Life Expectancy"
table1::label(gapminder$pop) <- "Population"
table1::label(gapminder$gdpPercap) <- "Gdp Per Capita"

table1::table1(~lifeExp + pop + gdpPercap | continent, data = gapminder)
summary statistics table

Here, the missing values are displayed as percentages. I prefer to have the missing values displayed only as counts. More often than not, I am interested in the percentage of the factor variables without the NA values included when calculating the percentage. This package unfortunately has only the option to show the missing values as percentages. So essentially it acts as the third factor with high and low together in the gdpPercap column. If you do not mind having the missing values displayed like that then this package is for you.

In the code below, we are showing how to create a table without stratification by any group.

table1::table1(~lifeExp + pop + gdpPercap, data = gapminder)
summary statistics table

Again, many more things are possible with this package. For example, you can create subgroupings. In addition to that, it is also possible to put p-values as a separate column at the end of the table. If you are interested, check out the vignette.

Now let’s switch the data set. It is becoming a bit boring to see the same data again and again. For the remaining tables, we are using the mtcars data set. Again, a bit modified and with the introduction of missing values.

data(mtcars)
mtcars %>%
  mutate(cylinder = factor(cyl), transmission = factor(am), weight = wt, milesPergallon = mpg) %>%
  select(cylinder, transmission, weight, milesPergallon) -> mtcars
mtcars$cylinder <- recode(mtcars$cylinder, 4 = "4 cylinders", 6 = "6 cylinders", 8 = "8 cylinders")

mtcars <- lapply(mtcars, function(x) x[sample(c(TRUE, NA),
    prob = c(0.8, 0.2),
    size = length(x),
    replace = TRUE
  )])
mtcars <- as.data.frame(mtcars)

 

Create Descriptive Summary Statistics Tables in R with tangram

I really really like the next package. The design is very beautiful and the code is also very short. The only drawback of this package is that it only knits to HTML. You can’t compile it to word :(. Another (tiny) drawback is that this table does not show the missing values by default. However, the package includes a function called insert_row(), where you can insert missing values or any other values (confidence interval for the mean, etc.) that you have calculated.

library(tangram)
tan <- tangram::tangram("cylinder ~ transmission + weight + milesPergallon",
  data = mtcars,
  msd = TRUE,
  quant = seq(0, 1, 0.25)
)
html5(tan, fragment = TRUE, 
           inline = "hmisc.css", 
           caption = "Summary Statistics of Gapminder Data Set", 
           id = "tbl2")
summary statistics table

In the next code block, I am showing you how to insert missing values. For the first three lines, I am using the purrrlyr package. This package is a combination of the dplyr and purrr packages. So what I am doing is separating the levels of the column I want to group by. In this case cylinders. After that, I am calculating the missing values of each cylinder group (4, 6, and 8) for every column.

Then we are removing the last column of our tibble which contains the missing values for cylinders. Then we are calculating the total missing cylinder values for each column. After that, we are doing an rbind and them and removing the column names.

library(purrrlyr)
mtcars %>%
  slice_rows("cylinder") %>%
  dmap(~sum(is.na(.))) -> by_cyl

# make sure variables are in the same order they appear in the tangram() function above
by_cyl <- select(by_cyl[-4, ], transmission, weight, milesPergallon) 
column_sums <- colSums(by_cyl)
by_cyl <- rbind(column_sums, by_cyl)
names(by_cyl) <- NULL

### This is how the insert row function works ###
# tan <- insert_row(tan, 3, "Missing", by_cyl[1, 1], by_cyl[1, 2], by_cyl[1, 3])
# tan <- insert_row(tan, 5, "Missing", by_cyl[2, 1], by_cyl[2, 2], by_cyl[2, 3])
# tan <- insert_row(tan, 7, "Missing", by_cyl[3, 1], by_cyl[3, 2], by_cyl[3, 3])

The out commented section is how the insert_row() function works. The first argument is the tan object that we have created in the above code block. The next argument is the number where you want to insert a row. Then we specify how we want to name the row. In our case, we are naming it “Missing”. The next four arguments represent the values that we want to insert in the row. First, the total missing values for the corresponding column. Then the missing values for the corresponding column by cylinder group.

We do not have to necessarily insert the missing values. We can insert any number we want. For example, a trimmed mean.

If you have a lot of rows to insert, this method becomes tedious and you have to write a lot of code. To make your lives easier, I created a generic function that will take care of almost everything. The only part that needs specification is the part where we specify at what position the row should be inserted in the table. You can specify the positions in the row_number vector. The argument_number object specifies how many arguments the insert_function() takes. The first three arguments are reserved for the table (tan), the row number, and the row label. Then you have an argument for the total number of missing values. After that, the number of arguments in the insert_row() function depends on how many levels the column has you want to group by. The code below shows the generic function.

split_by <- mtcars$cylinder
row_numbers <- c(3, 5, 7)
argument_number <- nlevels(split_by) + 1 + 3 # plus 1 refers to total column in table
# plus 3 refers to the first 3 args of insert_row

j <- 1
for (c in row_numbers) {
  args <- list(1:argument_number)
  args[[1]] <- tan
  args[[2]] <- c
  args[[3]] <- "Missing"
  for (i in c(4:argument_number)) {
    args[[i]] <- by_cyl[i - 3, j]
  }
  tan <- do.call(tangram::insert_row, args)
  j <- j + 1
}

html5(tan, fragment = TRUE, 
           inline = "hmisc.css", 
           caption = "Summary Statistics of Gapminder Data Set", 
           id = "tbl2")
summary statistics table

This package has way more functionality than we have shown. The documentation is very long. However, not very detailed. The vignette does not show many more examples and when it does, it is a pain to understand the code behind it. Overall it is a good package and when you want to customize it more I would suggest using another package.

Create Descriptive Summary Statistics Tables in R with furniture

Our next package will be the furniture package. It is an okay package in my opinion. Missing values are only displayed for categorical variables and only as percentages again. The overall look of the table is very simple.

library(furniture)
library(knitr)

furniture::table1(mtcars,
  "Miles per US gallon" = milesPergallon, "Transmission" = transmission, "Weight 1000 lbs" = weight,
  splitby = ~cylinder,
  test = TRUE,
  na.rm = FALSE,
  format_number = TRUE
) -> tab11

kable(tab11)

There is nothing much more to say and if you are interested you can find the vignette here.

Create Descriptive Summary Statistics Tables in R with tableone

The tableone package is more aesthetic than the furniture package. However, it does not display missing values. If you want to display missing values, you must print them out in a separate table with the summary() function.

library(tableone)
library(pander)

factor_variables <- c("transmission")
variable_list <- c("milesPergallon", "weight", "transmission")

table_one <- CreateTableOne(
  vars = variable_list,
  strata = "cylinder",
  data = mtcars,
  factorVars = factor_variables
)

table_one_matrix <- print(table_one,
  includeNA = TRUE,
  showAllLevels = TRUE
)

pandoc_table <- pandoc.table(table_one_matrix,
  split.table = Inf,
  style = "rmarkdown",
  caption = "mtcars summary statistics table"
)

summary(table_one)
summary statistics table

As with the furniture package, there is nothing more to add and the vignette can be found here.

Create Descriptive Summary Statistics Tables in R with compareGroups

ComapareGroups is another great package that can stratify our table by groups. It is very simple to use. One drawback however is that it does not display missing values by default. When we want to add missing values we must include the argument include.miss = TRUE. The missing values are only displayed as percentages. As with the tableone package, we can display missing values in a separate table.

By default, the compareGroups package displays five or fewer groups. So, when we have a column with more than five levels, there is an argument max.ylev that can be passed in the compareGroups function where we can specify the number of groups. Thanks, Isaac for commenting and pointing it out!

For the mtcars data set we only have three groupings (4 cylinders, 6 cylinders, and 8 cylinders). So we don’t have to use the max.ylev argument in the compareGroups function.

library(compareGroups)

table <- compareGroups(cylinder ~ ., data = mtcars)
pvals <- getResults(table, "p.overall")
p.adjust(pvals, method = "BH")
export_table <- createTable(table)
export2word(export_table, file = "table.docx")
summary statistics table

There is a lot more to discover for this package in the vignette.

Create Descriptive Summary Statistics Tables in R with Gmisc

The Gmisc package is another great package which will create an awesome looking summary statistics table for you. Relabelling variables is very easy and the table looks really beautiful. The only drawback is that the table can only be created in an HTML file. It, unfortunately, cannot be knitted to a word document.

library(Gmisc)

getT1Stat <- function(varname, digits = 0) {
  getDescriptionStatsBy(mtcars[, varname],
    mtcars$cylinder,
    add_total_col = TRUE,
    show_all_values = TRUE,
    hrzl_prop = TRUE,
    statistics = FALSE,
    html = TRUE,
    digits = digits
  )
}

table_data <- list()

table_data[["Miles/(US) gallon"]] <- getT1Stat("milesPergallon")
table_data[["Weight (1000 lbs)"]] <- getT1Stat("weight")
table_data[["Transmission (0 = automatic, 1 = manual)"]] <- getT1Stat("transmission")

rgroup <- c()
n.rgroup <- c()
output_data <- NULL
for (varlabel in names(table_data)) {
  output_data <- rbind(
    output_data,
    table_data[[varlabel]]
  )
  rgroup <- c(
    rgroup,
    varlabel
  )
  n.rgroup <- c(
    n.rgroup,
    nrow(table_data[[varlabel]])
  )
}


htmlTable(output_data,
  align = "rrrr",
  rgroup = rgroup, n.rgroup = n.rgroup,
  rgroupCSSseparator = "",
  rowlabel = "",
  caption = "Summary Statistics",
  ctable = TRUE
)
summary statistics table

For more information and examples have a look at the vignette.

I discovered all of these packages during my data science internship. If you want to know what else I had to do and what I learned from this data science internship then you can read about it here.

I hope you all have enjoyed this post and that you have found a package that suits your needs. If there are any other packages you know of that I have missed in this blog post please let me know in the comments below.

If you liked this blog post, you might also like a collection of other packages that can create tables for you (ANOVA tables, linear models output tables, and more descriptive summary statistics tables). The ones I presented are the best ones for descriptive summary tables I believe.

Lastly, this is another great blog post that presents how to easily summarise data in R.

Comments (20)

  1. Very interesting post.
    Regarding compareGroups package, you can change the max.ylev argument from compareGroups function to build tables with more than 5 groups.
    In the package vignette you can find an explanation of this option and many others with real data examples.

    1. Hi Isaac,

      I am glad you found the post interesting and hopefully helpful 🙂 I had a look at the vignette, updated the post, and included the max.ylev argument. Good eye, thank you for your feedback!

  2. wow, I spent hours trying to find a simple way of doing a summary table and arsenal::tableby was exactly what I was looking for! easy to mix categorical and numerical, and without confusing variables into weird parent-child relations (this, I’ve found, was hard to solve in some other packages I tested).

    thank you for taking the time to make this tutorial!

    1. Glad it helped Martin! I also spent several hours evaluating different packages and arsenal is the most flexible library I have used so far. Enjoy!

  3. Dear Pascal,

    THANK YOU! You are a lifesaver – this is clear, concise, and very beginner-friendly. I will be consulting this page extensively for my senior thesis project in sociology 🙂

  4. What package is the best for calculating t.test on a dichotomous dependent variable, rather than default ANOVA? I noticed that ‘arsenal’ only allows you to pick between ANOVA and KWT to calculate the p-value. The package instructions read “when LHS variable has two levels, equivalent to two-sample t-test.” However, I am not convinced this is appropriate, since using an ANOVA with a dichotomous variable will only be accurate under certain strict conditions (https://www.jstor.org/stable/1434469?seq=1). Please help!

    1. Hi Stephen,

      Have a look at the vignette at point 17 where it says that you can create your own p-values. I calculated the p-values with a t-test in the example below.


      library(gapminder)
      library(arsenal)

      gapminder %>%
      dplyr::filter(continent == "Asia" | continent == "Europe") %>%
      dplyr::mutate(continent = forcats::fct_drop(continent)) %>%
      dplyr::select(lifeExp, gdpPercap, continent) %>%
      dplyr::summarise_at(vars(lifeExp, gdpPercap), ~ list(t.test(. ~ continent))) %>%
      purrr::map(~ purrr::pluck(., 1, "p.value")) %>%
      purrr::flatten_chr() -> my_p_values

      mypval <- data.frame( byvar = "continent", variable = c("gdpPercap", "lifeExp"), adj.pvalue = my_p_values ) tab1 <- tableby(continent ~ gdpPercap + lifeExp, data = gapminder) tab2 <- modpval.tableby(tab1, mypval, use.pname=TRUE) summary(tab2)

      Here is a link to the vignette: https://cran.r-project.org/web/packages/arsenal/vignettes/tableby.html#create-your-own-p-value-and-add-it-to-the-table

      I hope that helps!

  5. Hi,
    This is a great tutorial and arsenal does everything clinical research needs. Many thanks!
    My table currently outputs to the console though. Is there a way to output it to Rstudio viewer like the images you have shown, or is this only done through “write2” function.

    Thanks!

    1. Hi,

      Thanks for your comment! I am not sure if there is a way to see the table in the viewer pane. I always use an Rmarkdown file and then knit to pdf, word or html.

      So what I do is:

      ```{r, results = "asis"}
      my_table_output
      ```

      and then knit it to html/word/pdf. I have never used the write2 function before.

  6. Hi Pascal,
    Thank you so much for this great article!
    Many interesting package and clear explanation. It help me a lot!

    I have a question, while doing a table without a grouping variable, in order to have a smaller table and so more easy to read table, is there a way to produce a table with the summary statistic as column ? I like to produce table with qwarps2 and I tried transpose but it doesn’t work.
    I would like to produce a table like this:
    for continuous variable:
    Min Max Mean(sd) Median[Q1-Q3] Missing
    variable1
    variable2
    variable3

    for categorical variable:
    Freq(%) missing
    Variable1
    choice1
    choice2
    variable2
    choice1
    choice2

    All the best,
    Ana

    1. Hi Ana,

      Thank you for your comment, I am glad it helped out. You can look at a package such as skimr on CRAN.

      Alternatively, I coded something up that might help you and that you can modify and tweak it to your need with the gapminder package below.


      library(gapminder)
      library(tidyverse)

      # for numerical variables
      gapminder %>%
      dplyr::summarize_if(is.numeric, list(mean = ~mean(., na.rm = TRUE),
      median = ~median(., na.rm = TRUE),
      max = ~max(., na.rm = TRUE),
      min = ~min(., na.rm = TRUE))) %>%
      tidyr::pivot_longer(-NULL, names_to = "var") %>%
      tidyr::separate(var, into = c("var", "statistic")) %>%
      tidyr::pivot_wider(names_from = statistic, values_from = value)

      # for factors
      gapminder %>%
      purrr::map_dfr(~ (is.factor(.) | is.character(.))) %>%
      purrr::pmap(~ c(...)) %>%
      purrr::flatten_lgl() %>%
      unname() %>%
      gapminder[, .] -> gapminder

      gapminder %>%
      dplyr::mutate_all(.funs = ~ ifelse(is.na(.), "Missing", as.character(.))) %>%
      tidyr::pivot_longer(-NULL) %>%
      dplyr::group_by_all() %>%
      dplyr::summarise(n = dplyr::n()) %>%
      dplyr::ungroup() %>%
      dplyr::group_by(name) %>%
      dplyr::mutate(freq = n / sum(n))

      Let me know if you have more questions.

  7. Hi,
    Is it possible to export table to Word (or Excel), if table was made with “table1”? if yes, could you please provide a code for that?

    I tried a few options, but non of them worked.

  8. Is there any package for adding more than one variables in row other than pivottabler? or any other way? pivottabler makes it hard to include percentage along with count.

  9. Thank your very much for this helpful article!
    I choosed to build my tables with the arsenal package since I mostly wanted to illustrate descriptive statistics.
    When I am knitting the pdf document, the tables are always in the center position, but I want them to be aligend left. I couldn’t solve this problem yet. Hope that you can help me.

    Best regards and thanks in advance!

    1. Thanks for your comment. Maybe you can try some css if you knit to HTML such as

      Your Rmarkdown code chunk with code to produce table

      Let me know if that worked.

  10. well, I’ve spent the morning working through the first two and got neither to work. I’ve been working in R for more than 5 years, so it’s not my first rodeo. Dumb as I am, I didn’t try to reproduce your gapminder examples, but use your code on my own data. I get lots of gibberish in the output, way more than cleaning it would be worth

  11. What a joy to read through your article, Pascal. It is well-researched and easy to follow. Thank you so much! I am investigating into inserting one of the summary statistics created from the table1 command into a ggplot with facet. I looked into annotate and add_summary, but neither worked. I’d love to hear any suggestions or pointers.

Post your comment