Build a Histogram in R From Scratch – Resembling the hist() Function in R

February 26, 2020 By Pascal Schmidt programming R

In this tutorial, we will be covering how to create a histogram in R from scratch without the base hist() function and without geom_histogram() or any other plotting library. We will do this by only using the plot() and lines() functions in base R.

For our histogram, we will be developing two functions. One function creates the desired intervals or bins and our second function counts how many values fall into a bin and then draws the histogram.

Let’s start and build a histogram from scratch in R.

First, we need to calculate the range of our vector. Then, we need to know how wide the intervals or bins should be. We do that with the following code.

x <- seq(0, 10)
n <- 5

range_x <- range(x)
diff <- diff(range_x)
bin_width <- diff / n

In the case above, 10 – 0 equals 10 and we want to have 5 bins in total. This gives us a bin width of 2 for each interval.

Next, we want to specify if the intervals should be right open or left open. In our case, we want to have our intervals right open. When we reach our last interval, we want that interval to be a closed interval so we include the maximum number in our vector as well when we draw our histogram.

Histogram From Scratch in R – Interval Function

intervals <- vector(mode = "character", length = n)
for(i in 1:n) {
  
  if(i == 1) {
    
    intervals[[i]] <- paste0("[", range_x[1], 
                             ", ", 
                             range_x[1] + (bin_width * i), ")")
    
  }
  else if(i == max(n)) {
    
    intervals[[i]] <- paste0("[", range_x[1] + (bin_width * (i - 1)), 
                             ", ", 
                             range_x[1] + bin_width * i, "]")
    
  }
  
  else {
    
    intervals[[i]] <- paste0("[", range_x[1] + (bin_width * (i - 1)), 
                             ", ", 
                             range_x[1] + bin_width * i, ")")
    
  }
  
}

intervals
[1] "[0, 2)"  "[2, 4)"  "[4, 6)"  "[6, 8)"  "[8, 10]"

Below, there is the interval function.

intervals <- function(x, n) {
  
  range_x <- range(x)
  diff <- diff(range_x)
  bin_width <- diff / n
  
  intervals <- vector(mode = "character", length = n)
  for(i in 1:n) {
    
    if(i == 1) {
      
      intervals[[i]] <- paste0("[", range_x[1], 
                               ", ", 
                               range_x[1] + (bin_width * i), ")")
      
    }
    else if(i == max(n)) {
      
      intervals[[i]] <- paste0("[", range_x[1] + (bin_width * (i - 1)), 
                               ", ", 
                               range_x[1] + bin_width * i, "]")
      
    }
    
    else {
      
      intervals[[i]] <- paste0("[", range_x[1] + (bin_width * (i - 1)), 
                               ", ", 
                               range_x[1] + bin_width * i, ")")
      
    }
    
  }
  
  return(intervals)
  
}

x <- seq(0, 10)
n <- 5

intervals(x, n)
[1] "[0, 2)"  "[2, 4)"  "[4, 6)"  "[6, 8)"  "[8, 10]"

Histogram From Scratch in R – Count Function

Next, we want to calculate how many values of our vector fall inside a specific interval. In order to do that, we use the code below. We use the parse_number() function from the readr package to get rid of the brackets and return doubles in a list format. Each interval is a separate index in our list.

bins <- intervals(x, n)
stringr::str_split(bins, pattern = ",") %>%
    purrr::map(~readr::parse_number(.)) -> ranges
ranges

[[1]]
[1] 0 2

[[2]]
[1] 2 4

[[3]]
[1] 4 6

[[4]]
[1] 6 8

[[5]]
[1]  8 10

Then we use a for loop to go from 1 to n and count how many values are in a specific interval. I could have done that in a double for loop but vectorized the second for loop.

Consider the first interval:

start <- 0
end <- 2
x >= start & x < end

[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The code above will return a logical vector that tells us how many values fall into the desired interval (TRUE) and which ones do not (FALSE).

Next, we know that each TRUE in R represents a numeric value. This value is one. Hence, we just sum up the logical vector to get the counts like that:

sum(x >= start & x < end)
[1] 2

Below, you can find the code for calculating the counts for each interval:

bins <- intervals(x, n)

y <- vector(mode = "numeric", length = n)
stringr::str_split(bins, pattern = ",") %>%
    purrr::map(~readr::parse_number(.)) -> ranges

for(i in seq_along(bins)) {
  
  if(i == 1) {
    
    y[i] <- sum(x < ranges[[i]][2])
    
  }
  
  else if(i == length(bins)) {
    
    y[i] <- sum(x >= ranges[[i]][1])
    
  } else {
    
    y[i] <- sum(x >= ranges[[i]][1] & x < ranges[[i]][2])
    
  }
  
}

y
[1] 2 2 2 2 3

Now, we just have to draw the histogram with the plot() and lines() function. With the code below, we are drawing an empty plot and removing all the axis. Then, we add an x-axis and a y-axis to it. For the x-axis, we specify the breakpoints.

plot(1,
     type = "n",
     xlab = "x",
     ylab = "Frequency",
     xlim = c(min(x), max(x)),
     ylim = c(0, max(y)),
     main = "Histogram of x",
     axes = F,
     xaxt = "n",
     yaxt = "n")

ranges %>%
  purrr::flatten_dbl() %>%
  unique() -> break_points

axis(side = 1, at = round(break_points, 0))
axis(side = 2)

For the lines function, we use a for loop from 1 to n, that, for i equal to 1 (first interval), draws a vertical line where the interval starts, then a second vertical line where the interval ends and then a horizontal line on top of those two lines. The last lines() function just closes every bar from below.

for(i in 1:n) {
  
  if(y[i] == 0) {
    
    next
    
  }
  
  else {
    
    lines(c(ranges[[i]][1], ranges[[i]][1]), c(0, y[i]))
    lines(c(ranges[[i]][2], ranges[[i]][2]), c(0, y[i]))
    lines(c(ranges[[i]][1], ranges[[i]][2]), c(y[i], y[i]))
    
  }
  
}

lines(c(ranges[[1]][1], ranges[[i]][2]), c(0, 0))

Putting everything together, the histogram function looks like that:

histo <- function(x, n) {
  
  bins <- intervals(x, n)
  
  y <- vector(mode = "numeric", length = n)
  stringr::str_split(bins, pattern = ",") %>%
      purrr::map(~readr::parse_number(.)) -> ranges
  for(i in seq_along(bins)) {
    
    if(i == 1) {
      
      y[i] <- sum(x < ranges[[i]][2])
      
    }
    
    else if(i == length(bins)) {
      
      y[i] <- sum(x >= ranges[[i]][1])
      
    } else {
      
      y[i] <- sum(x >= ranges[[i]][1] & x < ranges[[i]][2])
      
    }
    
  }
  
  plot(1,
       type = "n",
       xlab = "x",
       ylab = "Frequency",
       xlim = c(min(x), max(x)),
       ylim = c(0, max(y)),
       main = "Histogram of x",
       axes = F,
       xaxt = "n",
       yaxt = "n")
  
  ranges %>%
    purrr::flatten_dbl() %>%
    unique() -> break_points
  
  axis(side = 1, at = round(break_points, 0))
  axis(side = 2)
  
  for(i in 1:n) {
    
    if(y[i] == 0) {
      
      next
      
    }
    
    else {
      
      lines(c(ranges[[i]][1], ranges[[i]][1]), c(0, y[i]))
      lines(c(ranges[[i]][2], ranges[[i]][2]), c(0, y[i]))
      lines(c(ranges[[i]][1], ranges[[i]][2]), c(y[i], y[i]))
      
    }
    
  }
  
  lines(c(ranges[[1]][1], ranges[[i]][2]), c(0, 0))
  
}

Now we can test and compare our histogram function against the hist() function in R.

x <- rnorm(1000)
n <- 15
histo(x, n)

bins <- intervals(x, n)
stringr::str_split(bins, pattern = ",") %>%
      purrr::map(~readr::parse_number(.)) -> ranges
ranges %>%
  purrr::flatten_dbl() %>%
  unique() -> break_points

hist(x, breaks = break_points, right = F)

First, our own histogram:

histogram in R

Next, we used the built-in function hist() to draw the histogram:

histogram in R

Both histograms in R look exactly identical. When we change the number of bins and n, then we can see a change in how the x-axis is labeled. That is the only difference though.

I hope you have enjoyed this blog post, how to build a histogram from scratch in R. If you have any questions, let me know in the comments below.

Post your comment