ggplot2- Introduction to histograms

Continuing our tutorial on Graph plotting in R using ggplot, today we will discuss histograms. If you want to go through the previous article in this series, refer to ggplot2 – An Introduction to Bar Plots

In the previous article, We created bar plots for discrete variables like cut and clarity in diamonds dataset.

So a question arises here -does the type of variable influence the decision of choosing the graph type? Answer is yes. It does and it makes lots of sense.

For example, Values of an ordered categorical variable may be the ratings you provide in a satisfaction survey i.e., 0 being highly dissatisfied and 5 being highly satisfied. In such scenarios it will not be completely correct to create a boxplot for the survey ratings.

Following table gives an overview for selection of an appropriate graph type

X – axis is Y –axis Common Graph Name
Nominal variable Count or Values Bar graph
Continuous variable Count Histogram
Continuous variable Value Bar graph
Ordinal variable Count or Values Line graph or Bar graph
Continuous variable Continuous variable Scatter plot

Scatter plot is used to represent the correlation between 2 continuous variables.

What we saw in the previous tutorial was first type, where cut and clarity were discrete variables and we were counting them and representing their count using barplots.

In the current tutorial we will create graphs where Histograms are most suitable.

Histograms are generally used for visualizing the distribution of continuous variables in the sample dataset. According to the grammar of graphics, you need to equally divide the whole range of values into some interval  called ‘bins’ and then count the numbers of values called ‘frequency’ falling into the particular interval. The bins are supposed to be equally sized and adjacent to each other on the graph.  In this case, a rectangle is erected over each bin height of which represents the frequency.

Let’s get started with diamonds dataset.

Before you start working with ggplot2, you need to load the ggplot2 package into the environment. And following the same naming convention as previous tutorial , Let’s load diamonds dataset into the variable named data_set.

## Stackoverflow is a great place to get help:
data_set = diamonds

Let’s have a look at the dataset to figure out for the continuous variables which can be plotted as histogram.

head(data_set, n = 10)
##    carat       cut color clarity depth table price    x    y    z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39

Looking at the variables/features of the dataset we can understand that there are many continuous variables. For example – price, x ,y, z etc.
We will be using price features of diamond for the illustration of histogram.

  • create histogram of price for all the diamonds in the dataset
p = ggplot(data_set, aes(x = price))


plot of chunk unnamed-chunk-3

A ggplot element p is stored in the global environment. This creates the structure for the the plot.

q = p + geom_histogram(color = "darkgreen", fill = "red", binwidth = 500) 


plot of chunk unnamed-chunk-4

A histogram is plotted over the previous structure with its bins filled in red color and the border of bins as dark green. The bin width is fixed as 500.

r = q + scale_x_continuous( breaks = seq(0, 20000, 1000))


plot of chunk unnamed-chunk-5

Here, the range of price values is broken into intervals of 1000 starting from 0 and ending at 20000.

s = r + theme(axis.text.x = element_text(angle = 90))

Here, text of intercept values for respective axis are made perpendicular to the axis.

t = s + xlab("Price of diamond ") + ylab("Count of prices within a particular interval")


plot of chunk unnamed-chunk-7

Here, the labels are added to both the axis.

The histogram graph we are seeing here is skewed to the right. That means if we draw a curve following the tip of all the bins then a curve with thinner tail towards right side of the plot will be generated.

  • Try to limit the previous plots within the range of 0 to 2000 only.
u = t + coord_cartesian(c(0,2000)) 


plot of chunk unnamed-chunk-8

To make this graph look better , we will change the intervals within the range of 0 to 2000. And put the interval as 100.

x = ggplot(data_set, aes(x = price)) + 
  geom_histogram(color = "darkgreen", fill = "red", binwidth = 25) + 
  scale_x_continuous( breaks = seq(0, 2000, 100)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  coord_cartesian(c(0,2000)) +
  xlab("Price") + ylab("Count")

This graph gives us a drilled down report of the data.

  • Break down the histograms of price by cut.
x + facet_grid(cut~.)

plot of chunk unnamed-chunk-10

In the same plot five histograms are created , one for each type of cut.

x + facet_grid(cut~., scale = 'free')

plot of chunk unnamed-chunk-11

In this plot, we have made the individual graphs scale free for better representation. You will notice different scales of count for individual graphs.

Keshav Kumar

Currently working as Datastage Developer at Tata Consultancy Services Limited, Keshav wants to become a Data Analyst in near future. He is self learning the tools and techniques of the Analytics from Online resources. Keshav loves to share his knowledge and skill with other people. He has given internship to 7 students in analytics while studying at Guru Nanak Dev Engineering College, Ludhiana. He built a student placement prediction system for his college and published the white paper in IEEE conference. He also delivered a talk for his team mates in TCS about the use of R in Insurance domain.

Leave a Reply

Your email address will not be published. Required fields are marked *