# ggplot2- Introduction to histograms

Continuing our tutorial on **Graph plotting in R using ggplot, **today we will discuss histograms. If you want to go through the previous article in this series, refer to ggplot2 – An Introduction to Bar Plots

In the previous article, We created bar plots for discrete variables like **cut and clarity **in diamonds dataset.

So a question arises here -does the type of variable influence the decision of choosing the graph type? Answer is yes. It does and it makes lots of sense.

For example, Values of an ordered categorical variable may be the ratings you provide in a satisfaction survey i.e., 0 being highly dissatisfied and 5 being highly satisfied. In such scenarios it will not be completely correct to create a boxplot for the survey ratings.

Following table gives an overview for selection of an appropriate graph type

X – axis is | Y –axis | Common Graph Name |

Nominal variable | Count or Values | Bar graph |

Continuous variable | Count | Histogram |

Continuous variable | Value | Bar graph |

Ordinal variable | Count or Values | Line graph or Bar graph |

Continuous variable | Continuous variable | Scatter plot |

Scatter plot is used to represent the correlation between 2 continuous variables.

What we saw in the previous tutorial was first type, where cut and clarity were discrete variables and we were counting them and representing their count using barplots.

In the current tutorial we will create graphs where Histograms are most suitable.

Histograms are generally used for visualizing the distribution of continuous variables in the sample dataset. According to the grammar of graphics, you need to equally divide the whole range of values into some interval called ‘**bins**’ and then count the numbers of values called ‘**frequency**’ falling into the particular interval. The bins are supposed to be equally sized and adjacent to each other on the graph. In this case, a rectangle is erected over each bin height of which represents the frequency.

Let’s get started with diamonds dataset.

Before you start working with ggplot2, you need to load the ggplot2 package into the environment. And following the same naming convention as previous tutorial , Let’s load diamonds dataset into the variable named data_set.

```
library(ggplot2)
```

```
## Stackoverflow is a great place to get help:
## http://stackoverflow.com/tags/ggplot2.
```

```
data_set = diamonds
```

Let’s have a look at the dataset to figure out for the continuous variables which can be plotted as histogram.

```
head(data_set, n = 10)
```

```
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
```

Looking at the variables/features of the dataset we can understand that there are many continuous variables. For example – price, x ,y, z etc.

We will be using price features of diamond for the illustration of histogram.

- create histogram of price for all the diamonds in the dataset

```
p = ggplot(data_set, aes(x = price))
p
```

A ggplot element p is stored in the global environment. This creates the structure for the the plot.

```
q = p + geom_histogram(color = "darkgreen", fill = "red", binwidth = 500)
q
```

A histogram is plotted over the previous structure with its bins filled in red color and the border of bins as dark green. The bin width is fixed as 500.

```
r = q + scale_x_continuous( breaks = seq(0, 20000, 1000))
r
```

Here, the range of price values is broken into intervals of 1000 starting from 0 and ending at 20000.

```
s = r + theme(axis.text.x = element_text(angle = 90))
```

Here, text of intercept values for respective axis are made perpendicular to the axis.

```
t = s + xlab("Price of diamond ") + ylab("Count of prices within a particular interval")
t
```

Here, the labels are added to both the axis.

The histogram graph we are seeing here is skewed to the right. That means if we draw a curve following the tip of all the bins then a curve with thinner tail towards right side of the plot will be generated.

- Try to limit the previous plots within the range of 0 to 2000 only.

```
u = t + coord_cartesian(c(0,2000))
u
```

To make this graph look better , we will change the intervals within the range of 0 to 2000. And put the interval as 100.

```
x = ggplot(data_set, aes(x = price)) +
geom_histogram(color = "darkgreen", fill = "red", binwidth = 25) +
scale_x_continuous( breaks = seq(0, 2000, 100)) +
theme(axis.text.x = element_text(angle = 90)) +
coord_cartesian(c(0,2000)) +
xlab("Price") + ylab("Count")
```

This graph gives us a drilled down report of the data.

- Break down the histograms of price by cut.

```
x + facet_grid(cut~.)
```

In the same plot five histograms are created , one for each type of cut.

```
x + facet_grid(cut~., scale = 'free')
```

In this plot, we have made the individual graphs scale free for better representation. You will notice different scales of count for individual graphs.