# Aggregate Functions in R

Aggregation of data means grouping your data and applying some function over groups.

Let say we have following data from a retail store:

```
retail_data<-data.frame(CID=c(1,1,1,2,2,3,3,3),mon=c("jan","jan","feb","jan","feb","jan","jan","feb"),spend=c(100,100,200,100,300,50,100,250))
retail_data
```

```
## CID mon spend
## 1 1 jan 100
## 2 1 jan 100
## 3 1 feb 200
## 4 2 jan 100
## 5 2 feb 300
## 6 3 jan 50
## 7 3 jan 100
## 8 3 feb 250
```

Where CID is the customer ID of a particular customer, mon represents the month in which the transaction happenned and spend is the amount spent by the customer at a particular retail store.

Now let say, we want to know the amount spent by each customer. Then it will be something like

```
## CID total_spend
## 1 1 400
## 2 2 400
## 3 3 400
```

If you are familiar with SQL then you can get the result using following query

```
select CID, sum(spend) as total_spend from reatail_data group by CID
```

The same can be done in R using aggregate() function:

```
agg_data<-aggregate(retail_data$spend,by=list(retail_data$CID),FUN=sum)
colnames(agg_data)<-c("CID","Total_Spend")
agg_data
```

```
## CID Total_Spend
## 1 1 400
## 2 2 400
## 3 3 400
```

In the function call above, we told R that group “spend” variable by “CID” and apply “sum” function on it.

Now, let say we want to calculate the total spend by each customer each month. This can be done as follows:

```
agg_data1<-aggregate(retail_data$spend,by=list(retail_data$CID,retail_data$mon),FUN=sum)
colnames(agg_data1)<-c("CID","Month","Total_Spend")
agg_data1
```

```
## CID Month Total_Spend
## 1 1 feb 200
## 2 2 feb 300
## 3 3 feb 250
## 4 1 jan 200
## 5 2 jan 100
## 6 3 jan 150
```

- FUN Argument – FUN can take any function which can be used in aggregation, in the examples above, we have used only “sum” but you could also use other functions like “mean”,“length”,“sd” etc.

### Using User Defined Functions in Aggregate

FUN argument in aggregate is not only restricted to predefined R functions. You can also use the functions defined by you.

For example, consider the following dataset mtcars

```
data(mtcars)
head(mtcars)
```

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

Let’s assume, we want to know how many distinct type of ‘cyl’ are presents for each of the ‘gear’ observation.

There is no predefined function in R, which can calculate the distinct number of elements.

Let’s create our own function to calculate the same.

```
myfunc<-function(x)
{
return(length(unique(x)))
}
```

Let’s see how does this function works.

Assume a vector v1 which contain certain numbers:

```
v1<-c(1,2,3,1,2,3,4,5,6,4)
v1
```

```
## [1] 1 2 3 1 2 3 4 5 6 4
```

And we are interested in calculating the distinct number of elements in v1

```
myfunc(v1)
```

```
## [1] 6
```

Let’s use this function to solve the problem of mtcars

```
agg_mtcars<-aggregate(mtcars$cyl,by=list(mtcars$gear),myfunc)
colnames(agg_mtcars)<-c("gear","no_of_cyl_entries")
agg_mtcars
```

```
## gear no_of_cyl_entries
## 1 3 3
## 2 4 2
## 3 5 3
```

Instead of creating a function, we can define the function within the aggregate() itself

```
agg_mtcars2<-aggregate(mtcars$cyl,by=list(mtcars$gear),function(x) length(unique(x)))
colnames(agg_mtcars2)<-c("gear","no_of_cyl_entries")
agg_mtcars2
```

`## gear no_of_cyl_entries `

`## 1 3 3 `

`## 2 4 2 `

`## 3 5 3`

### Formula method to call aggregate()

There is an another way to call aggregate function which is known as Formula Method.

It is easier to write and more intuitive compared to the previously discussed method.

Let’s recreate the last example using formula method

```
agg_mtcars3<-aggregate(cyl~gear,mtcars,function(x) length(unique(x)))
colnames(agg_mtcars3)<-c("gear","no_of_cyl_entries")
agg_mtcars3
```

```
## gear no_of_cyl_entries
## 1 3 3
## 2 4 2
## 3 5 3
```

variable before “~” sign are the ones on which you want to apply your function and variable after “~” sign are the grouping variables.

Let’s re create the spend by CID & mon using retail_data

```
data_out1<-aggregate(spend~CID+mon, retail_data, FUN=sum)
data_out1
```

```
## CID mon spend
## 1 1 feb 200
## 2 2 feb 300
## 3 3 feb 250
## 4 1 jan 200
## 5 2 jan 100
## 6 3 jan 150
```

### Using aggregate on multiple columns simultaneously

Let say we want to calculate average “hp” and average “drat” for a given “cyl” number.

One solution is to calculate averages separately and merge them –

```
#Calculating average hp for each cylinder
mtcars_hp<-aggregate(hp~cyl,mtcars,mean)
colnames(mtcars_hp) <- c("cyl","average_hp")
#Calculating average drat for each cylinder
mtcars_drat<-aggregate(drat~cyl,mtcars,mean)
colnames(mtcars_drat) <- c("cyl","average_drat")
#Merging average hp and average drat
mtcars_hp_drat<-merge(mtcars_hp,mtcars_drat,all=T)
mtcars_hp_drat
```

```
## cyl average_hp average_drat
## 1 4 82.63636 4.070909
## 2 6 122.28571 3.585714
## 3 8 209.21429 3.229286
```

Another solution can be to apply aggregate function on hp & drat together –

```
mtcars_hp_drat1<-aggregate(cbind(hp,drat)~cyl,mtcars,mean)
colnames(mtcars_hp_drat1)<-c("cyl","average_hp","average_drat")
mtcars_hp_drat1
```

```
## cyl average_hp average_drat
## 1 4 82.63636 4.070909
## 2 6 122.28571 3.585714
## 3 8 209.21429 3.229286
```

### Using multiple functions in aggregate

If you want to use multiple functions in a single aggregate call, it is possible using with() function in r.

Let say we want to calculate Mean and Standard Deviation of hp for all the cyl types

```
data_out2<-with(mtcars, aggregate(hp, list(cyl), function(x) { c(MEAN=mean(x) , SD=sd(x) )}))
data_out2
```

```
## Group.1 x.MEAN x.SD
## 1 4 82.63636 20.93453
## 2 6 122.28571 24.26049
## 3 8 209.21429 50.97689
```

```
str(data_out2)
```

```
## 'data.frame': 3 obs. of 2 variables:
## $ Group.1: num 4 6 8
## $ x : num [1:3, 1:2] 82.6 122.3 209.2 20.9 24.3 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "MEAN" "SD"
```

This is not something we wanted, let’s modify the structue of output data

```
data_out3<-data.frame(cyl=data_out2[,1],mean_hp=data_out2$x[,1],sd_hp=data_out2$x[,2])
data_out3
```

```
## cyl mean_hp sd_hp
## 1 4 82.63636 20.93453
## 2 6 122.28571 24.26049
## 3 8 209.21429 50.97689
```

```
str(data_out3)
```

```
## 'data.frame': 3 obs. of 3 variables:
## $ cyl : num 4 6 8
## $ mean_hp: num 82.6 122.3 209.2
## $ sd_hp : num 20.9 24.3 51
```

#### analyticsfreak

#### Latest posts by analyticsfreak (see all)

- Few interesting questions related to correlation - July 22, 2016
- How to make a reproducible example to share? - July 21, 2016
- Few random questions on Random Forest - July 20, 2016