# Few random questions on Random Forest

Random forest is one of the stepping stones when someone tries to enter into the field of machine learning. In this article we will talk about a few frequently asked questions about Random Forest.

#### How to build random forests in R with missing (NA) values?

In such scenarios you have two options:

a. If only a small number of cases have missing values – you could try setting na.action=na.omit to drop these cases

b. use rfImpute method in randomForest package to impute missing values

#### What is out of bag error in Random Forests?

In random forest each of the tree is trained on approximately 2/3rd or 66% of training data. As the forest is built each tree is tested on the remaining 1/3rd of training data.

The error in this remaining 1/3rd of the training data which was left over while building tree is known as out of bag error.

#### My randomForest is taking too much time to build, what should I do?

How many times have you faced the issue of slow building of randomForest? Here are few of the possible reasons which you might consider looking into-

**mtry****ntrees****number of features or variables**in data – it is proven to have quadratic impact on speed!!**number of observations/rows**- ncores (as you can guess more will be faster but make sure that parallel processing is on)
- some performance boost by setting importance=F and proximity=F (don’t compute proximity matrix)
**Never ever use the insane default**`nodesize=1 It's killer`

- So here your data has 4.2e+5 rows, then if each node shouldn’t be smaller than ~0.01%, try
`nodesize=42`

. (First try nodesize=4200 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.) - runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
- optionally you can also speedup by using sampling, see
`strata,sampsize`

argument

#### Can I run randomForest in parallel?

Yes, you can using foreach and doSNOW packages to do parallel processing for you-

```
library("foreach")
library("doSNOW")
registerDoSNOW(makeCluster(4, type="SOCK"))
x <- matrix(runif(500), 100)
y <- gl(2, 50)
rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar%
+ randomForest(x, y, ntree = ntree)
rf
```

```
Call:
randomForest(x = x, y = y, ntree = ntree)
Type of random forest: classification
Number of trees: 1000
```

#### Is there any predefined formula for selecting parameters like mtry, ntree etc.?

No there is not.

For ntree – you could start with a smaller number like 501 and keep on increasing this number until you get a stabilized OOB error, but if your variable set is huge – You might consider choosing the higher number like 1501. (Odd number is preferable to break the ties)

Next, you can use tuneRF function in randomForest to optimize the mtry parameter.

#### My dataset contains n variables then why shouldn’t I give mtry=n? Won’t it be genius to put all n variables?

There are two ways in which randomForest achieves randomness. One is through bootstrap samples of observations and second is through selecting variables at random.

By doing so randomForest adds stochasticity which improves the out-of-sample fit. The extra stochasticity added by random ‘m’ variable selection out of ‘n’ makes each tree a poorer classifier than it would be with using all possible variables. And this is how the random forest achieves lower error over a single tree or bagging.

#### What is proximity matrix in randomForest?

Proximity is the proportion how often two data points end in the same leaf node for different trees.

#### analyticsfreak

#### Latest posts by analyticsfreak (see all)

- Few interesting questions related to correlation - July 22, 2016
- How to make a reproducible example to share? - July 21, 2016
- Few random questions on Random Forest - July 20, 2016