# Is Random Forest the best substitute for OLS Linear Regression?

Random forest is probably one of the most used Machine Learning techniques. And people are largely using Random Forest for linear regression problems also. But does it provide better solution compared to linear regression in every case?

Let’s see an example,

```
beta <- runif(5)
x <- matrix(rnorm(500), nc=5)
y <- drop(x %*% beta)
dat <- data.frame(y=y, x1=x[,1], x2=x[,2], x3=x[,3], x4=x[,4], x5=x[,5])
model1 <- lm(y~., data=dat)
require(randomForest)
model2 <- randomForest(y ~., data=dat)
pred1 <- predict(model1 ,dat)
pred2 <- predict(model2 ,dat)
plot(y, pred1)
points(y, pred2, col="blue")
```

Where points shown in black color are output of linear regression while points in blue color are output of random forest.

In this example, OLS has proven to be better than Random Forest.

But is it out of chance or there is some theory behind it? Let’s have a look-

Many random forests are essentially windows within which the average is assumed to represent the system.

Lets say you have a two-leaf CAR-tree. Your data will be split into two sections. The (constant) output of each section will be its average.

Now lets do it 1000 times with random subsets of the data. You will still have discontinuous regions with outputs that are averages. The winner in a RF is the most frequent outcome. That only “Fuzzies” the border between categories.

Example of piecewise linear output of CART tree:

Let us say, for instance, that our function is y=0.5*x+2. A plot of that looks like the following:

We can model it using a 2 leaves classification tree using a single classification tree. First step would be to identify a best split point and then split at that point.

Mean of output variable will be assigned as the output of that leaf.

We can increase the number of such classification trees and receive better classification.

### Why CAR-forests?

You can see that, in the limit of infinite leaves the CART tree would be an acceptable approximator.

The problem is that the real world is noisy. We like to think in means, but the world likes both the central tendency (mean) and the tendency of variation (std dev). There is noise.

The same thing that gives a CAR-tree its great strength, its ability to handle discontinuity, makes it vulnerable to modeling noise as if it were signal.

So Leo Breimann made a simple but powerful proposition: use Ensemble methods to make Classification and Regression trees robust. He takes random subsets (a cousin of bootstrap resampling) and uses them to train a forest of CAR-trees. When you ask a question of the forest, the whole forest speaks, and the most common answer is taken as the output. If you are dealing with numeric data, it can be useful to look at the expectation as the output.

So for the second plot, think about modeling using a random forest. Each tree will have a random subset of the data. That means that the location of the “best” split point will vary from tree to tree. If you were to make a plot of the output of the random forest, as you approach the discontinuity, first few branches will indicate a jump, then many. The mean value in that region will traverse a smooth sigmoid path. Bootstrapping is convolving with a Gaussian, and the Gaussian blur on that step function becomes a sigmoid.

### Bottom lines:

You need a lot of branches per tree to get a good approximation to a very linear function.

There are many “dials” that you could change to impact the answer, and it is unlikely that you have set them all to the proper values.

The problem is that bagged classifiers like random forest, which are made by taking bootstrap samples from your data set, tend to perform badly in the extremes. Because there is not much data in the extremes, they tend to get smoothed out.

In more detail, recall that a random forest for regression averages the predictions of a large number of classifiers. If you have a single point which is far from the others, many of the classifiers will not see it, and these will essentially be making an out-of-sample prediction, which might not be very good. In fact, these out-of-sample predictions will tend to pull the prediction for the data point towards the overall mean.

If you use a single decision tree, you won’t have the same problem with extreme values, but the fitted regression won’t be very linear either.

#### analyticsfreak

#### Latest posts by analyticsfreak (see all)

- Few interesting questions related to correlation - July 22, 2016
- How to make a reproducible example to share? - July 21, 2016
- Few random questions on Random Forest - July 20, 2016