# Missing values and type conversion in R

In this article we will talk about missing values and type conversion in R. Let's start with missing values in R, how to deal with them etc.

##Missing values in R

Each programming language has it's own way to represent missing values, R represents missing values with an identifier 'NA'.

Whenever there is data missing R will put an NA to represent the missing values.

Let's create a numerical vector which contains values from 1 to 5

```
> v1<-seq(1,5)
> v1
```

```
## [1] 1 2 3 4 5
```

```
> class(v1)
```

```
## [1] "integer"
```

Let's introduce a missing value at 3rd position.

```
> v1[3]<-NA
> v1
```

```
## [1] 1 2 NA 4 5
```

```
> class(v1)
```

```
## [1] "integer"
```

You see, class of vector v1 is still integer. NAs or missing values do not change the class of your object.

Now lets see the summary of v1

```
> summary(v1)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.75 3.00 3.00 4.25 5.00 1
```

What do you think, what will be the output of following line of code? 3 or 2.4 or something else?

```
> mean(v1)
```

Here, you go –

```
## [1] NA
```

Because vector v1 contains a missing value, R doesn't know how to deal with this?

In order to ignore the NAs from data and calcualte the arithmatic operations, we have to add 'na.rm=TRUE' in our code. Let's try that

```
## [1] 3
```

By specifying na.rm=TRUE we have told R to ignore NAs and return us the result by using other observations.

Almost all arithmatic functions in R allows you to include 'na.rm=T' while computing.

Remember that mentioning na.rm=T doesn't delete missing values from your data, it just ignores NAs while calculating.

##Checking for missing values

Let see which values are missing in v1

```
> v1==NA
```

```
## [1] NA NA NA NA NA
```

So NA or missing values in R can not be used for comparison. Because they don't exist and we can't use them.

Using is.na() function in R, you can identify whether a value is missing or not. e.g.,

```
> is.na(v1)
```

```
## [1] FALSE FALSE TRUE FALSE FALSE
```

```
> which(is.na(v1))
```

```
## [1] 3
```

So, it says that we have a missing value at 3rd index

is.na() is very useful function and can be used with variety of objects like vectors, arrays, matrix, dataframe etc.

Let's create one more variable v2.

```
> v2<-c(1:4,NA,NA,2:3,NA)
> v2
```

```
## [1] 1 2 3 4 NA NA 2 3 NA
```

You might ask, how to know if a vector has any missing value in it?

You can use 'any' function to do it. any() checks whether any value is true among the logical values passed to it. For example,

```
> any(TRUE,FALSE,TRUE)
```

```
## [1] TRUE
```

```
> any(FALSE,FALSE)
```

```
## [1] FALSE
```

We can usy any() and is.na() together to answer the question

```
> any(is.na(v2))
```

```
## [1] TRUE
```

Basically, is.na() returned a logical vector of TRUE and FALSE based on whether a particular value is missing or not and any() told us, “Yeah there is atleast one value which is TRUE”, and hence there is atleast one missing value in vector v2.

You might again ask, How to identify if all of them are missing or not?

Now, you have all() function to rescue you.

```
> all(is.na(v2))
```

```
## [1] FALSE
```

So we got FALSE in return, which means that not all the values are missing. There are a few values which are not missing.

One question for you, Let say we want to replace all the missing values in vector v2 with the mean of remaining values how would you do that?

```
> mn<-mean(v2,na.rm=T)
> v2[which(is.na(v2))]<-mn
> v2
```

```
## [1] 1.0 2.0 3.0 4.0 2.5 2.5 2.0 3.0 2.5
```

##Missing values and Dataframe

Missing values behaves similarly for Dataframes. Let's use our favorite iris dataset to understand missing values in R.

We are going to introduce few missing values into our iris dataset.

```
> iris_1<-iris
> iris_1[c(1,2,3,4,5),c(2,3)]<-NA
> nrow(iris_1)
```

```
## [1] 150
```

```
> summary(iris_1)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.00 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.80 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.00 Median :4.400 Median :1.300
## Mean :5.843 Mean :3.05 Mean :3.839 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.30 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.40 Max. :6.900 Max. :2.500
## NA's :5 NA's :5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
##
```

iris_1 is a replica of iris which contains a few missing values as well.

Let's try a few functions on this newly created dataframe iris_1

```
> iris_2<-na.omit(iris_1)
> nrow(iris_2)
```

```
## [1] 145
```

```
> summary(iris_2)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.00 Min. :1.000 Min. :0.100
## 1st Qu.:5.200 1st Qu.:2.80 1st Qu.:1.600 1st Qu.:0.400
## Median :5.800 Median :3.00 Median :4.400 Median :1.300
## Mean :5.877 Mean :3.05 Mean :3.839 Mean :1.234
## 3rd Qu.:6.400 3rd Qu.:3.30 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.40 Max. :6.900 Max. :2.500
## Species
## setosa :45
## versicolor:50
## virginica :50
##
##
##
```

na.omit() returns the original object having missing values removed.

```
> iris_3<-iris_1[complete.cases(iris_1),]
> nrow(iris_3)
```

```
## [1] 145
```

```
> summary(iris_3)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.00 Min. :1.000 Min. :0.100
## 1st Qu.:5.200 1st Qu.:2.80 1st Qu.:1.600 1st Qu.:0.400
## Median :5.800 Median :3.00 Median :4.400 Median :1.300
## Mean :5.877 Mean :3.05 Mean :3.839 Mean :1.234
## 3rd Qu.:6.400 3rd Qu.:3.30 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.40 Max. :6.900 Max. :2.500
## Species
## setosa :45
## versicolor:50
## virginica :50
##
##
##
```

R function complete.cases() returns a logical vector, indicating whether the observations have any missing values or not.

###Type conversion in R

Now, let's talk about type conversion in R.

By type conversion we mean, converting a particular data type to another. Let's have a look.

Let's create a variable 'a' which contains number 5

```
> a<-5
> a
```

```
## [1] 5
```

```
> class(a)
```

```
## [1] "numeric"
```

Now, let's convert 'a' to a character variable 'b'

```
> b<-as.character(a)
> b
```

```
## [1] "5"
```

```
> class(b)
```

```
## [1] "character"
```

Let's convert 'b' to an integer

```
> c<-as.integer(b)
> c
```

```
## [1] 5
```

```
> class(c)
```

```
## [1] "integer"
```

Now, let's create a character vector 'd' which contains a few numbers and characters.

```
> d<-c(1,2,3,"a",4,"b")
> d
```

```
## [1] "1" "2" "3" "a" "4" "b"
```

Now, what will happen if we convert d to a numeric vector? Let's try it out

```
> e<-as.numeric(d)
```

```
## Warning: NAs introduced by coercion
```

```
> e
```

```
## [1] 1 2 3 NA 4 NA
```

```
> class(e)
```

```
## [1] "numeric"
```

As R couldn't convert 'a' & 'b' to a numeric value it returned NAs for that

Let's create a matrix and convert that to a dataframe.

```
> mt<-matrix(1:9,3,3)
> mt
```

```
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
```

```
> df<-as.data.frame(mt)
> df
```

```
## V1 V2 V3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
```

But why is it giving variable names as V1, V2 and V3? because there wasn't any name given to the columns when we created the matrix mt.

```
> mt1<-matrix(1:9,3,3,dimnames = list(c("row1","row2","row3"), c("col1","col2","col3")))
> mt1
```

```
## col1 col2 col3
## row1 1 4 7
## row2 2 5 8
## row3 3 6 9
```

```
> df1<-as.data.frame(mt1)
> df1
```

```
## col1 col2 col3
## row1 1 4 7
## row2 2 5 8
## row3 3 6 9
```

```
> summary(df1)
```

```
## col1 col2 col3
## Min. :1.0 Min. :4.0 Min. :7.0
## 1st Qu.:1.5 1st Qu.:4.5 1st Qu.:7.5
## Median :2.0 Median :5.0 Median :8.0
## Mean :2.0 Mean :5.0 Mean :8.0
## 3rd Qu.:2.5 3rd Qu.:5.5 3rd Qu.:8.5
## Max. :3.0 Max. :6.0 Max. :9.0
```

Here we gave rownames and columnames while creating matrix m1. Remember dimnames argument accepts only a list with rownames and column names.

Let's try and convert df1 to matrix

```
> mt2<-as.matrix(df1)
> mt2
```

```
## col1 col2 col3
## row1 1 4 7
## row2 2 5 8
## row3 3 6 9
```

```
> identical(mt1,mt2)
```

```
## [1] TRUE
```

This is it. Type conversion is very simple but you have to be vigil about the warnings it produces.

#### analyticsfreak

#### Latest posts by analyticsfreak (see all)

- Few interesting questions related to correlation - July 22, 2016
- How to make a reproducible example to share? - July 21, 2016
- Few random questions on Random Forest - July 20, 2016