Aims

  • To demonstrate the other R data structures like data frames, lists etc
  • To demonstrate the R markdown

The structure of data in R continues

  • R has many data structures. These include
    • Atomic vector
    • factors
    • matrix
    • data frame
    • list

Factors

  • A factor is very similar to a vector
  • Factors are used to represent categorical data such as , sex (male or female), tumour stage (stage 1, stage 2 …)
  • Factors are useful for statistical analysis and plotting
x <- c("male", "female", "female", "male")
x
## [1] "male"   "female" "female" "male"
sex <- factor(x) # convert vector x into a factor
sex
## [1] male   female female male  
## Levels: female male
  • Once created, factors can only contain a pre-defined set values, known as levels.
levels(sex) # what are the levels in sex object
## [1] "female" "male"
nlevels(sex) # how many levels in sex object
## [1] 2
  • In R’s memory, these factors are represented by integers (1,2,3,4 …)
x <- c("low", "high", "medium", "high", "low", "medium", "high")
typeof(x) 
## [1] "character"
food <- factor(x) # convert x into a factor
levels(food) # get the levels
## [1] "high"   "low"    "medium"
typeof(sex) 
## [1] "integer"
food_relevel <- factor(x, levels = c("low", "medium", "high")) # re-level the levels
food_relevel
## [1] low    high   medium high   low    medium high  
## Levels: low medium high
levels(food_relevel)
## [1] "low"    "medium" "high"
nlevels(food_relevel)
## [1] 3
typeof(food_relevel)
## [1] "integer"
  • “is” family functions

    • Is function start with “is.” like is.vector(), is.character()

    • Output of is family of functions is a logical value TRUE or FALSE

      • What type of data does the object contain?
        • typeof(): Indicates the type of data contained in the object
        • is.character(): to test if the object holds character type data
        • is.double(): to test if the object holds decimal type data
        • is.numeric(): to test if the object holds numeric type data
        • is.integer(): to test if the object holds integer type data
        • is.logical(): to test if the object holds logical type data
      • What type of data structure does the object represent?
        • class(): Describes the type of data structure that the object represents
        • is.vector(): To determine if the object is a vector
        • is.factor(): To determine if the object is a factor
        • is.matrix(): To determine if the object is a matrix
        • is.data.frame(): To determine if the object is a data frame
x <- 1:100
typeof(x)
## [1] "integer"
is.integer(x)
## [1] TRUE
is.vector(x)
## [1] TRUE
is.factor(x)
## [1] FALSE
  • “as” family functions

    • For converting one data type to another data type or one data structure to another data structure

    • As function start with “as.” like as.vector(), as.character()

      • Convert one data type to another data type
        • as.character(): convert to character type data
        • as.numeric()
        • as.integer()
        • as.logical()
      • Convert one type of data structure to another
        • as.vector()
        • as.matrix()
        • as.factor()
        • as.data.frame()
x <- 1:100
typeof(x)
## [1] "integer"
y <- as.character(x)
typeof(y)
## [1] "character"
is.vector(x)
## [1] TRUE
z <- as.matrix(x)
is.matrix(z)
## [1] TRUE

Matrices

  • In R matrices are an extension of the numeric or character vectors.
  • As with atomic vectors, the elements of a matrix must be of the same data type.
  • The matrix can be viewed as a collection of vectors
    • Matrix rows and columns are vectors
    • All the vectors hold identical data type
    • The vectors must have the same length
x <- c(1:9)
is.vector(x)
## [1] TRUE
typeof(x)
## [1] "integer"
y <- matrix(data=1:9, nrow = 3, ncol = 3) # create a matrix 
class(y)
## [1] "matrix" "array"
typeof(y)
## [1] "integer"
  • In contrast to vectors, matrices and data frames are two dimensional data structures
    • First dimension: rows
    • Second dimension: columns
  • dim(): Provides the dimensions of an object
  • nrow(): gives number of rows
  • ncol(): gives number of columns
dim(y) # get dimensions of an object
## [1] 3 3
nrow(y) # gives number of rows
## [1] 3
ncol(y) # gives number of columns
## [1] 3
  • Unlike vectors, sub-setting requires both rows and columns to be provided
  • Like vector subscript operator “[]” is used to access the values
  • Unlike vector, matrix is a two dimensional data structure
  • general syntax: matrix[ rows , columns ]
y[] # gives all the values in a matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
y[1,2] # get the value from 1st row and 2 columns
## [1] 4
y[1,] # get all the values from 1st row
## [1] 1 4 7
y[,3] # get all the values from 3rd column
## [1] 7 8 9
y[c(1,3), c(2,3)] # get 1st and 3rd values from 2nd and 3rd column
##      [,1] [,2]
## [1,]    4    7
## [2,]    6    9
  • Like vector matrix can hold only one type of data, if we try to mix two are more different data type, R implicitly convert into one type
typeof(y)
## [1] "integer"
y
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
y[2,3]
## [1] 8
y[2,3] <- "A" # replace 
y
##      [,1] [,2] [,3]
## [1,] "1"  "4"  "7" 
## [2,] "2"  "5"  "A" 
## [3,] "3"  "6"  "9"
typeof(y)
## [1] "character"

Lists

  • R’s simplest structure that combines data of different types is a list.
  • A list is a collection of any data structures
  • List can be of different types and different lengths.
  • A list can be a collection of lists.
  • Lists are very flexible data structures that can hold anything.
my_first_list <- list(1:10, c("a", "b", "c"), 
                      c(TRUE, FALSE), 100, 
                      c(1.3, 2.2, 0.75, 3.8))
my_first_list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 100
## 
## [[5]]
## [1] 1.30 2.20 0.75 3.80

my_first_list has five elements and when printed out like this looks quite strange at first sight. Note how each of the elements of a list is referred to by an index within 2 sets of square brackets or one set of square. This gives a clue to how you can access individual elements in the list.

  • How to access the elements from the list?
    • Using any one of the following three operator …
      1. “[]”: Standard subscript operator, works like in a vector. The output is still a list.
      2. “[[]]”: This operator can only take one index number at a time and the output will be the data structure for that element.
      3. “$”: to extract an element from a list or a column from a data frame by name.
length(my_first_list)
## [1] 5
my_first_list[1] # get element one from the list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
my_first_list[1:3] # get element one from the list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
## [1]  TRUE FALSE
length(my_first_list)
## [1] 5
my_first_list[[1]] # get element one from the list
##  [1]  1  2  3  4  5  6  7  8  9 10
my_first_list[[5]] # get element one from the list
## [1] 1.30 2.20 0.75 3.80
  • Elements in lists are normally named
named_list <- list( 
  city_name=c("Cambridge", "London", "Oxford"),
  population=c(1.62, 8.9, 1.5 ) )

named_list
## $city_name
## [1] "Cambridge" "London"    "Oxford"   
## 
## $population
## [1] 1.62 8.90 1.50
named_list$city_name # equivalent to named_list[[1]]
## [1] "Cambridge" "London"    "Oxford"
named_list$population # equivalent to named_list[[2]]
## [1] 1.62 8.90 1.50
  • You can modify lists either by adding addition elements or modifying existing ones.
named_list$city_name[2] <- "LONDON"
named_list
## $city_name
## [1] "Cambridge" "LONDON"    "Oxford"   
## 
## $population
## [1] 1.62 8.90 1.50

Lists can be thought of as a ragbag collection of things without a very clear structure. You probably won’t find yourself creating list objects of the kind we’ve seen above when analyzing your own data. However, the list provides the basic underlying structure to the data frame that we’ll be using throughout the rest of this course.

The other area where you’ll come across lists is as the return value for many of the statistical tests and procedures such as linear regression that you can carry out in R.

To demonstrate, we’ll run a t-test comparing two sets of samples drawn from subtly different normal distributions. We’ve already come across the rnorm() function for creating random numbers based on a normal distribution.

sample_1 <- rnorm(n = 100, mean=5, sd=1)
sample_2 <- rnorm(n = 100, mean=7, sd=1)

result <- t.test(x=sample_1, y=sample_2)

is.list(result) # test if object is a list
## [1] TRUE
result
## 
##  Welch Two Sample t-test
## 
## data:  sample_1 and sample_2
## t = -15.787, df = 197.74, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.394985 -1.863080
## sample estimates:
## mean of x mean of y 
##  5.005948  7.134981
names(result) # list names of the object
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"
result$statistic
##         t 
## -15.78674
result$p.value
## [1] 7.260099e-37

Data frames

  • A much more useful data structure and the one we will mostly be using for the rest of the course is the data frame.
  • This is actually a special type of list in which all the elements are vectors of the same length.
df <- data.frame( 
  city_name=c("Cambridge", "London", "Oxford"),
  population=c(1.62, 8.9, 1.5 ))
df
##   city_name population
## 1 Cambridge       1.62
## 2    London       8.90
## 3    Oxford       1.50
class(df)
## [1] "data.frame"
dim(df)
## [1] 3 2
  • R provides many built-in data sets, most of which are represented as data frames.
  • data(): lists all the avilable built-in data sets
  • One popular data set is iris
  • To know more about any data set, one can use help() or ? to get help
  • iris:
    • data frame with 150 observations (rows)
    • 5 variables (columns)
      • Sepal.Length
      • Sepal.Width
      • Petal.Length
      • Petal.Width
      • Species
  • To bring one of these internal data sets to the fore, you can just start using it by name.
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  • You can also get help for a data set such as iris in the usual way.
?iris # get help for iris data
  • This reveals that iris is a rather famous old data set of measurements taken by the esteemed British statistician and geneticist, Ronald Fisher (he of Fisher’s exact test fame).

  • A data frame is a special type of list so you can access its columns in the same way as we saw previously for lists.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
iris$Petal.Width # or equivalently iris[["Petal.Width"]] or iris[[4]]
##   [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3
##  [19] 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2
##  [37] 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5 1.5 1.3
##  [55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0 1.5 1.1 1.8 1.3
##  [73] 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3
##  [91] 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8
## [109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8
## [127] 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3
## [145] 2.5 2.3 1.9 2.0 2.3 1.8
  • Viewing data frames is made easier with these functions
    • head(): Shows first 6 lines of a data frame
    • tail(): Shows first 6 lines of a data frame
    • View(): Shows data frame in excel like format
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
View(iris)
  • In that last example we extracted the Petal.Width column which itself is a vector. We can further subset the values in that column to, say, return the first 10 values only.
iris$Petal.Length[1:10]
##  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
  • Matrix-like syntax can be used to access both rows and columns.
  • df[ row index, column index]
# df[ row index, column index]
iris[ 1:4, 1:3] # rows from 1 to 4 and columns from 1 to 3
##   Sepal.Length Sepal.Width Petal.Length
## 1          5.1         3.5          1.4
## 2          4.9         3.0          1.4
## 3          4.7         3.2          1.3
## 4          4.6         3.1          1.5
iris[c(1,4,7), c(2,4)] #  rows from1,4 and 7 and columns from 2 and 4
##   Sepal.Width Petal.Width
## 1         3.5         0.2
## 4         3.1         0.2
## 7         3.4         0.3
  • To extract values from a data frame, row/column names may also be used
iris[c(1,4,7), c("Sepal.Width","Species" )] # iris[c(1,4,7), c(2,5)] 
##   Sepal.Width Species
## 1         3.5  setosa
## 4         3.1  setosa
## 7         3.4  setosa
  • We can also use conditional sub-setting to extract the rows that meet certain conditions, e.g. all the rows with Sepal.Length of 5.
iris[iris$Sepal.Length == 5, ]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 5             5         3.6          1.4         0.2     setosa
## 8             5         3.4          1.5         0.2     setosa
## 26            5         3.0          1.6         0.2     setosa
## 27            5         3.4          1.6         0.4     setosa
## 36            5         3.2          1.2         0.2     setosa
## 41            5         3.5          1.3         0.3     setosa
## 44            5         3.5          1.6         0.6     setosa
## 50            5         3.3          1.4         0.2     setosa
## 61            5         2.0          3.5         1.0 versicolor
## 94            5         2.3          3.3         1.0 versicolor
  • One can use & (and), | (or) and ! (not) logical operators for complex sub-setting.
iris[iris$Sepal.Length == 5 & iris$Species == "setosa", ]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5             5         3.6          1.4         0.2  setosa
## 8             5         3.4          1.5         0.2  setosa
## 26            5         3.0          1.6         0.2  setosa
## 27            5         3.4          1.6         0.4  setosa
## 36            5         3.2          1.2         0.2  setosa
## 41            5         3.5          1.3         0.3  setosa
## 44            5         3.5          1.6         0.6  setosa
## 50            5         3.3          1.4         0.2  setosa

A data frame is the most common data structure you will encounter as a beginner. As you can see from the examples above, the syntax is tricky and not intuitive at all. In order to overcome this problem, R has a package called tidyverse. The tidyverse package combines eight other packages into one. Two important packages in tidyverse are dplyr and ggplot2

  • dplyr: Makes working with data frames fun and easy
  • ggplot2: Creates beautiful plots with ease.

From next week you will be working with tidyverse.

  • summary() is a very useful function that summarises each column in a data frame
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

R Markdown: a brief introduction

  • Documents created with R Markdown can serve as a neat record of your analyses.

  • Reproducible research aims to make your analysis easy to understand by other researchers

  • Using RMarkdown, you can view your code along with its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook.

  • Once RMarkdown file (.Rmd) is created, one can easily knit into different formats

    • html file
    • pdf file
    • doc file
    • slides
  • An R Markdown file is made up of 3 basic components:

    • header: contains title, author, date etc
    • markdown: contains free text, images etc
    • R code chunks: contains actual R code
  • Header:

    • The markdown document starts with an optional header in YAML format known as the YAML metadata.
    • In the example below the title, author and date are specified in the header.
  • R code chunks

    • This contains actual R code just like code in the R script
    • A comment can be added to a code by using “#”
    • below is the example R code chunk
  • Markdown

    • The text following the header in an R Markdown file is in Markdown syntax.
    • This is the syntax that gets converted to HTML format once we click on the Knit button
    • The philosophy behind Markdown is that it should be easy to write and easy to read.
    • Within markdown, one can add text, add images, links etc
  • Headings:

  • Heading can be created by stating the line with “#” symbol

    • # Header 1

    • ## Header 2

    • ### Header 3

  • Text type

    • This is bold text or This is bold text
    • This is Italic text or This is Italic text
  • Create links

    • general syntax to create a link to a webpage use: [text of link](web address)
    • Example: [This is link to bitesize R course](https://bioinformatics-core-shared-training.github.io/Bitesize-R/) this is rendered as This is link to bitesize R course
  • Insert images

    • general syntax: ![Figure caption](path to image)
    • Example: ![R Logo](img/Rlogo.png)
  • Lists

    * Item 1
    * Item 2
    * Item 3

After the R markdown file has been prepared, save it and knit it to the desired document like shown below.

Credit

These instructions were adapted from following sources.

Exercises

  1. Using mtcars dataset answer the following.

    1. How many rows and columns in the mtcars data set?
    2. How many cars in the mtcars data set have 8 cylinders?
    3. How many cars in the mtcars data set have 6 cylinders and more than 3 gears?
Answer
# 1 a
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
# 1 b
sum(mtcars$cyl == 8)
## [1] 14
# 1 c
sum(mtcars$cyl== 6 & mtcars$gear > 3)
## [1] 5
  1. Extract 3rd and 5th rows with 1st and 3rd columns from mtcars data set.
Answer
mtcars[c(3,5), c(1,3)]
##                    mpg disp
## Datsun 710        22.8  108
## Hornet Sportabout 18.7  360
  1. From airquality data set
    1. Identify the data structure of the first row by extracting it?
    2. Identify the data structure of the first column by extracting it?
Answer
# 3 a
class(airquality[1,])
## [1] "data.frame"
# 3 b
class(airquality[,1])
## [1] "integer"
is.vector(airquality[,1])
## [1] TRUE
is.data.frame(airquality[,1])
## [1] FALSE
  1. Using airquality data set
    1. Show first 10 rows
    2. Show last 2 rows
    3. list all the column names available
Answer
# 4 a
head(airquality, n=10)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
# 4 b
tail(airquality, n=2)
##     Ozone Solar.R Wind Temp Month Day
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30
# 4 c
names(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
colnames(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
# 3 b
class(airquality[,1])
## [1] "integer"
is.vector(airquality[,1])
## [1] TRUE
is.data.frame(airquality[,1])
## [1] FALSE
  1. Using airquality data set, can you test Ozone production and solar radiation are correlated and answer the following. Hint: Usecor.test function.
    1. Get correlation coefficient confidence interval
    2. what is the p-value?
    3. What is the estimated correlation coefficient?
Answer
# 1 a
results <- cor.test(airquality$Ozone, airquality$Solar.R)
results$conf.int
## [1] 0.173194 0.502132
## attr(,"conf.level")
## [1] 0.95
# 1 b
result$p.value
## [1] 7.260099e-37
# 1 c
results$estimate
##       cor 
## 0.3483417