A short introduction to R

Outline

In this session, we review some of the fundamentals of the R language. It should be a useful refresher prior to the intermediate-level Data Analysis and Visualisation using R course.

Topics covered include:-

  • Creating variables
  • Using Functions
  • Vectors
  • Data frame
  • Subsetting data, the base R way
  • Plotting
  • Statistical testing
  • How to get help

For a more detailed introduction, we suggest the following free resources

R basics

Advantages of R

The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include:-

Key features

  • Open-source
  • Cross-platform
  • Access to existing visualisation / statistical tools
  • Flexibility
  • Visualisation and interactivity
  • Add-ons for many fields of research
  • Facilitating Reproducible Research
duke-scandal

duke-scandal

Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyone’s mind.

Keith Baggerly’s talk on the subject is highy-recommended.

Support for R

  • Online forums such as Stack Overflow regularly feature R
  • Blogs
  • Local user groups
  • e.g. Sheffield-R
  • Documentation via ? or help.start()
  • Documentation for packages is found via the Packages tab in the bottom-right of RStudio.
  • Packages analyse all kinds of Genomic data (>800)
  • Compulsory documentation (vignettes) for each package
  • 6-month release cycle
  • Course Materials
  • Example data and workflows
  • Common, re-usable framework and functionality
  • Available Support
    • Often you will be able to interact with the package maintainers / developers and other power-users of the project software

RStudio

  • Rstudio is a free environment for R
  • Convenient menus to access scripts, display plots
  • Still need to use command-line to get things done
  • Developed by some of the leading R programmers
  • Used by beginners, and experienced users alike

To get started, you will need to install the latest version of R and RStudio Desktop; both of which are free.

Once installed, you should be able to launch RStudio by clicking on its icon:-

Entering commands in R

  • The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
  • this doesn’t automatically keep track of the steps you did
  • Alternative, an R script can be used to keep a record of the commands you used.
  • For this course we will use a relatively new feature called R markdown.
  • An R markdown mixes plain text with R code
  • The R code can be run from inside the document and the results are displayed directly underneath
  • Each chunk of R code looks something like this.
  • Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
  • Or you can press the green triangle on the right-hand side to run everything in the chunk
  • Try this now!
print("Hello World")
[1] "Hello World"
  • You can add R chunks by pressing CRTL + ALT + I
    • or using the Insert menu option
    • (can also include code from other languages such as Python or bash)
    • try and avoid adding code chunks manually

Getting started

At a basic level, we can use R as a calculator to compute simple sums with the +, -, * (for multiplication) and / (for division) symbols.

2 + 2
[1] 4
2 - 2
[1] 0
4 * 3
[1] 12
10 / 2
[1] 5

The answer is displayed at the console with a [1] in front of it. The 1 inside the square brackets is a place-holder to signify how many values were in the answer (in this case only one). We will talk about dealing with lists of numbers shortly…

In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.

2 + 2 *3
[1] 8
2 + (2 * 3)
[1] 8
(2 + 2) * 3
[1] 12

R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.

pi
[1] 3.141593
sin (pi/2)
[1] 1
cos(pi)
[1] -1
tan(2)
[1] -2.18504
log(1)
[1] 0

We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.

Variables

A variable is a letter or word which takes (or contains) a value. We use the assignment ‘operator’, <- to create a variable and store some value in it.

x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25

We also can perform arithmetic on variables using functions:

sqrt(myNumber)
[1] 5

We can add variables together:

x + myNumber
[1] 35

We can change the value of an existing variable:

x <- 21
x
[1] 21
  • We can set one variable to equal the value of another variable:
x <- myNumber
x
[1] 25
  • We can modify the contents of a variable:
myNumber <- myNumber + sqrt(16)
myNumber
[1] 29

When we are feeling lazy we might give our variables short names (x, y, i…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as ., _, ‘-’. Naming variables the same as in-built functions in R, such as c, T, mean should also be avoided.

Naming variables is a matter of taste. Some conventions exist such as a separating words with - or using camelCaps. Whatever convention you decided, stick with it!

Functions

Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:

sin(x)
[1] -0.1323518

this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.

Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.

seq(from = 3, to = 20, by = 4)
[1]  3  7 11 15 19
seq(3, 20, 4)
[1]  3  7 11 15 19

Arguments can have default values, meaning we do not need to specify values for these in order to run the function.

rnorm is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want

## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
 [1]  1.07428564 -1.42125894  0.07065709  1.64828501 -1.43106387  0.87397931 -1.03087092 -0.69340405  0.50653689
[10]  0.85978011

The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page

(N.B sometimes help pages can describe more than one function)

?rnorm

In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean and sd arguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.

rnorm(n=10, mean=2,sd=3)
 [1]  5.7418172  0.1905547 -3.8939599 -1.3003433  3.4895211  1.9100285  1.5689795 -0.1027642 -0.5908338  5.3734122
rnorm(10, 2, 3)
 [1] -5.6024243  1.6679810  5.1216143  4.5250889  2.8038471  6.0292955  1.3821019 -0.1843767 -3.7224487  4.5168876

In the examples above, seq and rnorm were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.




Exercise

  • What is the value of pi to 3 decimal places?
    • see the help for round ?round
  • How can we a create a sequence from 2 to 20 comprised of 5 equally-spaced numbers?
    • check the help page for seq ?seq
  • Create a variable containing 1000 random numbers with a mean of 2 and a standard deviation of 3
    • what is the maximum and minimum of these numbers?
    • what is the average?
    • HINT: see the help pages for functions min, max and `mean



Vectors

  • The basic data structure in R is a vector – an ordered collection of values.
  • R treats even single values as 1-element vectors.
  • The function c combines its arguments into a vector:
  • As c is a function, we specify it’s arguments in curved brackets(...)
x <- c(3,4,5,6)
x
[1] 3 4 5 6

The seq function we saw before was another example of how to create a sequence of values. A useful shortcut is to use the : symbol.

x <- 3:6
x
[1] 3 4 5 6

The square brackets [] indicate the position within the vector (the index). We can extract individual elements by using the [] notation:

x[1]
[1] 3
x[4]
[1] 6

We can even put a vector inside the square brackets: (vector indexing)




Exercise

Without using R!

  • If y <- 2:4, what would x[y] give?
    • [1] 3 5
    • [1] 2 4
    • [1] 4 5 6



When applying all standard arithmetic operations to vectors, application is element-wise. Thus, we say that R supports vectorised operations.

x <- 1:10
y <- x*2
y
 [1]  2  4  6  8 10 12 14 16 18 20
z <- x^2
x + y
 [1]  3  6  9 12 15 18 21 24 27 30

Vectorised operations are extremely powerful. Operations that would require a for loop (or similar) in other languages such as C, Python, can be performed in a single line of R code.

A vector can also contain text; called a character vector. Such a vector can also be constructed using the c function.

x <- c("A","B","C","D")

The quote marks are crucial. Why?

##try and ignore the eval=FALSE on the line above...we'll explain what this means later
x <- c(A, B, C, D)
Error in try(x <- c(A, B, C, D), silent = TRUE) : object 'A' not found

Another useful type of data that we will see is the logical or boolean which can take either the values of TRUE or FALSE

x <- c(TRUE,TRUE,FALSE)

Logical values are useful when we want to create subsets of our data. We can use comparison operators; ==, >, <, != to check if values are equal, greater than, less than, or not equal.

x <- c("A","A", "B","B","C")
x == "A"
[1]  TRUE  TRUE FALSE FALSE FALSE
x != "A"
[1] FALSE FALSE  TRUE  TRUE  TRUE
x <- rnorm(10)
x > 0
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

However, all items in the vector must be the same type. If you attempt anything else, R will convert all values to the same (most basic) type.

x <- c(1, 2, "three")
x
[1] "1"     "2"     "three"

Packages in R

So far we have used functions that are available with the base distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.

Packages are distributed through repositories. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.

The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The library function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session.

## RColorBrewer is a package for making nice colours
library(RColorBrewer)

There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages

We will be using the gapminder R package in this practical. To install it, we would do.

install.packages("RColorBrewer")

A package may have several dependancies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.

So long as you stick with the same version of R, you won’t need to repeat this install process.

Dealing with data

We are going to explore some of the basic features of R using data from the gapminder project, which have been bundled into an R package. These data give various indicator variables for different countries around the world (life expectancy, population and Gross Domestic Product). We have saved these data as a .csv file to demonstrate how to import data into R.

You can download these data here. Right-click the link and save to somewhere on your computer that you wish to work from.

The working directory

Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by doing:-

getwd()

N.B. Here, a set of open and closed brackets () is used to call the getwd function with no arguments.

We can also list the files in this directory with:-

list.files()

Any .csv file in the working directory can be imported into R by supplying the name of the file to the read.csv function and creating a new variable to store the result. A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.

file.exists("gapminder.csv")
[1] TRUE

If the file we want to read is not in the current working directory, we will have to write the path to the file; either relevant to the current working directory (e.g. the directory “up” from the current working directory, or in a sub-folder), or the full path. In an interactive session, you can do use file.choose to open a dialogue box. The path to the the file will then be displayed in R.

file.choose()

Assuming the file can be found, we can use read.csv to import. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created.

gapminder <- read.csv("gapminder.csv")

The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).




Exercise

  • What are the dimensions of the data frame?
  • What columns are available?
  • HINT: see the dim, ncol, nrow and colnames functions



In Rstudio , you can view the contents of the data frame we have just created. This is useful for interactive exploration of the data, but not so useful for automation and scripting and analyses.

View(gapminder)

We should always check the data frame that we have created. Sometimes R will happily read data using an inappropriate function and create an object without raising an error. However, the data might be unsuable. Consider:-

test <- read.delim("gapminder.csv")
head(test)
dim(test)
[1] 1704    1

We can access the columns of a data frame by knowing the column name. TIP Use auto-complete with the TAB key to get the name of the column correct

gapminder$country

A vector (1-dimensional) is returned, the length of which is the same as the number of rows in the data frame. The vector could be stored as a variable and itself be subset or used in further calculations

The summary function is a useful way of summarising the data containing in each column. It will give information about the type of data (remember, data frames can have a mixture of numeric and character columns) and also an appropriate summary. For numeric columns, it will report some stats about the distribution of the data. For categorical data, it will report the different levels.

summary(gapminder)
        country        continent        year         lifeExp           pop              gdpPercap       
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60   Min.   :6.001e+04   Min.   :   241.2  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20   1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71   Median :7.024e+06   Median :  3531.8  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47   Mean   :2.960e+07   Mean   :  7215.3  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85   3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Australia  :  12                  Max.   :2007   Max.   :82.60   Max.   :1.319e+09   Max.   :113523.1  
 (Other)    :1632                                                                                       



Exercise

  • Save the life expectancy and population as variables
    • what is the maximum life expectancy?
    • what is the smallest population?
    • create a new vector containing the life expectancy to the nearest whole number
    • create a new vector containing the populations expressed as millions of people
    • HINT:- min, max, round…..



Subsetting

A data frame can be subset using square brackes[] placed after the name of the data frame. As a data frame is a two-dimensional object, you need a row and column index, or vector indices.

gapminder[1,2]
gapminder[2,1]
gapminder[c(1,2,3),1]
gapminder[c(1,2,3),c(1,2,3)]

Note that the data frame is not altered we are just seeing what a subset of the data looks like and not changing the underlying data. If we wanted to do this, we would need to create a new variale.

gapminder

Should we wish to see all rows, or all columns, we can neglect either the row or column index

gapminder[1,]
gapminder[,1]

Just like subsetting a vector, the indices can be vectors containing multiple values

gapminder[1:3,1:2]
gapminder[seq(1,1704,length.out = 10),1:4]

A common shortcut is head which prints the first six rows of a data frame.

head(gapminder)

When subsetting entire rows you need to remember the , after the row indices. If you fail to do so, R may still return a result. However, it probably won’t be what you expected. Look what happens if you wanted to the first three rows but typed the following command

gapminder[1:3]

Rather than selecting rows based on their numeric index (as in the previous example) we can use what we call a logical test. This is a test that gives either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.

For example we could compare the lifeExp variable to 40. The result is a vector of TRUE or FALSE; one for each row in the data frame

gapminder$lifeExp < 40

This R code can be put inside the square brackets to select rows of interest (those observations where the life expectancy variable is less than 40).

gapminder[gapminder$lifeExp < 40, ]

The , is important as this tells R to display all columns. If we wanted a subset of the columns we would put their indices after the ,

gapminder[gapminder$lifeExp < 40, 1:4]

Testing for equality can be done using ==. This will only give TRUE for entries that are exactly the same as the test string.

gapminder[gapminder$country == "Zambia",]

N.B. For partial matches, the grep function and / or regular expressions (if you know them) can be used.

gapminder[grep("land", gapminder$country),]

There are a couple of ways of testing for more than one text value. The first uses an or | statement. i.e. testing if the value of country is Zambia or the value is Zimbabwe.

The %in% function is a convenient function for testing which items in a vector correspond to a defined set of values.

gapminder[gapminder$country == "Zambia" | gapminder$country == "Zimbabwe",]
gapminder[gapminder$country %in% c("Zambia","Zimbabwe"),]

Similar to or, we can require that both tests are TRUE by using an and & operation. e.g. which years in Zambia had a life expectancy less than 40

gapminder[gapminder$country == "Zambia" & gapminder$lifeExp < 40,]

Finally, we have != for testing if something is not equal

gapminder[gapminder$continent != "Europe",]

You will sometimes see the which function used during subsetting. This is used to convert the logical (TRUE/FALSE) vector into a numeric vector that gives the positon of the TRUE values only

  • this is useful if you think your vector might have NA (missing) values
  • there are no NAvalues in gapminder, so we’ll fake an example
tmp <- gapminder
tmp$pop[1:10] <- NA
tmp[tmp$pop < 1e6,]
tmp[which(tmp$pop < 1e6),]



Exercise

  • Create a subset of the data where the population less than a million in the year 2002
  • Create a subset of the data where the life expectancy is less than 45 in the year 2007
    • which continents are these countries located in?
  • Which rows in the data frame correspond to the maximum life expectancy and smallest population that you found in the previous exercise?



Columns in the data frame can be modified, or new columns created

  • can use the $ operator to assign a value to a column that doesn’t currently exist
    • the vector on the right of the assignment <- must that a length equal to the number of rows in the data frame
  • can also overwrite existing columns
gapminder_extra <- gapminder
gapminder_extra$PopInMillions <- gapminder_extra$pop / 1000000
gapminder_extra$lifeExp <- round(gapminder_extra$lifeExp)
gapminder_extra

Ordering and sorting

A vector can be returned in sorted form using the sort function.

sort(countries)
sort(countries,decreasing = TRUE)

However, if we want to sort an entire data frame a different approach is needed. The trick is to use order. Rather than giving a sorted set of values, it will give sorted indices. These indices can then be used for a subset operation.

leastPop <- gapminder[order(gapminder$pop),]
head(leastPop)

We can even order by more than one condition

gapminder[order(gapminder$year, gapminder$country),]

A final point on data frames is that we can export them out of R once we have done our data processing.

byWealth <- gapminder[order(gapminder$gdpPercap,decreasing = TRUE),]
head(byWealth)
write.csv(byWealth, file="dataOrderedByWealth.csv")

Plotting and stats

All your favourite types of plot can be created in R

  • Simple plots are supported in the base distribution of R (what you get automatically when you download R).
    • boxplot, hist, barplot,… all of which are extensions of the basic plot function
  • Many different customisations are possible
    • colour, overlay points / text, legends, multi-panel figures
  • You need to think about how best to visualise your data
  • References..

Plots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame

hist(gapminder$lifeExp)

Scatter plots of two variables require two arguments; one for the x and one for the y axis.

plot(gapminder$pop,gapminder$lifeExp)

Barplots are commonly-used for counts of categorical data

barplot(table(gapminder$continent))

Boxplots are good for visualising and comparing distributions. Here the ~ symbol sets up a formula, the effect of which is to put the categorical variable on the x axis and continuous variable on the y axis.

boxplot(gapminder$gdpPercap ~ gapminder$continent)

Lots of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages ?plot and ?par give the full details. In short,

  • Axis labels, and titles can be specified as character strings.

  • R recognises many preset names as colours. To get a full list use colours()
  • Plotting characters can be specified using a pre-defined number:-

Putting it all together.

plot(gapminder$pop,gapminder$lifeExp,pch=16,
     col="red",ylab="Life Expectancy",
     xlab="Population",main="Life Expectancy trend with population")

The same customisations can be used for various plots:-

boxplot(gapminder$gdpPercap ~ gapminder$continent,
        col=c("red","orange","green","blue","purple"),
        main="GDP per-continent",
        xlab="Continent",
        ylab="GDP")

an also save plots to a file calling the pdf or png functions before executing the code to create the plot.

pdf("myLittlePlot.pdf")
barplot(table(gapminder$continent))
dev.off()
null device 
          1 

Any plots created in-between the pdf(..) and dev.off() lines will get saved to the named file. The dev.off() line is very important; without it you will not be able to view the plot you have created. pdf files are useful because you can create documents with multiple pages. Moreover, they can be imported into tools such as Adobe Illustrator to be incorporated with other graphics.

The canvas model

It is important to realise that base graphics in R uses a “canvas model” to create graphics. We can only overlay extra information on-top of an exising plot and cannot “undo” what is already drawn.

Let’s suppose we want to visualise and life expectancy and population of countries in Europe and Africa. First, create two datasets to represent European and African countries in the year 2002

euroData <- gapminder[gapminder$continent == "Europe" & gapminder$year == 2002,]
dim(euroData)
[1] 30  6
afrData <- gapminder[gapminder$continent == "Africa" & gapminder$year == 2002,]
dim(afrData)
[1] 52  6

We can start by plotting the life expectancy of the European countries as red dots.

plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy")

The points function can be used put extra points corresponding to African countries on the existing plot.

points(afrData$pop, afrData$lifeExp,col="blue",pch=16)

Wait, how many African countries did we have?

nrow(afrData)
[1] 52

The problem here is that the initial limits of the y axis were defined using the life expectancy range of the European data. We can only add points to the existing plotting window, so any African countries with life expectancy outside this range will not get displayed.

We can define the axes when we create the plot using xlim and ylim.

plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     xlim=c(0,8e7),ylim=c(30,90))
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)

Other useful functions for adding features to an existing plot include text, abline, grid, legend among others

Another useful trick for plotting is to take advantage of pre-existing colour palettes in R. The RColorBrewer package is a useful package for such palettes; many of which are friendly to those with visual impairments.

library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)

The brewer.pal function can return the names of n colours from one of the pre-defined palettes to be used as a col argument to a plotting function.

boxplot(gapminder$gdpPercap ~ gapminder$continent,col=brewer.pal(5,"Set2"),
        main="GDP per-continent",
        xlab="Continent",
        ylab="GDP")




Exercise

  • The plot of population versus GDP shows and obvious set of outliers
  • which countries do the points represent?
  • can you plot these points in a different colour?
  • use the abline function to draw a vertical line to separate the outliers from the rest of the plot
    • see below for what the graph is intended to look like



The col argument to plot doesn’t have to contain only one colour. We can use a vector that assigns a colour to each point to be plotted. In the following code we:-

  • create a subset of the data describing European and African countries in 2002
  • create a dummy vector that is just red repeated as many times as rows in the data frame
subset_afr_eur <- gapminder[gapminder$continent %in% c("Europe", "Africa") & gapminder$year == 2002,]
mycol <- rep("red",nrow(subset_afr_eur))
plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
     col=mycol,
     pch=16)

We can replace the items in the colour vector that correspond to an African country with a different value

  • we can now create the same plot as before, but only using the plot function once
mycol[which(subset_afr_eur$continent == "Africa")] <- "blue"
plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
     col=mycol,
     pch=16,
     xlab="Population",
     ylab="Life Expenctancy")

If we preferred, we could plot the data for European and African countries in separate plots side-by-side.

  • here we specify a layout of the plot using the mfrow plotting parameter
  • c(1,2) corresponds to 1 row and 2 columns
  • make sure you run the entire code chunk at once by pressing the green run button, or highlighting all the lines
  • running the code line-by-line won’t have the correct effect
par(mfrow=c(1,2))
plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     main="European Countries",
    xlim=c(0,8e7),ylim=c(30,90))
plot(afrData$pop, afrData$lifeExp,col="blue",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     main="African Countries",
    xlim=c(0,8e7),ylim=c(30,90))

We can’t really have a run-through of the R language without at least mentioning statistics! However, like plotting it is a vast field. The main challenges are putting your data in the correct format (which we have covered here), and deciding which test to use (which R will not advise you on!)

The t.test function is probably the most fundamental statistical testing function in R, and can be adapted to many different situations. Full details are given in the help page ?t.test.

The life expectancy looks to be different between Europe and Africa

boxplot(gapminder$lifeExp ~ gapminder$continent)

The output from t.test can be used to judge if there is a statistically-significant difference in means

  • need to specify two vectors, which are the values we want to test for difference in means
    • they don’t need to contain the same number of values
t.test(euroData$lifeExp,afrData$lifeExp)

    Welch Two Sample t-test

data:  euroData$lifeExp and afrData$lifeExp
t = 16.318, df = 65.751, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 20.51515 26.23558
sample estimates:
mean of x mean of y 
 76.70060  53.32523 

Were our data not normally-distributed we could use wilcox.test, for example. Fortunately, most statistical tests can be accessed in a similar manner, so it is easy to switch between using different tests provided your data are in the correct format. To re-iterate, the skill is in choosing which test is appropriate.

R also has functionality for assessing correlation between sets of values. For example, there seems to be a relationship between a country’s wealth and life expenctancy:-

plot(euroData$gdpPercap,euroData$lifeExp)

This can be quantified by calculating the correlation. Commonly the \(R^2\) value is reported

cor(euroData$gdpPercap,euroData$lifeExp)
[1] 0.8545089
cor(euroData$gdpPercap,euroData$lifeExp)^2
[1] 0.7301854

A linear model can be used to assess the relationship, which we can overlay on the plot using the function abline. The coefficients of the model give the slope and intercept of the straight line

model <- lm(euroData$lifeExp~euroData$gdpPercap)
plot(euroData$gdpPercap,euroData$lifeExp)
abline(model)

model$coefficients
       (Intercept) euroData$gdpPercap 
      7.185885e+01       2.230016e-04 
---
title: "R Crash Course"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
---


# A short introduction to R

## Outline

In this session, we review some of the fundamentals of the R language. It should be a useful refresher prior to the intermediate-level [Data Analysis and Visualisation using R](http://bioinformatics-core-shared-training.github.io/r-intermediate/) course. 

Topics covered include:- 

- Creating variables
- Using Functions
- Vectors
- Data frame
- Subsetting data, the base R way
- Plotting 
- Statistical testing
- How to get help

For a more detailed introduction, we suggest the following ***free*** resources

- [Solving Biological Problems with R](http://cambiotraining.github.io/r-intro/)
- [Introduction to Data Science with R](http://shop.oreilly.com/product/0636920034834.do)
- [Coursera course in R](http://blog.revolutionanalytics.com/2012/12/coursera-videos.html)
- [Beginners Introduction to R Statistical Software](http://bitesizebio.com/webinar/20600/beginners-introduction-to-r-statistical-software/)
- [R programming wiki](https://en.wikibooks.org/wiki/R_Programming)
- [Quick R](http://www.statmethods.net/)

# R basics

## Advantages of R

![](images/NYTimes_R_Article.png)

The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. [Notable users of R](http://www.revolutionanalytics.com/companies-using-r) include:- 

- [Facebook](http://blog.revolutionanalytics.com/2010/12/analysis-of-facebook-status-updates.html),
- [google](http://blog.revolutionanalytics.com/2009/05/google-using-r-to-analyze-effectiveness-of-tv-ads.html),
- [Microsoft](http://blog.revolutionanalytics.com/2014/05/microsoft-uses-r-for-xbox-matchmaking.html) (who recently [invested](http://blogs.microsoft.com/blog/2015/01/23/microsoft-acquire-revolution-analytics-help-customers-find-big-data-value-advanced-statistical-analysis/) in a commerical provider of R)
- The [New York Times](http://blog.revolutionanalytics.com/2011/03/how-the-new-york-times-uses-r-for-data-visualization.html).
- [Buzzfeed](http://blog.revolutionanalytics.com/2015/12/buzzfeed-uses-r-for-data-journalism.html) use R for some of their serious articles and have made the code [publically available](https://buzzfeednews.github.io/2016-04-federal-surveillance-planes/analysis.html)
- The [New Zealand Tourist Board](https://mbienz.shinyapps.io/tourism_dashboard_prod/) have R running in the background of their website
- The BBC makes code available for some of their stories (e.g. [gender bias in music festivals](https://github.com/BBC-Data-Unit/music-festivals))
- [Airbnb](https://medium.com/airbnb-engineering/using-r-packages-and-education-to-scale-data-science-at-airbnb-906faa58e12d)
![](https://cdn-images-2.medium.com/max/1200/1*BTMbVFh_hziJJcaQ7TBRwg.png)

## Key features

- Open-source
- Cross-platform
- Access to existing visualisation / statistical tools
- Flexibility
- Visualisation and interactivity
- Add-ons for many fields of research
- Facilitating ***Reproducible Research***

![duke-scandal](images/rep-research-nyt.png)

Two Biostatiscians (later termed '*Forensic Bioinformaticians*') from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent [scandal](https://www.youtube.com/watch?v=W5sZTNPMQRM) put Reproducible Research at the forefront of everyone's mind.

Keith Baggerly's talk on the subject is highy-recommended.

<iframe width="420" height="315" src="https://www.youtube.com/embed/7gYIs7uYbMo" frameborder="0" allowfullscreen></iframe>

## Support for R

- Online forums such as [Stack Overflow](http://stackoverflow.com/questions/tagged/r) regularly feature R
- [Blogs](http://www.r-bloggers.com/)
- Local user groups
  + e.g. [Sheffield-R](https://www.meetup.com/SheffieldR-Sheffield-R-Users-Group/)
- Documentation via `?` or `help.start()`
- Documentation for packages is found via the Packages tab in the bottom-right of RStudio.


![](images/logo_bioconductor.png)

-  Packages analyse all kinds of Genomic data (>800)
- Compulsory documentation (*vignettes*) for each package
- 6-month release cycle
- [Course Materials](http://bioconductor.org/help/course-materials/)
- [Example data](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData) and [workflows](http://bioconductor.org/help/workflows/)
- Common, re-usable framework and functionality
- [Available Support](https://support.bioconductor.org/)
    + Often you will be able to interact with the package maintainers / developers and other power-users of the project software

  
## RStudio



- Rstudio is a free environment for R
- Convenient menus to access scripts, display plots
- Still need to use *command-line* to get things done
- Developed by some of the leading R programmers
- Used by beginners, and experienced users alike

To get started, you will need to install the [latest version of R](https://cran.r-project.org/) and [RStudio Desktop](https://www.rstudio.com/products/rstudio/download3/); both of which are ***free***. 

Once installed, you should be able to launch RStudio by clicking on its icon:-


![](http://www.rstudio.com/wp-content/uploads/2014/03/blue-125.png)

## Entering commands in R

- The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
  + this doesn't automatically keep track of the steps you did
- Alternative, an *R script* can be used to keep a record of the commands you used.
- For this course we will use a relatively new feature called *R markdown*.
- An R markdown mixes plain text with R code
- The R code can be run from inside the document and the results are displayed directly underneath
- Each chunk of R code looks something like this.


```{r}

```


- Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
- Or you can press the green triangle on the right-hand side to run everything in the chunk
- Try this now!

```{r}
print("Hello World")
```

- You can add R chunks by pressing CRTL + ALT + I
    + or using the Insert menu option
    + (can also include code from other languages such as Python or bash)
    + **try and avoid adding code chunks manually**


# Getting started

![](images/128px-SHARP_ELSIMATE_EL-W221.jpg)

At a basic level, we can use R as a calculator to compute simple sums with the `+`, `-`, `*` (for multiplication) and `/` (for division) symbols. 

```{r}
2 + 2
2 - 2
4 * 3
10 / 2
```

The answer is displayed at the console with a `[1]` in front of it. The `1` inside the square brackets is a place-holder to signify how many values were in the answer (in this case only one). We will talk about dealing with lists of numbers shortly...

In the case of expressions involving multiple operations, R respects the [BODMAS](https://en.wikipedia.org/wiki/Order_of_operations#Mnemonics) system to decide the order in which operations should be performed.

```{r}
2 + 2 *3
2 + (2 * 3)
(2 + 2) * 3
```

R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.

![](images/128px-Casio-fx115ES-5564.jpg)

```{r}
pi
sin (pi/2)
cos(pi)
tan(2)
log(1)
```

We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of *variables*.

## Variables

A variable is a letter or word which takes (or contains) a value. We
use the assignment 'operator', `<-` to create a variable and store some value in it. 

```{r}
x <- 10
x
myNumber <- 25
myNumber
```
We also can perform arithmetic on variables using functions:

```{r}
sqrt(myNumber)
```

We can add variables together:
```{r}
x + myNumber
```


We can change the value of an existing variable:

```{r}
x <- 21
x
```

- We can set one variable to equal the value of another variable:

```{r}
x <- myNumber
x
```

- We can modify the contents of a variable:

```{r}
myNumber <- myNumber + sqrt(16)
myNumber
```

When we are feeling lazy we might give our variables short names (`x`, `y`, `i`...etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as `.`, `_`, '-'. Naming variables the same as in-built functions in R, such as `c`, `T`, `mean` should also be avoided.

Naming variables is a matter of taste. Some [conventions](http://adv-r.had.co.nz/Style.html) exist such as a separating words with `-` or using *c*amel*C*aps. Whatever convention you decided, stick with it!

## Functions

**Functions** in R perform operations on **arguments** (the inputs(s) to the function). We have already used:

```{r}
sin(x)
```

this returns the sine of x. In this case the function has one argument: **x**. Arguments are always contained in parentheses -- curved brackets, **()** -- separated by commas.


Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments. `seq` is a function for generating a numeric sequence *from* and *to* particular numbers. Type `?seq` to get the help page for this function.

```{r}
seq(from = 3, to = 20, by = 4)
seq(3, 20, 4)
```

Arguments can have *default* values, meaning we do not need to specify values for these in order to run the function.

`rnorm` is a function that will generate a series of values from a *normal distribution*. In order to use the function, we need to tell R how many values we want

```{r}
## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
```

The normal distribution is defined by a *mean* (average) and *standard deviation* (spread). However, in the above example we didn't tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page

(*N.B sometimes help pages can describe more than one function*)

```{r}
?rnorm
```

In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the `mean` and `sd` *arguments*. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.

```{r}
rnorm(n=10, mean=2,sd=3)
rnorm(10, 2, 3)
```

In the examples above, `seq` and `rnorm` were both outputting a series of numbers, which is called a *vector* in R and is the most-fundamental data-type.


******
******
******



### Exercise


  - What is the value of `pi` to 3 decimal places?
    + see the help for `round` `?round`
  - How can we a create a sequence from 2 to 20 comprised of 5 equally-spaced numbers?
    + check the help page for seq `?seq`
  - Create a *variable* containing 1000 random numbers with a *mean* of 2 and a *standard deviation* of 3
    + what is the maximum and minimum of these numbers?
    + what is the average?
    + HINT: see the help pages for functions `min`, `max` and `mean
    
```{r}


```
    
    
******
******
******

## Vectors

- The basic data structure in R is a **vector** -- an ordered collection of values. 
- R treats even single values as 1-element vectors. 
- The function `c` *combines* its arguments into a vector:
- As `c` is a function, we specify it's arguments in curved brackets`(...)`

```{r}
x <- c(3,4,5,6)
x
```


The `seq` function we saw before was another example of how to create a sequence of values. A useful shortcut is to use the `:` symbol. 

```{r}
x <- 3:6
x
```


The square brackets `[]` indicate the position within the vector (the ***index***). We can extract individual elements by using the `[]` notation:

```{r}
x[1]
x[4]
```

We can even put a vector inside the square brackets: (*vector indexing*)


******
******
******

### Exercise
**Without using R!**

  - If `y <- 2:4`, what would `x[y]` give?
    + `[1]  3  5 `
    + `[1]  2  4`
    + `[1]  4  5  6`

******
******
******


When applying all standard arithmetic operations to vectors, application is element-wise. Thus, we say that R supports *vectorised* operations.

```{r}
x <- 1:10
y <- x*2
y
z <- x^2

x + y
```

Vectorised operations are extremely powerful. Operations that would require a *for* loop (or similar) in other languages such as **C**, **Python**, can be performed in a single line of R code.

A vector can also contain text; called a character vector. Such a vector can also be constructed using the `c` function. 

```{r}
x <- c("A","B","C","D")
```

The quote marks are crucial. Why?

```{r eval=FALSE}
##try and ignore the eval=FALSE on the line above...we'll explain what this means later
x <- c(A, B, C, D)
```

```{r echo=FALSE}
### you don't need to run this code
tt <- try(x <- c(A, B, C, D),silent = TRUE)
cat(tt)
```


Another useful type of data that we will see is the *logical* or *boolean* which can take either the values of `TRUE` or `FALSE`

```{r}
x <- c(TRUE,TRUE,FALSE)
```

Logical values are useful when we want to create subsets of our data. We can use *comparison* operators; `==`, `>`, `<`, `!=` to check if values are equal, greater than, less than, or not equal.

```{r}
x <- c("A","A", "B","B","C")
x == "A"
x != "A"
x <- rnorm(10)
x > 0
```

However, all items in the vector **must** be the same type. If you attempt anything else, R will convert all values to the same (most basic) type.

```{r}
x <- c(1, 2, "three")
x
```





## Packages in R

So far we have used functions that are available with the *base* distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.

Packages are distributed through *repositories*. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.

The **Packages** tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The `library` function is used to load a package and make it's functions / data available in your current R session. *You need to do this every time you load a new RStudio session*. 


```{r eval=FALSE}
## RColorBrewer is a package for making nice colours
library(RColorBrewer)
```

There are functions for installing packages within R. If your package is part of the main **CRAN** repository, you can use `install.packages`

We will be using the `gapminder` R package in this practical. To install it, we would do.

```{r eval=FALSE}
install.packages("RColorBrewer")
```


A package may have several *dependancies*; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.

**So long as you stick with the same version of R, you won't need to repeat this install process.**

# Dealing with data

We are going to explore some of the basic features of R using data from the [gapminder](https://www.gapminder.org/data/) project, which have been bundled into an [R package](https://github.com/jennybc/gapminder). These data give various indicator variables for different countries around the world (life expectancy, population and Gross Domestic Product). We have saved these data as a `.csv` file to demonstrate how to import data into R.

You can download these data [here](https://raw.githubusercontent.com/bioinformatics-core-shared-training/r-intermediate/master/gapminder.csv). Right-click the link and save to somewhere on your computer that you wish to work from.

## The working directory

Like other software (Word, Excel, Photoshop....), R has a default location where it will save files to and import data from. This is known as the *working directory* in R. You can query what R currently considers its working directory by doing:-

```{r eval=FALSE}
getwd()
```

*N.B. Here, a set of open and closed brackets `()` is used to call the `getwd` function with no arguments.*

We can also list the files in this directory with:-

```{}
list.files()
```

Any `.csv` file in the working directory can be imported into R by supplying the name of the file to the `read.csv` function and creating a new variable to store the result. A useful sanity check is the `file.exists` function which will print `TRUE` is the file can be found in the working directory.

```{r}
file.exists("gapminder.csv")
```

If the file we want to read is not in the current working directory, we will have to write the path to the file; either *relevant* to the current working directory (e.g. the directory "up" from the current working directory, or in a sub-folder), or the full path. In an interactive session, you can do use `file.choose` to open a dialogue box. The path to the the file will then be displayed in R.

```{r eval=FALSE}
file.choose()
```

Assuming the file can be found, we can use `read.csv` to import. Other functions can be used to read tab-delimited files (`read.delim`) or a generic `read.table` function. A data frame object is created.

```{r}
gapminder <- read.csv("gapminder.csv")
```

The data frame object in R allows us to work with "tabular" data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).


******
******
******



### Exercise

- What are the dimensions of the data frame?
- What columns are available?
  + HINT: see the `dim`, `ncol`, `nrow` and `colnames` functions
  
```{r}



```
  
 
******
******
******



In Rstudio , you can view the contents of the data frame we have just created. This is useful for interactive exploration of the data, but not so useful for automation and scripting and analyses.

```{r eval=FALSE}
View(gapminder)
```

We should always check the data frame that we have created. Sometimes R will happily read data using an inappropriate function and create an object without raising an error. However, the data might be unsuable. Consider:-

```{r}
test <- read.delim("gapminder.csv")
head(test)
dim(test)
```


We can access the columns of a data frame by knowing the column name. 
***TIP*** Use auto-complete with the ***TAB*** key to get the name of the column correct

```{r eval=FALSE}
gapminder$country


```


A vector (1-dimensional) is returned, the length of which is the same as the number of rows in the data frame. The vector could be stored as a variable and itself be subset or used in further calculations

```{r, echo=FALSE}
countries <- gapminder$country
```


The `summary` function is a useful way of summarising the data containing in each column. It will give information about the *type* of data (remember, data frames can have a mixture of numeric and character columns) and also an appropriate summary. For numeric columns, it will report some stats about the distribution of the data. For categorical data, it will report the different *levels*.

```{r}
summary(gapminder)
```


******
******
******

### Exercise

- Save the life expectancy and population as variables
    + what is the maximum life expectancy?
    + what is the smallest population?
    + create a new vector containing the life expectancy to the nearest whole number
    + create a new vector containing the populations expressed as millions of people
    + HINT:- `min`, `max`, `round`.....
    
```{r}

```
    
    
******
******
******



## Subsetting

A data frame can be subset using square brackes`[]` placed after the name of the data frame. As a data frame is a two-dimensional object, you need a *row* and *column* index, or vector indices. 




```{r eval=FALSE}
gapminder[1,2]
gapminder[2,1]
gapminder[c(1,2,3),1]
gapminder[c(1,2,3),c(1,2,3)]
```

***Note that the data frame is not altered*** we are just seeing what a subset of the data looks like and not changing the underlying data. If we wanted to do this, we would need to create a new variale.

```{r eval=FALSE}
gapminder
```



Should we wish to see all rows, or all columns, we can neglect either the row or column index

```{r eval=FALSE}
gapminder[1,]
gapminder[,1]
```

Just like subsetting a vector, the indices can be vectors containing multiple values

```{r eval=FALSE}
gapminder[1:3,1:2]
gapminder[seq(1,1704,length.out = 10),1:4]
```


A common shortcut is `head` which prints the first six rows of a data frame.
   
```{r}
head(gapminder)
```

When subsetting entire rows ***you need to remember the , after the row indices***. If you fail to do so, R may still return a result. However, it probably won't be what you expected. Look what happens if you wanted to the first three rows but typed the following command

```{r eval=FALSE}
gapminder[1:3]
```


Rather than selecting rows based on their *numeric* index (as in the previous example) we can use what we call a *logical test*. This is a test that gives either a `TRUE` or `FALSE` result. When applied to subsetting, only rows with a `TRUE` result get returned.

For example we could compare the `lifeExp` variable to 40. The result is a *vector* of `TRUE` or `FALSE`; one for each row in the data frame

```{r eval=FALSE}
gapminder$lifeExp < 40
```

This R code can be put inside the square brackets to select rows of interest (those observations where the life expectancy variable is less than 40). 

```{r eval=FALSE}
gapminder[gapminder$lifeExp < 40, ]
```


The `,` is important as this tells R to display all columns. If we wanted a subset of the columns we would put their indices after the `,`

```{r eval=FALSE}
gapminder[gapminder$lifeExp < 40, 1:4]
```


Testing for equality can be done using `==`. This will only give `TRUE` for entries that are *exactly* the same as the test string. 

```{r}
gapminder[gapminder$country == "Zambia",]

```

N.B. For partial matches, the `grep` function and / or *regular expressions* (if you know them) can be used.

```{r eval=FALSE}
gapminder[grep("land", gapminder$country),]
```


There are a couple of ways of testing for more than one text value. The first uses an *or* `|` statement. i.e. testing if the value of `country` is `Zambia` *or* the value is `Zimbabwe`.

The `%in%` function is a convenient function for testing which items in a vector correspond to a defined set of values.

```{r eval=FALSE}
gapminder[gapminder$country == "Zambia" | gapminder$country == "Zimbabwe",]
gapminder[gapminder$country %in% c("Zambia","Zimbabwe"),]
```

```{r echo=FALSE}
head(gapminder[gapminder$country == "Zambia" | gapminder$country == "Zimbabwe",])
```


Similar to *or*, we can require that both tests are `TRUE` by using an *and* `&` operation. e.g. which years in Zambia had a life expectancy less than 40

```{r}
gapminder[gapminder$country == "Zambia" & gapminder$lifeExp < 40,]
```

Finally, we have `!=` for testing if something is *not* equal

```{r}
gapminder[gapminder$continent != "Europe",]
```

You will sometimes see the `which` function used during subsetting. This is used to convert the logical (`TRUE/FALSE`) vector into a numeric vector that gives the positon of the `TRUE` values only

- this is useful if you think your vector might have `NA` (*missing*) values 
- there are no `NA`values in `gapminder`, so we'll fake an example

```{r}
tmp <- gapminder
tmp$pop[1:10] <- NA
tmp[tmp$pop < 1e6,]
tmp[which(tmp$pop < 1e6),]
```


******
******
******

### Exercise

- Create a subset of the data where the population less than a million in the year 2002
- Create a subset of the data where the life expectancy is less than 45 in the year 2007
    + which continents are these countries located in?
- Which rows in the data frame correspond to the *maximum life expectancy* and *smallest population* that you found in the previous exercise?

```{r}


```



******
******
******


Columns in the data frame can be modified, or new columns created

- can use the `$` operator to assign a value to a column that doesn't currently exist
    + the vector on the right of the assignment `<-` must that a length equal to the number of rows in the data frame
- can also overwrite existing columns


```{r}
gapminder_extra <- gapminder
gapminder_extra$PopInMillions <- gapminder_extra$pop / 1000000
gapminder_extra$lifeExp <- round(gapminder_extra$lifeExp)
gapminder_extra
```


## Ordering and sorting

A vector can be returned in sorted form using the `sort` function.

```{r eval=FALSE}
sort(countries)
sort(countries,decreasing = TRUE)
```

However, if we want to sort an entire data frame a different approach is needed. The trick is to use `order`. Rather than giving a sorted set of *values*, it will give sorted *indices*. These indices can then be used for a subset operation.

```{r}
leastPop <- gapminder[order(gapminder$pop),]
head(leastPop)
```


We can even order by more than one condition

```{r eval=FALSE}
gapminder[order(gapminder$year, gapminder$country),]
```

```{r echo=FALSE}
head(gapminder[order(gapminder$year, gapminder$lifeExp),])
```

A final point on data frames is that we can export them out of R once we have done our data processing. 

```{r}
byWealth <- gapminder[order(gapminder$gdpPercap,decreasing = TRUE),]
head(byWealth)
write.csv(byWealth, file="dataOrderedByWealth.csv")
```

# Plotting and stats 

All your favourite types of plot can be created in R

```{r echo=FALSE}

par(mfrow=c(2,2))
barplot(VADeaths, beside = TRUE,
        col = c("lightblue", "mistyrose", "lightcyan",
                "lavender", "cornsilk"), ylim = c(0, 100))
boxplot(len ~ dose, data = ToothGrowth,
        boxwex = 0.25, at = 1:3 - 0.2,
        subset = supp == "VC", col = "yellow",
        main = "Guinea Pigs' Tooth Growth",
        xlab = "Vitamin C dose mg",
        ylab = "tooth length",
        xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
boxplot(len ~ dose, data = ToothGrowth, add = TRUE,boxwex = 0.25, at = 1:3 + 0.2,subset = supp == "OJ", col = "orange")
legend(2, 9, c("Ascorbic acid", "Orange juice"),fill = c("yellow", "orange"))
set.seed(14)
x <- rchisq(100, df = 4)
## if you really insist on using hist() ... :
hist(x, freq = FALSE, ylim = c(0, 0.2))
curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)
pie(c(Sky = 78, "Sunny side of pyramid" = 17, "Shady side of pyramid" = 5),
    init.angle = 315, col = c("deepskyblue", "yellow", "yellow3"), border = FALSE)
```


- Simple plots are supported in the *base* distribution of R (what you get automatically when you download R). 
    + `boxplot`, `hist`, `barplot`,... all of which are extensions of the basic `plot` function
- Many different customisations are possible
    + colour, overlay points / text, legends, multi-panel figures
- ***You need to think about how best to visualise your data*** 
    + http://www.bioinformatics.babraham.ac.uk/training.html#figuredesign
- References..
    + [Quick-R](http://www.statmethods.net/graphs/index.html)
    + [R graph gallery](https://www.r-graph-gallery.com/)
    
Plots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame

```{r}
hist(gapminder$lifeExp)
```

Scatter plots of two variables require two arguments; one for the `x` and one for the `y` axis.

```{r}
plot(gapminder$pop,gapminder$lifeExp)
```


Barplots are commonly-used for counts of categorical data

```{r}
barplot(table(gapminder$continent))
```

Boxplots are good for visualising and comparing distributions. Here the `~` symbol sets up a formula, the effect of which is to put the categorical variable on the `x` axis and continuous variable on the `y` axis.

```{r}
boxplot(gapminder$gdpPercap ~ gapminder$continent)
``` 

*Lots* of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages `?plot` and `?par` give the full details. In short,

- Axis labels, and titles can be specified as character strings. 

- R recognises many preset names as colours. To get a full list use `colours()`
    + check this [online reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) to see what they correspond to
- Plotting characters can be specified using a pre-defined number:-

```{r echo=FALSE}
par(mar=c(0.1,0.1,0.1,0.1))
i <- 0:24

x <- floor(i /5) + 1
y <- i %%5

plot(1:10, type="n", xlim = c(1,5), ylim=c(-1,5),axes=F,xlab="",ylab="")
points(x,y,pch=i+1, cex=2)
text(x,y-0.3,i+1)
```

Putting it all together.

```{r}
plot(gapminder$pop,gapminder$lifeExp,pch=16,
     col="red",ylab="Life Expectancy",
     xlab="Population",main="Life Expectancy trend with population")
```


The same customisations can be used for various plots:-

```{r}
boxplot(gapminder$gdpPercap ~ gapminder$continent,
        col=c("red","orange","green","blue","purple"),
        main="GDP per-continent",
        xlab="Continent",
        ylab="GDP")
```

an also save plots to a file calling the `pdf` or `png` functions before executing the code to create the plot. 
 
 
```{r}
pdf("myLittlePlot.pdf")
barplot(table(gapminder$continent))
dev.off()
```

Any plots created in-between the `pdf(..)` and `dev.off()` lines will get saved to the named file. The `dev.off()` line is very important; without it you will not be able to view the plot you have created. `pdf` files are useful because you can create documents with multiple pages. Moreover, they can be imported into tools such as Adobe Illustrator to be incorporated with other graphics. 

## The canvas model

It is important to realise that base graphics in R uses a *"canvas model"* to create graphics. We can only overlay extra information on-top of an exising plot and cannot "undo" what is already drawn.

Let's suppose we want to visualise and life expectancy and population of countries in Europe and Africa. First, create two datasets to represent European and African countries in the year 2002

```{r}
euroData <- gapminder[gapminder$continent == "Europe" & gapminder$year == 2002,]
dim(euroData)
```

```{r}
afrData <- gapminder[gapminder$continent == "Africa" & gapminder$year == 2002,]
dim(afrData)
```

We can start by plotting the life expectancy of the European countries as red dots.

```{r}
plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy")
```

The `points` function can be used put extra points corresponding to African countries on the existing plot. 

```{r eval=FALSE}
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)
```

```{r echo=FALSE}
plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy")
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)
```

Wait, how many African countries did we have?

```{r}
nrow(afrData)
```

The problem here is that the initial limits of the y axis were defined using the life expectancy range of the European data. We can only add points to the existing plotting window, so any African countries with life expectancy outside this range will not get displayed.

We can define the axes when we create the plot using `xlim` and `ylim`.

```{r}
plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     xlim=c(0,8e7),ylim=c(30,90))
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)
```




Other useful functions for adding features to an existing plot include `text`, `abline`, `grid`, `legend` among others

Another useful trick for plotting is to take advantage of pre-existing colour palettes in R. The `RColorBrewer` package is a useful package for such palettes; many of which are friendly to those with visual impairments.

```{r}
library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)

```

The `brewer.pal` function can return the names of `n` colours from one of the pre-defined palettes to be used as a `col` argument to a plotting function.

```{r}
boxplot(gapminder$gdpPercap ~ gapminder$continent,col=brewer.pal(5,"Set2"),
        main="GDP per-continent",
        xlab="Continent",
        ylab="GDP")
```

******
******
******

### Exercise

- The plot of population versus GDP shows and obvious set of outliers
  + which countries do the points represent?
  + can you plot these points in a different colour?
  + use the `abline` function to draw a vertical line to separate the outliers from the rest of the plot
    + see below for what the graph is intended to look like
![](images/plotting_exercise_target.png)
  
```{r}


```
  

******
******
******

The `col` argument to `plot` doesn't have to contain only one colour. We can use a vector that assigns a colour to each point to be plotted. In the following code we:-

- create a subset of the data describing European and African countries in 2002
- create a *dummy* vector that is just `red` repeated as many times as rows in the data frame

```{r}
subset_afr_eur <- gapminder[gapminder$continent %in% c("Europe", "Africa") & gapminder$year == 2002,]
mycol <- rep("red",nrow(subset_afr_eur))

plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
     col=mycol,
     pch=16)

```
 
We can replace the items in the colour vector that correspond to an African country with a different value

- we can now create the same plot as before, but only using the `plot` function once
 
```{r}
mycol[which(subset_afr_eur$continent == "Africa")] <- "blue"
plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
     col=mycol,
     pch=16,
     xlab="Population",
     ylab="Life Expenctancy")
```


If we preferred, we could plot the data for European and African countries in separate plots side-by-side.

- here we specify a *layout* of the plot using the `mfrow` plotting parameter
- `c(1,2)` corresponds to 1 *row* and 2 *columns*
- **make sure you run the entire code chunk at once by pressing the green run button, or highlighting all the lines**
  + running the code line-by-line won't have the correct effect

```{r}
par(mfrow=c(1,2))

plot(euroData$pop, euroData$lifeExp,col="red",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     main="European Countries",
    xlim=c(0,8e7),ylim=c(30,90))

plot(afrData$pop, afrData$lifeExp,col="blue",
     pch=16,
     xlab="Population",
     ylab="Life Expectancy",
     main="African Countries",
    xlim=c(0,8e7),ylim=c(30,90))


```



We can't really have a run-through of the R language without at least *mentioning* statistics! However, like plotting it is a vast field. The main challenges are putting your data in the correct format (which we have covered here), and deciding which test to use (**which R will not advise you on!**) 

- If you have some background in statistics you can see this course from the [Babraham Institute Bioinformatics Core](http://www.bioinformatics.babraham.ac.uk/training/R_Statistics/Introduction%20to%20Statistics%20with%20R.pdf) about how to perform statistical testing in R.
- If you need a more basic grounding in which statistical test to use, you can see this course from [CRUK Cambridge Institute](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
  
The `t.test` function is probably the most fundamental statistical testing function in R, and can be adapted to many different situations. Full details are given in the help page `?t.test`.


The life expectancy looks to be different between Europe and Africa

```{r}
boxplot(gapminder$lifeExp ~ gapminder$continent)
```

The output from `t.test` can be used to judge if there is a statistically-significant difference in means

- need to specify two *vectors*, which are the values we want to test for difference in means
    + they don't need to contain the same number of values

```{r}
t.test(euroData$lifeExp,afrData$lifeExp)
```



Were our data not normally-distributed we could use `wilcox.test`, for example.  Fortunately, most statistical tests can be accessed in a similar manner, so it is easy to switch between using different tests provided your data are in the correct format. To re-iterate, the skill is in choosing which test is appropriate.

R also has functionality for assessing correlation between sets of values. For example, there seems to be a relationship between a country's wealth and life expenctancy:-

```{r}
plot(euroData$gdpPercap,euroData$lifeExp)
```

This can be quantified by calculating the *correlation*. Commonly the $R^2$ value is reported

```{r}
cor(euroData$gdpPercap,euroData$lifeExp)
cor(euroData$gdpPercap,euroData$lifeExp)^2
```

A *linear model* can be used to assess the relationship, which we can overlay on the plot using the function `abline`. The *coefficients* of the model give the *slope* and *intercept* of the straight line

```{r}
model <- lm(euroData$lifeExp~euroData$gdpPercap)
plot(euroData$gdpPercap,euroData$lifeExp)
abline(model)
model$coefficients

```


