In this session, we review some of the fundamentals of the R language. It should be a useful refresher prior to the intermediate-level Data Analysis and Visualisation using R course.
Topics covered include:-
For a more detailed introduction, we suggest the following free resources
The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include:-
Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyone’s mind.
Keith Baggerly’s talk on the subject is highy-recommended.
?
or help.start()
To get started, you will need to install the latest version of R and RStudio Desktop; both of which are free.
Once installed, you should be able to launch RStudio by clicking on its icon:-
print("Hello World")
[1] "Hello World"
At a basic level, we can use R as a calculator to compute simple sums with the +
, -
, *
(for multiplication) and /
(for division) symbols.
2 + 2
[1] 4
2 - 2
[1] 0
4 * 3
[1] 12
10 / 2
[1] 5
The answer is displayed at the console with a [1]
in front of it. The 1
inside the square brackets is a place-holder to signify how many values were in the answer (in this case only one). We will talk about dealing with lists of numbers shortly…
In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.
2 + 2 *3
[1] 8
2 + (2 * 3)
[1] 8
(2 + 2) * 3
[1] 12
R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
pi
[1] 3.141593
sin (pi/2)
[1] 1
cos(pi)
[1] -1
tan(2)
[1] -2.18504
log(1)
[1] 0
We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.
A variable is a letter or word which takes (or contains) a value. We use the assignment ‘operator’, <-
to create a variable and store some value in it.
x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25
We also can perform arithmetic on variables using functions:
sqrt(myNumber)
[1] 5
We can add variables together:
x + myNumber
[1] 35
We can change the value of an existing variable:
x <- 21
x
[1] 21
x <- myNumber
x
[1] 25
myNumber <- myNumber + sqrt(16)
myNumber
[1] 29
When we are feeling lazy we might give our variables short names (x
, y
, i
…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as .
, _
, ‘-’. Naming variables the same as in-built functions in R, such as c
, T
, mean
should also be avoided.
Naming variables is a matter of taste. Some conventions exist such as a separating words with -
or using camelCaps. Whatever convention you decided, stick with it!
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
sin(x)
[1] -0.1323518
this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments. seq
is a function for generating a numeric sequence from and to particular numbers. Type ?seq
to get the help page for this function.
seq(from = 3, to = 20, by = 4)
[1] 3 7 11 15 19
seq(3, 20, 4)
[1] 3 7 11 15 19
Arguments can have default values, meaning we do not need to specify values for these in order to run the function.
rnorm
is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want
## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
[1] 1.07428564 -1.42125894 0.07065709 1.64828501 -1.43106387 0.87397931 -1.03087092 -0.69340405 0.50653689
[10] 0.85978011
The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
(N.B sometimes help pages can describe more than one function)
?rnorm
In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean
and sd
arguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.
rnorm(n=10, mean=2,sd=3)
[1] 5.7418172 0.1905547 -3.8939599 -1.3003433 3.4895211 1.9100285 1.5689795 -0.1027642 -0.5908338 5.3734122
rnorm(10, 2, 3)
[1] -5.6024243 1.6679810 5.1216143 4.5250889 2.8038471 6.0292955 1.3821019 -0.1843767 -3.7224487 4.5168876
In the examples above, seq
and rnorm
were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.
pi
to 3 decimal places?
round
?round
?seq
min
, max
and `meanc
combines its arguments into a vector:c
is a function, we specify it’s arguments in curved brackets(...)
x <- c(3,4,5,6)
x
[1] 3 4 5 6
The seq
function we saw before was another example of how to create a sequence of values. A useful shortcut is to use the :
symbol.
x <- 3:6
x
[1] 3 4 5 6
The square brackets []
indicate the position within the vector (the index). We can extract individual elements by using the []
notation:
x[1]
[1] 3
x[4]
[1] 6
We can even put a vector inside the square brackets: (vector indexing)
Without using R!
y <- 2:4
, what would x[y]
give?
[1] 3 5
[1] 2 4
[1] 4 5 6
When applying all standard arithmetic operations to vectors, application is element-wise. Thus, we say that R supports vectorised operations.
x <- 1:10
y <- x*2
y
[1] 2 4 6 8 10 12 14 16 18 20
z <- x^2
x + y
[1] 3 6 9 12 15 18 21 24 27 30
Vectorised operations are extremely powerful. Operations that would require a for loop (or similar) in other languages such as C, Python, can be performed in a single line of R code.
A vector can also contain text; called a character vector. Such a vector can also be constructed using the c
function.
x <- c("A","B","C","D")
The quote marks are crucial. Why?
##try and ignore the eval=FALSE on the line above...we'll explain what this means later
x <- c(A, B, C, D)
Error in try(x <- c(A, B, C, D), silent = TRUE) : object 'A' not found
Another useful type of data that we will see is the logical or boolean which can take either the values of TRUE
or FALSE
x <- c(TRUE,TRUE,FALSE)
Logical values are useful when we want to create subsets of our data. We can use comparison operators; ==
, >
, <
, !=
to check if values are equal, greater than, less than, or not equal.
x <- c("A","A", "B","B","C")
x == "A"
[1] TRUE TRUE FALSE FALSE FALSE
x != "A"
[1] FALSE FALSE TRUE TRUE TRUE
x <- rnorm(10)
x > 0
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
However, all items in the vector must be the same type. If you attempt anything else, R will convert all values to the same (most basic) type.
x <- c(1, 2, "three")
x
[1] "1" "2" "three"
So far we have used functions that are available with the base distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.
Packages are distributed through repositories. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.
The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The library
function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session.
## RColorBrewer is a package for making nice colours
library(RColorBrewer)
There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages
We will be using the gapminder
R package in this practical. To install it, we would do.
install.packages("RColorBrewer")
A package may have several dependancies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.
So long as you stick with the same version of R, you won’t need to repeat this install process.
We are going to explore some of the basic features of R using data from the gapminder project, which have been bundled into an R package. These data give various indicator variables for different countries around the world (life expectancy, population and Gross Domestic Product). We have saved these data as a .csv
file to demonstrate how to import data into R.
You can download these data here. Right-click the link and save to somewhere on your computer that you wish to work from.
Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by doing:-
getwd()
N.B. Here, a set of open and closed brackets ()
is used to call the getwd
function with no arguments.
We can also list the files in this directory with:-
list.files()
Any .csv
file in the working directory can be imported into R by supplying the name of the file to the read.csv
function and creating a new variable to store the result. A useful sanity check is the file.exists
function which will print TRUE
is the file can be found in the working directory.
file.exists("gapminder.csv")
[1] TRUE
If the file we want to read is not in the current working directory, we will have to write the path to the file; either relevant to the current working directory (e.g. the directory “up” from the current working directory, or in a sub-folder), or the full path. In an interactive session, you can do use file.choose
to open a dialogue box. The path to the the file will then be displayed in R.
file.choose()
Assuming the file can be found, we can use read.csv
to import. Other functions can be used to read tab-delimited files (read.delim
) or a generic read.table
function. A data frame object is created.
gapminder <- read.csv("gapminder.csv")
The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).
dim
, ncol
, nrow
and colnames
functionsIn Rstudio , you can view the contents of the data frame we have just created. This is useful for interactive exploration of the data, but not so useful for automation and scripting and analyses.
View(gapminder)
We should always check the data frame that we have created. Sometimes R will happily read data using an inappropriate function and create an object without raising an error. However, the data might be unsuable. Consider:-
test <- read.delim("gapminder.csv")
head(test)
dim(test)
[1] 1704 1
We can access the columns of a data frame by knowing the column name. TIP Use auto-complete with the TAB key to get the name of the column correct
gapminder$country
A vector (1-dimensional) is returned, the length of which is the same as the number of rows in the data frame. The vector could be stored as a variable and itself be subset or used in further calculations
The summary
function is a useful way of summarising the data containing in each column. It will give information about the type of data (remember, data frames can have a mixture of numeric and character columns) and also an appropriate summary. For numeric columns, it will report some stats about the distribution of the data. For categorical data, it will report the different levels.
summary(gapminder)
country continent year lifeExp pop gdpPercap
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Min. : 241.2
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Median : 3531.8
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Mean : 7215.3
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09 Max. :113523.1
(Other) :1632
min
, max
, round
…..A data frame can be subset using square brackes[]
placed after the name of the data frame. As a data frame is a two-dimensional object, you need a row and column index, or vector indices.
gapminder[1,2]
gapminder[2,1]
gapminder[c(1,2,3),1]
gapminder[c(1,2,3),c(1,2,3)]
Note that the data frame is not altered we are just seeing what a subset of the data looks like and not changing the underlying data. If we wanted to do this, we would need to create a new variale.
gapminder
Should we wish to see all rows, or all columns, we can neglect either the row or column index
gapminder[1,]
gapminder[,1]
Just like subsetting a vector, the indices can be vectors containing multiple values
gapminder[1:3,1:2]
gapminder[seq(1,1704,length.out = 10),1:4]
A common shortcut is head
which prints the first six rows of a data frame.
head(gapminder)
When subsetting entire rows you need to remember the , after the row indices. If you fail to do so, R may still return a result. However, it probably won’t be what you expected. Look what happens if you wanted to the first three rows but typed the following command
gapminder[1:3]
Rather than selecting rows based on their numeric index (as in the previous example) we can use what we call a logical test. This is a test that gives either a TRUE
or FALSE
result. When applied to subsetting, only rows with a TRUE
result get returned.
For example we could compare the lifeExp
variable to 40. The result is a vector of TRUE
or FALSE
; one for each row in the data frame
gapminder$lifeExp < 40
This R code can be put inside the square brackets to select rows of interest (those observations where the life expectancy variable is less than 40).
gapminder[gapminder$lifeExp < 40, ]
The ,
is important as this tells R to display all columns. If we wanted a subset of the columns we would put their indices after the ,
gapminder[gapminder$lifeExp < 40, 1:4]
Testing for equality can be done using ==
. This will only give TRUE
for entries that are exactly the same as the test string.
gapminder[gapminder$country == "Zambia",]
N.B. For partial matches, the grep
function and / or regular expressions (if you know them) can be used.
gapminder[grep("land", gapminder$country),]
There are a couple of ways of testing for more than one text value. The first uses an or |
statement. i.e. testing if the value of country
is Zambia
or the value is Zimbabwe
.
The %in%
function is a convenient function for testing which items in a vector correspond to a defined set of values.
gapminder[gapminder$country == "Zambia" | gapminder$country == "Zimbabwe",]
gapminder[gapminder$country %in% c("Zambia","Zimbabwe"),]
Similar to or, we can require that both tests are TRUE
by using an and &
operation. e.g. which years in Zambia had a life expectancy less than 40
gapminder[gapminder$country == "Zambia" & gapminder$lifeExp < 40,]
Finally, we have !=
for testing if something is not equal
gapminder[gapminder$continent != "Europe",]
You will sometimes see the which
function used during subsetting. This is used to convert the logical (TRUE/FALSE
) vector into a numeric vector that gives the positon of the TRUE
values only
NA
(missing) valuesNA
values in gapminder
, so we’ll fake an exampletmp <- gapminder
tmp$pop[1:10] <- NA
tmp[tmp$pop < 1e6,]
tmp[which(tmp$pop < 1e6),]
Columns in the data frame can be modified, or new columns created
$
operator to assign a value to a column that doesn’t currently exist
<-
must that a length equal to the number of rows in the data framegapminder_extra <- gapminder
gapminder_extra$PopInMillions <- gapminder_extra$pop / 1000000
gapminder_extra$lifeExp <- round(gapminder_extra$lifeExp)
gapminder_extra
A vector can be returned in sorted form using the sort
function.
sort(countries)
sort(countries,decreasing = TRUE)
However, if we want to sort an entire data frame a different approach is needed. The trick is to use order
. Rather than giving a sorted set of values, it will give sorted indices. These indices can then be used for a subset operation.
leastPop <- gapminder[order(gapminder$pop),]
head(leastPop)
We can even order by more than one condition
gapminder[order(gapminder$year, gapminder$country),]
A final point on data frames is that we can export them out of R once we have done our data processing.
byWealth <- gapminder[order(gapminder$gdpPercap,decreasing = TRUE),]
head(byWealth)
write.csv(byWealth, file="dataOrderedByWealth.csv")
All your favourite types of plot can be created in R
boxplot
, hist
, barplot
,… all of which are extensions of the basic plot
functionPlots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame
hist(gapminder$lifeExp)
Scatter plots of two variables require two arguments; one for the x
and one for the y
axis.
plot(gapminder$pop,gapminder$lifeExp)
Barplots are commonly-used for counts of categorical data
barplot(table(gapminder$continent))
Boxplots are good for visualising and comparing distributions. Here the ~
symbol sets up a formula, the effect of which is to put the categorical variable on the x
axis and continuous variable on the y
axis.
boxplot(gapminder$gdpPercap ~ gapminder$continent)
Lots of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages ?plot
and ?par
give the full details. In short,
Axis labels, and titles can be specified as character strings.
colours()
Plotting characters can be specified using a pre-defined number:-
Putting it all together.
plot(gapminder$pop,gapminder$lifeExp,pch=16,
col="red",ylab="Life Expectancy",
xlab="Population",main="Life Expectancy trend with population")
The same customisations can be used for various plots:-
boxplot(gapminder$gdpPercap ~ gapminder$continent,
col=c("red","orange","green","blue","purple"),
main="GDP per-continent",
xlab="Continent",
ylab="GDP")
an also save plots to a file calling the pdf
or png
functions before executing the code to create the plot.
pdf("myLittlePlot.pdf")
barplot(table(gapminder$continent))
dev.off()
null device
1
Any plots created in-between the pdf(..)
and dev.off()
lines will get saved to the named file. The dev.off()
line is very important; without it you will not be able to view the plot you have created. pdf
files are useful because you can create documents with multiple pages. Moreover, they can be imported into tools such as Adobe Illustrator to be incorporated with other graphics.
It is important to realise that base graphics in R uses a “canvas model” to create graphics. We can only overlay extra information on-top of an exising plot and cannot “undo” what is already drawn.
Let’s suppose we want to visualise and life expectancy and population of countries in Europe and Africa. First, create two datasets to represent European and African countries in the year 2002
euroData <- gapminder[gapminder$continent == "Europe" & gapminder$year == 2002,]
dim(euroData)
[1] 30 6
afrData <- gapminder[gapminder$continent == "Africa" & gapminder$year == 2002,]
dim(afrData)
[1] 52 6
We can start by plotting the life expectancy of the European countries as red dots.
plot(euroData$pop, euroData$lifeExp,col="red",
pch=16,
xlab="Population",
ylab="Life Expectancy")
The points
function can be used put extra points corresponding to African countries on the existing plot.
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)
Wait, how many African countries did we have?
nrow(afrData)
[1] 52
The problem here is that the initial limits of the y axis were defined using the life expectancy range of the European data. We can only add points to the existing plotting window, so any African countries with life expectancy outside this range will not get displayed.
We can define the axes when we create the plot using xlim
and ylim
.
plot(euroData$pop, euroData$lifeExp,col="red",
pch=16,
xlab="Population",
ylab="Life Expectancy",
xlim=c(0,8e7),ylim=c(30,90))
points(afrData$pop, afrData$lifeExp,col="blue",pch=16)
Other useful functions for adding features to an existing plot include text
, abline
, grid
, legend
among others
Another useful trick for plotting is to take advantage of pre-existing colour palettes in R. The RColorBrewer
package is a useful package for such palettes; many of which are friendly to those with visual impairments.
library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)
The brewer.pal
function can return the names of n
colours from one of the pre-defined palettes to be used as a col
argument to a plotting function.
boxplot(gapminder$gdpPercap ~ gapminder$continent,col=brewer.pal(5,"Set2"),
main="GDP per-continent",
xlab="Continent",
ylab="GDP")
abline
function to draw a vertical line to separate the outliers from the rest of the plot
The col
argument to plot
doesn’t have to contain only one colour. We can use a vector that assigns a colour to each point to be plotted. In the following code we:-
red
repeated as many times as rows in the data framesubset_afr_eur <- gapminder[gapminder$continent %in% c("Europe", "Africa") & gapminder$year == 2002,]
mycol <- rep("red",nrow(subset_afr_eur))
plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
col=mycol,
pch=16)
We can replace the items in the colour vector that correspond to an African country with a different value
plot
function oncemycol[which(subset_afr_eur$continent == "Africa")] <- "blue"
plot(subset_afr_eur$pop, subset_afr_eur$lifeExp,
col=mycol,
pch=16,
xlab="Population",
ylab="Life Expenctancy")
If we preferred, we could plot the data for European and African countries in separate plots side-by-side.
mfrow
plotting parameterc(1,2)
corresponds to 1 row and 2 columnspar(mfrow=c(1,2))
plot(euroData$pop, euroData$lifeExp,col="red",
pch=16,
xlab="Population",
ylab="Life Expectancy",
main="European Countries",
xlim=c(0,8e7),ylim=c(30,90))
plot(afrData$pop, afrData$lifeExp,col="blue",
pch=16,
xlab="Population",
ylab="Life Expectancy",
main="African Countries",
xlim=c(0,8e7),ylim=c(30,90))
We can’t really have a run-through of the R language without at least mentioning statistics! However, like plotting it is a vast field. The main challenges are putting your data in the correct format (which we have covered here), and deciding which test to use (which R will not advise you on!)
The t.test
function is probably the most fundamental statistical testing function in R, and can be adapted to many different situations. Full details are given in the help page ?t.test
.
The life expectancy looks to be different between Europe and Africa
boxplot(gapminder$lifeExp ~ gapminder$continent)
The output from t.test
can be used to judge if there is a statistically-significant difference in means
t.test(euroData$lifeExp,afrData$lifeExp)
Welch Two Sample t-test
data: euroData$lifeExp and afrData$lifeExp
t = 16.318, df = 65.751, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
20.51515 26.23558
sample estimates:
mean of x mean of y
76.70060 53.32523
Were our data not normally-distributed we could use wilcox.test
, for example. Fortunately, most statistical tests can be accessed in a similar manner, so it is easy to switch between using different tests provided your data are in the correct format. To re-iterate, the skill is in choosing which test is appropriate.
R also has functionality for assessing correlation between sets of values. For example, there seems to be a relationship between a country’s wealth and life expenctancy:-
plot(euroData$gdpPercap,euroData$lifeExp)
This can be quantified by calculating the correlation. Commonly the \(R^2\) value is reported
cor(euroData$gdpPercap,euroData$lifeExp)
[1] 0.8545089
cor(euroData$gdpPercap,euroData$lifeExp)^2
[1] 0.7301854
A linear model can be used to assess the relationship, which we can overlay on the plot using the function abline
. The coefficients of the model give the slope and intercept of the straight line
model <- lm(euroData$lifeExp~euroData$gdpPercap)
plot(euroData$gdpPercap,euroData$lifeExp)
abline(model)
model$coefficients
(Intercept) euroData$gdpPercap
7.185885e+01 2.230016e-04