Learning objectives
- Install R and RStudio
- Install the tidyverse collection of R packages
- To introduce Rstudio that we will be using, to write R scripts, in this course
- To introduce data types and data structures
Before starting this course you will need to ensure that your computer is set up with the required software.
If you have any difficulty installing any of this software then please contact one of the trainers for help.
R and RStudio are separate downloads and installations.
R is the underlying statistical computing environment. The base R system and a very large collection of packages that give you access to a huge range of statistical and analytical functionality are available from CRAN, the Comprehensive R Archive Network.
However, using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio.
On this course we will be making use of a brilliant collection of
packages designed for data science called the
tidyverse
that make it much easierand more
fun to work with your data. After installing R and RStudio, follow the
instructions at the bottom of this page to install the
tidyverse
.
To check which version of R you are using, start RStudio and the
first thing that appears in the console indicates the version of R you
are running. Alternatively, you can type sessionInfo()
,
which will also display which version of R you are running. Go on the CRAN website and
check whether a more recent version is available. If so, please download
and install it. You can check
here for more information on how to remove old versions from your
system if you wish to do so.
.exe
file that was just downloadedTo check the version of R you are using, start RStudio and the first
thing that appears on the terminal indicates the version of R you are
running. Alternatively, you can type sessionInfo()
, which
will also display which version of R you are running. Go on the CRAN website and check
whether a more recent version is available. If so, please download and
install it.
.pkg
file for the latest R version
sudo apt-get install r-base
, and for Fedora
sudo yum install R
), but we don’t recommend this approach
as the versions provided by this are usually out of date. In any case,
make sure you have at least R 3.3.1.sudo dpkg -i RSTUDIO-2023.xx.y-zzz-AMD64.DEB
at the terminal).After installing R and RStudio, please install the
tidyverse
packages.
After starting RStudio, at the console type:
install.packages("tidyverse")
(look for the ‘Console’ tab
and type at the >
prompt)
You can also do this by going to Tools -> Install Packages and typing the names of the packages separated by a comma.
R involves creating & using scripts which makes the steps you used in your analysis clear and can be inspected by someone else for feedback and error-checking.
R code is great for reproducibility. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
R integrates with other tools to generate manuscripts from your code. This document (RMarkdown a .Rmd file) is a case in point.
R is interdisciplinary and extensible and has thousands of installable packages to extend its capabilities. R has packages for image analysis, GIS, time series, population genetics, and a lot more.
R scales well to work on data of all shapes and sizes.
R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
R produces high-quality graphics suitable for publication in journals or the web.
R has a large and welcoming community - Thousands use R daily and many of them are willing to help you through websites such as Stack Overflow or the RStudio community.
Not only is R free, but it is also open-source and cross-platform.
Rstudio provides us with a friendly interface to the R statistical progrmming language. It consists of four main “Panes”. These can be re-sized and moved around to suit how you like to work.
By default the top left-hand pane is one for creating, editing & running R scripts.
A script is an R program that you have written. A good practice is for that script to perform only one role in your analysis workflow and so you may have several R scripts which you call, in a particular sequence, to analyse your data.
As you will see, a script is basically a text file that contains R commands and (ideally) comments to explain what the codes function is (as a documentation process).
As well as R scripts, there are many types of Rstudio document including Markdown files which we will use in the teaching of this course. These can provide interactive workbooks or pdf and web documents to name but a few possible outcomes.
Coming down the screen to the bottom left-hand pane we find the console window. This is where we can find output produced by running our R scripts.
We can also try out snippets of R code here. Those of you who have only used graphical interfaces like Windows or MacOS where you click on commands using a mouse may find this aspect of R somewhat different. We type in commands to R using the command line.
This area can also be used like a calculator. Let’s just type in
something like 23 + 45
followed by the return key and see
what happens. You should get the following:
> 23 + 45
[1] 68
Now 68 is clearly the answer but what is that 1 in brackets?
Here is another example to explain. If we type 1:36
and
press enter, what happens? R generates output counting from 1 to 36 but
cannot fit all the output on one line and so starts another like
this:
> 1:36
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36
Now we have two lines beginning with a number in square brackets.
Note that the number of values displayed on each line may differ on your
computer; it largely depends on the width of your console pane and the
font size. Try creating a larger sequence of numbers,
e.g. 1:100
, if all 36 numbers fit on a single line in your
case.
This is just R helping us to keep tabs on which number we are looking
at. [1]
denotes that the line starts with the first result
and the [26]
denotes that this line starts with the 26th
number. Let’s try another one and generate a sequence incrementing in
steps of 2:
> 1:36 * 2
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
[26] 52 54 56 58 60 62 64 66 68 70 72
There are other tabs on this pane but we shall not be covering these on this course.
Next we move to the top right-hand corner pane. Here we have even more tabs (of which we will only consider two Environment and History).
Environment keeps track on R variables which we create (more on those later) and their contents. History is like a tally role of all the R commands we have entered in our session.
Our final bottom right-hand pane also has several tabs. The Files tab is a file explorer that enable us to move around our directories and select which files we wish to work on. We can also change the default working directory that Rstudio will use.
The Plots tab is where any graphs that we create in R will appear. We can move through them using the arrow buttons and the export button will convert them to different graphics formats e.g. for publication in a paper or for the web.
The Packages tab shows which R packages are installed (These expand R’s functionallity and again will be covered later) and can also install new packages.
The Help tab is a massively useful tab which enables you to search R help index to get help pages on R functions and provide example code to help you use them in your R scripts.
Our overall goal for this course is to give you the ability to import your data into R, select a subset of the data most of interest for a given analysis, carry out an analysis to summarize these data and create visualizations of the data. First though, let us consider “What is Data?”
Data comes in many forms: Numbers (Integers and decimal values) or alphabetical (characters or lines of text). Clearly a computer (or R) needs a way of representing this wide range of data with it’s diverse properties.
R has 6 basic data types
The last two data types are rarely used in practice
Different types of data are needed in (any) programming for a variety of reasons:
11 + 3 # Operation of addition performed correctly
"11" + 3 # gives error
Mathematical operations such as addition and multiplication are performed using various operators. Here is a list of R’s arithmetic operators.
5*5 # 5 times 5
## [1] 25
7/3 # 7 divided by 3
## [1] 2.333333
7%%3 # reminder of 7 divided by 3
## [1] 1
Understanding this structure is absolutely essential
x <- 100 # create first vector
object or a variable
, both of which are used
interchangeably in this course.<-
An assignment operator one can also use
=
instead of <-
x <- 100
x = 100
y <- 10
y*y
## [1] 100
y + y
## [1] 20
100 + y
## [1] 110
x <- "Tom"
x
## [1] "Tom"
x <- "Jerry"
x
## [1] "Jerry"
2x <- 100 # gives error
_x <- 100 # gives error
my-name <- "Chandra" # throws error
my_name <- "Chandra" # no error
my.name <- "Chandra" # no error
COUNTRY <- "United Kingdom"
country <- "India"
c()
function should be used to create a vector that
holds more than one valuec()
x <- 100
x <- c(100) # same like above
x <- 100, 200 # gives error
x <- c(100, 200) # no error
x <- c(1:100) # create values from 1 to 100
x <- c(1,2,3,4)
typeof(x)
## [1] "double"
x <- c(1,"2",3,4)
typeof(x)
## [1] "character"
y <- c(TRUE, FALSE, TRUE, 1L)
y
## [1] 1 0 1 1
typeof(y)
## [1] "integer"
z <- c(TRUE, FALSE, FALSE)
typeof(z)
## [1] "logical"
z <- c(TRUE, FALSE, "FALSE")
z
## [1] "TRUE" "FALSE" "FALSE"
typeof(z)
## [1] "character"
y <- c(TRUE, FALSE, TRUE)
y
## [1] TRUE FALSE TRUE
sum(y)
## [1] 2
mean(y)
## [1] 0.6666667
x <- c(1,2,3,4,5)
x * 5 # same as x * c(5)
## [1] 5 10 15 20 25
x + 1 # same as x + c(1)
## [1] 2 3 4 5 6
x <- c(1,2,3,4,5,6)
y <- c(1,2,3,4,5,6)
x + y
## [1] 2 4 6 8 10 12
x <- c(1,2,3,4,5)
y <- c(1,2)
x + y
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 2 4 4 6 6
[]
subscript operator[]
one can give any of the following
vec <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
vec
## [1] 10 20 30 40 50 60 70 80 90 100
# extract 4th value from the vector var
vec[4]
## [1] 40
vec[c(4)] # good idea to use c() function even for single value
## [1] 40
# extract 4th and 7th values from the vector var
vec[c(4,7)]
## [1] 40 70
# extract all the values except 4th and 7th value
vec[c(-4,-7)]
## [1] 10 20 30 50 60 80 90 100
y <- c( 5, 8, 10)
y[c(FALSE, TRUE, FALSE)] # extract second element
## [1] 8
x <- c(10, 20, 30, 40)
x == 20
## [1] FALSE TRUE FALSE FALSE
x > 20
## [1] FALSE FALSE TRUE TRUE
keep <- x > 20
keep
## [1] FALSE FALSE TRUE TRUE
x[keep]
## [1] 30 40
x[ x > 20 ] # The equivalent of x[keep]
## [1] 30 40
x <- c(10, 20, 30, 40)
x[x > 10] # get all the values > 10
## [1] 20 30 40
x[x < 40] # get all the values < 40
## [1] 10 20 30
x[ x > 10 & x < 40] # get all the values > 10 and < 40
## [1] 20 30
x[ x == 20] # get values that are equal to 20
## [1] 20
x[ !x == 20] # equivalent to x[ x != 20]
## [1] 10 30 40
x <- c(10, 20, 30)
x[2]
## [1] 20
x[2] <- 1000
x
## [1] 10 1000 30
Functions are a fundamental building block of R. Functions are
“canned scripts” that automate more complicated sets of commands
including operations assignments, etc. Many functions are predefined, or
can be made available by importing R packages (more on that later). A
function usually takes one or more inputs called arguments. Functions
often (but not always) return a value. A typical example would be the
function round()
. The input (the argument) must be a
number, and the return value (in fact, the output) is the rounded
number. Executing a function (‘running it’) is called calling the
function. An example of a function call is:
pi <- 3.141593
round(pi)
## [1] 3
round
is a function that takes at lest one number and
returns a number that rounded to the nearest integer.()
args()
function?
or help()
followed by the name
of the function in the console, for example to get help with the round
function, type “?round” in the console?round
help(round) # equivalent to ?round
args()
to view the arguments of a functionargs(round)
## function (x, digits = 0)
## NULL
round()
takes exactly two arguments
round(x=pi, digits = 0)
## [1] 3
round(x=pi, digits = 2)
## [1] 3.14
round(x=pi, digits = 4)
## [1] 3.1416
round( digits = 4, x=pi)
## [1] 3.1416
As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.
When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values.
heights <- c(2, 4, 4, NA, 6)
mean(heights)
## [1] NA
max(heights)
## [1] NA
mean(heights, na.rm = TRUE)
## [1] 4
max(heights, na.rm = TRUE)
## [1] 6
If your data include missing values, you may want to become familiar with the function is.na() See below for examples.
## Extract those elements which are not missing values.
heights[!is.na(heights)]
## [1] 2 4 4 6
# Challenge 1 a: How many observations we have?
tumour_vol <- c(2.1, 1.9, 2.6, 1.8,3)
length(tumour_vol)
## [1] 5
# Challenge 1 b: What is the mean tumour volume?
mean(tumour_vol)
## [1] 2.28
# Challenge 1 c: How many patients has tumour volume less than 2
sum(c(2.1, 1.9, 2.6, 1.8,3) < 2)
## [1] 2
cor()
to get the correlation
coefficientcor()
function uses? For help use help()
or
?
# Challenge 2 a: use the function `cor()` to get the correlation coefficient
data1 <- c(10, 9, 7, 6, 7, 3, 7, 5, 6, 6)
data2 <- c(5, 2, 10, 7, 2, 5, 1, 5, 3, 4)
cor(x=data1, y=data2)
## [1] -0.1572206
# Challenge 2 b: Can you identify the default correlation method the `cor()` function uses?
?cor
# According to the help in the 'cor' file, if no method is specified by default the function will use the 'pearson' method.
# Challenge 2 c: Can you get Spearman correlation coefficient for these two vectors?
cor(x=data1, y=data2, method = "spearman")
## [1] -0.2839244
sum()
,
mean()
on this logical vector, if so what is the output of
sum() and mean()?# Challenge 3:
logi_vec <- c(TRUE, FALSE, TRUE, TRUE)
# Mathematical functions can be applied to logical vectors. Internally, the logical values TURE and FALSE are represented as 1 and 0, respectively.
sum(logi_vec)
## [1] 3
mean(logi_vec)
## [1] 0.75
# Challenge 4 a: c(5, 2, 9, 1, 13) * c(2)
c(5, 2, 9, 1, 13) * c(2)
## [1] 10 4 18 2 26
# Since the shorter vector has only one value, every value of the longer vector is multiplied by the value of the shorter vector.
# Challenge 4 b: c(5, 2, 9, 1, 13) * c(1,2)
c(5, 2, 9, 1, 13) * c(1,2)
## Warning in c(5, 2, 9, 1, 13) * c(1, 2): longer object length is not a multiple
## of shorter object length
## [1] 5 4 9 2 13
# Since the shorter vector has only two values, these two values are sequentially recycled to multiply the longer vector's values.
# Challenge 4 c: c(5, 2, 9, 1, 13) + c(1,2,3,4,5)
c(5, 2, 9, 1, 13) + c(1,2,3,4,5)
## [1] 6 4 12 5 18
# Due to the fact that both vectors have the same length, values are sequentially added together
vec <- c(23, 12, 41, 65, 23, 6)
vec[ vec == 23 | vec < 15]
## [1] 23 12 23 6
# Challenge 6 a: What is the index number of "April" in month.name vector? hist: "which" function may help you.
which(month.name == "April")
## [1] 4
# Challenge 6 b:Extract all the months from April to December
month.name[which(month.name == "April"):length(month.name)]
## [1] "April" "May" "June" "July" "August" "September"
## [7] "October" "November" "December"
These instructions were adapted from Data Carpentry course materials.