Introduction

The purpose of this section is to review some of the key concepts in basic R usage, and statistical testing

  • Reading data into R
  • The data-frame representation of data in R
  • Selecting rows and columns from a data frame
  • Computing numerical summaries
  • Basic plotting
  • Getting help on functions in RStudio

About this tutorial

  • The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
  • However, for this course we will use a relatively new feature called R-notebooks.
  • An R-notebook mixes plain text with R code
    • The R code can be run from inside the document and the results are displayed directly underneath
  • Each chunk of R code looks something like this.
  • Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
  • Or you can press the green triangle on the right-hand side to run everything in the chunk
    • Try this now!
print("Hello World")
[1] "Hello World"
  • You can add R chunks by pressing CRTL + ALT + I
    • or using the Insert menu option
    • (can also include code from other languages such as Python or bash)

The document may also contain other formatting options that are used to render the HTML (or PDF, Word) output.

Here is some italic text, but we can also write in bold, or write things

  • in
  • a
  • list
    • which include sub-lists

Example Analysis

We will use a dataset from The University of Sheffield Mathematics and Statistics Help group ((MASH)(https://www.sheffield.ac.uk/mash/statistics2/anova)).

The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.

Reading and inspecting the data

Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by executing the following R command:-

getwd()
[1] "/Users/coutur01/courses/cruk/LinearModelAndExtensions/20200310/Practicals"
/Users/coutur01/courses/cruk/LinearModelAndExtensions/20200310/Practicals

N.B.Here, a set of open and closed brackets () is used to run the getwd function with no arguments.
*Note if you are following this material on a Windows machine as opposed to a Linux or MacOS machine you will get a path like C:. If you want to use the complementing R command ‘setwd()’ to set the working directory you MUST escape the  i.e. setwd(“C:\Users\Fred”).
We can also list the files in a specific directory with:-

list.files("data/")
[1] "amess.csv"                  "Bronchitis.csv"            
[3] "crab.csv"                   "diet.csv"                  
[5] "globalBreastCancerRisk.csv" "myocardialinfarction.csv"  
[7] "OscillationIndex.txt"       "students.csv"              
amess.csv

Bronchitis.csv

crab.csv

diet.csv

globalBreastCancerRisk.csv

myocardialinfarction.csv

OscillationIndex.txt

students.csv

A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.

file.exists("data/diet.csv")
[1] TRUE
  • Assuming the file can be found, we can use the read.csv function to import the data. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created.
  • The file name diet.csv is the only argument to the function read.csv
    • arguments are listed inside the brackets
    • for functions requiring more than one argument (input), arguments are separated by commas
    • a function may have default values for some arguments; meaning they do not need to be specified
  • The characters <- are used to tell R to create a variable
    • without this, the data are not loaded into memory and you won’t be able to work with them
  • If you get an error saying Error in file(file, “rt”) : cannot open the connection..., you might need to change your working directory or make sure the file name is typed correctly (R is case-sensitive)
  • Typing the name of an object will cause R to print the contents to the screen
diet <- read.csv("data/diet.csv")
diet

A note on importing your own data

If you are trying to read your own data, and encounter an error at this stage, you may need to consider if your data are in the correct form for analysis. Like most programming languages, R will struggle if your spreadsheet has been heavily formatted to include colours, formulas and special formatting.

These references will guide you through some of the pitfalls and common mistakes to avoid when formatting data

diet is an example of a data frame. The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).

  • the summary function will provide a overview of the contents of each column in the table
    • the type of summary provided depends on the data type in each column
summary(diet)
       id           gender        age            height      diet.type
 Min.   : 1.00   Female:43   Min.   :16.00   Min.   :141.0   A:24     
 1st Qu.:19.75   Male  :33   1st Qu.:32.50   1st Qu.:163.8   B:25     
 Median :40.50               Median :39.00   Median :169.0   C:27     
 Mean   :39.87               Mean   :39.22   Mean   :170.8            
 3rd Qu.:59.25               3rd Qu.:47.25   3rd Qu.:175.2            
 Max.   :78.00               Max.   :60.00   Max.   :201.0            
 initial.weight   final.weight  
 Min.   :58.00   Min.   :53.00  
 1st Qu.:66.00   1st Qu.:61.95  
 Median :72.00   Median :68.95  
 Mean   :72.29   Mean   :68.34  
 3rd Qu.:78.00   3rd Qu.:73.67  
 Max.   :88.00   Max.   :84.50  
  • particular columns can be accessed using the $ operator
    • TIP RStudio will allow auto-complete using the Tab key
diet$gender
 [1] Female Female Female Female Female Female Female Female Female Female
[11] Female Female Female Female Female Female Female Female Female Female
[21] Female Female Female Female Female Female Female Female Female Female
[31] Female Female Female Female Female Female Female Female Female Female
[41] Female Female Female Male   Male   Male   Male   Male   Male   Male  
[51] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
[61] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
[71] Male   Male   Male   Male   Male   Male  
Levels: Female Male
diet$age
 [1] 22 46 55 33 50 50 37 28 28 45 60 48 41 37 44 37 41 43 20 51 31 54 50
[24] 48 16 37 30 29 51 35 21 22 36 20 35 45 58 37 31 35 56 48 41 39 31 40
[47] 50 43 25 52 42 39 40 51 38 54 33 45 37 44 40 37 39 31 36 47 29 37 31
[70] 26 40 35 49 28 40 51

We can create new columns based on existing ones

diet$weight.loss <- diet$final.weight - diet$initial.weight

Subsetting rows and columns is done using the [rows, columns] syntax; where rows and columns are vectors containing the rows and columns you want

  • you can choose to omit either vector to show all rows and columns. *However, you still need to remember the ,
diet[1:5,]
diet[,2:3]

Logical tests can be used to select rows. e.g. using ==, <, >

diet$diet.type == "A"
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[45]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
dietA <- diet[diet$diet.type == "A",]
dietA

Visualisation

All your favourite types of plot can be created in R

Plots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame.

  • a histogram is commonly-used to examine the distribution of a particular variable
hist(diet$weight.loss)

  • a boxplot is often used to compare distributions visually
    • if given a data-frame, each column will be shown as a separate box
    • otherwise the formula syntax ~ is used to define x and y variables
boxplot(diet$weight.loss~diet$diet.type)