The purpose of this section is to review some of the key concepts in basic R usage, and statistical testing
print("Hello World")
[1] "Hello World"
The document may also contain other formatting options that are used to render the HTML (or PDF, Word) output.
Here is some italic text, but we can also write in bold, or write things
We will use a dataset from The University of Sheffield Mathematics and Statistics Help group ((MASH)(https://www.sheffield.ac.uk/mash/statistics2/anova)).
The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.
Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by executing the following R command:-
getwd()
[1] "/Users/coutur01/courses/cruk/LinearModelAndExtensions/20200310/Practicals"
/Users/coutur01/courses/cruk/LinearModelAndExtensions/20200310/Practicals
N.B.Here, a set of open and closed brackets () is used to run the getwd
function with no arguments.
*Note if you are following this material on a Windows machine as opposed to a Linux or MacOS machine you will get a path like C:. If you want to use the complementing R command ‘setwd()’ to set the working directory you MUST escape the i.e. setwd(“C:\Users\Fred”).
We can also list the files in a specific directory with:-
list.files("data/")
[1] "amess.csv" "Bronchitis.csv"
[3] "crab.csv" "diet.csv"
[5] "globalBreastCancerRisk.csv" "myocardialinfarction.csv"
[7] "OscillationIndex.txt" "students.csv"
amess.csv
Bronchitis.csv
crab.csv
diet.csv
globalBreastCancerRisk.csv
myocardialinfarction.csv
OscillationIndex.txt
students.csv
A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.
file.exists("data/diet.csv")
[1] TRUE
read.csv
function to import the data. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created.diet.csv
is the only argument to the function read.csv
<-
are used to tell R to create a variable
Error in file(file, “rt”) : cannot open the connection...
, you might need to change your working directory or make sure the file name is typed correctly (R is case-sensitive)diet <- read.csv("data/diet.csv")
diet
If you are trying to read your own data, and encounter an error at this stage, you may need to consider if your data are in the correct form for analysis. Like most programming languages, R will struggle if your spreadsheet has been heavily formatted to include colours, formulas and special formatting.
These references will guide you through some of the pitfalls and common mistakes to avoid when formatting data
diet
is an example of a data frame. The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).
summary
function will provide a overview of the contents of each column in the table
summary(diet)
id gender age height diet.type
Min. : 1.00 Female:43 Min. :16.00 Min. :141.0 A:24
1st Qu.:19.75 Male :33 1st Qu.:32.50 1st Qu.:163.8 B:25
Median :40.50 Median :39.00 Median :169.0 C:27
Mean :39.87 Mean :39.22 Mean :170.8
3rd Qu.:59.25 3rd Qu.:47.25 3rd Qu.:175.2
Max. :78.00 Max. :60.00 Max. :201.0
initial.weight final.weight
Min. :58.00 Min. :53.00
1st Qu.:66.00 1st Qu.:61.95
Median :72.00 Median :68.95
Mean :72.29 Mean :68.34
3rd Qu.:78.00 3rd Qu.:73.67
Max. :88.00 Max. :84.50
$
operator
diet$gender
[1] Female Female Female Female Female Female Female Female Female Female
[11] Female Female Female Female Female Female Female Female Female Female
[21] Female Female Female Female Female Female Female Female Female Female
[31] Female Female Female Female Female Female Female Female Female Female
[41] Female Female Female Male Male Male Male Male Male Male
[51] Male Male Male Male Male Male Male Male Male Male
[61] Male Male Male Male Male Male Male Male Male Male
[71] Male Male Male Male Male Male
Levels: Female Male
diet$age
[1] 22 46 55 33 50 50 37 28 28 45 60 48 41 37 44 37 41 43 20 51 31 54 50
[24] 48 16 37 30 29 51 35 21 22 36 20 35 45 58 37 31 35 56 48 41 39 31 40
[47] 50 43 25 52 42 39 40 51 38 54 33 45 37 44 40 37 39 31 36 47 29 37 31
[70] 26 40 35 49 28 40 51
We can create new columns based on existing ones
diet$weight.loss <- diet$final.weight - diet$initial.weight
Subsetting rows and columns is done using the [rows, columns]
syntax; where rows
and columns
are vectors containing the rows and columns you want
,
diet[1:5,]
diet[,2:3]
Logical tests can be used to select rows. e.g. using ==
, <
, >
diet$diet.type == "A"
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
dietA <- diet[diet$diet.type == "A",]
dietA
All your favourite types of plot can be created in R
boxplot
, hist
, barplot
,… all of which are extensions of the basic plot
functionPlots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame.
hist(diet$weight.loss)
~
is used to define x and y variablesboxplot(diet$weight.loss~diet$diet.type)