The purpose of this section is to review some of the key concepts in basic R usage, and statistical testing
print("Hello World")
[1] "Hello World"
The document may also contain other formatting options that are used to render the HTML (or PDF, Word) output.
Here is some italic text, but we can also write in bold, or write things
We will use a dataset from The University of Sheffield Mathematics and Statistics Help group ((MASH)(https://www.sheffield.ac.uk/mash/statistics2/anova)).
The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.
Like other software (Word, Excel, Photoshop….), R has a default location where it will save files to and import data from. This is known as the working directory in R. You can query what R currently considers its working directory by executing the following R command:-
getwd()
[1] "/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r"
/Volumes/Files/courses/cruk/LinearModelAndExtensions/git_linear-models-r
N.B.Here, a set of open and closed brackets () is used to run the getwd
function with no arguments.
*Note if you are following this material on a Windows machine as opposed to a Linux or MacOS machine you will get a path like C:. If you want to use the complementing R command 'setwd()' to set the working directory you MUST escape the i.e. setwd("C:\Users\Fred").
We can also list the files in a specific directory with:-
list.files("data/")
[1] "amess.csv" "Assay.txt" "Bronchitis.csv" "clinicalTrials.txt"
[5] "crab.csv" "diet.csv" "genotypes.txt" "globalBreastCancerRisk.csv"
[9] "lactoferrin.csv" "myocardialinfarction.csv" "OscillationIndex.txt" "pollution.csv"
[13] "protein-expression.csv" "students.csv" "treatments.txt"
amess.csv
Assay.txt
Bronchitis.csv
clinicalTrials.txt
crab.csv
diet.csv
genotypes.txt
globalBreastCancerRisk.csv
lactoferrin.csv
myocardialinfarction.csv
OscillationIndex.txt
pollution.csv
protein-expression.csv
students.csv
treatments.txt
A useful sanity check is the file.exists function which will print TRUE is the file can be found in the working directory.
file.exists("data/diet.csv")
[1] TRUE
read.csv
function to import the data. Other functions can be used to read tab-delimited files (read.delim) or a generic read.table function. A data frame object is created.diet.csv
is the only argument to the function read.csv
<-
are used to tell R to create a variable
Error in file(file, “rt”) : cannot open the connection...
, you might need to change your working directory or make sure the file name is typed correctly (R is case-sensitive)diet <- read.csv("data/diet.csv")
diet
If you are trying to read your own data, and encounter an error at this stage, you may need to consider if your data are in the correct form for analysis. Like most programming languages, R will struggle if your spreadsheet has been heavily formatted to include colours, formulas and special formatting.
These references will guide you through some of the pitfalls and common mistakes to avoid when formatting data
diet
is an example of a data frame. The data frame object in R allows us to work with “tabular” data, like we might be used to dealing with in Excel, where our data can be thought of having rows and columns. The values in each column have to all be of the same type (i.e. all numbers or all text).
summary
function will provide a overview of the contents of each column in the table
summary(diet)
id gender age height diet.type initial.weight final.weight
Min. : 1.00 Length:76 Min. :16.00 Min. :141.0 Length:76 Min. :58.00 Min. :53.00
1st Qu.:19.75 Class :character 1st Qu.:32.50 1st Qu.:163.8 Class :character 1st Qu.:66.00 1st Qu.:61.95
Median :40.50 Mode :character Median :39.00 Median :169.0 Mode :character Median :72.00 Median :68.95
Mean :39.87 Mean :39.22 Mean :170.8 Mean :72.29 Mean :68.34
3rd Qu.:59.25 3rd Qu.:47.25 3rd Qu.:175.2 3rd Qu.:78.00 3rd Qu.:73.67
Max. :78.00 Max. :60.00 Max. :201.0 Max. :88.00 Max. :84.50
$
operator
diet$gender
[1] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
[16] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female"
[31] "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Female" "Male" "Male"
[46] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
[61] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male"
[76] "Male"
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Female
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
diet$age
[1] 22 46 55 33 50 50 37 28 28 45 60 48 41 37 44 37 41 43 20 51 31 54 50 48 16 37 30 29 51 35 21 22 36 20 35 45 58 37 31 35 56 48 41 39 31
[46] 40 50 43 25 52 42 39 40 51 38 54 33 45 37 44 40 37 39 31 36 47 29 37 31 26 40 35 49 28 40 51
We can create new columns based on existing ones
diet$weight.loss <- diet$final.weight - diet$initial.weight
Subsetting rows and columns is done using the [rows, columns]
syntax; where rows
and columns
are vectors containing the rows and columns you want
,
diet[1:5,]
diet[,2:3]
Logical tests can be used to select rows. e.g. using ==
, <
, >
diet$diet.type == "A"
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
dietA <- diet[diet$diet.type == "A",]
dietA
All your favourite types of plot can be created in R
boxplot
, hist
, barplot
,... all of which are extensions of the basic plot
functionPlots can be constructed from vectors of numeric data, such as the data we get from a particular column in a data frame.
hist(diet$weight.loss)
~
is used to define x and y variablesboxplot(diet$weight.loss~diet$diet.type)
plot
plot(diet$age,diet$initial.weight)
Lots of customisations are possible to enhance the appaerance of our plots. Not for the faint-hearted, the help pages ?plot
and ?par
give the full details. In short,
Axis labels, and titles can be specified as character strings.
colours()
, or check this online reference.
Plotting characters can be specified using a pre-defined number
boxplot(diet$weight.loss~diet$diet.type,
ylab="Weight Loss",
xlab="Diet Type",
col=c("yellow","blue","red"),
main="Weight Loss According to diet type")
You can get help on any of the functions that we will be using in this course by using the '?' or 'help()' commands. The help will appear in the help pane (usually bottom RH corner) .
?lm
help(lm)