1. How many cars in the mtcars
data set have 8 cylinders?
Write an R instruction to calculate this.
sum(mtcars$cyl == 8)
## [1] 14
2. Print out a subset of the starwars
tibble containing just the characters with blond hair and blue eyes
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
table(starwars$hair_color)
##
## auburn auburn, grey auburn, white black blond
## 1 1 1 13 3
## blonde brown brown, grey grey none
## 1 18 1 1 37
## unknown white
## 1 4
table(starwars$eye_color)
##
## black blue blue-gray brown dark
## 10 19 1 21 1
## gold green, yellow hazel orange pink
## 1 1 3 8 1
## red red, blue unknown white yellow
## 5 1 3 1 11
starwars[(starwars$hair_color == "blond" | starwars$hair_color == "blonde") & starwars$eye_color == "blue", ]
## # A tibble: 3 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 Anak… 188 84 blond fair blue 41.9 male
## 3 Fini… 170 NA blond fair blue 91 male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
3. Convert the values for the heights of characters in the starwars
tibble from centimetres to metres
typeof(starwars)
## [1] "list"
starwars$height <- starwars$height / 100
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 1.72 77 blond fair blue 19 male
## 2 C-3PO 1.67 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 0.96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 2.02 136 none white yellow 41.9 male
## 5 Leia… 1.5 49 brown light brown 19 female
## 6 Owen… 1.78 120 brown, gr… light blue 52 male
## 7 Beru… 1.65 75 brown light blue 47 female
## 8 R5-D4 0.97 32 <NA> white, red red NA <NA>
## 9 Bigg… 1.83 84 black light brown 24 male
## 10 Obi-… 1.82 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
typeof(starwars)
## [1] "list"
What happened to the type of the height column?
Convert the height column back to metres again and see if you can work out how to change it back to its original type.
# type of height column changed from integer to double
# convert back to centimetres
starwars$height <- starwars$height * 100
typeof(starwars)
## [1] "list"
# after converting back the height column is still a double
# converting back to an integer
starwars$height <- as.integer(starwars$height)
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
typeof(starwars)
## [1] "list"
4. Use the sample()
function to simulate the rolling of a 6-sided die
If you look at the help page for sample
you’ll find there are two functions with the same name, one in the base
package and the other in the dplyr
package. The sample
function in the base
package is the one you need for this exercise.
Count the number of sixes you get for 1000 rolls?
six_sided_die <- 1:6
rolls <- sample(six_sided_die, size = 1000, replace = TRUE)
sum(rolls == 6)
## [1] 166
What number would you expect to get? Calculate this in R.
1000 / 6
## [1] 166.6667
Re-run the simulation a few times by clicking on the green ‘Run current chunk’ button for the code you’ve written. What do you notice?
Re-run the simulation with a much larger sample size. Check that the number of sixes you obtain is much closer to the expected value.
# re-running the simulation gives different numbers of sixes each time
# re-running for larger sample size
number_of_rolls <- 100000
number_of_rolls / 6
## [1] 16666.67
rolls <- sample(1:6, size = number_of_rolls, replace = TRUE)
sum(rolls == 6)
## [1] 16524
5. Simulate the rolling of a weighted six-sided die
Imagine we’re running a casino and we have some weighted dice in which the probability of throwing a six is a bit lower than it should be if all is fair and square. Let’s say the probability of a throwing a six is 0.1 while the other 5 numbers all have equal probability.
Calculate the probability for throwing each of the other numbers and then, referring again at the help page for sample()
, re-run the simulation using these probabilities. See how many sixes you get.
prob_six <- 0.1
prob_not_six <- 1.0 - prob_six
prob_other <- prob_not_six / 5
number_of_rolls <- 100000
number_of_rolls * prob_six
## [1] 10000
# long-winded
#rolls <- sample(1:6, size = number_of_rolls, replace = TRUE, prob = c(prob_other, prob_other, prob_other, prob_other, prob_other, prob_six))
# using rep() instead
rolls <- sample(1:6, size = number_of_rolls, replace = TRUE, prob = c(rep(prob_other, 5), prob_six))
sum(rolls == 6)
## [1] 9956
Hint: use the rep()
function to save some typing.
6. How many samples in the METABRIC dataset have high cellularity but have no recorded classification with the 3-gene classifier?
metabric <- read_tsv("brca_metabric_clinical_data.tsv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Mutation Count` = col_double(),
## `Age at Diagnosis` = col_double(),
## Cohort = col_double(),
## `Lymph nodes examined positive` = col_double(),
## `Neoplasm Histologic Grade` = col_double(),
## `Nottingham prognostic index` = col_double()
## )
## See spec(...) for full column specifications.
metabric
## # A tibble: 2,509 x 20
## `Patient ID` `Sample ID` `Cancer Type` `Cancer Type De… `Mutation Count`
## <chr> <chr> <chr> <chr> <dbl>
## 1 MB-0000 MB-0000 Breast Cancer Breast Invasive… NA
## 2 MB-0002 MB-0002 Breast Cancer Breast Invasive… 2
## 3 MB-0005 MB-0005 Breast Cancer Breast Invasive… 2
## 4 MB-0006 MB-0006 Breast Cancer Breast Mixed Du… 1
## 5 MB-0008 MB-0008 Breast Cancer Breast Mixed Du… 2
## 6 MB-0010 MB-0010 Breast Cancer Breast Invasive… 4
## 7 MB-0014 MB-0014 Breast Cancer Breast Invasive… 4
## 8 MB-0020 MB-0020 Breast Cancer Breast Invasive… NA
## 9 MB-0022 MB-0022 Breast Cancer Breast Mixed Du… 1
## 10 MB-0025 MB-0025 Breast Cancer Breast Invasive… 5
## # … with 2,499 more rows, and 15 more variables: `3-Gene classifier
## # subtype` <chr>, `Age at Diagnosis` <dbl>, Cellularity <chr>,
## # Chemotherapy <chr>, Cohort <dbl>, `ER Status` <chr>, `ER status
## # measured by IHC` <chr>, `HER2 Status` <chr>, `HER2 status measured by
## # SNP6` <chr>, `Hormone Therapy` <chr>, `Inferred Menopausal
## # State` <chr>, `Integrative Cluster` <chr>, `Lymph nodes examined
## # positive` <dbl>, `Neoplasm Histologic Grade` <dbl>, `Nottingham
## # prognostic index` <dbl>
table(metabric$Cellularity)
##
## High Low Moderate
## 965 215 737
table(metabric$`3-Gene classifier subtype`)
##
## ER-/HER2- ER+/HER2- High Prolif ER+/HER2- Low Prolif
## 309 617 640
## HER2+
## 198
sum(metabric$Cellularity == "High")
## [1] NA
sum(metabric$Cellularity == "High" & !is.na(metabric$Cellularity))
## [1] 965
sum(metabric$Cellularity == "High", na.rm = TRUE)
## [1] 965
sum(is.na(metabric$`3-Gene classifier subtype`))
## [1] 745
sum(metabric$Cellularity == "High" & !is.na(metabric$Cellularity) & is.na(metabric$`3-Gene classifier subtype`))
## [1] 102
sum(metabric$Cellularity == "High" & is.na(metabric$`3-Gene classifier subtype`), na.rm = TRUE)
## [1] 102
# note dplyr filter that we'll introduce in a couple of weeks is much easier and
# handles the missing value more like how we'd expect
Hint: this is actually quite tricky – you may want to check for missing values in the cellularity column as well as the column for the 3-gene classifier.
7. Read into R one of your own Excel spreadsheets and explore the data using summary()
, table()
and/or plot()
We haven’t yet showed how to read Excel spreadsheets into R. There are more than package providing this capability but for this exercise use the readxl
package from the tidyverse.
readxl
should have been installed as part of the tidyverse but if you find that’s not the case, you will have to install it separately.
readxl
is not loaded as one of the core tidyverse packages when you run library(tidyverse)
so you’ll have to load this separately.
If you don’t have an Excel spreadsheet to hand, you can use the clinical annotation data for the NCI-60 cell line data set (https://bioinformatics-core-shared-training.github.io/r-intro/data/cellline_nci60_clinical_data.xlsx).
library(readxl)
nci60 <- read_excel("cellline_nci60_clinical_data.xlsx", na = "NA")
# need to comment on the na argument when reading data in (blanks assigned to NA
# by default, not "NA" strings)
nci60
## # A tibble: 67 x 20
## `Study ID` `Patient ID` `Sample ID` `Diagnosis Age` `Cancer Type`
## <chr> <chr> <chr> <dbl> <chr>
## 1 cellline_… 786_0 786_0 58 Renal
## 2 cellline_… A498 A498 52 Renal
## 3 cellline_… A549 A549 58 Non-Small Ce…
## 4 cellline_… ACHN ACHN 22 Renal
## 5 cellline_… BT_549 BT_549 72 Breast
## 6 cellline_… CAKI_1 CAKI_1 49 Renal
## 7 cellline_… CCRF_CEM CCRF_CEM 4 Myeloid Neop…
## 8 cellline_… COLO205 COLO205 70 Colon
## 9 cellline_… CSF_268 CSF_268 NA <NA>
## 10 cellline_… CSF_295 CSF_295 NA <NA>
## # … with 57 more rows, and 15 more variables: `Cancer Type
## # Detailed` <chr>, `Category of Sample` <chr>, `Doubling Time
## # (hrs)` <dbl>, Epithelial <chr>, `Fraction Genome Altered` <dbl>,
## # `Tumor Other Histologic Subtype` <chr>, MDR <dbl>, `Mutation
## # Count` <dbl>, `Oncotree Code` <chr>, P53 <chr>, Ploidy <chr>, `Prior
## # Treatment` <chr>, `Number of Samples Per Patient` <dbl>, `Sample
## # Type` <chr>, Sex <chr>
# summary of each variable
summary(nci60)
## Study ID Patient ID Sample ID Diagnosis Age
## Length:67 Length:67 Length:67 Min. : 4.00
## Class :character Class :character Class :character 1st Qu.:43.00
## Mode :character Mode :character Mode :character Median :52.00
## Mean :49.82
## 3rd Qu.:62.00
## Max. :75.00
## NA's :22
## Cancer Type Cancer Type Detailed Category of Sample
## Length:67 Length:67 Length:67
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Doubling Time (hrs) Epithelial Fraction Genome Altered
## Min. :17.40 Length:67 Min. :0.0117
## 1st Qu.:25.70 Class :character 1st Qu.:0.1815
## Median :32.70 Mode :character Median :0.2665
## Mean :35.24 Mean :0.2729
## 3rd Qu.:41.75 3rd Qu.:0.3563
## Max. :79.50 Max. :0.7609
## NA's :7 NA's :7
## Tumor Other Histologic Subtype MDR Mutation Count
## Length:67 Min. :-86.0 Min. : 117.0
## Class :character 1st Qu.: -4.0 1st Qu.: 189.5
## Mode :character Median : 11.0 Median : 276.0
## Mean : 24.1 Mean : 567.7
## 3rd Qu.: 23.5 3rd Qu.: 439.5
## Max. :414.0 Max. :5551.0
## NA's :8 NA's :7
## Oncotree Code P53 Ploidy
## Length:67 Length:67 Length:67
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Prior Treatment Number of Samples Per Patient Sample Type
## Length:67 Min. :1 Length:67
## Class :character 1st Qu.:1 Class :character
## Mode :character Median :1 Mode :character
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## Sex
## Length:67
## Class :character
## Mode :character
##
##
##
##
# note that categorical variables are not factors so would need converting
# using as.factor() or for this to be specified when reading the data in
# nevertheless we can run table() on these
table(nci60$`Cancer Type`)
##
## Breast CNS Colon
## 5 6 7
## Melanoma Myeloid Neoplasm Non-Small Cell Lung
## 10 6 9
## Ovarian Prostate Renal
## 7 2 8
# scatter plot to see if there is a relationship between the number of point
# mutations and the fraction of the genome altered
plot(nci60$`Mutation Count`, nci60$`Fraction Genome Altered`)
Click on the ‘Knit’ menu at the top of this file and select either whichever option you prefer to create an HTML, PDF or Word document version of your assignment. This will run all the code chunks and “knit” the resulting results with the surrounding text to produce a report.
Emphasis
Can use either underscores or asterisks and combination of both for both bold and italic emphasis.
*italic* **bold**
_italic_ __bold__
Inline Code
Use of backticks (`).
`mean(starwar$height)`
Horizontal Rule
Three of more astisks or dashes.
*******
-------
Plain Code Blocks
Plain code blocks are displayed in a fixed-width font but not evaluated.
This text is displayed verbatim / preformatted
Have a look at examples of these in the text parts of this R markdown document and how these are rendered when you run the knit operation.
For more features, see the help page in RStudio that can be accessed from the the help menu: Help > Markdown Quick Reference