Working with vectors, data frames and tibbles

1. How many cars in the mtcars data set have 8 cylinders?

Write an R instruction to calculate this.

sum(mtcars$cyl == 8)
## [1] 14

2. Print out a subset of the starwars tibble containing just the characters with blond hair and blue eyes

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
table(starwars$hair_color)
## 
##        auburn  auburn, grey auburn, white         black         blond 
##             1             1             1            13             3 
##        blonde         brown   brown, grey          grey          none 
##             1            18             1             1            37 
##       unknown         white 
##             1             4
table(starwars$eye_color)
## 
##         black          blue     blue-gray         brown          dark 
##            10            19             1            21             1 
##          gold green, yellow         hazel        orange          pink 
##             1             1             3             8             1 
##           red     red, blue       unknown         white        yellow 
##             5             1             3             1            11
starwars[(starwars$hair_color == "blond" | starwars$hair_color == "blonde") & starwars$eye_color == "blue", ]
## # A tibble: 3 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  
## 2 Anak…    188    84 blond      fair       blue            41.9 male  
## 3 Fini…    170    NA blond      fair       blue            91   male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

3. Convert the values for the heights of characters in the starwars tibble from centimetres to metres

typeof(starwars)
## [1] "list"
starwars$height <- starwars$height / 100
starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…   1.72    77 blond      fair       blue            19   male  
##  2 C-3PO   1.67    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2   0.96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…   2.02   136 none       white      yellow          41.9 male  
##  5 Leia…   1.5     49 brown      light      brown           19   female
##  6 Owen…   1.78   120 brown, gr… light      blue            52   male  
##  7 Beru…   1.65    75 brown      light      blue            47   female
##  8 R5-D4   0.97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…   1.83    84 black      light      brown           24   male  
## 10 Obi-…   1.82    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>
typeof(starwars)
## [1] "list"

What happened to the type of the height column?

Convert the height column back to metres again and see if you can work out how to change it back to its original type.

# type of height column changed from integer to double

# convert back to centimetres
starwars$height <- starwars$height * 100
typeof(starwars)
## [1] "list"
# after converting back the height column is still a double
# converting back to an integer

starwars$height <- as.integer(starwars$height)
starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>
typeof(starwars)
## [1] "list"

Simulation: rolling 6-sided dice

4. Use the sample() function to simulate the rolling of a 6-sided die

If you look at the help page for sample you’ll find there are two functions with the same name, one in the base package and the other in the dplyr package. The sample function in the base package is the one you need for this exercise.

Count the number of sixes you get for 1000 rolls?

six_sided_die <- 1:6
rolls <- sample(six_sided_die, size = 1000, replace = TRUE)
sum(rolls == 6)
## [1] 166

What number would you expect to get? Calculate this in R.

1000 / 6
## [1] 166.6667

Re-run the simulation a few times by clicking on the green ‘Run current chunk’ button for the code you’ve written. What do you notice?

Re-run the simulation with a much larger sample size. Check that the number of sixes you obtain is much closer to the expected value.

# re-running the simulation gives different numbers of sixes each time

# re-running for larger sample size
number_of_rolls <- 100000
number_of_rolls / 6
## [1] 16666.67
rolls <- sample(1:6, size = number_of_rolls, replace = TRUE)
sum(rolls == 6)
## [1] 16524

5. Simulate the rolling of a weighted six-sided die

Imagine we’re running a casino and we have some weighted dice in which the probability of throwing a six is a bit lower than it should be if all is fair and square. Let’s say the probability of a throwing a six is 0.1 while the other 5 numbers all have equal probability.

Calculate the probability for throwing each of the other numbers and then, referring again at the help page for sample(), re-run the simulation using these probabilities. See how many sixes you get.

prob_six <- 0.1
prob_not_six <- 1.0 - prob_six
prob_other <- prob_not_six / 5

number_of_rolls <- 100000
number_of_rolls * prob_six
## [1] 10000
# long-winded
#rolls <- sample(1:6, size = number_of_rolls, replace = TRUE, prob = c(prob_other, prob_other, prob_other, prob_other, prob_other, prob_six))

# using rep() instead
rolls <- sample(1:6, size = number_of_rolls, replace = TRUE, prob = c(rep(prob_other, 5), prob_six))
sum(rolls == 6)
## [1] 9956

Hint: use the rep() function to save some typing.


Clinical cancer data

6. How many samples in the METABRIC dataset have high cellularity but have no recorded classification with the 3-gene classifier?

metabric <- read_tsv("brca_metabric_clinical_data.tsv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Mutation Count` = col_double(),
##   `Age at Diagnosis` = col_double(),
##   Cohort = col_double(),
##   `Lymph nodes examined positive` = col_double(),
##   `Neoplasm Histologic Grade` = col_double(),
##   `Nottingham prognostic index` = col_double()
## )
## See spec(...) for full column specifications.
metabric
## # A tibble: 2,509 x 20
##    `Patient ID` `Sample ID` `Cancer Type` `Cancer Type De… `Mutation Count`
##    <chr>        <chr>       <chr>         <chr>                       <dbl>
##  1 MB-0000      MB-0000     Breast Cancer Breast Invasive…               NA
##  2 MB-0002      MB-0002     Breast Cancer Breast Invasive…                2
##  3 MB-0005      MB-0005     Breast Cancer Breast Invasive…                2
##  4 MB-0006      MB-0006     Breast Cancer Breast Mixed Du…                1
##  5 MB-0008      MB-0008     Breast Cancer Breast Mixed Du…                2
##  6 MB-0010      MB-0010     Breast Cancer Breast Invasive…                4
##  7 MB-0014      MB-0014     Breast Cancer Breast Invasive…                4
##  8 MB-0020      MB-0020     Breast Cancer Breast Invasive…               NA
##  9 MB-0022      MB-0022     Breast Cancer Breast Mixed Du…                1
## 10 MB-0025      MB-0025     Breast Cancer Breast Invasive…                5
## # … with 2,499 more rows, and 15 more variables: `3-Gene classifier
## #   subtype` <chr>, `Age at Diagnosis` <dbl>, Cellularity <chr>,
## #   Chemotherapy <chr>, Cohort <dbl>, `ER Status` <chr>, `ER status
## #   measured by IHC` <chr>, `HER2 Status` <chr>, `HER2 status measured by
## #   SNP6` <chr>, `Hormone Therapy` <chr>, `Inferred Menopausal
## #   State` <chr>, `Integrative Cluster` <chr>, `Lymph nodes examined
## #   positive` <dbl>, `Neoplasm Histologic Grade` <dbl>, `Nottingham
## #   prognostic index` <dbl>
table(metabric$Cellularity)
## 
##     High      Low Moderate 
##      965      215      737
table(metabric$`3-Gene classifier subtype`)
## 
##             ER-/HER2- ER+/HER2- High Prolif  ER+/HER2- Low Prolif 
##                   309                   617                   640 
##                 HER2+ 
##                   198
sum(metabric$Cellularity == "High")
## [1] NA
sum(metabric$Cellularity == "High" & !is.na(metabric$Cellularity))
## [1] 965
sum(metabric$Cellularity == "High", na.rm = TRUE)
## [1] 965
sum(is.na(metabric$`3-Gene classifier subtype`))
## [1] 745
sum(metabric$Cellularity == "High" & !is.na(metabric$Cellularity) & is.na(metabric$`3-Gene classifier subtype`))
## [1] 102
sum(metabric$Cellularity == "High" & is.na(metabric$`3-Gene classifier subtype`), na.rm = TRUE)
## [1] 102
# note dplyr filter that we'll introduce in a couple of weeks is much easier and
# handles the missing value more like how we'd expect

Hint: this is actually quite tricky – you may want to check for missing values in the cellularity column as well as the column for the 3-gene classifier.


Working with data from an Excel spreadsheet

7. Read into R one of your own Excel spreadsheets and explore the data using summary(), table() and/or plot()

We haven’t yet showed how to read Excel spreadsheets into R. There are more than package providing this capability but for this exercise use the readxl package from the tidyverse.

readxl should have been installed as part of the tidyverse but if you find that’s not the case, you will have to install it separately.

readxl is not loaded as one of the core tidyverse packages when you run library(tidyverse) so you’ll have to load this separately.

If you don’t have an Excel spreadsheet to hand, you can use the clinical annotation data for the NCI-60 cell line data set (https://bioinformatics-core-shared-training.github.io/r-intro/data/cellline_nci60_clinical_data.xlsx).

library(readxl)

nci60 <- read_excel("cellline_nci60_clinical_data.xlsx", na = "NA")
# need to comment on the na argument when reading data in (blanks assigned to NA
# by default, not "NA" strings)

nci60
## # A tibble: 67 x 20
##    `Study ID` `Patient ID` `Sample ID` `Diagnosis Age` `Cancer Type`
##    <chr>      <chr>        <chr>                 <dbl> <chr>        
##  1 cellline_… 786_0        786_0                    58 Renal        
##  2 cellline_… A498         A498                     52 Renal        
##  3 cellline_… A549         A549                     58 Non-Small Ce…
##  4 cellline_… ACHN         ACHN                     22 Renal        
##  5 cellline_… BT_549       BT_549                   72 Breast       
##  6 cellline_… CAKI_1       CAKI_1                   49 Renal        
##  7 cellline_… CCRF_CEM     CCRF_CEM                  4 Myeloid Neop…
##  8 cellline_… COLO205      COLO205                  70 Colon        
##  9 cellline_… CSF_268      CSF_268                  NA <NA>         
## 10 cellline_… CSF_295      CSF_295                  NA <NA>         
## # … with 57 more rows, and 15 more variables: `Cancer Type
## #   Detailed` <chr>, `Category of Sample` <chr>, `Doubling Time
## #   (hrs)` <dbl>, Epithelial <chr>, `Fraction Genome Altered` <dbl>,
## #   `Tumor Other Histologic Subtype` <chr>, MDR <dbl>, `Mutation
## #   Count` <dbl>, `Oncotree Code` <chr>, P53 <chr>, Ploidy <chr>, `Prior
## #   Treatment` <chr>, `Number of Samples Per Patient` <dbl>, `Sample
## #   Type` <chr>, Sex <chr>
# summary of each variable
summary(nci60)
##    Study ID          Patient ID         Sample ID         Diagnosis Age  
##  Length:67          Length:67          Length:67          Min.   : 4.00  
##  Class :character   Class :character   Class :character   1st Qu.:43.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :52.00  
##                                                           Mean   :49.82  
##                                                           3rd Qu.:62.00  
##                                                           Max.   :75.00  
##                                                           NA's   :22     
##  Cancer Type        Cancer Type Detailed Category of Sample
##  Length:67          Length:67            Length:67         
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
##                                                            
##                                                            
##  Doubling Time (hrs)  Epithelial        Fraction Genome Altered
##  Min.   :17.40       Length:67          Min.   :0.0117         
##  1st Qu.:25.70       Class :character   1st Qu.:0.1815         
##  Median :32.70       Mode  :character   Median :0.2665         
##  Mean   :35.24                          Mean   :0.2729         
##  3rd Qu.:41.75                          3rd Qu.:0.3563         
##  Max.   :79.50                          Max.   :0.7609         
##  NA's   :7                              NA's   :7              
##  Tumor Other Histologic Subtype      MDR        Mutation Count  
##  Length:67                      Min.   :-86.0   Min.   : 117.0  
##  Class :character               1st Qu.: -4.0   1st Qu.: 189.5  
##  Mode  :character               Median : 11.0   Median : 276.0  
##                                 Mean   : 24.1   Mean   : 567.7  
##                                 3rd Qu.: 23.5   3rd Qu.: 439.5  
##                                 Max.   :414.0   Max.   :5551.0  
##                                 NA's   :8       NA's   :7       
##  Oncotree Code          P53               Ploidy         
##  Length:67          Length:67          Length:67         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Prior Treatment    Number of Samples Per Patient Sample Type       
##  Length:67          Min.   :1                     Length:67         
##  Class :character   1st Qu.:1                     Class :character  
##  Mode  :character   Median :1                     Mode  :character  
##                     Mean   :1                                       
##                     3rd Qu.:1                                       
##                     Max.   :1                                       
##                                                                     
##      Sex           
##  Length:67         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# note that categorical variables are not factors so would need converting
# using as.factor() or for this to be specified when reading the data in
# nevertheless we can run table() on these
table(nci60$`Cancer Type`)
## 
##              Breast                 CNS               Colon 
##                   5                   6                   7 
##            Melanoma    Myeloid Neoplasm Non-Small Cell Lung 
##                  10                   6                   9 
##             Ovarian            Prostate               Renal 
##                   7                   2                   8
# scatter plot to see if there is a relationship between the number of point
# mutations and the fraction of the genome altered
plot(nci60$`Mutation Count`, nci60$`Fraction Genome Altered`)


Create a report for your assignment

Click on the ‘Knit’ menu at the top of this file and select either whichever option you prefer to create an HTML, PDF or Word document version of your assignment. This will run all the code chunks and “knit” the resulting results with the surrounding text to produce a report.

Additional markdown features

Emphasis

Can use either underscores or asterisks and combination of both for both bold and italic emphasis.

*italic*   **bold**
_italic_   __bold__

Inline Code

Use of backticks (`).

`mean(starwar$height)`

Horizontal Rule

Three of more astisks or dashes.

*******

-------

Plain Code Blocks

Plain code blocks are displayed in a fixed-width font but not evaluated.

This text is displayed verbatim / preformatted

Have a look at examples of these in the text parts of this R markdown document and how these are rendered when you run the knit operation.

For more features, see the help page in RStudio that can be accessed from the the help menu: Help > Markdown Quick Reference