Learning objectives
- Learn how to install and use packages that extend R’s basic functionality
- Read data in tabular format into R
- Calculate summary statistics on your tabular data
- Introduce the
tibble
, arguably the most important data structure you will use in R- Learn how R deals with missing values
- Learn the two key dplyr functions for manipulating your data
filter()
for selecting observations based on their values, i.e. selecting rows in your tableselect()
for selecting a subset of variables, i.e. selecting columns in your tablearrange()
for sorting the observations in your table
In this course, we’ll be mostly focusing on a set of R packages specifically designed for the most useful and common operations for interacting with and visualizing data, and particularly data in a tabular form. This collection of packages is known as the tidyverse.
Learning the tidyverse is more than just learning about some helpful functions to get certain tasks done. The tidyverse packages form a coherant system for data manipulation, exploration and visualization that share a common design philosophy. There is a certain elegance to code crafted using the tidyverse approach that is in stark contrast to the strange and often cryptic equivalent using traditional ‘base R’.
We will also start to look at the most important data structure you’ll use with your data, assuming it is in tabular form, the data frame, and its superior tidyverse derivative, the tibble.
Before we move on to the tidyverse version of a data frame, the tibble, we have to load the tidyverse package that provides the tibble functionality. It doesn’t come with the set of packages loaded when you first start R.
There are thousands of R packages that extend R’s core functionality to provide specialized functions for doing all kinds of things. There are R packages that provide various kinds of visualizations and plotting functions, others that let us manipulate data in interesting and useful ways, and still others that give us access to cutting-edge machine learning algorithms.
You need to load a package using the library()
function in order to be able to use the functions and data sets that the package provides. The tibble data structure is part of the tibble
package so to make this available you would run the following command:
library(tibble)
The tibble
package also contains various useful functions for converting other data structures, e.g. data frames, to tibbles and for working with tibble objects.
The tibble
package is one of a collection of packages that make up the tidyverse. It is usually more convenient to load the core tidyverse packages in one go as follows.
library(tidyverse)
## ── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The output from this command tells us which packages were loaded, one of which is the tibble
package. Another of these is ggplot2
that provides excellent plotting functions for visualizing our data and will be the main focus next week.
We also got a warning about some conflicts. This tells us that there are functions loaded as part of the dplyr
package that have the same name as functions that are in the stats
package. The stats
package was already loaded, from when we first started our R session, and the functions referred to in the warning are now masked; to use these instead of their dplyr
namesakes you need to specify them fully by adding the package name and ::
as shown in the warning message.
If you got the following error message when trying to load the tidyverse, then you haven’t yet installed it.
> library(tidyverse)
Error in library(tidyverse) : there is no package called ‘tidyverse’
Installing packages should be relatively straightforward using the install.packages()
function.
install.packages("tidyverse")
If you’re using R at a regular console within a terminal window, i.e. not RStudio, you may be asked to select one of several mirrors of the CRAN R package repository hosted at different locations around the world (in the UK these are in Bristol and London).
For some packages you may be asked to choose between a binary and a source package, particularly if the latest version of the package is only available in source form and that package contains code written in another language such as C. This could be a problem if you don’t have a C compiler installed on your computer - generally just choose to install the binary version.
The tidyverse brings us a special type of data frame called the tibble
.
Data frames have some rather quirky and annoying behaviours and the tibble
addresses some of these issues to make life a little easier. We’ll first look at how we can create a tibble and then will have a quick look at some of the advantages of the tibble over the data frame.
Tibbles can be created using the tibble()
function in a similar manner to how we created a data frame earlier.
beatles <- tibble(
name = c("John", "Paul", "Ringo", "George"),
birth_year = c(1940, 1942, 1940, 1943),
instrument = c("guitar", "bass", "drums", "guitar")
)
beatles
## # A tibble: 4 × 3
## name birth_year instrument
## <chr> <dbl> <chr>
## 1 John 1940 guitar
## 2 Paul 1942 bass
## 3 Ringo 1940 drums
## 4 George 1943 guitar
A data frame can be converted into a tibble using the as_tibble()
function.
irisTibble <- as_tibble(iris)
Our iris tibble is still a data frame but it also has acquired some additional types or classes (tbl_df
and tbl
).
class(iris)
## [1] "data.frame"
class(irisTibble)
## [1] "tbl_df" "tbl" "data.frame"
So let’s now have a look at what capabilities and behaviours these additional types confer and what the main differences are between tibbles and data frames.
The first immediate difference from a data.frame
is that a tibble
displays the data type of each column under its name and it only prints as many columns as fit on one screen. Another difference is that printing a tibble
will not print the whole dataset, but just the first 10 rows and only the columns that fit the screen (same as head
but with 10 rows instead of 6). If you would like to print more than the first 10 rows use the print
function.
Since printing tibble
already gives you information about the data structure, the data types of each column and the size of the dataset, the str
function is not as much useful as it was when using data.frame
.
str(irisTibble)
## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Notice that rather than specifing tibble
as the data structure of surveys, the first line of str
‘s output now specifies ‘tbl_df’, ‘tbl’ and ‘data.frame’ which can be a bit confusing. These are the classes tibble
inherts from which in simple terms means that tibble
is a data.frame
with a few modifications. Therefore, most of the functions that were used with data.frame
can also be used with tibble
.
Another of the tidyverse packages loaded when we ran library(tidyverse)
is the readr
package, which provides functions for reading tabular data into a tibble and writing tibbles to tab-delmited and comma-separated value (CSV) files.
For example, the clinical data for the METABRIC breast cancer dataset generated at CRUK CI can be downloaded as a tab-delimited file from cBioPortal. Targeted sequencing was performed for 2509 primary breast tumours, along with 548 matched normals, using a panel of 173 of the most frequently mutated breast cancer genes as part of the METABRIC study (Pereira et al., Nature Communications 7:11479, 2016).
There should be two metabric files in the data directory inside the R_Intro directory. We are first going to load brca_metabric_clinical_data.tsv. The “tsv” suffix tells us that this file contains information in tabular form in which values are separated by tab characters so we’ll use the read_tsv()
to read these data.
metabric <- read_tsv("data/brca_metabric_clinical_data.tsv")
## Rows: 2509 Columns: 20
## ── Column specification ───────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): Patient ID, Sample ID, Cancer Type, Cancer Type Detailed, 3-Gene c...
## dbl (6): Mutation Count, Age at Diagnosis, Cohort, Lymph nodes examined pos...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
read_tsv()
prints a message telling us which types it has assigned to each of the columns it has read, although in this case the table has quite a lot of columns and it only outputs details about the first few.
If we print the clinical data set we can see that read_tsv()
created a tibble.
metabric
## # A tibble: 2,509 × 20
## `Patient ID` `Sample ID` `Cancer Type` `Cancer Type Detailed`
## <chr> <chr> <chr> <chr>
## 1 MB-0000 MB-0000 Breast Cancer Breast Invasive Ductal Carcinoma
## 2 MB-0002 MB-0002 Breast Cancer Breast Invasive Ductal Carcinoma
## 3 MB-0005 MB-0005 Breast Cancer Breast Invasive Ductal Carcinoma
## 4 MB-0006 MB-0006 Breast Cancer Breast Mixed Ductal and Lobular Carci…
## 5 MB-0008 MB-0008 Breast Cancer Breast Mixed Ductal and Lobular Carci…
## 6 MB-0010 MB-0010 Breast Cancer Breast Invasive Ductal Carcinoma
## 7 MB-0014 MB-0014 Breast Cancer Breast Invasive Ductal Carcinoma
## 8 MB-0020 MB-0020 Breast Cancer Breast Invasive Lobular Carcinoma
## 9 MB-0022 MB-0022 Breast Cancer Breast Mixed Ductal and Lobular Carci…
## 10 MB-0025 MB-0025 Breast Cancer Breast Invasive Ductal Carcinoma
## # ℹ 2,499 more rows
## # ℹ 16 more variables: `Mutation Count` <dbl>,
## # `3-Gene classifier subtype` <chr>, `Age at Diagnosis` <dbl>,
## # Cellularity <chr>, Chemotherapy <chr>, Cohort <dbl>, `ER Status` <chr>,
## # `ER status measured by IHC` <chr>, `HER2 Status` <chr>,
## # `HER2 status measured by SNP6` <chr>, `Hormone Therapy` <chr>,
## # `Inferred Menopausal State` <chr>, `Integrative Cluster` <chr>, …
read_csv()
and read_tsv()
read_csv()
reads data from a comma-separated value (CSV) file into a tibble. read_tsv()
is the equivalent function that works on tab-delmited files.
These functions expect the first line to contain column names and try to make sensible guesses at the type of data in each column. You can change this by specifying various arguments, e.g. to skip comment lines beginning with a specific character (e.g. “#”) or to tell the function what the column types are.
You can see that the column names are surrounded by back ticks. This is because they contain characters, such as spaces and dashes, and in one case starts with a number. R does not natively allow column names to contain special characters other than “_” and “.” or to start with numbers. There are ways around this - using the back ticks - but this becomes cumbersome when writing our code. To create clean column names we can use clean_names()
from janitor
package.
#install.packages("janitor")
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
metabric <- clean_names(metabric)
We are now in a position to explore these data. For example, we can use the table()
function to count the number of patients in this cohort that had chemotherapy and those that did not.
table(metabric$chemotherapy)
##
## NO YES
## 1568 412
We can summarize the numbers of mutations found per patient.
summary(metabric$mutation_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 5.000 5.579 7.000 80.000 152
When reading in the METABRIC dataset we specified the file using a relative path, data/brca_metabric_clinical_data.tsv
. This is relative to the current working directory and is in the data
subdirectory within our working directory. The getwd()
function will tell you what the current working directory is.
getwd()
If you open RStudio by clicking on its icon in the task bar or its entry in the start menu, your working directory will probably be your home folder.
You can specify a full path for the file you want to read, which is also known as an absolute path.
You can explicitly set your working directory using setwd()
.
If you’re running an R script from a terminal window (by typing something like Rscript my_script.R
at the command prompt) your working directory will be the directory in which you run the script (i.e. the one in which you issue the Rscript command). In this case you will probably want to read files from locations that are relative to the directory in which you run the script.
If you’re writing an R script in RStudio you can set the working directory to be the same directory as that which contains your script file using the menu: Session > Set Working Directory > To Source File Location. You can also navigate to the directory of your choosing in the Files tab pane and set this to be the working directory by clicking on the More menu button and selecting Set As Working Directory.
The METABRIC clinical dataset has some missing values. If you look back to the table we printed out earlier you can see 2 missing values for mutation counts within the first 10 rows.
metabric$mutation_count[1:10]
## [1] NA 2 2 1 2 4 4 NA 1 5
Missing values (NA
)
Missing values in R are represented as NA
, which stands for ‘not available’.
dplyr is one of the packages that gets loaded as part of the tidyverse.
dplyr provides many useful functions for manipulating tabular data in data frames or tibbles. We’re going to look at the key functions for filtering our data, modifying the contents and computing summary statistics.
We’ll also introduce the pipe operator, %>%
, for chaining operations together into mini workflows in a way that makes for more readable and maintainable code.
This time we’ll use a METABRIC data set containing clinical data and gene expression data to illustrate how these operations work.
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
metabric
## # A tibble: 1,904 × 32
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0000 1 75.6 140. LIVING Living
## 2 MB-0002 1 43.2 84.6 LIVING Living
## 3 MB-0005 1 48.9 164. DECEASED Died of Dis…
## 4 MB-0006 1 47.7 165. LIVING Living
## 5 MB-0008 1 77.0 41.4 DECEASED Died of Dis…
## 6 MB-0010 1 78.8 7.8 DECEASED Died of Dis…
## 7 MB-0014 1 56.4 164. LIVING Living
## 8 MB-0022 1 89.1 99.5 DECEASED Died of Oth…
## 9 MB-0028 1 86.4 36.6 DECEASED Died of Oth…
## 10 MB-0035 1 84.2 36.3 DECEASED Died of Dis…
## # ℹ 1,894 more rows
## # ℹ 26 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
We use the terms ‘observation’ and ‘variable’ a lot in this course. As a reminder from week 2, when we talk about an observation, we’re talking about a set of values measured for the same unit or thing, e.g. a person or a date, and when we talk about a variable we are really talking about the attribute that we are measuring or recording, e.g. height, temperature or expression value.
Observations are represented as rows in our data frames or tibbles, while the columns correspond to variables.
From “Tidy Data” by Hadley Wickham, The Journal of Statistical Software, vol. 59, 2014.
A data set is a collection of values, usually either numbers (if quantitative) or character strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
An observation contains all values measured on the same unit (like a person or a day) across attributes.
We will be looking at the following key dplyr functions in this course:
filter()
for filtering rowsselect()
for selecting columnsarrange()
for sorting rowsmutate()
for modifying columns or creating new onessummarise()
for computing summary valuesIn looking at each of these in turn, we’ll be applying these to the entire data set. It is possible to combine these with the group_by()
function to instead operate on separate groups within our data set but this is something we’ll cover in detail later.
The dplyr operations are commonly referred to as “verbs” in a data manipulation grammar. These verbs have a common syntax and work together in a consistent and uniform manner. They all have the following shared behaviours:
The first argument in each function is a data frame (or tibble)
Any additional arguments describe what operation to perform on the data frame
Variable names, i.e. column names, are referred to without using quotes
The result of an operation is a new data frame
filter()
The filter
verb allows you to choose rows from a data frame that match some specified criteria. The criteria are based on values of variables and can make use of comparison operators such as ==
, >
, <
and !=
.
For example, to filter the METABRIC data set so that it only contains observations for those patients who died of breast cancer we can use filter()
as follows.
deceased <- filter(metabric, Vital_status == "Died of Disease")
deceased
## # A tibble: 622 × 32
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0005 1 48.9 164. DECEASED Died of Dis…
## 2 MB-0008 1 77.0 41.4 DECEASED Died of Dis…
## 3 MB-0010 1 78.8 7.8 DECEASED Died of Dis…
## 4 MB-0035 1 84.2 36.3 DECEASED Died of Dis…
## 5 MB-0036 1 85.5 132. DECEASED Died of Dis…
## 6 MB-0079 1 50.4 28.5 DECEASED Died of Dis…
## 7 MB-0083 1 64.8 86.1 DECEASED Died of Dis…
## 8 MB-0100 1 68.7 8.07 DECEASED Died of Dis…
## 9 MB-0102 1 51.4 141. DECEASED Died of Dis…
## 10 MB-0108 1 43.2 42.7 DECEASED Died of Dis…
## # ℹ 612 more rows
## # ℹ 26 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
Remember that the ==
operator tests for equality, i.e. is the value for Vital_status
for each patient (observation) equal to “Died of Disease”.
This filtering operation is equivalent to subsetting the rows based on a logical vector resulting from our comparison of vital status values with “Died of Disease”.
filter(metabric, Vital_status == "Died of Disease")
# is equivalent to
metabric[metabric$Vital_status == "Died of Disease", ]
Both achieve the same result but the dplyr filter method is arguably a little easier to read. We haven’t had to write metabric
twice for one thing; we just referred to the variable name as it is, unquoted and without any fuss.
Let’s have a look at the various categories in the Vital_status
variable.
table(metabric$Vital_status)
##
## Died of Disease Died of Other Causes Living
## 622 480 801
We could use the !=
comparison operator to select all deceased patients regardless of whether they died of the disease or other causes, by filtering for those that don’t have the value “Living”.
deceased <- filter(metabric, Vital_status != "Living")
nrow(deceased)
## [1] 1102
Another way of doing this is to specify which classes we are interested in and use the %in%
operator to test which observations (rows) contain those values.
deceased <- filter(metabric,
Vital_status %in% c("Died of Disease", "Died of Other Causes"))
nrow(deceased)
## [1] 1102
Another of the tidyverse packages, stringr
, contains a set of very useful functions for operating on text or character strings. One such function, str_starts()
could be used to find all Vital_status
values that start with “Died”.
deceased <- filter(metabric, str_starts(Vital_status, "Died"))
nrow(deceased)
## [1] 1102
Note that str_starts()
returns a logical vector - this is important since the filtering condition must evaluate to TRUE
or FALSE
values for each row.
Unsurprisingly there is an equivalent function, str_ends()
, for matching the end of text (character) values and str_detect()
is another useful function that determines whether values match a regular expression. Regular expressions are a language for search patterns used frequently in computer programming and really worth getting to grips with but sadly these are beyond the scope of this course.
Filtering based on a logical variable is the most simple type of filtering of all. We don’t have any logical variables in our METABRIC data set so we’ll create one from the binary Survival_status
variable to use as an example.
# create a new logical variable called 'Deceased'
metabric$Deceased <- metabric$Survival_status == "DECEASED"
#
# filtering based on a logical variable - only selects TRUE values
deceased <- filter(metabric, Deceased)
#
# only display those columns we're interested in
deceased[, c("Patient_ID", "Survival_status", "Vital_status", "Deceased")]
## # A tibble: 1,103 × 4
## Patient_ID Survival_status Vital_status Deceased
## <chr> <chr> <chr> <lgl>
## 1 MB-0005 DECEASED Died of Disease TRUE
## 2 MB-0008 DECEASED Died of Disease TRUE
## 3 MB-0010 DECEASED Died of Disease TRUE
## 4 MB-0022 DECEASED Died of Other Causes TRUE
## 5 MB-0028 DECEASED Died of Other Causes TRUE
## 6 MB-0035 DECEASED Died of Disease TRUE
## 7 MB-0036 DECEASED Died of Disease TRUE
## 8 MB-0046 DECEASED Died of Other Causes TRUE
## 9 MB-0079 DECEASED Died of Disease TRUE
## 10 MB-0083 DECEASED Died of Disease TRUE
## # ℹ 1,093 more rows
We can use the !
operator to filter those patients who are not deceased.
filter(metabric, !Deceased)
The eagle-eyed will have spotted that filtering on our newly created Deceased logical variable gave a slightly different number of observations (patients) who are considered to be deceased, compared with the filtering operations shown above based on the Vital_status
variable. We get one extra row. This is because we have a missing value for the vital status of one of the patients. We can filter for this using the is.na()
function.
missing_vital_status <- filter(metabric, is.na(Vital_status))
missing_vital_status[, c("Patient_ID",
"Survival_status",
"Vital_status",
"Deceased")]
## # A tibble: 1 × 4
## Patient_ID Survival_status Vital_status Deceased
## <chr> <chr> <chr> <lgl>
## 1 MB-5130 DECEASED <NA> TRUE
filter()
only retains rows where the condition if TRUE
; both FALSE
and NA
values are filtered out.
We can apply more than one condition in our filtering operation, for example the patients who were still alive at the time of the METABRIC study and had survived for more than 10 years (120 months).
filter(metabric, Survival_status == "LIVING", Survival_time > 120)
## # A tibble: 545 × 33
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0000 1 75.6 140. LIVING Living
## 2 MB-0006 1 47.7 165. LIVING Living
## 3 MB-0014 1 56.4 164. LIVING Living
## 4 MB-0039 1 70.9 164. LIVING Living
## 5 MB-0045 1 45.3 165. LIVING Living
## 6 MB-0053 1 70.0 161. LIVING Living
## 7 MB-0054 1 66.9 160. LIVING Living
## 8 MB-0060 1 45.4 141. LIVING Living
## 9 MB-0062 1 52.1 154. LIVING Living
## 10 MB-0066 1 61.5 157. LIVING Living
## # ℹ 535 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
The equivalent using R’s usual subsetting is slightly less readable.
metabric[metabric$Survival_status == "LIVING" & metabric$Survival_time > 120, ]
We can add as many conditions as we like separating each with a comma. Note that filtering using R subsetting gets more unreadable, and more cumbersome to code, the more conditions you add.
Adding conditions in this way is equivalent to combining the conditions using the &
(Boolean AND) operator.
filter(metabric, Survival_status == "LIVING", Survival_time > 120)
Naturally we can also check when either of two conditions holds true by using the |
(Boolean OR) operator. And we can build up more complicated filtering operations using both &
and |
. For example, let’s see which patients have stage 3 or stage 4 tumours that are either estrogen receptor (ER) positive or progesterone receptor (PR) positive.
selected_patients <- filter(metabric,
Tumour_stage >= 3,
ER_status == "Positive" | PR_status == "Positive")
nrow(selected_patients)
## [1] 79
In this case, if you used &
in place of the comma you’d need to be careful about the precedence of the &
and |
operators and use parentheses to make clear what you intended.
filter(metabric, Tumour_stage >= 3 &
(ER_status == "Positive" | PR_status == "Positive"))
select()
Another way of “slicing and dicing”” our tabular data set is to select just the variables or columns we’re interested in. This can be important particularly when the data set contains a very large number of variables as is the case for the METABRIC data. Notice how when we print the METABRIC data frame it is not possible to display all the columns; we only get the first few and then a long list of the additional ones that weren’t displayed.
Using the $
operator is quite convenient for selecting a single column and extracting the values as a vector. Selecting several columns using the []
subsetting operator is a bit more cumbersome. For example, in our look at filtering rows, we considered two different variables in our data set that are concerned with the living/deceased status of patients. When printing out the results we selected just those columns along with the patient identifier and the newly created Deceased
column.
deceased[, c("Patient_ID", "Survival_status", "Vital_status", "Deceased")]
## # A tibble: 1,103 × 4
## Patient_ID Survival_status Vital_status Deceased
## <chr> <chr> <chr> <lgl>
## 1 MB-0005 DECEASED Died of Disease TRUE
## 2 MB-0008 DECEASED Died of Disease TRUE
## 3 MB-0010 DECEASED Died of Disease TRUE
## 4 MB-0022 DECEASED Died of Other Causes TRUE
## 5 MB-0028 DECEASED Died of Other Causes TRUE
## 6 MB-0035 DECEASED Died of Disease TRUE
## 7 MB-0036 DECEASED Died of Disease TRUE
## 8 MB-0046 DECEASED Died of Other Causes TRUE
## 9 MB-0079 DECEASED Died of Disease TRUE
## 10 MB-0083 DECEASED Died of Disease TRUE
## # ℹ 1,093 more rows
The select()
function from dplyr is simpler.
select(metabric, Patient_ID, Survival_status, Vital_status, Deceased)
## # A tibble: 1,904 × 4
## Patient_ID Survival_status Vital_status Deceased
## <chr> <chr> <chr> <lgl>
## 1 MB-0000 LIVING Living FALSE
## 2 MB-0002 LIVING Living FALSE
## 3 MB-0005 DECEASED Died of Disease TRUE
## 4 MB-0006 LIVING Living FALSE
## 5 MB-0008 DECEASED Died of Disease TRUE
## 6 MB-0010 DECEASED Died of Disease TRUE
## 7 MB-0014 LIVING Living FALSE
## 8 MB-0022 DECEASED Died of Other Causes TRUE
## 9 MB-0028 DECEASED Died of Other Causes TRUE
## 10 MB-0035 DECEASED Died of Disease TRUE
## # ℹ 1,894 more rows
Notice the similarities with the filter()
function. The first argument is the data frame we are operating on and the arguments that follow on are specific to the operation in question, in this case, the variables (columns) to select. Note that the variables do not need to be put in quotes, and the returned value is another data frame, even if only one column was selected.
We can alter the order of the variables (columns).
select(metabric, Patient_ID, Vital_status, Survival_status, Deceased)
## # A tibble: 1,904 × 4
## Patient_ID Vital_status Survival_status Deceased
## <chr> <chr> <chr> <lgl>
## 1 MB-0000 Living LIVING FALSE
## 2 MB-0002 Living LIVING FALSE
## 3 MB-0005 Died of Disease DECEASED TRUE
## 4 MB-0006 Living LIVING FALSE
## 5 MB-0008 Died of Disease DECEASED TRUE
## 6 MB-0010 Died of Disease DECEASED TRUE
## 7 MB-0014 Living LIVING FALSE
## 8 MB-0022 Died of Other Causes DECEASED TRUE
## 9 MB-0028 Died of Other Causes DECEASED TRUE
## 10 MB-0035 Died of Disease DECEASED TRUE
## # ℹ 1,894 more rows
We can also select a range of columns using :
, e.g. to select the patient identifier and all the columns between Tumour_size
and Cancer_type
we could run the following select()
command.
select(metabric, Patient_ID, Chemotherapy:Tumour_stage)
## # A tibble: 1,904 × 5
## Patient_ID Chemotherapy Radiotherapy Tumour_size Tumour_stage
## <chr> <chr> <chr> <dbl> <dbl>
## 1 MB-0000 NO YES 22 2
## 2 MB-0002 NO YES 10 1
## 3 MB-0005 YES NO 15 2
## 4 MB-0006 YES YES 25 2
## 5 MB-0008 YES YES 40 2
## 6 MB-0010 NO YES 31 4
## 7 MB-0014 YES YES 10 2
## 8 MB-0022 NO YES 29 2
## 9 MB-0028 NO YES 16 2
## 10 MB-0035 NO NO 28 2
## # ℹ 1,894 more rows
The help page for select
points to some special functions that can be used within select()
. We can find all the columns, for example, that contain the term “status” using contains()
.
select(metabric, contains("status"))
## # A tibble: 1,904 × 7
## Survival_status Vital_status Lymph_node_status ER_status PR_status
## <chr> <chr> <dbl> <chr> <chr>
## 1 LIVING Living 3 Positive Negative
## 2 LIVING Living 1 Positive Positive
## 3 DECEASED Died of Disease 2 Positive Positive
## 4 LIVING Living 2 Positive Positive
## 5 DECEASED Died of Disease 3 Positive Positive
## 6 DECEASED Died of Disease 1 Positive Positive
## 7 LIVING Living 2 Positive Positive
## 8 DECEASED Died of Other Causes 2 Positive Negative
## 9 DECEASED Died of Other Causes 2 Positive Negative
## 10 DECEASED Died of Disease 1 Positive Negative
## # ℹ 1,894 more rows
## # ℹ 2 more variables: HER2_status <chr>, HER2_status_measured_by_SNP6 <chr>
If we only wanted those ending with “status” we could use ends_with()
instead.
select(metabric, ends_with("status"))
## # A tibble: 1,904 × 6
## Survival_status Vital_status Lymph_node_status ER_status PR_status
## <chr> <chr> <dbl> <chr> <chr>
## 1 LIVING Living 3 Positive Negative
## 2 LIVING Living 1 Positive Positive
## 3 DECEASED Died of Disease 2 Positive Positive
## 4 LIVING Living 2 Positive Positive
## 5 DECEASED Died of Disease 3 Positive Positive
## 6 DECEASED Died of Disease 1 Positive Positive
## 7 LIVING Living 2 Positive Positive
## 8 DECEASED Died of Other Causes 2 Positive Negative
## 9 DECEASED Died of Other Causes 2 Positive Negative
## 10 DECEASED Died of Disease 1 Positive Negative
## # ℹ 1,894 more rows
## # ℹ 1 more variable: HER2_status <chr>
We can also select those columns we’re not interested in and that shouldn’t be included by prefixing the columns with -
.
select(metabric, -Cohort)
## # A tibble: 1,904 × 32
## Patient_ID Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <chr> <chr>
## 1 MB-0000 75.6 140. LIVING Living
## 2 MB-0002 43.2 84.6 LIVING Living
## 3 MB-0005 48.9 164. DECEASED Died of Disease
## 4 MB-0006 47.7 165. LIVING Living
## 5 MB-0008 77.0 41.4 DECEASED Died of Disease
## 6 MB-0010 78.8 7.8 DECEASED Died of Disease
## 7 MB-0014 56.4 164. LIVING Living
## 8 MB-0022 89.1 99.5 DECEASED Died of Other Caus…
## 9 MB-0028 86.4 36.6 DECEASED Died of Other Caus…
## 10 MB-0035 84.2 36.3 DECEASED Died of Disease
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
You can use a combination of explicit naming, ranges, helper functions and negation to select the columns of interest.
selected_columns <- select(metabric,
Patient_ID,
starts_with("Tumour_"),
`3-gene_classifier`:Integrative_cluster,
-Cellularity)
selected_columns
## # A tibble: 1,904 × 6
## Patient_ID Tumour_size Tumour_stage `3-gene_classifier`
## <chr> <dbl> <dbl> <chr>
## 1 MB-0000 22 2 ER-/HER2-
## 2 MB-0002 10 1 ER+/HER2- High Prolif
## 3 MB-0005 15 2 <NA>
## 4 MB-0006 25 2 <NA>
## 5 MB-0008 40 2 ER+/HER2- High Prolif
## 6 MB-0010 31 4 ER+/HER2- High Prolif
## 7 MB-0014 10 2 <NA>
## 8 MB-0022 29 2 <NA>
## 9 MB-0028 16 2 ER+/HER2- High Prolif
## 10 MB-0035 28 2 ER+/HER2- High Prolif
## # ℹ 1,894 more rows
## # ℹ 2 more variables: Nottingham_prognostic_index <dbl>,
## # Integrative_cluster <chr>
You can also use select()
to completely reorder the columns so they’re in the order of your choosing. The everything()
helper function is useful in this context, particularly if what you’re really interested in is bringing one or more columns to the left hand side and then including everything else afterwards in whatever order they were already in.
select(metabric, Patient_ID, Survival_status, Tumour_stage, everything())
## # A tibble: 1,904 × 33
## Patient_ID Survival_status Tumour_stage Cohort Age_at_diagnosis Survival_time
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MB-0000 LIVING 2 1 75.6 140.
## 2 MB-0002 LIVING 1 1 43.2 84.6
## 3 MB-0005 DECEASED 2 1 48.9 164.
## 4 MB-0006 LIVING 2 1 47.7 165.
## 5 MB-0008 DECEASED 2 1 77.0 41.4
## 6 MB-0010 DECEASED 4 1 78.8 7.8
## 7 MB-0014 LIVING 2 1 56.4 164.
## 8 MB-0022 DECEASED 2 1 89.1 99.5
## 9 MB-0028 DECEASED 2 1 86.4 36.6
## 10 MB-0035 DECEASED 2 1 84.2 36.3
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Vital_status <chr>, Chemotherapy <chr>,
## # Radiotherapy <chr>, Tumour_size <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
Note dplyr also provides the relocate()
function to achieve the same goal:
relocate(metabric, Patient_ID, Survival_status, Tumour_stage)
## # A tibble: 1,904 × 33
## Patient_ID Survival_status Tumour_stage Cohort Age_at_diagnosis Survival_time
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MB-0000 LIVING 2 1 75.6 140.
## 2 MB-0002 LIVING 1 1 43.2 84.6
## 3 MB-0005 DECEASED 2 1 48.9 164.
## 4 MB-0006 LIVING 2 1 47.7 165.
## 5 MB-0008 DECEASED 2 1 77.0 41.4
## 6 MB-0010 DECEASED 4 1 78.8 7.8
## 7 MB-0014 LIVING 2 1 56.4 164.
## 8 MB-0022 DECEASED 2 1 89.1 99.5
## 9 MB-0028 DECEASED 2 1 86.4 36.6
## 10 MB-0035 DECEASED 2 1 84.2 36.3
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Vital_status <chr>, Chemotherapy <chr>,
## # Radiotherapy <chr>, Tumour_size <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
Finally, columns can be renamed as part of the selection process.
select(metabric,
Patient_ID, Classification = `3-gene_classifier`,
NPI = Nottingham_prognostic_index)
## # A tibble: 1,904 × 3
## Patient_ID Classification NPI
## <chr> <chr> <dbl>
## 1 MB-0000 ER-/HER2- 6.04
## 2 MB-0002 ER+/HER2- High Prolif 4.02
## 3 MB-0005 <NA> 4.03
## 4 MB-0006 <NA> 4.05
## 5 MB-0008 ER+/HER2- High Prolif 6.08
## 6 MB-0010 ER+/HER2- High Prolif 4.06
## 7 MB-0014 <NA> 4.02
## 8 MB-0022 <NA> 4.06
## 9 MB-0028 ER+/HER2- High Prolif 5.03
## 10 MB-0035 ER+/HER2- High Prolif 3.06
## # ℹ 1,894 more rows
Note that dplyr provides the rename()
function for when we only want to rename a column and not select a subset of columns.
rename(metabric, PatientID = Patient_ID)
## # A tibble: 1,904 × 33
## PatientID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0000 1 75.6 140. LIVING Living
## 2 MB-0002 1 43.2 84.6 LIVING Living
## 3 MB-0005 1 48.9 164. DECEASED Died of Dise…
## 4 MB-0006 1 47.7 165. LIVING Living
## 5 MB-0008 1 77.0 41.4 DECEASED Died of Dise…
## 6 MB-0010 1 78.8 7.8 DECEASED Died of Dise…
## 7 MB-0014 1 56.4 164. LIVING Living
## 8 MB-0022 1 89.1 99.5 DECEASED Died of Othe…
## 9 MB-0028 1 86.4 36.6 DECEASED Died of Othe…
## 10 MB-0035 1 84.2 36.3 DECEASED Died of Dise…
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
arrange()
It is sometimes quite useful to sort our data frame based on the values in one or more of the columns, particularly when displaying the contents or saving them to a file. The arrange()
function in dplyr provides this sorting capability.
For example, we can sort the METABRIC patients into order of increasing survival time.
arrange(metabric, Survival_time)
## # A tibble: 1,904 × 33
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0284 1 51.4 0 LIVING Living
## 2 MB-6229 5 75.3 0.1 DECEASED Died of Dis…
## 3 MB-0627 1 54.1 0.767 LIVING Living
## 4 MB-0880 1 73.6 1.23 LIVING Living
## 5 MB-0125 1 74.0 1.27 LIVING Living
## 6 MB-0374 1 34.7 1.43 LIVING Living
## 7 MB-0148 1 53.2 1.77 LIVING Living
## 8 MB-5525 3 63.2 2 LIVING Living
## 9 MB-6092 5 80.6 2.3 DECEASED Died of Oth…
## 10 MB-0117 1 60.1 2.4 LIVING Living
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
Or we might be more interested in the patients that survived the longest so would need the order to be of decreasing survival time. For that we can use the helper function desc()
that works specifically with arrange()
.
arrange(metabric, desc(Survival_time))
## # A tibble: 1,904 × 33
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-4189 3 61.0 355. DECEASED Died of Oth…
## 2 MB-4079 3 63.2 351 DECEASED Died of Oth…
## 3 MB-0270 1 30.0 337. LIVING Living
## 4 MB-4235 3 67.5 336. DECEASED Died of Dis…
## 5 MB-4292 3 58.8 336. DECEASED Died of Oth…
## 6 MB-4212 3 45.5 330. LIVING Living
## 7 MB-4548 3 50.4 323. LIVING Living
## 8 MB-4633 3 67.0 318. LIVING Living
## 9 MB-4332 3 34.4 308. LIVING Living
## 10 MB-4418 3 56.1 308. LIVING Living
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
We can sort by more than one variable by adding more variable arguments to arrange()
.
arrange(metabric, Tumour_stage, Nottingham_prognostic_index)
## # A tibble: 1,904 × 33
## Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MB-0472 1 59.3 27.4 LIVING Living
## 2 MB-0230 1 42.0 200. LIVING Living
## 3 MB-0522 1 54.1 2.5 LIVING Living
## 4 MB-0573 1 76.2 163. DECEASED Died of Oth…
## 5 MB-5186 3 42.8 203. LIVING Living
## 6 MB-0260 1 78.3 170. DECEASED Died of Oth…
## 7 MB-0245 1 54.8 165. LIVING Living
## 8 MB-2821 2 57.4 136. DECEASED Died of Oth…
## 9 MB-4681 3 60.7 252. DECEASED Died of Dis…
## 10 MB-0318 1 58.0 169. LIVING Living
## # ℹ 1,894 more rows
## # ℹ 27 more variables: Chemotherapy <chr>, Radiotherapy <chr>,
## # Tumour_size <dbl>, Tumour_stage <dbl>, Neoplasm_histologic_grade <dbl>,
## # Lymph_nodes_examined_positive <dbl>, Lymph_node_status <dbl>,
## # Cancer_type <chr>, ER_status <chr>, PR_status <chr>, HER2_status <chr>,
## # HER2_status_measured_by_SNP6 <chr>, PAM50 <chr>, `3-gene_classifier` <chr>,
## # Nottingham_prognostic_index <dbl>, Cellularity <chr>, …
Here we’ve sorted first by tumour stage from lowest to highest value and then by the Nottingham prognostic index for when there are ties, i.e. where the tumour stage is the same.
Sorting is most commonly used in workflows as one of the last steps before printing out a data frame or writing out the table to a file.
We haven’t yet showed how to read Excel spreadsheets into R. There are several packages providing this capability, but for general use the readxl
package from the tidyverse ideal.
readxl
should have been installed as part of the tidyverse but if you find that’s not the case, you will have to install it separately.
readxl
is not loaded as one of the core tidyverse packages when you run library(tidyverse)
so you’ll have to load this separately.
The the main functions are read_excel
, read_xls
and read_xslx
. Use ?read_xls
to find out how specify cell ranges and worksheets to be read.
How many samples in the METABRIC dataset have high cellularity and positive ER and HER2 status?
After reading the data first subset it to retain only Cellularity, ER_status and HER2_status. Then filter the data to answer the question.
selectedPatients <- metabric %>%
select(Cellularity, ER_status, HER2_status) %>%
filter(Cellularity=="High" & ER_status=="Positive" & HER2_status=="Positive")
nrow(selectedPatients)
## [1] 51
Investigate the subset of long-surviving breast cancer patients that didn’t receive chemo or radiotherapy.
First obtain the subset of patients that received neither chemotherapy or radiotherapy and survived for more than 20 years.
What percentage of these patients had negative ER status?
How does this compare to the percentage of patients with ER_negative status in the whole data set?
selectedPatients <- metabric %>%
filter(Chemotherapy=="NO" & Radiotherapy=="NO" & Survival_time > 240)
mean(selectedPatients$ER_status=="Negative")
## [1] 0.1142857
mean(metabric$ER_status=="Negative")
## [1] 0.2337185