If you haven’t learned the basics of R prior to attending this course, you should check out our R crash course for an overview of R’s syntax. It’s also a great refresher if you feel it has been a while since you last worked with R.
Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse here.
EVAL = FALSE
or echo = FALSE
print("Hello World")
## [1] "Hello World"
When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. This is markdown, a simple language that formats the text whle, crucially, being still readable in the raw state.
For example:
This will be displayed in italic
This will be displayed in bold
You can also add hyperlinks, images and tables.
Lastly, you can even embed chunks of code written in other programming languages.
a='Wow python'
print(a.split())
## ['Wow', 'python']
More help is available through RStudio Help -> Markdown Quick Reference or you can view a cheat sheet here.
To create markdown files for your own analysis; File -> New File -> R Markdown…
Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.
On the Bioconductor website, you will find
For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data
All Bioconductor software packages are listed under
install
command from the BiocManager package
BiocManager::install
forgoes needing to load the package manager into the work enviroment## You don't need to run this, edgeR should already be installed for the course
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
Once installed, a Bioconductor package can be loaded in the usual way with the library
function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR
have very comprehensive user guides with lots of use-cases.
library(edgeR)
vignette("edgeR")
edgeRUsersGuide()
Package documentation can also be accessed via the Help tab in RStudio, which can also be invoked in the console using “?”
?edgeR
Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available via Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.
## No need to run this - for reference only!
BiocManager::install("breastCancerVDX")
To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX
in RStudio, we find that it provides an object called vdx
which we load into R’s memory using the data
function.
library(breastCancerVDX)
data(vdx)
The object vdx
is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don’t envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data
vdx
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 344 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283
## total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
The expression values can be obtained by the exprs
function:
<-
is used for assignment to create a new variablematrix
in R
dim
, ncol
, nrow
etc.eValues <- exprs(vdx) # also found at vdx@assayData$exprs
class(eValues)
## [1] "matrix"
dim(eValues)
## [1] 22283 344
ncol(eValues)
## [1] 344
nrow(eValues)
## [1] 22283
[row, column]
notation
c
is used to make a one-dimensional vector:
can used to stand for a sequence of consecutive numberseValues[c(1,2,3),c(1,2,3,4)]
## VDX_3 VDX_5 VDX_6 VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at 7.895424 7.885696 7.949535 7.481396
## 117_at 8.259272 7.052025 8.225930 8.382408
eValues[1:3,1:4]
## VDX_3 VDX_5 VDX_6 VDX_7
## 1007_s_at 11.965135 11.798593 11.777625 11.538577
## 1053_at 7.895424 7.885696 7.949535 7.481396
## 117_at 8.259272 7.052025 8.225930 8.382408
-
eValues[1:3,1:4][,2:3]
## VDX_5 VDX_6
## 1007_s_at 11.798593 11.777625
## 1053_at 7.885696 7.949535
## 117_at 7.052025 8.225930
eValues[1:3,1:4][,-(2:3)]
## VDX_3 VDX_7
## 1007_s_at 11.965135 11.538577
## 1053_at 7.895424 7.481396
## 117_at 8.259272 8.382408
The most basic plotting function in R is plot
plot
function with a vector will plot the values of that vector against the index
plot(eValues[1,])
boxplot(eValues,outline=FALSE)
The metadata, or phenotypic data, for the samples is retrieved using the pData
function.
metadata <- pData(vdx) #vdx@phenoData@data
head(metadata)
## samplename dataset series id filename size age er grade pgr
## VDX_3 VDX_3 VDX VDX 3 GSM36793.CEL.gz NA 36 0 NA NA
## VDX_5 VDX_5 VDX VDX 5 GSM36796.CEL.gz NA 47 1 3 NA
## VDX_6 VDX_6 VDX VDX 6 GSM36797.CEL.gz NA 44 0 3 NA
## VDX_7 VDX_7 VDX VDX 7 GSM36798.CEL.gz NA 41 0 3 NA
## VDX_8 VDX_8 VDX VDX 8 GSM36800.CEL.gz NA 70 0 NA NA
## VDX_9 VDX_9 VDX VDX 9 GSM36801.CEL.gz NA 62 1 3 NA
## her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue
## VDX_3 NA NA 0 3072 0 NA NA 0 1
## VDX_5 NA NA 0 3589 0 NA NA 0 1
## VDX_6 NA NA 1 274 0 NA NA 0 1
## VDX_7 NA NA 0 3224 0 NA NA 0 1
## VDX_8 NA NA 1 1125 0 NA NA 0 1
## VDX_9 NA NA 0 3802 0 NA NA 0 1
## t.os e.os
## VDX_3 NA NA
## VDX_5 NA NA
## VDX_6 NA NA
## VDX_7 NA NA
## VDX_8 NA NA
## VDX_9 NA NA
head(metadata,10)
$
notation.$
, it should give you a list of possible options.head(metadata$samplename,15)
## [1] "VDX_3" "VDX_5" "VDX_6" "VDX_7" "VDX_8" "VDX_9" "VDX_11"
## [8] "VDX_14" "VDX_15" "VDX_17" "VDX_18" "VDX_19" "VDX_20" "VDX_21"
## [15] "VDX_22"
[]
to print
mean
function
class(metadata)
## [1] "data.frame"
class(eValues)
## [1] "matrix"
metadata[1:10,1:5]
## samplename dataset series id filename
## VDX_3 VDX_3 VDX VDX 3 GSM36793.CEL.gz
## VDX_5 VDX_5 VDX VDX 5 GSM36796.CEL.gz
## VDX_6 VDX_6 VDX VDX 6 GSM36797.CEL.gz
## VDX_7 VDX_7 VDX VDX 7 GSM36798.CEL.gz
## VDX_8 VDX_8 VDX VDX 8 GSM36800.CEL.gz
## VDX_9 VDX_9 VDX VDX 9 GSM36801.CEL.gz
## VDX_11 VDX_11 VDX VDX 11 GSM36834.CEL.gz
## VDX_14 VDX_14 VDX VDX 14 GSM36835.CEL.gz
## VDX_15 VDX_15 VDX VDX 15 GSM36836.CEL.gz
## VDX_17 VDX_17 VDX VDX 17 GSM36837.CEL.gz
metadata[(nrow(metadata)-9):nrow(metadata),7:9]
## age er grade
## VDX_1966 NA 0 NA
## VDX_1967 NA 0 NA
## VDX_1976 NA 0 NA
## VDX_1978 NA 0 NA
## VDX_2001 NA 0 NA
## VDX_2007 NA 0 NA
## VDX_2021 NA 0 NA
## VDX_2024 NA 0 NA
## VDX_2025 NA 0 NA
## VDX_2038 NA 0 NA
tail(metadata,10)[,7:9]
## age er grade
## VDX_1966 NA 0 NA
## VDX_1967 NA 0 NA
## VDX_1976 NA 0 NA
## VDX_1978 NA 0 NA
## VDX_2001 NA 0 NA
## VDX_2007 NA 0 NA
## VDX_2021 NA 0 NA
## VDX_2024 NA 0 NA
## VDX_2025 NA 0 NA
## VDX_2038 NA 0 NA
hist(metadata$age)
mean(metadata$age)
## [1] NA
mean(metadata$age,na.rm=T)
## [1] 53.87762
So far we have been able to print out a subset of our data by specifying a set of numeric indices (e.g. first 10 rows etc). Lets say we’re interested in high-grade tumours, in which case we might not know in advance which rows these correspond to
==
>
, <
, !=
can used to make a logical comparisonTRUE
or FALSE
indicating whether each entry satisfies the test condition, or not.
NA
the resulting logical vector will have NA
at the same positionstable(metadata$grade == 3)
##
## FALSE TRUE
## 49 148
table(metadata$grade == 3,useNA="ifany")
##
## FALSE TRUE <NA>
## 49 148 147
Such a logical vector can then be used for subsetting
which
can be used to make sure there are no NA
values in the logical vectors
TRUE
[]
head(metadata[which(metadata$grade == 3),])
## samplename dataset series id filename size age er grade pgr
## VDX_5 VDX_5 VDX VDX 5 GSM36796.CEL.gz NA 47 1 3 NA
## VDX_6 VDX_6 VDX VDX 6 GSM36797.CEL.gz NA 44 0 3 NA
## VDX_7 VDX_7 VDX VDX 7 GSM36798.CEL.gz NA 41 0 3 NA
## VDX_9 VDX_9 VDX VDX 9 GSM36801.CEL.gz NA 62 1 3 NA
## VDX_11 VDX_11 VDX VDX 11 GSM36834.CEL.gz NA 69 1 3 NA
## VDX_14 VDX_14 VDX VDX 14 GSM36835.CEL.gz NA 68 0 3 NA
## her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue
## VDX_5 NA NA 0 3589 0 NA NA 0 1
## VDX_6 NA NA 1 274 0 NA NA 0 1
## VDX_7 NA NA 0 3224 0 NA NA 0 1
## VDX_9 NA NA 0 3802 0 NA NA 0 1
## VDX_11 NA NA 0 3315 0 NA NA 0 1
## VDX_14 NA NA 1 426 0 NA NA 0 1
## t.os e.os
## VDX_5 NA NA
## VDX_6 NA NA
## VDX_7 NA NA
## VDX_9 NA NA
## VDX_11 NA NA
## VDX_14 NA NA
Can use same expression to subset the columns of the expression matrix
head(colnames(eValues))
head(rownames(metadata))
table(colnames(eValues) == rownames(metadata))
eValues[,which(metadata$grade==3)][1:10,1:10]
## VDX_5 VDX_6 VDX_7 VDX_9 VDX_11 VDX_14
## 1007_s_at 11.798593 11.777625 11.538577 12.248698 12.259243 11.695620
## 1053_at 7.885696 7.949535 7.481396 7.708049 8.182891 7.970394
## 117_at 7.052025 8.225930 8.382408 7.731319 8.226412 7.710118
## 121_at 10.666845 10.888819 10.795472 10.886687 10.192909 11.275077
## 1255_g_at 5.452859 6.456149 6.147714 6.551516 6.376777 6.620586
## 1294_at 9.585901 9.422906 9.456970 9.831782 9.404290 9.292552
## 1316_at 7.644577 7.065012 8.121534 7.573647 7.068241 8.377644
## 1320_at 5.066089 4.661065 6.400879 4.153805 3.678072 4.209453
## 1405_i_at 10.043711 9.350276 11.520619 13.012869 9.896030 8.618386
## 1431_at 6.230741 6.263034 5.951868 6.183883 5.741467 6.797013
## VDX_15 VDX_17 VDX_18 VDX_20
## 1007_s_at 11.577193 12.483665 11.899130 12.414157
## 1053_at 6.721099 8.005625 6.815063 7.067165
## 117_at 7.375908 7.488644 7.209453 9.775939
## 121_at 10.614526 10.664625 10.772892 10.623607
## 1255_g_at 6.044394 6.244126 5.311067 6.266787
## 1294_at 8.572511 10.301839 9.430871 8.914684
## 1316_at 7.302867 7.780048 8.203103 7.651769
## 1320_at 4.887525 6.456149 5.385431 6.230741
## 1405_i_at 8.355792 9.875289 7.210428 9.133656
## 1431_at 5.744161 6.407693 6.560715 5.617651
vdx
object by sample subsets if we wishvdx[,which(metadata$grade==3)]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 148 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_5 VDX_6 ... VDX_913 (148 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283
## total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
Previously, we used a boxplot to visualise the expression levels of all genes in a given sample to look for trends across the dataset. Another use for a boxplot is to visualise the expression level of a particular gene with respect to the sample metadata
$
and use the formula syntax
table
in this case will tell us how many observations of each category are presentfactor
type if requiredfac <- metadata$er
table(fac)
## fac
## 0 1
## 135 209
boxplot(eValues[1,] ~ fac,
xlab="ER Status",
ylab="Expression Level",
col=c("steelblue","goldenrod"))
Performing a test to assess significance follows similar syntax
t.test
is the generic function to perform a t-test, and can be adapted to different circumstances
t.test
for more detailst.test(eValues[1,]~fac)
##
## Welch Two Sample t-test
##
## data: eValues[1, ] by fac
## t = -1.7476, df = 244.15, p-value = 0.08179
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2101311 0.0125555
## sample estimates:
## mean in group 0 mean in group 1
## 11.72515 11.82394
We could be lucky, and the first row in the expression matrix could be our favourite gene of interest! However, this is unlikely to be the case and we will have to figure out which row we want to plot
vdx
object; the feature datafData
function to access these datafeatures <- fData(vdx) #try to figure out the location in memory of this data.frame via @ and $
class(features)
## [1] "data.frame"
head(features[,1:5])
## probe Gene.title
## 1007_s_at 1007_s_at discoidin domain receptor tyrosine kinase 1
## 1053_at 1053_at replication factor C (activator 1) 2, 40kDa
## 117_at 117_at heat shock 70kDa protein 6 (HSP70B')
## 121_at 121_at paired box 8
## 1255_g_at 1255_g_at guanylate cyclase activator 1A (retina)
## 1294_at 1294_at ubiquitin-like modifier activating enzyme 7
## Gene.symbol Gene.ID EntrezGene.ID
## 1007_s_at DDR1 780 780
## 1053_at RFC2 5982 5982
## 117_at HSPA6 3310 3310
## 121_at PAX8 7849 7849
## 1255_g_at GUCA1A 2978 2978
## 1294_at UBA7 7318 7318
colnames(features)
## [1] "probe" "Gene.title"
## [3] "Gene.symbol" "Gene.ID"
## [5] "EntrezGene.ID" "UniGene.title"
## [7] "UniGene.symbol" "UniGene.ID"
## [9] "Nucleotide.Title" "GI"
## [11] "GenBank.Accession" "Platform_CLONEID"
## [13] "Platform_ORF" "Platform_SPOTID"
## [15] "Chromosome.location" "Chromosome.annotation"
## [17] "GO.Function" "GO.Process"
## [19] "GO.Component" "GO.Function.1"
## [21] "GO.Process.1" "GO.Component.1"
As we know, gene symbols (more-common gene names) can be accessed using the $
syntax; returning a vector
head(features$Gene.symbol)
## [1] "DDR1" "RFC2" "HSPA6" "PAX8" "GUCA1A" "UBA7"
We could inspect the data frame manually (e.g. using View(features)
) and identify the row number corresponding to our gene of interest. However, as aspiring R programmers, there is a better way
==
to test for exact matchingmatch
will return the index of the first matchgrep
can be used for partial matcheswhich(features$Gene.symbol == "BRCA1")
## [1] 4058 11245
match("BRCA1",features$Gene.symbol)
## [1] 4058
grep("BRCA1",features$Gene.symbol)
## [1] 4058 11245
grep("BRCA",features$Gene.symbol)
## [1] 4058 7869 11245 14103
ESR1
table(rownames(features) == rownames(eValues))
##
## TRUE
## 22283
which(features$Gene.symbol == "ESR1")
## [1] 4752 10671 10672 10673 11032 14924 14925 16530 16557
match("ESR1",features$Gene.symbol)
## [1] 4752
boxplot(eValues[match("ESR1",features$Gene.symbol),] ~ fac,
xlab="ER Status",
ylab="Expression Level",
col=c("steelblue","goldenrod"))
Later in the course we will see how to execute a differential expression analysis for RNA-seq data, and discuss some of the issues surrounding this. For now we will perform a simple two-sample t-test for all genes in our study, and derive a results table
genefilter
package which has a very convenient function for performing many t tests in parallelrowttests
will run a t-test for each row in a given matrix and produce an output table
statistic
; test statisticdm
; difference in meansp.value
; the p-valuerowttests
expects a factor as the second argument, so we have to use as.factor
?rowttests
library(genefilter)
tstats <- rowttests(eValues, as.factor(metadata$er))
head(tstats)
## statistic dm p.value
## 1007_s_at -1.826397 -0.09878782 6.866210e-02
## 1053_at 6.419587 0.54673000 4.565205e-10
## 117_at 2.787517 0.17342654 5.608148e-03
## 121_at 2.088925 0.08258267 3.745333e-02
## 1255_g_at 1.750130 0.17399402 8.099267e-02
## 1294_at -4.113639 -0.22451339 4.883874e-05
hist(tstats$statistic)
The rows are ordered in the same way as the input matrix
order
functionorder
will return a vector of the same length giving the permutation that rearranges that vector into ascending or descending ordersort
function shortcuts using the output of order
to rearrange the original vector by returning the sorted vectorx <- c(9, 3, 4, 2, 1, 6,5, 10, 8, 7)
x
## [1] 9 3 4 2 1 6 5 10 8 7
order(x)
## [1] 5 4 2 3 7 6 10 9 1 8
x[order(x)]
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
head(tstats[order(tstats$p.value,decreasing = FALSE),])
## statistic dm p.value
## 205225_at -22.58420 -3.762901 9.004544e-70
## 209603_at -18.89942 -3.052348 4.700701e-55
## 209604_s_at -17.53072 -2.431309 1.523875e-49
## 212956_at -17.42552 -2.157435 4.037559e-49
## 202088_at -17.25845 -1.719680 1.895888e-48
## 212496_s_at -16.82538 -1.459843 1.039846e-46
However, the table we get isn’t immediately useful unless we can recognise the manufacturer probe IDs
cbind
) the columns of test statistic with those from the feature matrixtable(rownames(tstats) == rownames(features))
##
## TRUE
## 22283
tstats.annotated <- cbind(features[,c("Gene.symbol","EntrezGene.ID","Chromosome.location")], tstats)
head(tstats.annotated)
## Gene.symbol EntrezGene.ID Chromosome.location statistic
## 1007_s_at DDR1 780 6p21.3 -1.826397
## 1053_at RFC2 5982 7q11.23 6.419587
## 117_at HSPA6 3310 1q23 2.787517
## 121_at PAX8 7849 2q12-q14 2.088925
## 1255_g_at GUCA1A 2978 6p21.1 1.750130
## 1294_at UBA7 7318 3p21 -4.113639
## dm p.value
## 1007_s_at -0.09878782 6.866210e-02
## 1053_at 0.54673000 4.565205e-10
## 117_at 0.17342654 5.608148e-03
## 121_at 0.08258267 3.745333e-02
## 1255_g_at 0.17399402 8.099267e-02
## 1294_at -0.22451339 4.883874e-05
Now when we order by p-value, the extra columns that we just added allow us to interpret the results more easily
tstats.ordered <- tstats.annotated[order(tstats$p.value,decreasing = FALSE),]
head(tstats.ordered)
## Gene.symbol EntrezGene.ID Chromosome.location statistic
## 205225_at ESR1 2099 6q25.1 -22.58420
## 209603_at GATA3 2625 10p15 -18.89942
## 209604_s_at GATA3 2625 10p15 -17.53072
## 212956_at TBC1D9 23158 4q31.21 -17.42552
## 202088_at SLC39A6 25800 18q12.2 -17.25845
## 212496_s_at KDM4B 23030 19p13.3 -16.82538
## dm p.value
## 205225_at -3.762901 9.004544e-70
## 209603_at -3.052348 4.700701e-55
## 209604_s_at -2.431309 1.523875e-49
## 212956_at -2.157435 4.037559e-49
## 202088_at -1.719680 1.895888e-48
## 212496_s_at -1.459843 1.039846e-46
We can also query this table to look for our favourite genes of interest
%in%
is a simplified way to perform matches to multiple items in a vectortstats.ordered[grep("ESR1",tstats.ordered$Gene.symbol),]
## Gene.symbol EntrezGene.ID Chromosome.location statistic
## 205225_at ESR1 2099 6q25.1 -22.5842021
## 215552_s_at ESR1 2099 6q25.1 -7.6049898
## 217190_x_at ESR1 2099 6q25.1 -6.1260692
## 211235_s_at ESR1 2099 6q25.1 -5.1799122
## 211233_x_at ESR1 2099 6q25.1 -4.3214212
## 211234_x_at ESR1 2099 6q25.1 -2.1071684
## 215551_at ESR1 2099 6q25.1 -1.5416740
## 217163_at ESR1 2099 6q25.1 1.4049015
## 211627_x_at ESR1 2099 6q25.1 0.3835679
## dm p.value
## 205225_at -3.76290050 9.004544e-70
## 215552_s_at -0.88635602 2.779492e-13
## 217190_x_at -0.80822337 2.476553e-09
## 211235_s_at -0.65885102 3.799727e-07
## 211233_x_at -0.47809985 2.035049e-05
## 211234_x_at -0.24655890 3.583069e-02
## 215551_at -0.10875680 1.240776e-01
## 217163_at 0.19803578 1.609582e-01
## 211627_x_at 0.05999355 7.015371e-01
tstats.ordered[tstats.ordered$Gene.symbol %in% c("ESR1","GATA3","FOXA1"),]
## Gene.symbol EntrezGene.ID Chromosome.location statistic
## 205225_at ESR1 2099 6q25.1 -22.5842021
## 209603_at GATA3 2625 10p15 -18.8994157
## 209604_s_at GATA3 2625 10p15 -17.5307220
## 209602_s_at GATA3 2625 10p15 -16.3574865
## 204667_at FOXA1 3169 14q12-q13 -14.0650865
## 215552_s_at ESR1 2099 6q25.1 -7.6049898
## 217190_x_at ESR1 2099 6q25.1 -6.1260692
## 211235_s_at ESR1 2099 6q25.1 -5.1799122
## 211233_x_at ESR1 2099 6q25.1 -4.3214212
## 211234_x_at ESR1 2099 6q25.1 -2.1071684
## 215551_at ESR1 2099 6q25.1 -1.5416740
## 217163_at ESR1 2099 6q25.1 1.4049015
## 211627_x_at ESR1 2099 6q25.1 0.3835679
## dm p.value
## 205225_at -3.76290050 9.004544e-70
## 209603_at -3.05234794 4.700701e-55
## 209604_s_at -2.43130913 1.523875e-49
## 209602_s_at -2.92150517 7.790349e-45
## 204667_at -2.11853144 8.982816e-36
## 215552_s_at -0.88635602 2.779492e-13
## 217190_x_at -0.80822337 2.476553e-09
## 211235_s_at -0.65885102 3.799727e-07
## 211233_x_at -0.47809985 2.035049e-05
## 211234_x_at -0.24655890 3.583069e-02
## 215551_at -0.10875680 1.240776e-01
## 217163_at 0.19803578 1.609582e-01
## 211627_x_at 0.05999355 7.015371e-01
csv
file (hint: use write.csv
)
p.adjust
to produce a vector of p-values that are adjusted. Add this as an extra column to your table of results and write as a fileDE.genes <- tstats.annotated[tstats.annotated$p.value<0.05,]
write.csv(DE.genes,file="~/Desktop/DE.gene.csv") #quotations in raw file
tstats.annotated$adj.p.val <- p.adjust(tstats.annotated$p.value)
DE.genes <- tstats.annotated[tstats.annotated$adj.p.val<0.05,]
write.csv(DE.genes,file="~/Desktop/DE.gene.csv") #quotations in raw file
The object vdx
, which contains the data we’ve been using, is an ExpressionSet object (check using the class
function). ExpressionSet’s are used as input in a variety of bioinformatic packages and pipelines. As we found above, an ExpressionSet usually contains the following data:
Let’s generate some data and create our own ExpressionSet from scratch - rnorm
generates normally distrubruted values - matrix
takes a vector and arranges it into the specified number of rows and columns - rep
is used to repeat a vector or its elements - sample
samples the input vector, with or without replacement - AnnotatedDataFrame
is a special container that conveniently stores and manipulates data - Remeber that you can use help (?
) to understand unknown functions
eData <- matrix(rnorm(100000),nrow = 1000,ncol = 100)
mData <- AnnotatedDataFrame(data.frame(Sex=rep(c("M","F"),each=50),biomarker.X=rnorm(100)))
fData <- AnnotatedDataFrame(data.frame(Gene=1:1000,Chr=sample(1:23,1000,replace = T)))
exampleSet <- ExpressionSet(
assayData=eData,
phenoData=mData,
featureData=fData
)
p.adjust
)eData[sample(1:1000,1),1:50] <- rnorm(50) + 1
x <- rowttests(eData,mData$Sex)
x[x$p.value<0.05,]
## statistic dm p.value
## 109 -2.550617 -0.5246087 0.012301189
## 135 -2.267937 -0.4423454 0.025528111
## 167 -2.006354 -0.4051639 0.047574015
## 185 2.172237 0.3601721 0.032249700
## 206 -2.960701 -0.7009906 0.003850057
## 217 2.248770 0.4855720 0.026766214
## 219 2.328365 0.4887082 0.021947872
## 220 3.337707 0.5804885 0.001194962
## 225 -2.130198 -0.3816484 0.035659149
## 253 -2.123878 -0.4044676 0.036197763
## 259 2.037844 0.3540993 0.044260316
## 288 -2.089326 -0.4314059 0.039268872
## 318 -2.453338 -0.4299312 0.015919935
## 360 2.283950 0.4853173 0.024532566
## 380 -3.033601 -0.6114122 0.003093490
## 383 2.753506 0.5116743 0.007027684
## 480 2.003204 0.4129679 0.047916924
## 487 2.375866 0.4459444 0.019452660
## 503 2.195379 0.4100578 0.030496734
## 513 2.277095 0.4555513 0.024954522
## 534 2.187405 0.4400729 0.031091076
## 553 2.120832 0.4476843 0.036459904
## 581 2.171876 0.4961963 0.032277752
## 593 -2.490590 -0.4685386 0.014434732
## 599 3.023485 0.6250911 0.003189536
## 666 1.991613 0.4312513 0.049196602
## 682 2.041925 0.4183467 0.043845756
## 685 -2.323825 -0.4204635 0.022200583
## 703 2.277704 0.4499552 0.024916744
## 728 -2.028053 -0.3871314 0.045268784
## 750 -2.406763 -0.4860006 0.017967910
## 765 2.262630 0.4659998 0.025865803
## 768 1.993142 0.3716602 0.049026181
## 771 -2.103070 -0.4642912 0.038021352
## 788 2.066365 0.3919579 0.041431700
## 810 2.250442 0.4427737 0.026656190
## 817 3.191158 0.6423427 0.001904363
## 842 2.114508 0.4448799 0.037009303
## 856 -2.042689 -0.4059577 0.043768447
## 877 -2.312479 -0.4441756 0.022843342
## 878 -2.254375 -0.4269911 0.026398838
## 923 -2.181983 -0.4187826 0.031501012
## 935 2.178914 0.4334397 0.031735145
## 953 -2.799210 -0.5565519 0.006169991
## 974 2.241139 0.4100998 0.027273631
## 982 -2.777263 -0.5078606 0.006569144
## 986 2.680898 0.5699136 0.008615751
## 990 2.512063 0.4968216 0.013636148
x$adj.p.val <- p.adjust(x$p.value)
x[x$adj.p.val<0.05,]
## [1] statistic dm p.value adj.p.val
## <0 rows> (or 0-length row.names)
eData[sample(1:1000,1),1:50] <- rnorm(50) + 1
x <- rowttests(eData,mData$Sex)
x$adj.p.val <- p.adjust(x$p.value)
x[x$adj.p.val<0.05,]
## statistic dm p.value adj.p.val
## 815 -5.160559 -1.191252 1.288028e-06 0.001288028