If you haven’t learned the basics of R prior to attending this course, you should check out our R crash course for an overview of R’s syntax. It’s also a great refresher if you feel it has been a while since you last worked with R.
Markdown is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse here.
EVAL = FALSE
or echo = FALSE
print("Hello World")
[1] "Hello World"
When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. For example:
This will be displayed in italic
This will be displayed in bold
You can also add hyperlinks, images and tables.
Lastly, you can even embed chunks of code written in other programming languages.
More help is available through RStudio Help -> Markdown Quick Reference or you can view a cheat sheet here.
To create markdown files for your own analysis; File -> New File -> R Markdown…
Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.
On the Bioconductor website, you will find
For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data
All Bioconductor software packages are listed under
biocLite
command
## You don't need to run this, edgeR should already be installed for the course
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
Once installed, a Bioconductor package can be loaded in the usual way with the library
function. All packages are required to have a vignette which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as edgeR
have very comprehensive user guides with lots of use-cases.
library(edgeR)
vignette("edgeR")
edgeRUsersGuide()
Package documentation can also be accessed via the Help tab in RStudio, which can also be invoked in the console using “?”
?edgeR
Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available through Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.
## No need to run this - for reference only!
biocLite("breastCancerVDX")
To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for breastCancerVDX
in RStudio, we find that it provides an object called vdx
which we load into R’s memory using the data
function.
library(breastCancerVDX)
data(vdx)
The object vdx
is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don’t envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data
vdx
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 344 samples
element names: exprs
protocolData: none
phenoData
sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17420468
Annotation: hgu133a
The expression values can be obtained by the exprs
function:-
<-
is used for assignment to create a new variablematrix
in R
dim
, ncol
, nrow
etc.eValues <- exprs(vdx) # also found at vdx@assayData$exprs
class(eValues)
[1] "matrix"
dim(eValues)
[1] 22283 344
ncol(eValues)
[1] 344
nrow(eValues)
[1] 22283
[row, column]
notation
c
is used to make a one-dimensional vector:
can used to stand for a sequence of consecutive numberseValues[c(1,2,3),c(1,2,3,4)]
VDX_3 VDX_5 VDX_6 VDX_7
1007_s_at 11.965135 11.798593 11.777625 11.538577
1053_at 7.895424 7.885696 7.949535 7.481396
117_at 8.259272 7.052025 8.225930 8.382408
eValues[1:3,1:4]
VDX_3 VDX_5 VDX_6 VDX_7
1007_s_at 11.965135 11.798593 11.777625 11.538577
1053_at 7.895424 7.885696 7.949535 7.481396
117_at 8.259272 7.052025 8.225930 8.382408
-
eValues[1:3,1:4][,2:3]
VDX_5 VDX_6
1007_s_at 11.798593 11.777625
1053_at 7.885696 7.949535
117_at 7.052025 8.225930
eValues[1:3,1:4][,-(2:3)]
VDX_3 VDX_7
1007_s_at 11.965135 11.538577
1053_at 7.895424 7.481396
117_at 8.259272 8.382408
The most basic plotting function in R is plot
plot
function with a vector will plot the values of that vector against the index
plot(eValues[1,])
boxplot(eValues,outline=FALSE)
The metadata, or phenotypic data, for the samples is retrieved using the pData
function.
metadata <- pData(vdx) #vdx@phenoData@data
head(metadata)
samplename dataset series id filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
VDX_3 VDX_3 VDX VDX 3 GSM36793.CEL.gz NA 36 0 NA NA NA NA 0 3072 0 NA NA 0 1 NA NA
VDX_5 VDX_5 VDX VDX 5 GSM36796.CEL.gz NA 47 1 3 NA NA NA 0 3589 0 NA NA 0 1 NA NA
VDX_6 VDX_6 VDX VDX 6 GSM36797.CEL.gz NA 44 0 3 NA NA NA 1 274 0 NA NA 0 1 NA NA
VDX_7 VDX_7 VDX VDX 7 GSM36798.CEL.gz NA 41 0 3 NA NA NA 0 3224 0 NA NA 0 1 NA NA
VDX_8 VDX_8 VDX VDX 8 GSM36800.CEL.gz NA 70 0 NA NA NA NA 1 1125 0 NA NA 0 1 NA NA
VDX_9 VDX_9 VDX VDX 9 GSM36801.CEL.gz NA 62 1 3 NA NA NA 0 3802 0 NA NA 0 1 NA NA
$
notation.$
, it should give you a list of possible options.head(metadata$samplename)
[1] "VDX_3" "VDX_5" "VDX_6" "VDX_7" "VDX_8" "VDX_9"
[]
to print
mean
function
#FILL IN SOLUTIONS HERE
So far we have been able to print out a subset of our data by specifying a set of numeric indices (e.g. first 10 rows etc). Lets say we’re interested in high-grade tumours, in which case we might not know in advance which rows these correspond to
==
>
, <
, !=
can used to make a logical comparisonTRUE
or FALSE
indicating whether each entry satisfies the test condition, or not.
NA
the resulting logical vector will have NA
at the same positionstable(metadata$grade == 3)
FALSE TRUE
49 148
table(metadata$grade == 3,useNA="ifany")
FALSE TRUE <NA>
49 148 147
Such a logical vector can then be used for subsetting
which
can be used to make sure there are no NA
values in the logical vectors
TRUE
[]
metadata[which(metadata$grade == 3),]
samplename dataset series id filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
VDX_5 VDX_5 VDX VDX 5 GSM36796.CEL.gz NA 47 1 3 NA NA NA 0 3589 0 NA NA 0 1 NA NA
VDX_6 VDX_6 VDX VDX 6 GSM36797.CEL.gz NA 44 0 3 NA NA NA 1 274 0 NA NA 0 1 NA NA
VDX_7 VDX_7 VDX VDX 7 GSM36798.CEL.gz NA 41 0 3 NA NA NA 0 3224 0 NA NA 0 1 NA NA
VDX_9 VDX_9 VDX VDX 9 GSM36801.CEL.gz NA 62 1 3 NA NA NA 0 3802 0 NA NA 0 1 NA NA
VDX_11 VDX_11 VDX VDX 11 GSM36834.CEL.gz NA 69 1 3 NA NA NA 0 3315 0 NA NA 0 1 NA NA
VDX_14 VDX_14 VDX VDX 14 GSM36835.CEL.gz NA 68 0 3 NA NA NA 1 426 0 NA NA 0 1 NA NA
VDX_15 VDX_15 VDX VDX 15 GSM36836.CEL.gz NA 44 1 3 NA NA NA 0 3011 0 NA NA 0 1 NA NA
VDX_17 VDX_17 VDX VDX 17 GSM36837.CEL.gz NA 51 1 3 NA NA NA 0 4167 0 NA NA 0 1 NA NA
VDX_18 VDX_18 VDX VDX 18 GSM36838.CEL.gz NA 32 1 3 NA NA NA 1 1034 0 NA NA 0 1 NA NA
VDX_20 VDX_20 VDX VDX 20 GSM36855.CEL.gz NA 34 0 3 NA NA NA 0 3893 0 NA NA 0 1 NA NA
VDX_22 VDX_22 VDX VDX 22 GSM36859.CEL.gz NA 32 1 3 NA NA NA 0 3954 0 NA NA 0 1 NA NA
VDX_27 VDX_27 VDX VDX 27 GSM36860.CEL.gz NA 47 1 3 NA NA NA 1 913 0 NA NA 0 1 NA NA
VDX_29 VDX_29 VDX VDX 29 GSM36862.CEL.gz NA 42 0 3 NA NA NA 1 760 0 NA NA 0 1 NA NA
VDX_31 VDX_31 VDX VDX 31 GSM36870.CEL.gz NA 49 1 3 NA NA NA 1 913 0 NA NA 0 1 NA NA
VDX_32 VDX_32 VDX VDX 32 GSM36871.CEL.gz NA 44 1 3 NA NA NA 0 2555 0 NA NA 0 1 NA NA
VDX_37 VDX_37 VDX VDX 37 GSM36875.CEL.gz NA 62 0 3 NA NA NA 1 213 0 NA NA 0 1 NA NA
VDX_39 VDX_39 VDX VDX 39 GSM36877.CEL.gz NA 38 1 3 NA NA NA 1 1308 0 NA NA 0 1 NA NA
VDX_41 VDX_41 VDX VDX 41 GSM36897.CEL.gz NA 46 1 3 NA NA NA 1 760 0 NA NA 0 1 NA NA
VDX_46 VDX_46 VDX VDX 46 GSM36900.CEL.gz NA 60 1 3 NA NA NA 0 3315 0 NA NA 0 1 NA NA
VDX_48 VDX_48 VDX VDX 48 GSM36901.CEL.gz NA 29 1 3 NA NA NA 0 3072 0 NA NA 0 1 NA NA
VDX_49 VDX_49 VDX VDX 49 GSM36902.CEL.gz NA 32 1 3 NA NA NA 1 852 0 NA NA 0 1 NA NA
VDX_55 VDX_55 VDX VDX 55 GSM36918.CEL.gz NA 54 0 3 NA NA NA 1 335 0 NA NA 0 1 NA NA
VDX_57 VDX_57 VDX VDX 57 GSM36920.CEL.gz NA 53 1 3 NA NA NA 1 608 0 NA NA 0 1 NA NA
VDX_61 VDX_61 VDX VDX 61 GSM36937.CEL.gz NA 27 0 3 NA NA NA 1 335 0 NA NA 0 1 NA NA
VDX_64 VDX_64 VDX VDX 64 GSM36940.CEL.gz NA 47 0 3 NA NA NA 0 4076 0 NA NA 0 1 NA NA
VDX_65 VDX_65 VDX VDX 65 GSM36941.CEL.gz NA 58 0 3 NA NA NA 1 274 0 NA NA 0 1 NA NA
VDX_67 VDX_67 VDX VDX 67 GSM36943.CEL.gz NA 32 1 3 NA NA NA 1 578 0 NA NA 0 1 NA NA
VDX_68 VDX_68 VDX VDX 68 GSM36944.CEL.gz NA 49 1 3 NA NA NA 0 3407 0 NA NA 0 1 NA NA
VDX_72 VDX_72 VDX VDX 72 GSM36965.CEL.gz NA 40 1 3 NA NA NA 0 4076 0 NA NA 0 1 NA NA
VDX_74 VDX_74 VDX VDX 74 GSM36966.CEL.gz NA 70 0 3 NA NA NA 0 2738 0 NA NA 0 1 NA NA
VDX_79 VDX_79 VDX VDX 79 GSM36969.CEL.gz NA 41 0 3 NA NA NA 1 973 0 NA NA 0 1 NA NA
VDX_80 VDX_80 VDX VDX 80 GSM36987.CEL.gz NA 52 1 3 NA NA NA 0 2950 0 NA NA 0 1 NA NA
VDX_84 VDX_84 VDX VDX 84 GSM36991.CEL.gz NA 66 0 3 NA NA NA 0 3468 0 NA NA 0 1 NA NA
VDX_86 VDX_86 VDX VDX 86 GSM36992.CEL.gz NA 55 1 3 NA NA NA 0 3163 0 NA NA 0 1 NA NA
VDX_88 VDX_88 VDX VDX 88 GSM36993.CEL.gz NA 55 1 3 NA NA NA 0 2981 0 NA NA 0 1 NA NA
VDX_90 VDX_90 VDX VDX 90 GSM37017.CEL.gz NA 41 0 3 NA NA NA 0 3281 0 NA NA 0 1 NA NA
VDX_92 VDX_92 VDX VDX 92 GSM37019.CEL.gz NA 51 1 3 NA NA NA 0 4441 0 NA NA 0 1 NA NA
VDX_93 VDX_93 VDX VDX 93 GSM37020.CEL.gz NA 37 0 3 NA NA NA 1 1460 0 NA NA 0 1 NA NA
VDX_98 VDX_98 VDX VDX 98 GSM37023.CEL.gz NA 53 0 3 NA NA NA 1 426 0 NA NA 0 1 NA NA
VDX_99 VDX_99 VDX VDX 99 GSM37025.CEL.gz NA 80 1 3 NA NA NA 0 3255 0 NA NA 0 1 NA NA
VDX_103 VDX_103 VDX VDX 103 GSM36880.CEL.gz NA 41 1 3 NA NA NA 0 2950 0 NA NA 0 1 NA NA
VDX_105 VDX_105 VDX VDX 105 GSM36882.CEL.gz NA 50 1 3 NA NA NA 0 3011 0 NA NA 0 1 NA NA
VDX_107 VDX_107 VDX VDX 107 GSM36886.CEL.gz NA 53 0 3 NA NA NA 0 2859 0 NA NA 0 1 NA NA
VDX_109 VDX_109 VDX VDX 109 GSM36891.CEL.gz NA 65 0 3 NA NA NA 0 2646 0 NA NA 0 1 NA NA
VDX_110 VDX_110 VDX VDX 110 GSM36903.CEL.gz NA 73 1 3 NA NA NA 1 243 0 NA NA 0 1 NA NA
VDX_113 VDX_113 VDX VDX 113 GSM36906.CEL.gz NA 47 0 3 NA NA NA 0 3072 0 NA NA 0 1 NA NA
VDX_115 VDX_115 VDX VDX 115 GSM36908.CEL.gz NA 61 1 3 NA NA NA 1 456 0 NA NA 0 1 NA NA
[ reached getOption("max.print") -- omitted 101 rows ]
Can use same expression to subset the columns of the expression matrix
head(colnames(eValues))
head(rownames(metadata))
table(colnames(eValues) == rownames(metadata))
eValues[,which(metadata$grade==3)][1:10,1:10]
VDX_5 VDX_6 VDX_7 VDX_9 VDX_11 VDX_14 VDX_15 VDX_17 VDX_18 VDX_20
1007_s_at 11.798593 11.777625 11.538577 12.248698 12.259243 11.695620 11.577193 12.483665 11.899130 12.414157
1053_at 7.885696 7.949535 7.481396 7.708049 8.182891 7.970394 6.721099 8.005625 6.815063 7.067165
117_at 7.052025 8.225930 8.382408 7.731319 8.226412 7.710118 7.375908 7.488644 7.209453 9.775939
121_at 10.666845 10.888819 10.795472 10.886687 10.192909 11.275077 10.614526 10.664625 10.772892 10.623607
1255_g_at 5.452859 6.456149 6.147714 6.551516 6.376777 6.620586 6.044394 6.244126 5.311067 6.266787
1294_at 9.585901 9.422906 9.456970 9.831782 9.404290 9.292552 8.572511 10.301839 9.430871 8.914684
1316_at 7.644577 7.065012 8.121534 7.573647 7.068241 8.377644 7.302867 7.780048 8.203103 7.651769
1320_at 5.066089 4.661065 6.400879 4.153805 3.678072 4.209453 4.887525 6.456149 5.385431 6.230741
1405_i_at 10.043711 9.350276 11.520619 13.012869 9.896030 8.618386 8.355792 9.875289 7.210428 9.133656
1431_at 6.230741 6.263034 5.951868 6.183883 5.741467 6.797013 5.744161 6.407693 6.560715 5.617651
vdx
object by sample subsets if we wishvdx[,which(metadata$grade==3)]
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 148 samples
element names: exprs
protocolData: none
phenoData
sampleNames: VDX_5 VDX_6 ... VDX_913 (148 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17420468
Annotation: hgu133a
Previously, we used a boxplot to visualise the expression levels of all genes in a given sample to look for trends across the dataset. Another use for a boxplot is to visualise the expression level of a particular gene with respect to the sample metadata
$
and use the formula syntax
table
in this case will tell us how many observations of each category are presentfactor
type if requiredfac <- metadata$er
table(fac)
fac
0 1
135 209
boxplot(eValues[1,] ~ fac,
xlab="ER Status",
ylab="Expression Level",
col=c("steelblue","goldenrod"))
Performing a test to assess significance follows similar syntax
t.test
is the generic function to perform a t-test, and can be adapted to different circumstances
t.test
for more detailst.test(eValues[1,]~fac)
Welch Two Sample t-test
data: eValues[1, ] by fac
t = -1.7476, df = 244.15, p-value = 0.08179
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2101311 0.0125555
sample estimates:
mean in group 0 mean in group 1
11.72515 11.82394
We could be lucky, and the first row in the expression matrix could be our favourite gene of interest! However, this is unlikely to be the case and we will have to figure out which row we want to plot
nki
object; the feature datafData
function to access these datafeatures <- fData(vdx) #try to figure out the location in memory of this data.frame via @ and $
class(features)
[1] "data.frame"
head(features[,1:5])
probe Gene.title Gene.symbol Gene.ID EntrezGene.ID
1007_s_at 1007_s_at discoidin domain receptor tyrosine kinase 1 DDR1 780 780
1053_at 1053_at replication factor C (activator 1) 2, 40kDa RFC2 5982 5982
117_at 117_at heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 3310
121_at 121_at paired box 8 PAX8 7849 7849
1255_g_at 1255_g_at guanylate cyclase activator 1A (retina) GUCA1A 2978 2978
1294_at 1294_at ubiquitin-like modifier activating enzyme 7 UBA7 7318 7318
colnames(features)
[1] "probe" "Gene.title" "Gene.symbol" "Gene.ID" "EntrezGene.ID" "UniGene.title"
[7] "UniGene.symbol" "UniGene.ID" "Nucleotide.Title" "GI" "GenBank.Accession" "Platform_CLONEID"
[13] "Platform_ORF" "Platform_SPOTID" "Chromosome.location" "Chromosome.annotation" "GO.Function" "GO.Process"
[19] "GO.Component" "GO.Function.1" "GO.Process.1" "GO.Component.1"
As we know, gene symbols (more-common gene names) can be accessed using the $
syntax; returning a vector
head(features$Gene.symbol)
[1] "DDR1" "RFC2" "HSPA6" "PAX8" "GUCA1A" "UBA7"
We could inspect the data frame manually (e.g. using View(features)
) and identify the row number corresponding to our gene of interest. However, as aspiring R programmers, there is a better way
==
to test for exact matchingmatch
will return the index of the first matchgrep
can be used for partial matcheswhich(features$Gene.symbol == "BRCA1")
[1] 4058 11245
match("BRCA1",features$Gene.symbol)
[1] 4058
grep("BRCA1",features$Gene.symbol)
[1] 4058 11245
grep("BRCA",features$Gene.symbol)
[1] 4058 7869 11245 14103
ESR1
#FILL IN SOLUTIONS HERE
Later in the course we will see how to execute a differential expression analysis for RNA-seq data, and discuss some of the issues surrounding this. For now we will perform a simple two-sample t-test for all genes in our study, and derive a results table
genefilter
package which has a very convenient function for performing many t tests in parallelrowttests
will run a t-test for each row in a given matrix and produce an output table
statistic
; test statisticdm
; difference in meansp.value
; the p-valuerowttests
expects a factor as the second argument, so we have to use as.factor
?rowttests
library(genefilter)
tstats <- rowttests(eValues, as.factor(metadata$er))
head(tstats)
statistic dm p.value
1007_s_at -1.826397 -0.09878782 6.866210e-02
1053_at 6.419587 0.54673000 4.565205e-10
117_at 2.787517 0.17342654 5.608148e-03
121_at 2.088925 0.08258267 3.745333e-02
1255_g_at 1.750130 0.17399402 8.099267e-02
1294_at -4.113639 -0.22451339 4.883874e-05
hist(tstats$statistic)
The rows are ordered in the same way as the input matrix
order
functionorder
will return a vector of the same length giving the permutation that rearranges that vector into ascending or descending ordersort
function shortcuts using the output of order
to rearrange the original vector by returning the sorted vectorx <- c(9, 3, 4, 2, 1, 6,5, 10, 8, 7)
x
[1] 9 3 4 2 1 6 5 10 8 7
order(x)
[1] 5 4 2 3 7 6 10 9 1 8
x[order(x)]
[1] 1 2 3 4 5 6 7 8 9 10
sort(x)
[1] 1 2 3 4 5 6 7 8 9 10
head(tstats[order(tstats$p.value,decreasing = FALSE),])
statistic dm p.value
205225_at -22.58420 -3.762901 9.004544e-70
209603_at -18.89942 -3.052348 4.700701e-55
209604_s_at -17.53072 -2.431309 1.523875e-49
212956_at -17.42552 -2.157435 4.037559e-49
202088_at -17.25845 -1.719680 1.895888e-48
212496_s_at -16.82538 -1.459843 1.039846e-46
However, the table we get isn’t immediately useful unless we can recognise the manufacturer probe IDs
cbind
) the columns of test statistic with those from the feature matrixtable(rownames(tstats) == rownames(features))
TRUE
22283
tstats.annotated <- cbind(tstats, features[,c("Gene.symbol","EntrezGene.ID","Chromosome.location")])
head(tstats.annotated)
statistic dm p.value Gene.symbol EntrezGene.ID Chromosome.location
1007_s_at -1.826397 -0.09878782 6.866210e-02 DDR1 780 6p21.3
1053_at 6.419587 0.54673000 4.565205e-10 RFC2 5982 7q11.23
117_at 2.787517 0.17342654 5.608148e-03 HSPA6 3310 1q23
121_at 2.088925 0.08258267 3.745333e-02 PAX8 7849 2q12-q14
1255_g_at 1.750130 0.17399402 8.099267e-02 GUCA1A 2978 6p21.1
1294_at -4.113639 -0.22451339 4.883874e-05 UBA7 7318 3p21
Now when we order by p-value, the extra columns that we just added allow us to interpret the results more easily
tstats.ordered <- tstats.annotated[order(tstats$p.value,decreasing = FALSE),]
head(tstats.ordered)
statistic dm p.value Gene.symbol EntrezGene.ID Chromosome.location
205225_at -22.58420 -3.762901 9.004544e-70 ESR1 2099 6q25.1
209603_at -18.89942 -3.052348 4.700701e-55 GATA3 2625 10p15
209604_s_at -17.53072 -2.431309 1.523875e-49 GATA3 2625 10p15
212956_at -17.42552 -2.157435 4.037559e-49 TBC1D9 23158 4q31.21
202088_at -17.25845 -1.719680 1.895888e-48 SLC39A6 25800 18q12.2
212496_s_at -16.82538 -1.459843 1.039846e-46 KDM4B 23030 19p13.3
We can also query this table to look for our favourite genes of interest
%in%
is a simplified way to perform matches to multiple items in a vectortstats.ordered[grep("ESR1",tstats.ordered$Gene.symbol),]
statistic dm p.value Gene.symbol EntrezGene.ID Chromosome.location
205225_at -22.5842021 -3.76290050 9.004544e-70 ESR1 2099 6q25.1
215552_s_at -7.6049898 -0.88635602 2.779492e-13 ESR1 2099 6q25.1
217190_x_at -6.1260692 -0.80822337 2.476553e-09 ESR1 2099 6q25.1
211235_s_at -5.1799122 -0.65885102 3.799727e-07 ESR1 2099 6q25.1
211233_x_at -4.3214212 -0.47809985 2.035049e-05 ESR1 2099 6q25.1
211234_x_at -2.1071684 -0.24655890 3.583069e-02 ESR1 2099 6q25.1
215551_at -1.5416740 -0.10875680 1.240776e-01 ESR1 2099 6q25.1
217163_at 1.4049015 0.19803578 1.609582e-01 ESR1 2099 6q25.1
211627_x_at 0.3835679 0.05999355 7.015371e-01 ESR1 2099 6q25.1
tstats.ordered[tstats.ordered$Gene.symbol %in% c("ESR1","GATA3","FOXA1"),]
statistic dm p.value Gene.symbol EntrezGene.ID Chromosome.location
205225_at -22.5842021 -3.76290050 9.004544e-70 ESR1 2099 6q25.1
209603_at -18.8994157 -3.05234794 4.700701e-55 GATA3 2625 10p15
209604_s_at -17.5307220 -2.43130913 1.523875e-49 GATA3 2625 10p15
209602_s_at -16.3574865 -2.92150517 7.790349e-45 GATA3 2625 10p15
204667_at -14.0650865 -2.11853144 8.982816e-36 FOXA1 3169 14q12-q13
215552_s_at -7.6049898 -0.88635602 2.779492e-13 ESR1 2099 6q25.1
217190_x_at -6.1260692 -0.80822337 2.476553e-09 ESR1 2099 6q25.1
211235_s_at -5.1799122 -0.65885102 3.799727e-07 ESR1 2099 6q25.1
211233_x_at -4.3214212 -0.47809985 2.035049e-05 ESR1 2099 6q25.1
211234_x_at -2.1071684 -0.24655890 3.583069e-02 ESR1 2099 6q25.1
215551_at -1.5416740 -0.10875680 1.240776e-01 ESR1 2099 6q25.1
217163_at 1.4049015 0.19803578 1.609582e-01 ESR1 2099 6q25.1
211627_x_at 0.3835679 0.05999355 7.015371e-01 ESR1 2099 6q25.1
csv
file (hint: use write.csv
)
p.adjust
to produce a vector of p-values that are adjusted. Add this as an extra column to your table of results and write as a file#FILL IN SOLUTIONS HERE