Docker is an open platform for developers to build and ship applications, whether on laptops, servers in a data center, or the cloud.
Once you have installed Docker using the insructions above, you can open a terminal (Mac) or command prompt (Windows) and run the following to download a container for the Ubuntu operating system;
docker pull ubuntu
To run any software inside the container we can do;
docker run ubuntu echo "Hello World"
echo
command within this operating systemTo use the container in interactive mode we have to specify a -it
argument. Which basically means that it doesn’t exit straight away, but instead runs the bash
command to get a terminal prompt
NB the --rm
means that the container is deleted on exit, otherwise your disk could get clogged up with lots of exited containers
docker run -it --rm ubuntu
The Sanger CGP docker container packages a complete environment including all the tools to run the CGP analysis pipeline. These tools include:
The container being used is called quay.io/wtsicgp/dockstore-cgpwgs:1.0.8
. There are some minimal instructions here.
Once inside the container we can run any tools that have been installed. However, what is not included is data and by default the file system is completely isolated from our own machine. In order to analyse our own data we mount directories on our own OS so they are available within the Docker environment.
-v
argument to docker run
allows certain directories on your machine to be visible inside docker.
On Day3, we ran CaVEMan using the Sanger Docker shortcut. What this was actually doing is below.
docker run \
--rm \
-it --entrypoint=bash \
-v /data:/data \
-v /reference_data:/reference_data \
-v /home/participant/Course_Materials/Day3:/caveman_practical \
quay.io/wtsicgp/dockstore-cgpwgs:1.0.8
However, the intended usage for this container is to run the entire pipeline on a matched tumour and normal sample. The tools listed above are used in an automated fashion as a workflow with, say, the output of ascat being automatically passed to caveman etc.
To run the pipeline, we need another piece of software called cwltool
and a configuration file that will define the locations of the input tumour and normal files, reference data, and where the output is to be stored. Installing cwltool
itself requires Python. Some notes are given below for this procedure.
To ensure that the pipeline runs in reasonable time, we have provided a downsampled version of the HCC1143 cell-line and its matched normal in the directory /data/downsampled
. We have also downloaded the reference files required by the pipeline to /reference_data
(the url for these files can be found in the script /home/participant/Course_Materials/download-ref-data.sh
)
This time open a Terminal using the icon on the left-side of the Desktop
ls /data/downsampled
ls /reference_data
The config file to run the pipeline can be found in the directory /home/participant/Course_Materials/pipeline_test/
cd /home/participant/Course_Materials/pipeline_test
cat cgpwgs.json
Which should look something like:-
{
"normal": {
"path": "/data/downsampled/HCC1143_BL.bam",
"class": "File"
},
"species": "human",
"tumour": {
"path": "/data/downsampled/HCC1143.bam",
"class": "File"
},
"subcl": {
"path": "/reference_data/SUBCL_ref_GRCh37d5.tar.gz",
"class": "File"
},
"cnv_sv": {
"path": "/reference_data/CNV_SV_ref_GRCh37d5.tar.gz",
"class": "File"
},
"cavereads": 350000,
"annot": {
"path": "/reference_data/VAGrENT_ref_GRCh37d5_ensembl_75.tar.gz",
"class": "File"
},
"skipbb": false,
"reference": {
"path": "/reference_data/core_ref_GRCh37d5.tar.gz",
"class": "File"
},
"snv_indel": {
"path": "/reference_data/SNV_INDEL_ref_GRCh37d5.tar.gz",
"class": "File"
},
"assembly": "GRCh37d5",
"timings": {
"path": "/home/participant/pipeline_test/timings_WGS.tar.gz",
"class": "File"
},
"exclude": "NC_007605,hs37d5,GL%",
"run_params": {
"path": "/home/participant/pipeline_test/params.txt",
"class": "File"
},
"global_time": {
"path": "/home/participant/pipeline_test/global_WGS.time",
"class": "File"
},
"result_archive": {
"path": "/home/participant/pipeline_test/result_WGS.tar.gz",
"class": "File"
},
"sv_cyto": {
"path": "/reference_data/cytoband_GRCh37d5.txt",
"class": "File"
}
}
The command to run the pipeline is taken from the Dockstore page for the tool.
cat cgpwgs.sh
mkdir -p tmp
mkdir -p output
cwltool \
--leave-container \
--leave-tmpdir \
--copy-outputs \
--tmpdir-prefix /home/participant/pipeline_test/tmp/ \
--tmp-outdir-prefix /home/participant/pipeline_test/output/ \
--non-strict \
https://www.dockstore.org:8443/api/ga4gh/v1/tools/quay.io%2Fwtsicgp%2Fdockstore-cgpwgs/versions/1.0.8/plain-CWL/descriptor \
cgpwgs.json
You can run this script as below. If all goes well, we should have some results to look at in the morning :tada:
./cgpwgs.sh
pip
json
file to include the paths to your tumour and normal files./cgpwgs.sh
command to run the pipeline
Install python
brew install python
pip
; python package manager system
run get-pip.py
with python
python get-pip.py
Now use pip
to install cwltool
We have collected several useful NGS tools into one environment for the purposes of this course
To download the CRUK Docker onto your computer you run the following. It should tell you that it is downloading various layers and show some progress bars
docker pull markdunning/cancer-genome-toolkit
docker run \
--rm \
-it --entrypoint=bash \
-v /data:/data \
-v /reference_data:/reference_data \
-v /home/participant/Course_Materials:/home/participant/Course_Materials \
-v /home/participant/software:/home/participant/Course_Materials/software \
markdunning/cancer-genome-toolkit
The machines used in Genetics for the course already happen to have a /data/
and /reference_data/
directory. However, when you try and run the course on your own machine, you will need to type the paths to where you have stored the data from the course. If you wish to run annovar
as described in the class you will need to download it and extract into a folder on your laptop /PATH/TO/YOUR/software
.
docker run \
--rm \
-it --entrypoint=bash \
-v /PATH/TO/YOUR/data:/data \
-v /PATH/TO/YOUR/reference_data:/reference_data \
-v /PATH/TO/YOUR/Course_Materials:/home/participant/Course_Materials \
-v /PATH/TO/YOUR/software:/home/participant/Course_Materials/software \
markdunning/cancer-genome-toolkit
As we have provided a hard drive with the data, the command would look something like this on a Mac:-
docker run \
--rm \
-it --entrypoint=bash \
-v /Volumes/CRUKSummer/data:/data \
-v /Volumes/CRUKSummer/reference_data:/reference_data \
-v /Volumes/CRUKSummer/Course_Materials:/home/participant/Course_Materials \
-v /Volumes/CRUKSummer/software:/home/participant/Course_Materials/software \
markdunning/cancer-genome-toolkit
And on Windows it will depend on what drive letter the external hard drive appears as. So if the D drive is used:-
docker run \
--rm \
-it --entrypoint=bash \
-v D:/CRUKSummer/YOUR_USERNAME/data:/data \
-v D:/CRUKSummer/YOUR_USERNAME/reference_data:/reference_data \
-v D:/CRUKSummer/YOUR_USERNAME/Course_Materials:/home/participant/Course_Materials \
-v D:/CRUKSummer/YOUR_USERNAME/software:/home/participant/Course_Materials/software \
markdunning/cancer-genome-toolkit
You don’t just have to use the container with the data shipped with the course. If you have other files that you want to play with you just need to change the first line that uses-v
docker pull markdunning/cancer-genome-toolkit
-v
arguments accordingly