wget
to download files from the internet.ssh
to connect to remote servers.scp
to copy files and directories onto a remote server.The wget
utility is the best option to download files from the internet. It can pretty much
handle all complex download situations including large file downloads,
recursive downloads, non-interactive downloads, multiple file downloads etc.
It retrieves files from World Wide Web (WWW) using widely used protocols like HTTP, HTTPS and FTP, and is
designed in such way so that it works in slow or unstable network
connections. Wget can automatically re-start a download where it was left off in case of network problem.
Also downloads file recursively and will keep trying until file has be retrieved completely.
The command wget
will download a single file and stores it in the current directory.
It shows download progress, size, date and time while downloading.
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ccdsGene.txt.gz
--2015-04-15 09:25:38-- http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ccdsGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2124490 (2.0M) [application/x-gzip]
Saving to: 'ccdsGene.txt.gz'
100%[==============================================================================>] 2,124,490 707KB/s in 2.9s
2015-04-15 09:25:41 (707 KB/s) - 'ccdsGene.txt.gz' saved [2124490/2124490]
The downloaded file is compressed to save space and contains the Consensus Coding DNA Sequence (CCDS)
Genes for Human (GRCh38/hg38), so to view its contents you will have to use zcat
.
$ zcat ccdsGene.txt.gz
585 CCDS30547.1 chr1 + 69090 70008 69090 70008 1 69090, 70008, 0 cmpl cmpl 0,
588 CCDS72675.1 chr1 - 450739 451678 450739 451678 1 450739, 451678, 0 cmpl cmpl 0,
590 CCDS41221.1 chr1 - 685715 686654 685715 686654 1 685715, 686654, 0 cmpl cmpl 0,
...
Unless you specify otherwise, the file created in your current directory will have the same name as the
file you are downloading. Using -O
(uppercase letter 'O') option creates a file with a specified
name. Here we have given mm10_geneid.txt.gz file name as show below.
$ wget -O mm10_ccdsGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
--2015-04-15 09:41:49-- http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1940163 (1.8M) [application/x-gzip]
Saving to: 'mm10_ccdsGene.txt.gz'
100%[==============================================================================>] 1,940,163 747KB/s in 2.5s
2015-04-15 09:41:52 (747 KB/s) - 'mm10_ccdsGene.txt.gz' saved [1940163/1940163]
Once again, the downloaded file is compressed. It contains the Consensus Coding DNA Sequence (CCDS)
Genes for Mouse (GRCm38/mm10), so to view its contents you will have to use zcat
.
$ zcat mm10_ccdsGene.txt.gz
76 CCDS14803.1 chr1 - 3216021 3671348 3216021 3671348 3 3216021,3421701,3670551, 3216968,3421901,3671348, 0 cmpl cmpl 1,2,0,
618 CCDS14804.1 chr1 - 4344599 4352825 4344599 4352825 3 4344599,4351909,4352201, 4350091,4352081,4352825, 0 cmpl cmpl 1,0,0,
619 CCDS14805.1 chr1 - 4491715 4493406 4491715 4493406 2 4491715,4493099, 4492668,4493406, 0 cmpl cmpl 1,0,
...
It is possible to continue an incomplete download using wget -c
.
This is very helpful when you have initiated a very big file download which got interrupted in the
middle. Instead of starting the whole download again, you can start the download from where it got
interrupted using the option -c
.
Here is an example of downloading a file which got interrupted manually by using Ctrl-C command on the keyboard.
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
--2015-04-17 09:40:23-- http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1955793 (1.9M) [application/x-gzip]
Saving to: 'geneid.txt.gz'
52% [========================================> ] 1,022,856 551KB/s ^C
To continue the download instead of restarting it, you can use the option -c
, the download
of the file will then restart where it did stop, as shown bellow.
$ wget -c http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
--2015-04-17 09:40:33-- http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1955793 (1.9M), 819497 (800K) remaining [application/x-gzip]
Saving to: 'geneid.txt.gz'
100%[+++++++++++++++++++++++++++++++++++++++++++++=================================>] 1,955,793 521KB/s in 1.5s
2015-04-17 09:40:35 (521 KB/s) - 'geneid.txt.gz' saved [1955793/1955793]
Here we see how to download multiple files using the FTP protocol and wget
.
It is the recommended method when downloading a large file or multiple files.
$ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccds*.txt.gz
--2015-04-15 09:45:48-- ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccds*.txt.gz
=> '.listing'
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /goldenPath/mm10/database ... done.
==> PASV ... done. ==> LIST ... done.
[ <=> ] 95,192 132KB/s in 0.7s
2015-04-15 09:45:51 (132 KB/s) - '.listing' saved [95192]
Removed '.listing'.
--2015-04-15 09:45:51-- ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
=> 'ccdsGene.txt.gz.2'
==> CWD not required.
==> PASV ... done. ==> RETR ccdsGene.txt.gz ... done.
Length: 1940163 (1.8M)
100%[==============================================================================>] 1,940,163 436KB/s in 4.9s
2015-04-15 09:45:56 (388 KB/s) - 'ccdsGene.txt.gz.2' saved [1940163]
--2015-04-15 09:45:56-- ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsInfo.txt.gz
=> 'ccdsInfo.txt.gz'
==> CWD not required.
==> PASV ... done. ==> RETR ccdsInfo.txt.gz ... done.
Length: 866080 (846K)
100%[==============================================================================>] 866,080 431KB/s in 2.0s
2015-04-15 09:45:59 (431 KB/s) - 'ccdsInfo.txt.gz' saved [866080]
--2015-04-15 09:45:59-- ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsKgMap.txt.gz
=> 'ccdsKgMap.txt.gz'
==> CWD not required.
==> PASV ... done. ==> RETR ccdsKgMap.txt.gz ... done.
Length: 449427 (439K)
100%[==============================================================================>] 449,427 322KB/s in 1.4s
2015-04-15 09:46:01 (322 KB/s) - 'ccdsKgMap.txt.gz' saved [449427]
--2015-04-15 09:46:01-- ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsNotes.txt.gz
=> 'ccdsNotes.txt.gz'
==> CWD not required.
==> PASV ... done. ==> RETR ccdsNotes.txt.gz ... done.
Length: 13011 (13K)
100%[==============================================================================>] 13,011 --.-K/s in 0.03s
2015-04-15 09:46:02 (406 KB/s) - 'ccdsNotes.txt.gz' saved [13011]
ftp
stands for File Transfer Protocol. A protocol is a set of rules that networked
computers use to talk to one another. And FTP is the language that computers on a TCP/IP network
(such as the internet) use to transfer files to and from each other.
We've just seen how to download multiple files from the UCSC ftp server at
ftp://hgdownload.soe.ucsc.edu/ using wget
.
An FTP address looks a lot like an HTTP, or Website, address except it uses the prefix ftp:// instead of http://.
To make an FTP connection you can use a standard Web browser (Internet Explorer, Firefox, etc.) or
a dedicated FTP software program (e.g. WinSCP on Windows or CyberDuck on Mac), referred to as an FTP ‘Client’, or use the command line tool ftp
.
ssh
(Secure Shell) is a network protocol that allows a secure access over an encrypted connection.
Through an SSH connection you can easily manage your files and folders, modify their permissions, edit files directly on the server,
configure and install your scripts, etc. ssh
is used to securely login to a Linux / UNIX host running the sshd daemon on a reachable network.
We will be accessing a training server, which his IP address is 10.20.222.90, using the ssh
command. Below is the command to use,
but each of you will have to use a different account from training1 to training30 and enter the password that matches the account name.
$ ssh training1@10.20.222.90
training1@10.20.222.90's password:
Last login: Wed Jun 15 15:36:34 2016 from 10.21.1.103
-bash-4.1$
At this point, you are physically on the training server inside your home directory. The directory structure is now completely different to the one you had before and you can no-longer access the files within nelle's file system. However, you can still use the same commands you have seen already. If you now type these commands the outputs will now be specific to you.
$ whoami
training1
$ pwd
/home/training1
$ ls
To return to our own computer as user nelle, you will have to type the exit
command.
$ exit
logout
Connection to 10.20.222.90 closed.
ssh
protocol can also be used to copy files & directories, using
the same connection method as above but the command we use is scp
.
This will copy the file notes.txt from its current location within nelle's file system to your home directory
on training. Again, remember to replace training1
with your own username.
$ scp notes.txt training1@10.20.222.90:.
Do not forget the trailing characters :.
training1@10.20.222.90's password:
notes.txt 100% 86 0.1KB/s 00:00
Similarly to the wget
command, it shows upload progress, size, speed and time while uploading.
It displays a 100% when the file is copied.
Let's now connect to the training server to check that the file has really been copied.
$ ssh training1@10.20.222.90
training1@10.20.222.90's password:
$ ls n*
notes.txt
$ cat notes.txt
- finish experiments
- write thesis
- get post-doc position (pref. with Dr. Horrible)
$ exit
Now we are going to copy an entire directory into your home directory on training.
Here is the command to copy across a directory,
you will need to use the option -r
of the scp
command.
To know more about a command, you can always type man scp
.
$ scp -r molecules/ training1@10.20.222.90:.
training1@10.20.222.90's password:
pentane.pdb 100% 1226 1.2KB/s 00:00
methane.pdb 100% 422 0.4KB/s 00:00
ethane.pdb 100% 622 0.6KB/s 00:00
propane.pdb 100% 825 0.8KB/s 00:00
cubane.pdb 100% 1158 1.1KB/s 00:00
octane.pdb 100% 1828 1.8KB/s 00:00
These files are pretty small, so they have been copied very quickly. Let's now check that they are actually on our remote server training.
$ ssh training1@10.20.222.90
training1@10.20.222.90's password:
$ ls molecules/
cubane.pdb ethane.pdb lenghts.txt methane.pdb octane.pdb pentane.pdb propane.pdb sorted-lenghts.txt
$ exit
The scp command can be used in three* ways: to copy from your computer to a (remote) server as we've just done, to copy from a (remote) server to your computer, and to copy from a (remote) server to another (remote) server. In the third case, the data is transferred directly between the servers; your own computer will only tell the servers what to do.
*: Actually you can also use it just like the normal cp command, without any ssh connections in it, but that’s quite useless. It requires you to type an extra ‘s’!
wget
ArgumentsDescribe in words what the following command does knowing that the file download-file-list.txt contains this list of urls
$ cat download-file-list.txt
http://hgdownload.cse.ucsc.edu/goldenPath/felCat5/database/geneid.txt.gz
http://hgdownload.cse.ucsc.edu/goldenPath/canFam3/database/geneid.txt.gz
http://hgdownload.cse.ucsc.edu/goldenPath/oryCun2/database/geneid.txt.gz
$ wget -i download-file-list.txt
Explain which commands will you use to explore the directory /home
on the training server.
scp
CommandExplain how to copy the remote file notes.txt on training server from your home directory into a new directory called remote/ on your local machine.