The Unix Shell

To go back to the home page

Section 3.1:- Transferring Files and Accessing Remote Server

Learning Objectives

  • Use wget to download files from the internet.
  • Use ssh to connect to remote servers.
  • Use scp to copy files and directories onto a remote server.

Downloading Files

The wget utility is the best option to download files from the internet. It can pretty much handle all complex download situations including large file downloads, recursive downloads, non-interactive downloads, multiple file downloads etc. It retrieves files from World Wide Web (WWW) using widely used protocols like HTTP, HTTPS and FTP, and is designed in such way so that it works in slow or unstable network connections. Wget can automatically re-start a download where it was left off in case of network problem. Also downloads file recursively and will keep trying until file has be retrieved completely.

The command wget will download a single file and stores it in the current directory. It shows download progress, size, date and time while downloading.

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ccdsGene.txt.gz
--2015-04-15 09:25:38--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ccdsGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2124490 (2.0M) [application/x-gzip]
Saving to: 'ccdsGene.txt.gz'

100%[==============================================================================>] 2,124,490    707KB/s   in 2.9s

2015-04-15 09:25:41 (707 KB/s) - 'ccdsGene.txt.gz' saved [2124490/2124490]

The downloaded file is compressed to save space and contains the Consensus Coding DNA Sequence (CCDS) Genes for Human (GRCh38/hg38), so to view its contents you will have to use zcat.

$ zcat ccdsGene.txt.gz
585	CCDS30547.1	chr1	+	69090	70008	69090	70008	1	69090,	70008,	0		cmpl	cmpl	0,
588	CCDS72675.1	chr1	-	450739	451678	450739	451678	1	450739,	451678,	0		cmpl	cmpl	0,
590	CCDS41221.1	chr1	-	685715	686654	685715	686654	1	685715,	686654,	0		cmpl	cmpl	0,
...

Unless you specify otherwise, the file created in your current directory will have the same name as the file you are downloading. Using -O (uppercase letter 'O') option creates a file with a specified name. Here we have given mm10_geneid.txt.gz file name as show below.

$ wget -O mm10_ccdsGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
--2015-04-15 09:41:49--  http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1940163 (1.8M) [application/x-gzip]
Saving to: 'mm10_ccdsGene.txt.gz'

100%[==============================================================================>] 1,940,163    747KB/s   in 2.5s

2015-04-15 09:41:52 (747 KB/s) - 'mm10_ccdsGene.txt.gz' saved [1940163/1940163]

Once again, the downloaded file is compressed. It contains the Consensus Coding DNA Sequence (CCDS) Genes for Mouse (GRCm38/mm10), so to view its contents you will have to use zcat.

$ zcat mm10_ccdsGene.txt.gz
76	CCDS14803.1	chr1	-	3216021	3671348	3216021	3671348	3	3216021,3421701,3670551,	3216968,3421901,3671348,	0		cmpl	cmpl	1,2,0,
618	CCDS14804.1	chr1	-	4344599	4352825	4344599	4352825	3	4344599,4351909,4352201,	4350091,4352081,4352825,	0		cmpl	cmpl	1,0,0,
619	CCDS14805.1	chr1	-	4491715	4493406	4491715	4493406	2	4491715,4493099,	4492668,4493406,	0		cmpl	cmpl	1,0,
...

It is possible to continue an incomplete download using wget -c. This is very helpful when you have initiated a very big file download which got interrupted in the middle. Instead of starting the whole download again, you can start the download from where it got interrupted using the option -c.

Here is an example of downloading a file which got interrupted manually by using Ctrl-C command on the keyboard.

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
--2015-04-17 09:40:23--  http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1955793 (1.9M) [application/x-gzip]
Saving to: 'geneid.txt.gz'

52% [========================================>                                      ] 1,022,856    551KB/s             ^C

To continue the download instead of restarting it, you can use the option -c, the download of the file will then restart where it did stop, as shown bellow.

$ wget -c http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
--2015-04-17 09:40:33--  http://hgdownload.cse.ucsc.edu/goldenPath/eriEur2/database/geneid.txt.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1955793 (1.9M), 819497 (800K) remaining [application/x-gzip]
Saving to: 'geneid.txt.gz'

100%[+++++++++++++++++++++++++++++++++++++++++++++=================================>] 1,955,793    521KB/s   in 1.5s

2015-04-17 09:40:35 (521 KB/s) - 'geneid.txt.gz' saved [1955793/1955793]

Here we see how to download multiple files using the FTP protocol and wget. It is the recommended method when downloading a large file or multiple files.

$ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccds*.txt.gz
--2015-04-15 09:45:48--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccds*.txt.gz
           => '.listing'
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /goldenPath/mm10/database ... done.
==> PASV ... done.    ==> LIST ... done.

    [    <=>                                                                                                                                                                                            ] 95,192       132KB/s   in 0.7s

2015-04-15 09:45:51 (132 KB/s) - '.listing' saved [95192]

Removed '.listing'.
--2015-04-15 09:45:51--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsGene.txt.gz
           => 'ccdsGene.txt.gz.2'
==> CWD not required.
==> PASV ... done.    ==> RETR ccdsGene.txt.gz ... done.
Length: 1940163 (1.8M)

100%[==============================================================================>] 1,940,163    436KB/s   in 4.9s

2015-04-15 09:45:56 (388 KB/s) - 'ccdsGene.txt.gz.2' saved [1940163]

--2015-04-15 09:45:56--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsInfo.txt.gz
           => 'ccdsInfo.txt.gz'
==> CWD not required.
==> PASV ... done.    ==> RETR ccdsInfo.txt.gz ... done.
Length: 866080 (846K)

100%[==============================================================================>] 866,080      431KB/s   in 2.0s

2015-04-15 09:45:59 (431 KB/s) - 'ccdsInfo.txt.gz' saved [866080]

--2015-04-15 09:45:59--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsKgMap.txt.gz
           => 'ccdsKgMap.txt.gz'
==> CWD not required.
==> PASV ... done.    ==> RETR ccdsKgMap.txt.gz ... done.
Length: 449427 (439K)

100%[==============================================================================>] 449,427      322KB/s   in 1.4s

2015-04-15 09:46:01 (322 KB/s) - 'ccdsKgMap.txt.gz' saved [449427]

--2015-04-15 09:46:01--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ccdsNotes.txt.gz
           => 'ccdsNotes.txt.gz'
==> CWD not required.
==> PASV ... done.    ==> RETR ccdsNotes.txt.gz ... done.
Length: 13011 (13K)

100%[==============================================================================>] 13,011      --.-K/s   in 0.03s

2015-04-15 09:46:02 (406 KB/s) - 'ccdsNotes.txt.gz' saved [13011]

ftp stands for File Transfer Protocol. A protocol is a set of rules that networked computers use to talk to one another. And FTP is the language that computers on a TCP/IP network (such as the internet) use to transfer files to and from each other.

We've just seen how to download multiple files from the UCSC ftp server at ftp://hgdownload.soe.ucsc.edu/ using wget.

An FTP address looks a lot like an HTTP, or Website, address except it uses the prefix ftp:// instead of http://. To make an FTP connection you can use a standard Web browser (Internet Explorer, Firefox, etc.) or a dedicated FTP software program (e.g. WinSCP on Windows or CyberDuck on Mac), referred to as an FTP ‘Client’, or use the command line tool ftp.

Accessing Remote Server

ssh (Secure Shell) is a network protocol that allows a secure access over an encrypted connection. Through an SSH connection you can easily manage your files and folders, modify their permissions, edit files directly on the server, configure and install your scripts, etc. ssh is used to securely login to a Linux / UNIX host running the sshd daemon on a reachable network.

We will be accessing a training server, which his IP address is 10.20.222.90, using the ssh command. Below is the command to use, but each of you will have to use a different account from training1 to training30 and enter the password that matches the account name.

$ ssh training1@10.20.222.90
training1@10.20.222.90's password:
Last login: Wed Jun 15 15:36:34 2016 from 10.21.1.103
-bash-4.1$ 

At this point, you are physically on the training server inside your home directory. The directory structure is now completely different to the one you had before and you can no-longer access the files within nelle's file system. However, you can still use the same commands you have seen already. If you now type these commands the outputs will now be specific to you.

$ whoami
training1
$ pwd
/home/training1
$ ls

To return to our own computer as user nelle, you will have to type the exit command.

$ exit
 logout
Connection to 10.20.222.90 closed.

Copying Files and Directories

ssh protocol can also be used to copy files & directories, using the same connection method as above but the command we use is scp.

This will copy the file notes.txt from its current location within nelle's file system to your home directory on training. Again, remember to replace training1 with your own username.

$ scp notes.txt training1@10.20.222.90:.

Do not forget the trailing characters :.

 training1@10.20.222.90's password:
notes.txt                                                                                 100%   86     0.1KB/s   00:00

Similarly to the wget command, it shows upload progress, size, speed and time while uploading. It displays a 100% when the file is copied.

Let's now connect to the training server to check that the file has really been copied.

$ ssh training1@10.20.222.90
 training1@10.20.222.90's password:
$ ls n*
notes.txt
$ cat notes.txt
- finish experiments
- write thesis
- get post-doc position (pref. with Dr. Horrible)
$ exit

Now we are going to copy an entire directory into your home directory on training. Here is the command to copy across a directory, you will need to use the option -r of the scp command. To know more about a command, you can always type man scp.

$ scp -r molecules/ training1@10.20.222.90:.
 training1@10.20.222.90's password:
pentane.pdb                                      100% 1226     1.2KB/s   00:00
methane.pdb                                      100%  422     0.4KB/s   00:00
ethane.pdb                                       100%  622     0.6KB/s   00:00
propane.pdb                                      100%  825     0.8KB/s   00:00
cubane.pdb                                       100% 1158     1.1KB/s   00:00
octane.pdb                                       100% 1828     1.8KB/s   00:00

These files are pretty small, so they have been copied very quickly. Let's now check that they are actually on our remote server training.

$ ssh training1@10.20.222.90
training1@10.20.222.90's password:
$ ls molecules/
cubane.pdb  ethane.pdb  lenghts.txt  methane.pdb  octane.pdb  pentane.pdb  propane.pdb  sorted-lenghts.txt
$ exit

The scp command can be used in three* ways: to copy from your computer to a (remote) server as we've just done, to copy from a (remote) server to your computer, and to copy from a (remote) server to another (remote) server. In the third case, the data is transferred directly between the servers; your own computer will only tell the servers what to do.

*: Actually you can also use it just like the normal cp command, without any ssh connections in it, but that’s quite useless. It requires you to type an extra ‘s’!

Exploring more wget Arguments

Describe in words what the following command does knowing that the file download-file-list.txt contains this list of urls

$ cat download-file-list.txt
http://hgdownload.cse.ucsc.edu/goldenPath/felCat5/database/geneid.txt.gz
http://hgdownload.cse.ucsc.edu/goldenPath/canFam3/database/geneid.txt.gz
http://hgdownload.cse.ucsc.edu/goldenPath/oryCun2/database/geneid.txt.gz
$ wget -i download-file-list.txt

Listing Directories and Files on Remote Host

Explain which commands will you use to explore the directory /home on the training server.

Exploring more scp Command

Explain how to copy the remote file notes.txt on training server from your home directory into a new directory called remote/ on your local machine.

Next topic

Loops