Introduction to the Command Line for Genomics: Bonus materials

1.0 Using the Carpentries AWS images - from lesson 01
2.0 Automation using loops and basename- from lesson 04
3.0 File manipulation and pipes - from lesson 04
4.0 Moving files to/from the cloud - from lesson 05.
next_item_here

Using the Carpentries AWS images

You can log-in to the remote server using the instructions here. Your instructor will supply the ip_address and password that you need to login.

Each of you will have a different ip_address. This will prevent us from accidentally changing each other’s files as we work through the exercises. The password will be the same for everyone.

Writing for loops

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Loops are helpful when performing operations on groups of sequencing files, such as unzipping or trimming multiple files. We will use loops for these purposes in subsequent analyses, but will cover the basics of them for now.

When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable’s value by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command. In shell programming, this is usually called “expanding” the variable.

Sometimes, we want to expand a variable without any whitespace to its right. Suppose we have a variable named foo that contains the text abc, and would like to expand foo to create the text abcEFG.

$ foo=abc
$ echo foo is $foo
foo is abc
$ echo foo is $fooEFG      # doesn't work
foo is

The interpreter is trying to expand a variable named fooEFG, which (probably) doesn’t exist. We can avoid this problem by enclosing the variable name in braces ({ and }, sometimes called “squiggle braces”). bash treats the # character as a comment character. Any text on a line after a # is ignored by bash when evaluating the text as code.

$ foo=abc
$ echo foo is $foo
foo is abc
$ echo foo is ${foo}EFG      # now it works!
foo is abcEFG

Let’s write a for loop to show us the first two lines of the fastq files we downloaded earlier. You will notice the shell prompt changes from $ to > and back again as we were typing in our loop. The second prompt, >, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;, can be used to separate two commands written on a single line.

$ cd ../untrimmed_fastq/
$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done

The for loop begins with the formula for <variable> in <group to iterate over>. In this case, the word filename is designated as the variable to be used over each iteration. In our case SRR097977.fastq and SRR098026.fastq will be substituted for filename because they fit the pattern of ending with .fastq in the directory we’ve specified. The next line of the for loop is do. The next line is the code that we want to execute. We are telling the loop to print the first two lines of each variable we iterate over. Finally, the word done ends the loop.

After executing the loop, you should see the first two lines of both fastq files printed to the terminal. Let’s create a loop that will save this information to a file.

$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done

When writing a loop, you will not be able to return to previous lines once you have pressed Enter. Remember that we can cancel the current command using

If you notice a mistake that is going to prevent your loop for executing correctly.

Note that we are using >> to append the text to our seq_info.txt file. If we used >, the seq_info.txt file would be rewritten every time the loop iterates, so it would only have text from the last variable used. Instead, >> adds to the end of the file.

Using Basename in for loops

Basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq extension from the files that we’ve been working with.

$ basename SRR097977.fastq .fastq

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

SRR097977

If we try the same thing but use .fasta as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

$ basename SRR097977.fastq .fasta
SRR097977.fastq

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}, and finally state that .fastq should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

$ for filename in *.fastq
> do
> name=$(basename ${filename} .fastq)
> echo ${name}
> done

Exercise

Print the file prefix of all of the .txt files in our current directory.

Solution

$ for filename in *.txt
> do
> name=$(basename ${filename} .txt)
> echo ${name}
> done

One way this is really useful is to move files. Let’s rename all of our .txt files using mv so that they have the years on them, which will document when we created them.

$ for filename in *.txt
> do
> name=$(basename ${filename} .txt)
> mv ${filename}  ${name}_2019.txt
> done

Exercise

Remove _2019 from all of the .txt files.

Solution

$ for filename in *_2019.txt
> do
> name=$(basename ${filename} _2019.txt)
> mv ${filename} ${name}.txt
> done

File manipulation and more practice with pipes

Let’s use the tools we’ve added to our tool kit so far, along with a few new ones, to example our SRA metadata file. First, let’s navigate to the correct directory.

$ cd
$ cd shell_data/sra_metadata

This file contains a lot of information about the samples that we submitted for sequencing. We took a look at this file in an earlier lesson. Here we’re going to use the information in this file to answer some questions about our samples.

How many of the read libraries are paired end?

The samples that we submitted to the sequencing facility were a mix of single and paired end libraries. We know that we recorded information in our metadata table about which samples used which library preparation method, but we don’t remember exactly where this data is recorded. Let’s start by looking at our column headers to see which column might have this information. Our column headers are in the first row of our data table, so we can use head with a -n flag to look at just the first row of the file.

$ head -n 1 SraRunTable.txt
BioSample_s	InsertSize_l	LibraryLayout_s	Library_Name_s	LoadDate_s	MBases_l	MBytes_l	ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s

That is only the first line of our file, but because there are a lot of columns, the output likely wraps around your terminal window and appears as multiple lines. Once we figure out which column our data is in, we can use a command called cut to extract the column of interest.

Because this is pretty hard to read, we can look at just a few column header names at a time by combining the | redirect and cut.

$ head -n 1 SraRunTable.txt | cut -f1-4

cut takes a -f flag, which stands for “field”. This flag accepts a list of field numbers, in our case, column numbers. Here we are extracting the first four column names.

BioSample_s InsertSize_l      LibraryLayout_s	Library_Name_s    

The LibraryLayout_s column looks like it should have the information we want. Let’s look at some of the data from that column. We can use cut to extract only the 3rd column from the file and then use the | operator with head to look at just the first few lines of data in that column.

$ cut -f3 SraRunTable.txt | head -n 10
LibraryLayout_s
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
PAIRED

We can see that there are (at least) two categories, SINGLE and PAIRED. We want to search all entries in this column for just PAIRED and count the number of matches. For this, we will use the | operator twice to combine cut (to extract the column we want), grep (to find matches) and wc (to count matches).

$ cut -f3 SraRunTable.txt | grep PAIRED | wc -l
2

We can see from this that we have only two paired-end libraries in the samples we submitted for sequencing.

Exercise

How many single-end libraries are in our samples?

Solution

$ cut -f3 SraRunTable.txt | grep SINGLE | wc -l
35

How many of each class of library layout are there?

We can extract even more information from our metadata table if we add in some new tools: sort and uniq. The sort command will sort the lines of a text file and the uniq command will filter out repeated neighboring lines in a file. You might expect uniq to extract all of the unique lines in a file. This isn’t what it does, however, for reasons involving computer memory and speed. If we want to extract all unique lines, we can do so by combining uniq with sort. We’ll see how to do this soon.

For example, if we want to know how many samples of each library type are recorded in our table, we can extract the third column (with cut), and pipe that output into sort.

$ cut -f3 SraRunTable.txt | sort

If you look closely, you might see that we have one line that reads “LibraryLayout_s”. This is the header of our column. We can discard this information using the -v flag in grep, which means return all the lines that do not match the search pattern.

$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort

This command returns a sorted list (too long to show here) of PAIRED and SINGLE values. We can use the uniq command to see a list of all the different categories that are present. If we do this, we see that the only two types of libraries we have present are labelled PAIRED and SINGLE. There aren’t any other types in our file.

$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq
PAIRED
SINGLE

If we want to count how many of each we have, we can use the -c (count) flag for uniq.

$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq -c
2 PAIRED
35 SINGLE

Exercise

  1. How many different sample load dates are there?
  2. How many samples were loaded on each date?

Solution

  1. There are two different sample load dates.

    cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq
    
    25-Jul-12
    29-May-14
    
  2. Six samples were loaded on one date and 31 were loaded on the other.

    cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq -c
    
     6 25-Jul-12
    31 29-May-14
    

Can we sort the file by library layout and save that sorted information to a new file?

We might want to re-order our entire metadata table so that all of the paired-end samples appear together and all of the single-end samples appear together. We can use the -k (key) flag for sort to sort based on a particular column. This is similar to the -f flag for cut.

Let’s sort based on the third column (-k3) and redirect our output to a new file.

$ sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt

Can we extract only paired-end records into a new file?

We also might want to extract the information for all samples that meet a specific criterion (for example, are paired-end) and put those lines of our table in a new file. First, we need to check to make sure that the pattern we’re searching for (“PAIRED”) only appears in the column where we expect it to occur (column 3). We know from earlier that there are only two paired-end samples in the file, so we can grep for “PAIRED” and see how many results we get.

$ grep PAIRED SraRunTable.txt | wc -l
2

There are only two results, so we can use “PAIRED” as our search term to extract the paired-end samples to a new file.

$ grep PAIRED SraRunTable.txt > SraRunTable_only_paired_end.txt

Exercise

Sort samples by load date and export each of those sets to a new file (one new file per unique load date).

Solution

$ grep 25-Jul-12 SraRunTable.txt > SraRunTable_25-Jul-12.txt
$ grep 29-May-14 SraRunTable.txt > SraRunTable_29-May-14.txt

Making code more customizeable using command line arguments

In Lesson 05 (Writing Scripts) we used the grep command line tool to look for FASTQ records with lots of Ns from all the .fastq files in our current folder using the following code:

$ grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt

This is very useful, but could be more customizeable. We may want to be able to run this command over and over again without needing to copy and paste it and allow the user to specify exactly which file they want to examine for bad reads.

We can accomplish these goals by including the above command in a script that takes in user input via a command line argument. We can slightly modify our bad-reads-script.sh file to do so. Use c to copy your bad-reads-script.sh into a new script called custom-bad-reads-script.sh. Make the following modifications to custom-bad-reads-script.sh:

filename=$1
grep -B1 -A2 -h NNNNNNNNNN $filename | grep -v '^--' > scripted_bad_reads.txt

$1 is our command line argument. The line filename=$1 tells Bash to take the first thing you type after the name of the script itself and assign that value to a variable called filename.

For example, this script can be run in the following way to output the bad reads just from one file:

bash custom-bad-reads-script.sh SRR098026.fastq

We can then take a look at what the output file currently contains using head scripted_bad_reads.txt:

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN

This should be same output as using our original code and manually modifying the original standalone code on the command line to “SRR098026.fastq” on the command line, which should give us the same output as above:

$ grep -B1 -A2 -h NNNNNNNNNN SRR098026.fastq | grep -v '^--' > scripted_bad_reads.txt
head scripted_bad_reads.txt
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN


Moving and Downloading Data

So far, we’ve worked with data that is pre-loaded on the instance in the cloud. Usually, however, most analyses begin with moving data onto the instance. Below we’ll show you some commands to download data onto your instance, or to move data between your computer and the cloud.

Getting data from the cloud

There are two programs that will download data from a remote server to your local (or remote) machine: wget and curl. They were designed to do slightly different tasks by default, so you’ll need to give the programs somewhat different options to get the same behaviour, but they are mostly interchangeable.

Which one you need to use mostly depends on your operating system, as most computers will only have one or the other installed by default.

Let’s say you want to download some data from Ensembl. We’re going to download a very small tab-delimited file that just tells us what data is available on the Ensembl bacteria server. Before we can start our download, we need to know whether we’re using curl or wget.

To see which program you have, type:

$ which curl
$ which wget

which is a BASH program that looks through everything you have installed, and tells you what folder it is installed to. If it can’t find the program you asked for, it returns nothing, i.e. gives you no results.

On Mac OSX, you’ll likely get the following output:

$ which curl
/usr/bin/curl
$ which wget
$

This output means that you have curl installed, but not wget.

Once you know whether you have curl or wget, use one of the following commands to download the file:

$ cd
$ wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

or

$ cd
$ curl -O ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

Since we wanted to download the file rather than just view it, we used wget without any modifiers. With curl however, we had to use the -O flag, which simultaneously tells curl to download the page instead of showing it to us and specifies that it should save the file using the same name it had on the server: species_EnsemblBacteria.txt

It’s important to note that both curl and wget download to the computer that the command line belongs to. So, if you are logged into AWS on the command line and execute the curl command above in the AWS terminal, the file will be downloaded to your AWS machine, not your local one.

Moving files between your laptop and your instance

What if the data you need is on your local computer, but you need to get it into the cloud? There are also several ways to do this, but it’s always easier to start the transfer locally. This means if you’re typing into a terminal, the terminal should not be logged into your instance, it should be showing your local computer. If you’re using a transfer program, it needs to be installed on your local machine, not your instance.

Transferring Data Between your Local Machine and the Cloud

These directions are platform specific, so please follow the instructions for your system:

Please select the platform you wish to use for the exercises:

Uploading Data to your Virtual Machine with scp

scp stands for ‘secure copy protocol’, and is a widely used UNIX tool for moving files between computers. The simplest way to use scp is to run it in your local terminal, and use it to copy a single file:

scp <file I want to move> <where I want to move it>

Note that you are always running scp locally, but that doesn’t mean that you can only move files from your local computer. In order to move a file from your local computer to an AWS instance, the command would look like this:

$ scp <local file> <AWS instance>

To move it back to your local computer, you re-order the to and from fields:

$ scp <AWS instance> <local file>

Uploading Data to your Virtual Machine with scp

Open the terminal and use the scp command to upload a file (e.g. local_file.txt) to the dcuser home directory:

$  scp local_file.txt dcuser@ip.address:/home/dcuser/

Downloading Data from your Virtual Machine with scp

Let’s download a text file from our remote machine. You should have a file that contains bad reads called ~/shell_data/scripted_bad_reads.txt.

Tip: If you are looking for another (or any really) text file in your home directory to use instead, try:

$ find ~ -name *.txt

Download the bad reads file in ~/shell_data/scripted_bad_reads.txt to your home ~/Download directory using the following command (make sure you substitute dcuser@ip.address with your remote login credentials):

$ scp dcuser@ip.address:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt ~/Downloads

Remember that in both instances, the command is run from your local machine, we’ve just flipped the order of the to and from parts of the command.

Uploading Data to your Virtual Machine with PSCP

If you’re using a PC, we recommend you use the PSCP program. This program is from the same suite of tools as the PuTTY program we have been using to connect.

  1. If you haven’t done so, download pscp from http://the.earth.li/~sgtatham/putty/latest/x86/pscp.exe
  2. Make sure the PSCP program is somewhere you know on your computer. In this case, your Downloads folder is appropriate.
  3. Open the windows PowerShell; go to your start menu/search enter the term ‘cmd’; you will be able to start the shell (the shell should start from C:\Users\your-pc-username>).
  4. Change to the Downloads directory:
> cd Downloads
  1. Locate a file on your computer that you wish to upload (be sure you know the path). Then upload it to your remote machine (you will need to know your AMI instance address (which starts with ec2), and login credentials). You will be prompted to enter a password, and then your upload will begin. (make sure you substitute ‘your-pc-username’ for your actual pc username and ‘ec2-54-88-126-85.compute-1.amazonaws.com’ with your AMI instance address)
C:\User\your-pc-username\Downloads> pscp.exe local_file.txt dcuser@ec2-54-88-126-85.compute-1.amazonaws.com:/home/dcuser/

Downloading Data from your Virtual Machine with PSCP

  1. Follow the instructions in the Upload section to download (if needed) and access the PSCP program (steps 1-3)
  2. Download the text file to your current working directory (represented by a .) using the following command (make sure you substitute ‘your-pc-username’ for your actual pc username and ‘ec2-54-88-126-85.compute-1.amazonaws.com’ with your AMI instance address)
C:\User\your-pc-username\Downloads> pscp.exe dcuser@ec2-54-88-126-85.compute-1.amazonaws.com:/home/dcuser/shell_data/untrimmed_fastq/scripted_bad_reads.txt .

C:\User\your-pc-username\Downloads