Usage¶

This section describes how to install and use the pipeline. It is written assuming you are using a linux computer but the same instructions should be valid on a Mac OSX computer as well. The pipeline uses external software that needs to be installed before attempting to run the pipeline. The software that needs to be available is:

Bowtie2

samtools

sickle

Fastqc

Setting up the python environment¶

Unfortunately this pipeline needs to have two different python environments available, since it needs to have two different versions (2.7 and 3.4) of python available. This instructions will setup these two environments for you using miniconda from Continuum. If you’re already using miniconda or anaconda, you might want to edit the commands to suit your needs.

First download and install miniconda in your home directory:

cd ~
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
chmod u+x miniconda.sh
./miniconda.sh -b

Now we are ready to create the two environments.:

conda create -n BLUEPRINT_pipeline python=3.4
conda create -n BLUEPRINT_pipeline_2.7 python=2.7

source activate BLUEPRINT_pipeline_2.7
conda install --yes pip numpy scipy pandas
pip install pysam
source deactivate

source activate BLUEPRINT_pipeline
conda install pip snakemake
source deactivate

This will create one environment named BLUEPRINT_pipeline which uses python version 3.4 and will be the main version of python used for the pipeline. The other environment BLUEPRINT_pipeline2.7 is used for a custom script that is used to quantify the open reading frames used as a part of the pipeline.

We’re now ready to install the pipeline itself, starting off by downloading it from github:

wget https://github.com/EnvGen/BLUEPRINT_pipeline/archive/master.zip
unzip master.zip

cd BLUEPRINT_pipeline-master

If these commands are successful, you should now have everything ready to run the small test example below.

Running an example¶

With the pipeline source code, a small test data example is supplied. By running this example, we aim to show how the pipeline works in practice. First, let’s move into the test directory:

cd test

This directory currently contains a few files, the output of the excellent ‘tree’ command shows us:

$ tree
.
├── config.json
├── config_uppmax.json
├── Snakefile
└── test_data
    ├── annotation
    │   └── reference
    │       └── assembly_v1.gff
    ├── references
    │   └── assembly_v1.fna
    └── samples
        ├── after_qc
        │   ├── 120322_R1.fastq
        │   ├── 120322_R2.fastq
        │   ├── 120507_R1.fastq
        │   └── 120507_R2.fastq
        └── raw
            ├── 120322_R1.fastq
            ├── 120322_R2.fastq
            ├── 120507_R1.fastq
            └── 120507_R2.fastq

7 directories, 13 files

The files within the test_data directory is the data that we’ll use to kick off our pipeline, the Snakefile defines what result files we’d like to create and how to create them, the config.json file defines specific configurations we’ll need (for this case where the python2 executable is available), and config_uppmax.json is a special config file needed only if we’re running our test on any of the uppmax clusters.

We’ll use the command snakemake to run our pipeline. The first command will execute a rule in the Snakefile that creates a directory structure that will suite the pipeline, and creates links to the test_data files needed:

snakemake prepare

Lets have a look at what this command created:

$ tree
.
├── annotation
│   └── reference
│       └── assembly_v1.gff
├── config.json
├── config_uppmax.json
├── mapping
├── quantification
├── references
│   └── assembly_v1.fna
├── rpkm_for_orfs.py -> ~/repos/BLUEPRINT_pipeline/test/../scripts/rpkm_for_orfs.py
├── samples
│   ├── after_qc
│   │   ├── 120322_R1.fastq
│   │   ├── 120322_R2.fastq
│   │   ├── 120507_R1.fastq
│   │   └── 120507_R2.fastq
│   └── raw
│       ├── 120322_R1.fastq
│       ├── 120322_R2.fastq
│       ├── 120507_R1.fastq
│       └── 120507_R2.fastq
├── Snakefile
└── test_data
    ├── annotation
    │   └── reference
    │       └── assembly_v1.gff
    ├── references
    │   └── assembly_v1.fna
    └── samples
        ├── after_qc
        │   ├── 120322_R1.fastq
        │   ├── 120322_R2.fastq
        │   ├── 120507_R1.fastq
        │   └── 120507_R2.fastq
        └── raw
            ├── 120322_R1.fastq
            ├── 120322_R2.fastq
            ├── 120507_R1.fastq
            └── 120507_R2.fastq

15 directories, 24 files

This shows us that the command has created a directory structure similar to the one present in the ‘test_data’ directory and copied the files present in test_data. It has also created two new directories named mapping and quantification where some output from the pipeline will be stored and created a link to the script rpkm_for_orfs.py. Now we should check what the pipeline would do if we executed it. By adding the –dryrun argument to snakemake, it will not execute any command but only show what it would do:

snakemake --dryrun test_qc

This should output a list of rules with input files and output files connected to them. After going through this list, running the first part of the pipeline should now be as simple as:

snakemake test_qc

If everything went alright you should now have the folloing files:

    $ tree
.
├── annotation
│   └── reference
│       └── assembly_v1.gff
├── config.json
├── config_uppmax.json
├── mapping
├── quantification
├── references
│   └── assembly_v1.fna
├── rpkm_for_orfs.py -> /pica/h1/alneberg/repos/BLUEPRINT_pipeline/test/../scripts/rpkm_for_orfs.py
├── samples
│   ├── after_qc
│   │   ├── 120322_R1.fastq
│   │   ├── 120322_R2.fastq
│   │   ├── 120507_R1.fastq
│   │   └── 120507_R2.fastq
│   ├── fastqc
│   │   ├── 120322
│   │   │   ├── 120322_R1_fastqc.html
│   │   │   └── 120322_R2_fastqc.html
│   │   └── 120507
│   │       ├── 120507_R1_fastqc.html
│   │       └── 120507_R2_fastqc.html
│   ├── raw
│   │   ├── 120322_R1.fastq
│   │   ├── 120322_R2.fastq
│   │   ├── 120507_R1.fastq
│   │   └── 120507_R2.fastq
│   └── sickle
│       ├── 120322.log
│       ├── 120322_R1.fastq
│       ├── 120322_R2.fastq
│       ├── 120322_single.fastq
│       ├── 120507.log
│       ├── 120507_R1.fastq
│       ├── 120507_R2.fastq
│       └── 120507_single.fastq
├── Snakefile
└── test_data
    ├── annotation
    │   └── reference
    │       └── assembly_v1.gff
    ├── references
    │   └── assembly_v1.fna
    └── samples
        ├── after_qc
        │   ├── 120322_R1.fastq
        │   ├── 120322_R2.fastq
        │   ├── 120507_R1.fastq
        │   └── 120507_R2.fastq
        └── raw
            ├── 120322_R1.fastq
            ├── 120322_R2.fastq
            ├── 120507_R1.fastq
            └── 120507_R2.fastq

19 directories, 36 files

At this point, in a real case, you should have a look at the fastqc output files with a .html extension. These are reports about the quality of the input files and based on these the user will have to take a decision if the input files are good enough to continue with, in that case copy the files to the after_qc directory, or if some additional step has to be run, such as cutting adaptor sequences. For this example we’ve prepared this step already, so we’re ready to take the next step. To check what the next step will execute, check:

snakemake --dryrun all_from_mapping

and to kick off the last main part of the pipeline, run:

snakemake all_from_mapping

If everything went alright you should now have have the following files:

 $tree
.
├── annotation
│   └── reference
│       └── assembly_v1.gff
├── config.json
├── config_uppmax.json
├── mapping
│   └── bowtie2
│       ├── assembly_v1
│       │   ├── 120322
│       ├── assembly_v1.1.bt2
│       ├── assembly_v1.2.bt2
│       ├── assembly_v1.3.bt2
│       ├── assembly_v1.4.bt2
│       ├── assembly_v1.rev.1.bt2
│       └── assembly_v1.rev.2.bt2
├── quantification
│   └── assembly_v1
│       └── orf
│           ├── 120322
│           │   └── 120322.rpkm
│           └── 120507
│               └── 120507.rpkm
├── references
│   └── assembly_v1.fna
├── rpkm_for_orfs.py -> ~/repos/BLUEPRINT_pipeline/test/../scripts/rpkm_for_orfs.py
├── samples
│   ├── after_qc
│   │   ├── 120322_R1.fastq
│   │   ├── 120322_R2.fastq
│   │   ├── 120507_R1.fastq
│   │   └── 120507_R2.fastq
│   ├── fastqc
│   │   ├── 120322
│   │   │   ├── 120322_R1_fastqc.html
│   │   │   └── 120322_R2_fastqc.html
│   │   └── 120507
│   │       ├── 120507_R1_fastqc.html
│   │       └── 120507_R2_fastqc.html
│   ├── raw
│   │   ├── 120322_R1.fastq
│   │   ├── 120322_R2.fastq
│   │   ├── 120507_R1.fastq
│   │   └── 120507_R2.fastq
│   └── sickle
│       ├── 120322.log
│       ├── 120322_R1.fastq
│       ├── 120322_R2.fastq
│       ├── 120322_single.fastq
│       ├── 120507.log
│       ├── 120507_R1.fastq
│       ├── 120507_R2.fastq
│       └── 120507_single.fastq
├── Snakefile
└── test_data
    ├── annotation
    │   └── reference
    │       └── assembly_v1.gff
    ├── references
    │   └── assembly_v1.fna
    └── samples
        ├── after_qc
        │   ├── 120322_R1.fastq
        │   ├── 120322_R2.fastq
        │   ├── 120507_R1.fastq
        │   └── 120507_R2.fastq
        └── raw
            ├── 120322_R1.fastq
            ├── 120322_R2.fastq
            ├── 120507_R1.fastq
            └── 120507_R2.fastq

27 directories, 50 files

Where the two files 120322.rpkm and 120507.rpkm are the most interesting ones. These should contain one row for each open reading fram found in the file annotation/reference/assembly_v1.gff. Each row would then contain the ORF id and a RPKM value which is ready to e.g. be imported into a databse. If you’d like to remove all the files created in this exercise in a smooth way there is a special rule for that as well:

# Running this will delete all directories created by the prepare command
snakemake clean_up

Now you’re ready to start all over again with the snakemake prepare command.