Data Analysis Pipelines

Table of Content

1. Overview of data
2. Computing resources
3. Pipelines
4. References

1. Overview of data

Sequence data are generated by the Illumina HiSeq 2000 sequencer. The runs are processed using Illumina's CASAVA 1.8 software. The resulting output are phred33 encoded 100 b.p. paired-end fastq files. Small RNA-seq data are phred33 encoded 50 b.p. single-end fastq files.

2. Computing resources

The fastq files are the starting point of every pipeline and are uploaded to a compute cluster. Two Compute Canada clusters (Guillimin, ~14,000 cores / Mammouth2, ~40,000 cores) are mainly used for data analysis. Depending on cluster usage and/or space, our own internal cluster (Abacus, ~1000 cores) can be used for data analysis. The previously mentioned clusters use Torque or Moab on Torque as schedulers. The amount of ram per node varies from 32GB to 1TB. The bandwidth from the Innovation Center to these clusters is 10Gb/s.

3. Pipelines

hg38 datasets

All hg38 aligned data was processed using GenPipes 3.1.0. Alignments were performed with BWA v0.7.12 except for all ChIP-Seq assays which used BWA v0.7.15. Specific experimental attributes are described in the metadata accompanying each dataset.

hg19 datasets

Parameters for hg19 aligned datasets are described below.

3.1 ChIP-seq

3.1.1 Fastq trimming/clipping

Raw reads are trimmed for quality ( phred33 >= 30 ) and length ( n >= 32 ) using Trimmomatic v. 0.22 [1]. Illumina adapter sequences are also clipped off using the palindrome mode from this software.

Software options

ILLUMINACLIP:adapters.fa:2:30:15 TRAILING:30 MINLEN:32

adapters.fa:

>Prefixtruseq/1

TACACTCTTTCCCTACACGACGCTCTTCCGATCT

>Prefixtruseq/2

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

3.1.2 Alignment to reference

The filtered reads are aligned to the hg19 human reference (for human samples) using BWA v. 0.6.1 [2]. The .bam file of BWA sampe is then sorted by coordinates using MergeSamFiles.jar from Picard tools v. 1.70 [3]. Every lane of a library is aligned independently and then merged using MergeSamFiles.jar for computing efficiency.

3.1.3 Marking duplicates

Duplicates are marked in the merged .bam file using Picard's MarkDuplicates.jar.

3.1.4 Peak calling

Peak calls are obtained using MACS2 v. 2.0.10.07132012 [4]. Different parameters are used whether the peaks are defined as narrow or broad and whether there is an input sample or not. All of the peak calls done until now have had an input sample. Human peak calling options shown below.

Software options

Broad peaks with input:

--nomodel --broad -g hs

Broad peaks with input:

--nomodel --broad -g hs

Broad peaks with input:

--nomodel --broad -g hs

Narrow peaks without input:

-g hs

Narrow peaks with input:

-g hs

3.1.5 Fraction of reads in peaks

Fraction of reads in peaks (FrIP) is obtained using samtools view v. 0.1.18 [5] on the called peaks. bed file. Duplicates and non primary alignments are not kept for the analysis.

Software options

-F 260 -c -L peaks.bed

3.1.6 Cross correlation analysis

Normalized strand cross-correlation coefficient (NSC) and relative strand cross-correlation coefficient (RSC) values are obtained using samtools and phantompeakqualtools (spp) v. 1.10.1 [6]. Duplicates and non primary alignments are not kept for the analysis. Only the first mate of the read is analyzed.

Software options

samtools view:

-b -h -f 64 -F 1280

spp:

-s=0:2:400 -rf

3.1.7 Wiggle tracks

Bigwig tracks are generated using bam_to_wiggle.py from the bcbb nextgen software suite [7]. The read counts are normalized to reads per million.

Software options

--normalize

3.2 RNA-seq

3.2.1 Fastq trimming/clipping

Same as procedure 3.1.1

3.2.2 Alignment to reference

The filtered reads are aligned to the hg19 human reference (for human samples) using Tophat v. 1.4.1 [8] and bowtie v. 0.12.8 [9]. The resulting bam file is sorted by coordinate. Every lane of a library is aligned independently and then merged using MergeSamFiles.jar for computing efficiency.

Software options (tophat)

-G hg19.ucsc.gtf (--library-type fr-firststrand (if strand specific library)

3.2.3 Marking duplicates

Same as procedure 3.1.3

3.2.4 Gene raw read counts

Gene raw read count values are obtained using htseq-count v. 0.5.3p9 [10]. The .bam file is initially sorted by read name since the data is paired-end and htseq requires this.

Software options

hg19.gtf (ucsc and ensembl) --stranded=reverse (if strand specific library)

3.2.5 Gene and transcript FPKM

Normalized gene and transcript expression values are obtained using cufflinks v. 2.1.1 [11].

Software options

-G hg19.gtf(ucsc and ensembl) -q --max-bundle-length 100000000000 --max-bundle-frags 100000000000 --library-type fr-firststrand (if strand specific library)

3.2.6 Quality control metrics

Multiple quality control metrics are obtained using RNA-SeQC v. 1.1.7 [12]. Some of those metrics include: rRNA rate, intragenic/intronic/exonic/intergenic rates, transcript coverage, etc. Strand specificity values are obtained using RseQC v. 2.3.1 [13] on splice junction reads only.

Software options

RNA-SeQC:

-n 1000 -t hg19.gtf (ucsc) -BWArRNA

3.2.7 Wiggle tracks

Bigwig tracks are generated by samtools and genomeCoverageBed from BEDTools [14]. If the library is strand specific, reads from each strand are separated with samtools as a first step.

Software options (BEDTools)

-bg -split -scale scale_factor

The scale factor is determined as the number of non duplicates and primary reads aligned divided by 10E6.

3.3 Whole genome bisulfite-seq

3.3.1 Fastq trimming/clipping

Same as procedure 3.1.1

3.3.2 Additional pipeline steps

Additional pipeline steps (up until methylation calls) are the ones described under the 'Basic protocol 3' section of this paper [15]. Since there are about 30 steps, they won't be described in this document. The software versions of bwa and samtools used in this pipeline are the same as the ones reported in the previous pipelines. The software version of nxtgen-utils is 0.12.2 [16].

3.3.3 Genome coverage

Genome coverage is obtained using GATK's DepthOfCoverage v. 1.6-11-g3b2fab9 [17].

Software options

--omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 75 --summaryCoverageThreshold 100 --start 1 --stop 1000 --nBins 999 -dt NONE

3.3.4 Wiggle tracks

Same as procedure 3.1.7

3.4 small RNA-seq

3.4.1 Fastq trimming/clipping

Same as procedure 3.1.1 except:
- Minimum length is n >= 10 since small fragments are sequenced.
- Adapter clipping is with the simple mode (clip off only)

adapters.fa:
>Truseq
TGGAATTCTCGGGTGCCAAGG

3.4.2 Alignment to reference

Same as procedure 3.2.2 but on single-end fastq files.

3.4.3 Marking duplicates

Same as procedure 3.1.3

3.4.4 Gene raw read counts

Same as procedure 3.2.4

3.4.5 Gene and transcript FPKM

Same as procedure 3.2.5

Additional software option

--no-effective-length-correction

3.4.6 Quality control metrics

Same as procedure 3.2.6

3.4.7 Wiggle tracks

Same as procedure 3.2.7

References

[1] Lohse, Marc, et al. "RobiNA: a user-friendly, integrated software solution for RNA-Seq-based
transcriptomics." Nucleic Acids Research 40.W1 (2012): W622-W627.

[2] Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler
transform." Bioinformatics 25.14 (2009): 1754-1760.

[3] http://picard.sourceforge.net

[4] Feng, Jianxing, et al. "Identifying ChIP-seq enrichment using MACS." Nature Protocols 7.9 (2012): 1728-1740.

[5] Li, Heng, et al. "The sequence alignment/map format and SAMtools." Bioinformatics 25.16 (2009): 2078-2079.

[6] Anshul Kundaje*, Lucy Yungsook Jung*, Peter Kharchenko, Barbara Wold, Arend Sidow, Serafim Batzoglou, Peter Park.
Assessment of ChIP-seq data quality using cross-correlation analysis (Submitted)

[7] https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/bam_to_wiggle.py

[8] Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq."
Bioinformatics 25.9 (2009): 1105-1111.

[9] Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome."
Genome Biol 10.3 (2009): R25.

[10] http://www-huber.embl.de/users/anders/HTSeq

[11] Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation."
Nature biotechnology 28.5 (2010): 511-515.

[12] DeLuca, David S., et al. "RNA-SeQC: RNA-seq metrics for quality control and process
optimization." Bioinformatics 28.11 (2012): 1530-1532.

[13] Wang, Liguo, Shengqin Wang, and Wei Li. "RSeQC: quality control of RNA-seq experiments.
" Bioinformatics 28.16 (2012): 2184-2185.

[14] Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing genomic features."
Bioinformatics 26.6 (2010): 841-842.

[15] Johnson, Michelle D., et al. "Single Nucleotide Analysis of Cytosine Methylation by Whole‐Genome Shotgun Bisulfite Sequencing."
Current Protocols in Molecular Biology (2012): 21-23.

[16] http://code.google.com/p/nxtgen-utils/downloads/list

[17] McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303.