Sequence data are generated by the Illumina HiSeq 2000 sequencer. The runs are processed using Illumina's CASAVA 1.8 software. The resulting output are phred33 encoded 100 b.p. paired-end fastq files. Small RNA-seq data are phred33 encoded 50 b.p. single-end fastq files.
The fastq files are the starting point of every pipeline and are uploaded to a compute cluster. Two Compute Canada clusters (Guillimin, ~14,000 cores / Mammouth2, ~40,000 cores) are mainly used for data analysis. Depending on cluster usage and/or space, our own internal cluster (Abacus, ~1000 cores) can be used for data analysis. The previously mentioned clusters use Torque or Moab on Torque as schedulers. The amount of ram per node varies from 32GB to 1TB. The bandwidth from the Innovation Center to these clusters is 10Gb/s.
All hg38 aligned data was processed using GenPipes 3.1.0. Alignments were performed with BWA v0.7.12 except for all ChIP-Seq assays which used BWA v0.7.15. Specific experimental attributes are described in the metadata accompanying each dataset.
Parameters for hg19 aligned datasets are described below.
Raw reads are trimmed for quality ( phred33 >= 30 ) and length ( n >= 32 ) using Trimmomatic v. 0.22 [1]. Illumina adapter sequences are also clipped off using the palindrome mode from this software.
The filtered reads are aligned to the hg19 human reference (for human samples) using BWA v. 0.6.1 [2]. The .bam file of BWA sampe is then sorted by coordinates using MergeSamFiles.jar from Picard tools v. 1.70 [3]. Every lane of a library is aligned independently and then merged using MergeSamFiles.jar for computing efficiency.
Duplicates are marked in the merged .bam file using Picard's MarkDuplicates.jar.
Peak calls are obtained using MACS2 v. 2.0.10.07132012 [4]. Different parameters are used whether the peaks are defined as narrow or broad and whether there is an input sample or not. All of the peak calls done until now have had an input sample. Human peak calling options shown below.
Fraction of reads in peaks (FrIP) is obtained using samtools view v. 0.1.18 [5] on the called peaks. bed file. Duplicates and non primary alignments are not kept for the analysis.
Normalized strand cross-correlation coefficient (NSC) and relative strand cross-correlation coefficient (RSC) values are obtained using samtools and phantompeakqualtools (spp) v. 1.10.1 [6]. Duplicates and non primary alignments are not kept for the analysis. Only the first mate of the read is analyzed.
Bigwig tracks are generated using bam_to_wiggle.py from the bcbb nextgen software suite [7]. The read counts are normalized to reads per million.
Same as procedure 3.1.1
The filtered reads are aligned to the hg19 human reference (for human samples) using Tophat v. 1.4.1 [8] and bowtie v. 0.12.8 [9]. The resulting bam file is sorted by coordinate. Every lane of a library is aligned independently and then merged using MergeSamFiles.jar for computing efficiency.
Same as procedure 3.1.3
Gene raw read count values are obtained using htseq-count v. 0.5.3p9 [10]. The .bam file is initially sorted by read name since the data is paired-end and htseq requires this.
Normalized gene and transcript expression values are obtained using cufflinks v. 2.1.1 [11].
Multiple quality control metrics are obtained using RNA-SeQC v. 1.1.7 [12]. Some of those metrics include: rRNA rate, intragenic/intronic/exonic/intergenic rates, transcript coverage, etc. Strand specificity values are obtained using RseQC v. 2.3.1 [13] on splice junction reads only.
Bigwig tracks are generated by samtools and genomeCoverageBed from BEDTools [14]. If the library is strand specific, reads from each strand are separated with samtools as a first step.
The scale factor is determined as the number of non duplicates and primary reads aligned divided by 10E6.
Same as procedure 3.1.1
Additional pipeline steps (up until methylation calls) are the ones described under the 'Basic protocol 3' section of this paper [15]. Since there are about 30 steps, they won't be described in this document. The software versions of bwa and samtools used in this pipeline are the same as the ones reported in the previous pipelines. The software version of nxtgen-utils is 0.12.2 [16].
Genome coverage is obtained using GATK's DepthOfCoverage v. 1.6-11-g3b2fab9 [17].
Same as procedure 3.1.7
Same as procedure 3.1.1 except:
- Minimum length is n >= 10 since small fragments are sequenced.
- Adapter clipping is with the simple mode (clip off only)
adapters.fa:
>Truseq
TGGAATTCTCGGGTGCCAAGG
Same as procedure 3.1.3
Same as procedure 3.2.4
Same as procedure 3.2.5
--no-effective-length-correction
Same as procedure 3.2.6
Same as procedure 3.2.7
[1] Lohse, Marc, et al. "RobiNA: a user-friendly, integrated software solution for
RNA-Seq-based
transcriptomics." Nucleic Acids Research 40.W1 (2012): W622-W627.
[2] Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler
transform." Bioinformatics 25.14 (2009): 1754-1760.
[3] http://picard.sourceforge.net
[4] Feng, Jianxing, et al. "Identifying ChIP-seq enrichment using MACS." Nature Protocols 7.9 (2012): 1728-1740.
[5] Li, Heng, et al. "The sequence alignment/map format and SAMtools." Bioinformatics 25.16 (2009): 2078-2079.
[6] Anshul Kundaje*, Lucy Yungsook Jung*, Peter Kharchenko, Barbara Wold, Arend Sidow, Serafim
Batzoglou, Peter Park.
Assessment of ChIP-seq data quality using cross-correlation analysis (Submitted)
[7] https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/bam_to_wiggle.py
[8] Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. "TopHat: discovering splice junctions with
RNA-Seq."
Bioinformatics 25.9 (2009): 1105-1111.
[9] Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the
human genome."
Genome Biol 10.3 (2009): R25.
[10] http://www-huber.embl.de/users/anders/HTSeq
[11] Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation."
Nature biotechnology 28.5 (2010): 511-515.
[12] DeLuca, David S., et al. "RNA-SeQC: RNA-seq metrics for quality control and process
optimization." Bioinformatics 28.11 (2012): 1530-1532.
[13] Wang, Liguo, Shengqin Wang, and Wei Li. "RSeQC: quality control of RNA-seq experiments.
" Bioinformatics 28.16 (2012): 2184-2185.
[14] Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing
genomic features."
Bioinformatics 26.6 (2010): 841-842.
[15] Johnson, Michelle D., et al. "Single Nucleotide Analysis of Cytosine Methylation by Whole‐Genome
Shotgun Bisulfite Sequencing."
Current Protocols in Molecular Biology (2012): 21-23.
[16] http://code.google.com/p/nxtgen-utils/downloads/list
[17] McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303.