I am working with a fairly dense dataset containing 12 samples each containing 45 million paired end reads.
I trimmed and quality filtered these reads using trimmomatic. I obtained two pair end files, and two files for orphan reads (reads that lost their pair).
Now, I'm interested in looking for novel genes and splice sites in Canola (plant;large repetitive genome; poorly characterized; non-model organism).
When I align my PE reads and my orphans, my data looks like this:
Left reads:
Input: 42041591
Mapped: 38698515 (92.0% of input)
of these: 20610206 (53.3%) have multiple alignments (9400 have >20)
Right reads:
Input: 37408393
Mapped: 34499559 (92.2% of input)
of these: 18350802 (53.2%) have multiple alignments (6973 have >20)
92.1% overall read alignment rate.
Aligned pairs: 33238603
of these: 12676525 (38.1%) have multiple alignments
and: 817033 ( 2.5%) are discordant alignments
86.7% concordant pair alignment rate.
Now I have several questions:
First, is it recommended to use only concordant aligned pairs for downstream analyses?
Second, should I only use uniquely mapping reads? The paper I am modelling this after only uses uniquely mapping... but that sounds counterintuitive, as this is a very repetitive genome and even though multiple alignments occur it still chooses the best one right? I was thinking I would use SAMtools to remove PCR duplicates or something (looks like there may be a few highly repetitive elements, duplicates? rRNA?)
Why do concordant pairs (86%) and discordant alignments (2.5%) not add to be 100%? What else is there?
I trimmed and quality filtered these reads using trimmomatic. I obtained two pair end files, and two files for orphan reads (reads that lost their pair).
Now, I'm interested in looking for novel genes and splice sites in Canola (plant;large repetitive genome; poorly characterized; non-model organism).
When I align my PE reads and my orphans, my data looks like this:
Left reads:
Input: 42041591
Mapped: 38698515 (92.0% of input)
of these: 20610206 (53.3%) have multiple alignments (9400 have >20)
Right reads:
Input: 37408393
Mapped: 34499559 (92.2% of input)
of these: 18350802 (53.2%) have multiple alignments (6973 have >20)
92.1% overall read alignment rate.
Aligned pairs: 33238603
of these: 12676525 (38.1%) have multiple alignments
and: 817033 ( 2.5%) are discordant alignments
86.7% concordant pair alignment rate.
Now I have several questions:
First, is it recommended to use only concordant aligned pairs for downstream analyses?
Second, should I only use uniquely mapping reads? The paper I am modelling this after only uses uniquely mapping... but that sounds counterintuitive, as this is a very repetitive genome and even though multiple alignments occur it still chooses the best one right? I was thinking I would use SAMtools to remove PCR duplicates or something (looks like there may be a few highly repetitive elements, duplicates? rRNA?)
Why do concordant pairs (86%) and discordant alignments (2.5%) not add to be 100%? What else is there?