Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Too many forward only surviving unpaired reads after trimmomatic (RNA-Seq data)

    This is a different issue than: http://seqanswers.com/forums/showthread.php?t=47604
    FastQC shows all bases are > Q30 on both forward and reverse fastq files. (See attached)

    I have some older 50 bp PE human data from a collaborator. They used CloneTech's SMARTer primers for polyA enrichment and the data has a lot of contaminating sequences including: Illumina adapters, CloneTech oligo dT primers, polyA and polyT (all 50 bases of Ts or As) sequences in them. After using Trimmomatic, to remove all 4 types of contaminants, only half of my read pairs remain, while ~40% of reads are now unpaired.

    Note: No quality trimming was done. However, prior to trimming, I removed rRNA sequences using SortMeRNA (~2-3% reads removed)

    Here are the results:
    TrimmomaticPE: Started with arguments: -threads 12 -phred33 <2 inputs> <4 outputs>
    ILLUMINACLIP:/usr/local/bioinf/trimmomatic/adapters/custom_a.fa:2:30:10 MINLEN:36

    Input Read Pairs: 55,723,516
    Both Surviving: 29,553,189 (53.04%)
    Forward Only Surviving: 19,691,236 (35.34%)
    Reverse Only Surviving: 2,350,462 (4.22%)
    Dropped: 4128629 (7.41%)

    It seems strange that the reverse only surviving reads are so low compared to the forward surviving reads... If I only remove the cloneTech primers and adapters and leave the polyA & Ts out of the specified custom.fa for ILLUMINACLIP:

    Input Read Pairs: 55,723,516
    Both Surviving: 29,583,888 (53.09%)
    Forward Only Surviving: 22,506,092 (40.39%)
    Reverse Only Surviving: 3,170,437 (5.69%)
    Dropped: 463,099 (0.83%)

    Fewer reads are dropped and more reads are piled into the forward only survinng reads, but the paired surviving remains almost the same.

    What could lead to so many unpaired reads and mostly on the forward read? I'd hate to toss out 35%-40% of my data if I can help it. If I have to, is there anyway I could incorporate these unpaired reads into my down stream analysis? Or is there something seriously wrong with my pipeline? Thanks in advance.
    Attached Files

  • #2
    I would say there's a problem with your methodology. You should expect adapters at the same position in read 1 as in read 2... adapter-trimming and contaminant filtering are different operations, and should be done in different steps.

    First do adapter trimming with adapter sequence, trimming to the same position in each read (you can do this with BBDuk using the "tpe" flag). Then remove all reads that contain contaminants, rather than trimming them... you can do that with BBDuk as well. I don't see much point in removing reads that are 100% poly-A or poly-T, though, as they won't map anyway.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      I would say there's a problem with your methodology. You should expect adapters at the same position in read 1 as in read 2... adapter-trimming and contaminant filtering are different operations, and should be done in different steps.

      First do adapter trimming with adapter sequence, trimming to the same position in each read (you can do this with BBDuk using the "tpe" flag). Then remove all reads that contain contaminants, rather than trimming them... you can do that with BBDuk as well. I don't see much point in removing reads that are 100% poly-A or poly-T, though, as they won't map anyway.
      Thanks for the suggestions. I'll give it a shot and report back.

      Comment


      • #4
        Hi Brian, I tried your suggestions. Here are the results in case someone else has a similar issue. First, I separated the adapter trimming step and the contaminant removal steps. Just as a comparison, I wanted to try BBDuk vs Trimmomatic for the adapter trimming step. I cleaned up the results a bit to make it easier to read.

        Trimmomatic adapter trimming:

        time java -Xmx27g -jar $TRIMMOMATICDIR/trimmomatic-0.32.jar PE -threads 12 -phred33 $DATADIR/test_R1.fastq $DATADIR/test_R2.fastq $DATADIR/test_trim/test_trimmomatic_R1.fastq $DATADIR/test_trim/test_trimmomatic_unpaired_R1.fastq $DATADIR/test_trim/test_trimmomatic_R2.fastq $DATADIR/test_trim/test_trimmomatic_unpaired_R2.fastq ILLUMINACLIP:"$TRIMMOMATICDIR/adapters/TruSeq3-PE-2.fa":2:30:10 MINLEN:36

        Input Read Pairs: 55723516
        Both Surviving: 55567661 (99.72%)
        Forward Only Surviving: 2203 (0.00%)
        Reverse Only Surviving: 130983 (0.24%)
        Dropped: 22669 (0.04%)

        real 4m42.470s
        user 16m35.382s
        sys 1m10.948s
        BBDuk Adapter Trimming:

        time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_R1.fastq
        in2=$DATADIR/test_R2.fastq out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36

        Input: 111447032 reads
        KTrimmed: 315462 reads (0.28%)
        Trimmed by overlap: 0 reads (0.00%)
        Result: 111135074 reads (99.72%) / 2 = 55567537

        Time: 359.990 seconds.
        Reads Processed: 111m 309.58k reads/sec
        Bases Processed: 5572m 15.48m bases/sec

        real 6m1.432s
        user 19m13.116s
        sys 1m14.501s
        Both seemed to have worked great. BBDuk took a slightly bit longer but ended up with a very similar amount of reads remaining. Adapters weren't a large proportion of the overrepresented sequences (fastqc) to begin with.

        Removing the clonetech primer contaminants

        Since trimmomatic doesn't have the option for removing contaminating sequences, I just used BBDuk for this one.

        clonetech.fa:
        >SMARTerIIAOligonucleotide
        AAGCAGTGGTATCAACGCAGAGTACNNNNN
        >3SMARTCDSPrimerIIA
        AAGCAGTGGTATCAACGCAGAGTACNNNNN
        >3SMARTCDSPrimerIIA_consensus
        AAGCAGTGGTATCAACGCAGAGTAC
        >SmartISPCRPrimer
        AAGCAGTGGTATCAACGCAGAGT
        time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_trim/test _bbduk_R1.fastq in2=$DATADIR/test_bbduk_R2.fastq out1=$DATADIR/ test_rm_contam/test_bbduk_unmatched_R1.fastq out2=$DATADIR/test_rm_contam/test_bbduk_unmatched_R2.fastq outm1=$DATADIR/test_rm_contam/test_bbduk_matched_R1.fastq outm2=$DATADIR/test_rm_contam/test_bbduk_matched_R2.fastq ref=$DATADIR/test_rm_contam/clonetech.fa k=25 hdist=1 stats=$SEBADIR/test_rm_contam/rm_contam_stats.txt

        Input: 111135074 reads
        Contaminants: 50448064 reads (45.39%)
        Result: 60687010 reads (54.61%)

        Time: 350.589 seconds.
        Reads Processed: 111m 317.00k reads/sec
        Bases Processed: 5556m 15.85m bases/sec

        Statsfile:
        #Total 111135074
        #Matched 25422697 22.87549%
        #Name Reads ReadsPct
        SMARTerIIAOligonucleotide 25422697 22.87549%

        real 5m52.338s
        user 11m23.119s
        sys 1m44.755s
        I am still losing ~ half my reads, but my matched forward & reverse read files are equal now (rather than forward heavy). Probably due to using the tbe flag. Another interesting point is that in the stats file, only half the reads matched the contaminants 22.87% vs 45.39% removed.

        Since these samples are enriched by oligo dT primers, do you think that maybe only one read of the pair would contain the contaminating primers? In the FastQC report, I would find over represented sequences that look like this:
        'AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT'
        Say if these primers only attach to the PolyA side of the insert, do you think the read's pair that covers the opposite end of the insert might contain useable data?

        Comment


        • #5
          @BrianU: You should be able to provide the clonetech.fa file to Trimmomatic as well. Only difference is you will need to include the reverse complements of the sequences also in your file (see the section in trimmomatic manual).

          Contaminant separation is not supported by trimmomatic but it should still be able to trim the clontech sequences. Isn't that correct?
          Last edited by GenoMax; 05-07-2015, 10:39 AM.

          Comment


          • #6
            This is quite interesting; I've never seen a situation where BBDuk was slower than Trimommatic. I imagine that it's only using 1 thread or something... you can force it to use 12 threads with the "t=12" flag, but normally it autodetects the number of cores available.

            As for the 22.87% vs 45.39%, that's because pairs are always kept together and if one read is flagged as contaminant, the mate is thrown away also (the "tpe" flag is ignored in filtering; that's only used in trimming). This is controlled by the flag removeifeitherbad (rieb). You can set "rieb=f" and then a pair will only be removed if both are contaminant.

            Now that you have the contaminant pairs from the outm stream, you can process them individually (rather than as pairs) to just get the noncontaminant singletons. But if these contaminants are in fact primers that theoretically should be attached to real genomic sequence, then probably you should be trimming them rather than removing them. BBDuk supports left-trimming and right-trimming, so if you know what part of the read should be genomic when you detect primer in some location, you should be able to salvage it. Can you explain (perhaps with a nice ascii diagram) exactly what the structure of the molecule/read is when these primer sequences are present?

            Comment


            • #7
              I'll give the other suggestions a try, but here is my attempt to explain the enrichment step. Maybe it will help. From what I understand, the CloneTech SMARTer II kit is supposed to allow you to get more complete transcripts (increased chances of getting the 5' end). This is how I interpreted the manual:

              Key:
              '...' - just spacers
              Ns - any bases in the transcript
              Xs - the overhangs (don't know if they are polyAs too?)

              First step:5'- ................NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA......... -3' Starting mRNA (w/ polyA tail)
              3'- ..............................................TTTTTTTTTTT[PRIMER1] -5' Prime SMART CDS Primer containing dT

              Second step:
              The kit contains a special reverse transcriptase that tacks on a few extra bases (overhang) after its done.
              5'- ................NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA......... -3'
              3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5' Create 1st Strand + overhang

              Third Step
              5'- [PRIMER2]XXXXXXX.................................................. -3' Prime SMARTer II A oligo using overhang
              3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5'

              Fourth Step
              5'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA[PRIMER1] -3' Complete 2nd Strand
              3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5'

              Final Step
              5'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA[PRIMER1] -3'
              3'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5' Complete 3rd strand
              This final cDNA product is then sheered into smaller pieces and used in subsequent Illumina library prep steps. So I am not sure where does the overrepresented sequence of primer + polyT comes from?

              'AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT'
              Unless it is the 5' overhang (XXXXXX) is a PolyA also. Out of all the clonetech primer sequences provided, only the SMARTerIIAOligonucleotide was matched (shown in the stats.txt) using BBDuk. However, the other primer sequences are very similar too. I wonder if it matched mainly because it was the first one on the list...

              Edit: Oops, it makes sense now once you orient the primer properly. The over represented sequence is from the 3rd strand illustrated above.

              This: 5' -AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT- 3'
              Is actually the same as this: 3' - TTTTTTTTTTT[PRIMER1] -5'
              Last edited by BrianU; 05-08-2015, 09:31 AM. Reason: clarity

              Comment


              • #8
                BBDuk will match the first in a list, if two sequences contain identical kmers. A related program, Seal, will match both of them, instead. It does not do trimming, but it does do filtering and can create a stats file.

                My impression from your diagram is that probably when one read in a pair contains primer plus poly-A/T, the other read should still be useful. And furthermore, the remainder of the read with the primer should be useful too, and you should trim it with "ktrim=l" to do a left trim, if I'm reading it correctly (but do NOT include "tpe or "tbo", those are just for the adapter-trimming).

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  My impression from your diagram is that probably when one read in a pair contains primer plus poly-A/T, the other read should still be useful. And furthermore, the remainder of the read with the primer should be useful too, and you should trim it with "ktrim=l" to do a left trim, if I'm reading it correctly (but do NOT include "tpe or "tbo", those are just for the adapter-trimming).
                  I think I may have come full circle. In my first post, I used Trimmomatic to trim both Illumina adapters and the clonetech primers in one step (granted, by combining both steps, I couln't diagnose the true problem). After I trim the clonetech primers and it turns out that the singleton reads are usable, what would I do with the unpaired reads? How would I incorporate them into the analysis?

                  Comment


                  • #10
                    If you right-trim adapter sequences, then left-trim clonetech primers in a second step, you should be left with surviving pairs, not singletons (depending on read length). Either way, mixing singletons and pairs for analysis should not really be a problem for most tools, I would think. But it depends on what analysis you are doing.

                    Comment


                    • #11
                      That makes sense. Thanks, Brian. I'll give it a shot.

                      Comment


                      • #12
                        Brian, in case you were curious about the Illumina adapter trim speed above of BBDuk after forcing it to use 12 threads. Here is the full output for diagnostic purposes.

                        Original run:
                        time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_R1.fastq in2=$DATADIR/test_R2.fastq \
                        out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq \
                        ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36

                        BBDuk version 34.92
                        maskMiddle was disabled because useShortKmers=true
                        Initial:
                        Memory: free=27348m, used=435m

                        Added 64770 kmers; time: 0.200 seconds.
                        Memory: free=26333m, used=1450m

                        Input is being processed as paired
                        Started output streams: 0.027 seconds.
                        Processing time: 359.725 seconds.

                        Input: 111447032 reads 5572351600 bases.
                        KTrimmed: 315462 reads (0.28%) 15640564 bases (0.28%)
                        Trimmed by overlap: 0 reads (0.00%) 0 bases (0.00%)
                        Result: 111135074 reads (99.72%) 5556711036 bases (99.72%)

                        Time: 359.990 seconds.
                        Reads Processed: 111m 309.58k reads/sec
                        Bases Processed: 5572m 15.48m bases/sec

                        real 6m1.432s
                        user 19m13.116s
                        sys 1m14.501s
                        Setting t = 12:

                        time $BBMAPDIR/bbduk.sh -Xmx27g t=12 in1=$DATADIR/test_R1.fastq in2=$DATADIR/test_R2.fastq \
                        out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq \
                        ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36


                        BBDuk version 34.92
                        Set threads to 12
                        maskMiddle was disabled because useShortKmers=true
                        Initial:
                        Memory: free=27348m, used=435m

                        Added 64770 kmers; time: 0.575 seconds.
                        Memory: free=26478m, used=1305m

                        Input is being processed as paired
                        Started output streams: 0.197 seconds.
                        Processing time: 310.076 seconds.

                        Input: 111447032 reads 5572351600 bases.
                        KTrimmed: 315462 reads (0.28%) 15640564 bases (0.28%)
                        Trimmed by overlap: 0 reads (0.00%) 0 bases (0.00%)
                        Result: 111135074 reads (99.72%) 5556711036 bases (99.72%)

                        Time: 310.945 seconds.
                        Reads Processed: 111m 358.41k reads/sec
                        Bases Processed: 5572m 17.92m bases/sec

                        real 5m12.763s
                        user 16m1.380s
                        sys 1m9.628s

                        Comment


                        • #13
                          Thanks; looks like it was probably already using (approximately) 12 threads and the time difference was just due to system variability. On my PC, with 4 threads it normally runs at over 60 megabases/s with that kind of command. I suspect that you may be I/O limited.

                          Comment


                          • #14
                            Originally posted by Brian Bushnell View Post
                            I suspect that you may be I/O limited
                            Ah... good point. I have limited local disk space, so all my data is processed on a large mounted NFS drive.

                            Comment


                            • #15
                              Here is my data on trimming the clonetech primers (rather than treating them as contaminants). I changed the minimum length filter between multiple runs to see if I can get different results. The clonetech primers are about ~25 nucleotides long and the max read length is 50. So, I chose to try 25nt, 30nt, and 36nt. I'd like to keep the longer reads if I can, as they will be more unique. I made a summary table of the results. SeqAnswers didn't turn on the BBcode for tables, so I just took a screen shot of my Excel sheet (see attached). I also tried two out of the three lengths in Trimmomatic to compare. Here are two examples of the parameters

                              BBDuk left trim
                              time $BBMAPDIR/bbduk.sh -Xmx27g t=12 \
                              in1=$DATADIR/test_trim/test_bbduk_R1.fastq \
                              in2=$DATADIR/test_trim/test_bbduk_R2.fastq \
                              out1=$DATADIR/ct_test_trim/test_bbduk_ct36_R1.fastq \
                              out2=$DATADIR/ct_test_trim/test_bbduk_ct36_R2.fastq \
                              ref=clonetech.fa ktrim=l k=24 mink=12 hdist=1 minlength=36

                              BBDuk version 34.92
                              Set threads to 12
                              maskMiddle was disabled because useShortKmers=true
                              Initial:
                              Memory: free=27348m, used=435m

                              Added 1430 kmers; time: 0.379 seconds.
                              Memory: free=26333m, used=1450m

                              Input is being processed as paired
                              Started output streams: 0.207 seconds.
                              Processing time: 239.172 seconds.

                              Input: 111135074 reads 5556711036 bases.
                              KTrimmed: 51351140 reads (46.21%) 2566637949 bases (46.19%)
                              Result: 59808372 reads (53.82%) 2990073087 bases (53.81%)

                              Time: 239.844 seconds.
                              Reads Processed: 111m 463.36k reads/sec
                              Bases Processed: 5556m 23.17m bases/sec

                              real 4m1.715s
                              user 12m41.120s
                              sys 1m23.517s
                              Trimmomatic

                              time java -jar $TRIMMOMATICDIR/trimmomatic-0.32.jar PE -threads 12 -phred33 \
                              $DATADIR/test_trim/test_trimmomatic_R1.fastq \
                              $DATADIR/test_trim/test_trimmomatic_R2.fastq \
                              $DATADIR/ct_test_trim/test_trimmomatic_ct36_R1.fastq \
                              $DATADIR/ct_test_trim/test_trimmomatic_ct36_unpaired_R1.fastq \
                              $DATADIR/ct_test_trim/test_trimmomatic_ct36_R2.fastq \
                              $DATADIR/ct_test_trim/test_trimmomatic_ct36_unpaired_R2.fastq \
                              ILLUMINACLIP:"clonetech.fa":2:30:10 MINLEN:36

                              Using Medium Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGT'
                              Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACNNNNN'
                              Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTAC'
                              Skipping duplicate Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACNNNNN'
                              ILLUMINACLIP: Using 0 prefix pairs, 3 forward/reverse sequences,
                              0 forward only sequences, 0 reverse only sequences

                              Input Read Pairs: 55,567,661
                              Both Surviving: 30,045,460 (54.07%)
                              Forward Only Surviving: 22,460,833 (40.42%)
                              Reverse Only Surviving: 2850609 (5.13%)
                              Dropped: 210759 (0.38%)
                              TrimmomaticPE: Completed successfully

                              real 3m55.694s
                              user 10m21.251s
                              sys 0m54.051s
                              Some observations:

                              As I expected, I get many more surviving reads after trimming with BBDuk with a more relaxed read length at 25 nt, its likely due to the clonetech primer size (~25 nt). Although checking fastqc, most of those remaining trimmed reads were PolyTs. At minlen=36, and 30, the out1(f) and out2(r) files are small (lost ~half the reads) and are even. However, at minlen=25 both files are close to double in size and the forward read file is bigger than the reverse.

                              Varying the minimum length between 25 and 36 didn't make a difference in trimmomatic, but I suspect I may have ran it wrong...? The file sizes turned out to be the same as the out1/out2 files from BBDuk when the minlens were stricter at 30, and 36nt. But the unpaired reads are both kept in trimmomatic (the forward read file is huge compared to the reverse).

                              Is a 25nt read long enough for decent mapping?

                              I aligned 3 pairs files with TopHat2 to see the differences. I'll post that next. Thanks for your help guys.
                              Attached Files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              201 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              212 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-02-2024, 03:00 PM
                              0 responses
                              194 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X