Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Properly paired mates

    Hi guys!

    I have a weird problem to solve.
    I have recently run an RNAseq pipeline on human cell line (polyA enriched, 75bp, PE, ~30 mil reads each).

    12 samples out of 16 showed great alignment rates and library QC.
    The remaining 4 instead have a ridiculously low properly mate pairing, like this one:

    64544673 in total
    0 QC failure
    0 duplicates
    64544673 mapped (100.00%)
    64544673 paired in sequencing
    33238851 read1
    31305822 read2
    14676 properly paired (0.02%)
    57821112 with itself and mate mapped
    6723561 singletons (10.42%)
    54777510 with mate mapped to a different chr
    44604000 with mate mapped to a different chr (mapQ>=5)
    Sample_TK1_002_flagstat.txt (END)

    Initially I thought it was a trivial issue, and I have proceeded with differential expression analysis. Unfortunately I have found out that these 4 samples were clustering together although belonging to 3 different cohorts.
    Then I re-checked my work flow and realized that the Genome Center that performed the library prep and sequencing had to fund 2/3 sequencing for these samples, as they were providing a low yield (of reads).
    Other thing to add, when trying to run RNASeQC on Gene Pattern, I cannot Mark Duplicates for these samples too (error mess:

    net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 1: Flowcell 1 Older_P3:BFC08P1:257:C38GUACXX:8:1108:18368:35984
    at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124)
    at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78)
    at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61)
    at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:294)
    at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:117)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:169)
    at org.genepattern.sam.MarkDuplicatesWrapper.main(MarkDuplicatesWrapper.java:83)


    I am sorry if all this sounds confusing... I am quite desperate in understanding what's going on.
    I know that there is some samtools available for fixing mate pairing...would it be the solution?

    Thanks to all of you!

    Manu

  • #2
    Then I re-checked my work flow and realized that the Genome Center that performed the library prep and sequencing had to fund 2/3 sequencing for these samples, as they were providing a low yield (of reads).

    Sorry there's a typo: the Genome center re-performed the sequencing a couple of times on my samples due to low reads yield!

    Comment


    • #3
      Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.

        Hi Brian!
        thanks for the reply. I have mapped my samples with TopHat2, using normal hg19. I guess that it is not a broken hg19(got it from iGenomes), as it worked just fine for the others.
        I have checked the read identifiers and they seem to agree...although I have just briefly looked at them, I haven't written any string to check if they were matching. Any suggestions how to?

        Thank you!!!
        Manu

        Comment


        • #5
          So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

          I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

          If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.
          Last edited by Brian Bushnell; 03-14-2014, 01:16 PM.

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

            I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

            If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.
            Hi Brian!

            I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?
            I have tried to do the same for one of the other samples that I have aligned the same day and that gave a optimal rate of properly paired mates...and in fact they look normal compared to this funny one.
            I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

            Thanks!!!

            Manu
            Attached Files

            Comment


            • #7
              Originally posted by emolinari View Post
              Hi Brian!

              I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?
              Colors mean different things in different modes with IGV. But those colors on your picture indicate SNPs - the reference does not match the reads, like, anywhere. Could you be using a slightly different reference for mapping and for IGV?

              I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

              Thanks!!!

              Manu
              Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.

              Comment


              • #8
                Originally posted by Brian Bushnell View Post

                Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.
                Thanks Brian,
                I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

                BTW, checking prep_read.logs I have found out this:

                /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
                prep_reads v2.0.6 (3670M)
                ---------------------------
                WARNING: read pairing issues detected (check prep_reads.log) !
                Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
                101900 out of 31388352 reads have been filtered out
                67141 out of 31388352 read mates have been filtered out

                ...can you explain?

                Thanks!!!
                Manu

                Comment


                • #9
                  Originally posted by emolinari View Post
                  Thanks Brian,
                  I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

                  BTW, checking prep_read.logs I have found out this:

                  /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
                  prep_reads v2.0.6 (3670M)
                  ---------------------------
                  WARNING: read pairing issues detected (check prep_reads.log) !
                  Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
                  101900 out of 31388352 reads have been filtered out
                  67141 out of 31388352 read mates have been filtered out

                  ...can you explain?

                  Thanks!!!
                  Manu
                  Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.

                  Comment


                  • #10
                    Originally posted by emolinari View Post
                    Hi guys!

                    I have a weird problem to solve.
                    I have recently run an RNAseq pipeline on human cell line (polyA enriched, 75bp, PE, ~30 mil reads each).

                    12 samples out of 16 showed great alignment rates and library QC.
                    The remaining 4 instead have a ridiculously low properly mate pairing, like this one:
                    What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

                    Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

                    Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

                    I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

                    Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

                    [Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

                    Finally, what ploidy has your cell line?

                    Comment


                    • #11
                      Originally posted by martin2 View Post
                      What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

                      Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

                      Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

                      I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

                      Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

                      [Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

                      Finally, what ploidy has your cell line?

                      My reads are 75bp long after adapter trimming.
                      Considering that are paired end, that it's homo sapiens reference transcriptome, that the literature advices this strategy and most importantly that ALL the other samples sequenced are just fine - only 4 samples failed- I think I can safely assume this is a good sequencing strategy for me.
                      I agree anyhow that the problem was created at the source (either PCR step or during barcoding)... trash data is always "easy" but it means re doing everything from scratch that of course takes lots of resources, such as money and most importantly TIME. So I was trying to save both, unless these data cannot be saved at all, as it starts to be evident

                      BTW my cells are diploid!
                      thanks

                      Manu

                      Comment


                      • #12
                        Originally posted by Brian Bushnell View Post
                        Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.
                        Hi Brian,
                        I've run Trimmomatic on my samples, with ILLUMINACLIP TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 and my map pairing now is way better. I am waiting for Cuffdiff results...

                        206974197 in total
                        0 QC failure
                        0 duplicates
                        206974197 mapped (100.00%)
                        206974197 paired in sequencing
                        103366809 read1
                        103607388 read2
                        152531082 properly paired (73.70%)
                        204699398 with itself and mate mapped
                        2274799 singletons (1.10%)
                        26308258 with mate mapped to a different chr
                        18725784 with mate mapped to a different chr (mapQ>=5)

                        thanks for the help!!!
                        Manu

                        Comment


                        • #13
                          Originally posted by emolinari View Post
                          Hi Brian,
                          ...
                          thanks for the help!!!
                          Manu
                          Glad I could assist.

                          By the way... I generally recommend aligning against a genome rather than transcriptome, for better objectivity and repeatability, and less bias toward known genes/isoforms.
                          Last edited by Brian Bushnell; 04-14-2014, 06:09 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-27-2024, 06:37 PM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-27-2024, 06:07 PM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          68 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X