Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why am I losing up to 5 bases at start of reads?

    Hello all, this is my first post. I have been trying for several weeks now to figure out an issue with a dataset. I have discussed this with a number of local experts and am in contact with Illumina support, but no one has come up with an answer yet. My advisor suggested SEQanswers as a good, knowledgeable forum.

    Our reads should start with a 4 base degenerate sequence (which rarely aligns to the genome; to be used to identify PCR duplicates), an invariant C at the 5th base, then genomic sequence.

    For visualization, start of read should be: NNNNC followed by 30 - 80 nt of genomic sequence.

    Before even sending the library to be sequenced, I cloned a bit of library into pBluescript and sequenced 10 clones. All 10 had this correct structure, so we went ahead with sequencing.

    However, after we sent the library to be sequenced on an Illumina HiScan SQ, the data that came back showed that only 33% of all reads had a C in the 5th position. Worse, when I randomly selected 30 reads and performed manual alignment, it appears as though anywhere from 0-5 of the first 5 bases align to the genome in a pretty random distribution. To put this another way, we have likely lost 1-5 nt from the beginning of reads (67% of all reads).

    I can still work with the data by just aligning it without the first 5 bases and accepting that there will be PCR biases. However, I would prefer to use the degenerate bases to limit PCR biases and thus make the analysis a bit more quantitative.

    Thanks for any help anyone can provide

  • #2
    Could you give us some details about your protocols and the structure of the adapters that you are using? How do you get randomized sequences a the beginning of your reads? In any case, I would suggest to run FastQC or a similar program on your data to check for any quality problems.
    Last edited by luc; 09-28-2012, 03:09 PM.

    Comment


    • #3
      The core facility I collaborate with ran FastQC for me after I posted this, and it showed that quality scores were above 30 for bases 1~55, with the exception of base 5, which had a very low score. The explanation from the core facility computer analyst was that having a C in every read at position 5 is probably confusing the machine. Further analysis showed that 35% of the time C was correctly called, but the other ~65% of the time the machine called the 5th base as N.

      During filtering, we were requiring that our reads have a C in the 5th position, thus we were throwing out a large portion of the data. By simply eliminating that requirement, we were able to include most reads in our data set, and most reads appear to have the correct structure.

      I have no explanation why this occurred, since libraries of essentially the same structure were sequenced a year ago and bases were called correctly. It could be a particular software update or machine update. If anyone needs specifics (like software version, etc.) I am sure I could get them.

      Thanks

      Comment


      • #4
        Hi,

        good that you figured that out.
        Having an identical base at one position in all clusters is obviously not a good premise as you have noted. Such problems are to be expected and you might have been merely lucky when doing your first sequencing run. Further I guess the HiSeq system has gotten considerably better over the last year - meaning we are getting a lot more reads on average - perhaps denser clusters lead to more problems in parts of the sequence lacking complexity?

        I would have some more questions. Why would you need your 4 degenerate bases to determine PCR duplicates? Are you analyzing a small genome? I would assume that for eukaryotic genomes the first 30 bases (or perhaps better something like bases 12-40) are diverse enough for a good removal of PCR duplicates, especially for paired end data. At least that is our working assumption.

        How did you generate the 4 degenerate bases at the beginning of the read? That sounds interesting.
        What is the resulting base composition of your sequenced first 4 bases?
        Last edited by luc; 10-01-2012, 06:19 PM.

        Comment


        • #5
          Our library prep strategy has two variables that help identify PCR duplicates. First, our reads are designed to be of various lengths. Second, the RT primer we use has the 4 degenerate bases, which end up at the start of our reads (essentially 256 possible RT primers in the mix).

          Doing a probability calculation, this comes out to thousands of possible combinations of read lengths and 4 degenerate base "codes" for a given genomic location. Thus, if we have multiple reads mapping to the exact same genomic coordinates and having the same 4 base "code," we treat those as PCR duplicates and collapse those reads into 1 read.

          In practice, this works well for all but the most highly expressed genes. Those relatively few genes are so highly expressed in the tissue we study that the number of reads are so many that each combination of length, sequence, and 4 base code is repeated multiple times. We are willing to accept this to limit PCR duplicates throughout the majority of the dataset.

          The base composition of the first 4 bases ended up as 25% A, 25% G, 15% C, and 35% T. Not a perfect 25% each, but OK for our purposes, which are qualitative comparative analyses.

          Comment


          • #6
            Thanks a lot for the details on your protocol! Very interesting.

            Comment


            • #7
              I'd be interested to know how often you get reads that look like PCR dupes without the random RT primer but have different degenerate bases. In other words, are 90% of the "duplicate" reads really duplicates, or is it more like 9%?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Exploring the Dynamics of the Tumor Microenvironment
                by seqadmin




                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                07-08-2024, 03:19 PM
              • seqadmin
                Exploring Human Diversity Through Large-Scale Omics
                by seqadmin


                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                06-25-2024, 06:43 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:09 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 07-19-2024, 07:20 AM
              0 responses
              148 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 07-16-2024, 05:49 AM
              0 responses
              124 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 07-15-2024, 06:53 AM
              0 responses
              111 views
              0 likes
              Last Post seqadmin  
              Working...
              X