Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alex Clop
    Member
    • Sep 2008
    • 17

    454: Unmapped contigs after Reference Assembly

    Dear all,

    We have sequenced human BAC clones using 454 sequencing technology. During the assembling process into a consensus sequence (CONS) using in parallel two reference sequences, many reads were not incorporated into the corresponding resulting consensus.

    Afterwards, I did a De Novo Assembly (using no reference sequence) of these unmapped reads and I am currently analysing the resulting contigs.

    I had two different scenarios: (A) the resulting contigs correspond to the cloning vector or to E. coli DNA (traces of bacterial DNA not eliminated during the maxiprep); (B) some other contigs, do map to our human target region

    (A) This is for most of the contigs and those being the longest and having the deepest read coverage.

    (B) When mapping these contigs into reference sequences, some of them behave similar than Paired End Tags but with different orientations or distance between the aligned segments than the expected in PET (3 kb in our case). I do not believe they correspond to structural variation between my template and these references.

    It is worth to mention that i) most of those contigs in (B) scenario are around 200 – 500 bp and none exceed 1300 bp ii) whilst the coverage in CONS is around 80 fold, the coverage of the ctg is for most of them between 2 and 3 and few of them exceed 10 fold coverage.

    Has anyone I would appreciate if i) anyone that has observed these kind of reads / contigs in their 454 analysis could let me know.

    I am also wondering how common is this type of reads / contigs and why is that occurring? Does anyone know?

    Thank you in advance for your help.

    With kindest regards

    Alex
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    Alex:

    I just finished this type of analysis with a 454 Titanium run on E. coli only I started with a de-novo assembly of the reads and then mapped them to the E. coli genome instead of using the Mapper and then assembling the remaining reads like you did. (Although I did also run the Mapper as a separate trial.) The statistics from the de-novo assembly:

    There are 651 "large" (>= 500 bp) contigs.

    Of these a whopping 545 do not match E. coli W3110. However none of these non-matching contigs are very long -- ranging from 500 to 2988 bp. As a comparison the 106 matching contigs tend be long and range from 531 bp to 222,307 bp.

    So it is obvious that the non-matching contigs are not very good. Never-the-less it is curious as to what the non-matching contigs do match.

    Of the 545 contigs:

    36 do not significantly match anything in genbank.

    137 match many entries in genbank.

    348 match Bacillus licheniformis genomes.

    3 match B. licheniformis plasmid

    9 match P. flourescens.

    2 match K. pneumoniae

    The remaining 10 I did not bother to characterize since they did not hit the same genbank entries.

    ---------------------------------------

    So what conclusions can be, tentatively, drawn?

    A) We did not have wholesale contamination otherwise the non-matching-to-Ecoli contigs would have been long.

    B) Perhaps E. coli is picking up strands of DNA from its environment?

    C) Perhaps the environment of strands of DNA is getting into our experiment? Due to a poor laboratory sterile technique. Perhaps due to DNA being stuck on new or reused equipment.

    I suspect that NextGen sequencers will uncover a lot of this low-level contamination. We are dealing with so many reads that, in my mind, it seems like some will arise from external sources.

    As to your particular case, you mentioned that your case (B) you were able to map the contigs back to your human reference sequence but that the contigs were looking strange. It is possible that you are finding traces of human contamination. Either the cells being sequenced had trace rogue DNA in them or in the handling trace DNA 'fell in' to the prep. It is an idea.

    I am looking forward to analyzing our next titanium run.

    Comment

    • Chuckytah
      Member
      • Mar 2011
      • 65

      #3
      Originally posted by westerman View Post
      Alex:

      I just finished this type of analysis with a 454 Titanium run on E. coli only I started with a de-novo assembly of the reads and then mapped them to the E. coli genome instead of using the Mapper and then assembling the remaining reads like you did. (Although I did also run the Mapper as a separate trial.) The statistics from the de-novo assembly:

      There are 651 "large" (>= 500 bp) contigs.

      Of these a whopping 545 do not match E. coli W3110. However none of these non-matching contigs are very long -- ranging from 500 to 2988 bp. As a comparison the 106 matching contigs tend be long and range from 531 bp to 222,307 bp.

      So it is obvious that the non-matching contigs are not very good. Never-the-less it is curious as to what the non-matching contigs do match.

      Of the 545 contigs:

      36 do not significantly match anything in genbank.

      137 match many entries in genbank.

      348 match Bacillus licheniformis genomes.

      3 match B. licheniformis plasmid

      9 match P. flourescens.

      2 match K. pneumoniae

      The remaining 10 I did not bother to characterize since they did not hit the same genbank entries.

      ---------------------------------------

      So what conclusions can be, tentatively, drawn?

      A) We did not have wholesale contamination otherwise the non-matching-to-Ecoli contigs would have been long.

      B) Perhaps E. coli is picking up strands of DNA from its environment?

      C) Perhaps the environment of strands of DNA is getting into our experiment? Due to a poor laboratory sterile technique. Perhaps due to DNA being stuck on new or reused equipment.

      I suspect that NextGen sequencers will uncover a lot of this low-level contamination. We are dealing with so many reads that, in my mind, it seems like some will arise from external sources.

      As to your particular case, you mentioned that your case (B) you were able to map the contigs back to your human reference sequence but that the contigs were looking strange. It is possible that you are finding traces of human contamination. Either the cells being sequenced had trace rogue DNA in them or in the handling trace DNA 'fell in' to the prep. It is an idea.

      I am looking forward to analyzing our next titanium run.

      What program/software did you use to obtain those statistics?
      thanks

      Comment

      • westerman
        Rick Westerman
        • Jun 2008
        • 1104

        #4
        Originally posted by Chuckytah View Post
        What program/software did you use to obtain those statistics?
        thanks
        Hum, making me think about project done over 2 years ago. That is forever in NGS time! I can not remember exactly but I probably used blast to get the statistics. E. coli is small enough that blasting the contigs to it would not be onerous.

        Comment

        • Chuckytah
          Member
          • Mar 2011
          • 65

          #5
          Originally posted by westerman View Post
          Hum, making me think about project done over 2 years ago. That is forever in NGS time! I can not remember exactly but I probably used blast to get the statistics. E. coli is small enough that blasting the contigs to it would not be onerous.
          sorry i didn't saw the dates lol
          ty anyway

          Comment

          • Jeremy
            Senior Member
            • Nov 2009
            • 190

            #6
            Originally posted by Alex Clop View Post
            It is worth to mention that i) most of those contigs in (B) scenario are around 200 – 500 bp and none exceed 1300 bp ii) whilst the coverage in CONS is around 80 fold, the coverage of the ctg is for most of them between 2 and 3 and few of them exceed 10 fold coverage.

            Has anyone I would appreciate if i) anyone that has observed these kind of reads / contigs in their 454 analysis could let me know.
            Alex
            The process of ligating adapters to the DNA fragments also produces chimeric sequences where two DNA fragments ligate together. The ratio of primers to DNA is designed to limit this but it does happen. More often than not it will be repetitive DNA that ligates. If these chimeric sequences then get the correct primers on each end they will amplify in the subsequent PCR steps producing more copies. That's why the sequences you describe have low sequence coverage and behave like paired end tags - they are an artefact of the ligation process.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Pathogen Surveillance with Advanced Genomic Tools
              by seqadmin




              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
              03-24-2025, 11:48 AM
            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            49 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            57 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            50 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            201 views
            0 reactions
            Last Post seqadmin  
            Working...