Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNP detection with gsMapper without reference

    Hello,

    I have a transcriptome assembly produced with Newbler, I don't have a reference genome, but I want to detect SNPs in my data. So I thought the simplest way to do so would be to take the isotigs of the transcriptome assembly as reference and then map all reads against that 'reference'.

    But what happens in the case I have two isotigs from one isogroup, which both contain the same contig and there is a SNP in that contig?

    Is that SNP counted twice (once in each isotig)? Are the reads showing that SNP rejected because they match twice in the reference?

  • #2
    What you should do is take the subset of reads that were used in the assembly (i.e. remove all singletons) and then map these back against your contigs. You can then identify which isotig/isogroup the contig comes from and use that to annotate the SNPs. I did a similar thing to get differential expression values from a pair of samples where I made an assembly using both of the samples together. The number of reads listed for each contig from the assembly of both samples was in most cases the same as the combined number of reads for that contig when each sample was individually mapped back against the contigs.

    However, the contig output file is incorrect and appends the sequence of the previous contig (for contigs where status=isotig), so you will need to correct this first. Read below for more information about the contig file problem
    Pyrosequencing in picotiter plates, custom arrays for enrichment/decomplexing. (Roche)


    You may want to consider a difference as being real if the allele frequency range is 20-80% of reads or perhaps 30-70%. I have previously sequenced some samples of gDNA that had been first sequence captured using Nimblegen. These samples had also been processed on a SNP array so I knew which sites were heterozygous. The allele sequence coverage from the 454 for heterozygosity ranged from about 20-80%. I don't know how much of that bias was from sequence capturethough, but keep in mind that its unlikely to get exactly 50:50 sequence depth.
    Last edited by Jeremy; 01-30-2011, 07:00 PM.

    Comment


    • #3
      Thank you for your reply, Jeremy.
      I'm not sure if I understood you correctly: do you mean I should take the contigs which can be found in the isotigs as reference? But this wouldn't solve the problem, as reads can again match several contigs and are therefore rejected (or counted several times) ??
      However, the contig output file is incorrect and appends the sequence of the previous contig (for contigs where status=isotig), so you will need to correct this first. Read below for more information about the contig file problem
      http://seqanswers.com/forums/showthread.php?t=4732
      Looks like Roche fixed this in v2.5p1.

      Comment


      • #4
        Originally posted by dschika View Post
        Thank you for your reply, Jeremy.
        I'm not sure if I understood you correctly: do you mean I should take the contigs which can be found in the isotigs as reference? But this wouldn't solve the problem, as reads can again match several contigs and are therefore rejected (or counted several times) ??
        I mean the 454AllContigs.fna file.
        For the mapping process the long reads are split into smaller fragments called kmers, and these kmers are then mapped against the reference to determine if the whole read maps or not. Reads seem to be rejected if the kmers that they are split into each match multiple sites in the reference. So, although many reads can map to several contigs, the kmers that they are split into (the default is 12 or 16 from memory) should only match a single contig which means that the read will not be flagged as repeat sequence.

        When I mapped the reads against the contigs, the read count per contig was in most cases identical to the number of reads used to produce that contig as listed in the 454AllContigs.fna file.

        Comment


        • #5
          ahh, ok. That sounds interessting. I will try it! Unfortunately I will have to do something else the next weeks, but when I'm coming back to the topic, I will let you know if it worked!

          Comment


          • #6
            Sorry for being away such a long time. But now I'm back to work at that topic.

            I used now the contigs of the assembly as reference and mapped all reads with gsMapper against it.
            Now I see something I'm not completely sure how to interprete:
            Looking at the 454HCDiffs.txt I see some thousand differences where the Var Freq is greater than 90% and even a lot with Var Freq 100% which means (as far as I understood it) that no reads were mapped against the reference with the exact reference sequence. But as the reference sequence is the assembly of the reads used for mapping, does it mean there are sequencing errors? Did you therefore suggest to

            Originally posted by Jeremy
            consider a difference as being real if the allele frequency range is 20-80% of reads or perhaps 30-70%.
            ?

            Btw:
            Originally posted by Jeremy View Post
            When I mapped the reads against the contigs, the read count per contig was in most cases identical to the number of reads used to produce that contig as listed in the 454AllContigs.fna file.
            I can't see this in my data. The number of reads vary greatly.

            Comment


            • #7
              Did you first subset the reads so that you only used the reads that were assembled? If you are using all reads then the reads that weren't assembled will be force aligned somewhere and you will get a lot of differences being identified based on incorrect alignment.

              Some reads will still align incorrectly, but you can filter these out simply by comparing the number of reads mapped compared to the number of reads used to create the contig. The number of reads used to create the contig has to come from the readstatus.txt file not the number listed in the contigs.fna file. This is because the number in the contigs.fna file counts reads considered duplicates as a single read but mapping does not.

              Comment


              • #8
                Thanks again, Jeremy and shame on me! I forgot to remove the not assembled reads
                Now the read numbers are more equal!

                But anyway I wasn't aware that duplicates are marked as assembled in the ReadStatus file. Thanks for that hint!

                One more question: Is there a more direct way of getting the SNP positions in the isotigs than:
                - parsing the 454HCDiffs.txt to get the SNP positions in the contigs
                - next, parse the 454Isotigs of the assembly to get the contigs per isotig
                - finally calculate the SNP positions in an isotig according to the SNP positions in the contigs of that isotig?

                The problem is: I found that there are some discrepancies in the Newbler files:
                - sometimes the length of an isotig and a corresponding contig can vary (nucleotides can be added or removed)
                - sometimes an isotig consists of the reverse complement of a contig. That is in cases where the isotig consists only of one contig and the contig direction is marked '-' in 454Isotigs.txt. (At least I have seen a reversed complement only in that cases until now). Is the reverse complement of contigs used also in other cases?

                Cheers!

                Comment


                • #9
                  Originally posted by dschika View Post
                  One more question: Is there a more direct way of getting the SNP positions in the isotigs than:
                  - parsing the 454HCDiffs.txt to get the SNP positions in the contigs
                  - next, parse the 454Isotigs of the assembly to get the contigs per isotig
                  - finally calculate the SNP positions in an isotig according to the SNP positions in the contigs of that isotig?

                  The problem is: I found that there are some discrepancies in the Newbler files:
                  - sometimes the length of an isotig and a corresponding contig can vary (nucleotides can be added or removed)
                  - sometimes an isotig consists of the reverse complement of a contig. That is in cases where the isotig consists only of one contig and the contig direction is marked '-' in 454Isotigs.txt. (At least I have seen a reversed complement only in that cases until now). Is the reverse complement of contigs used also in other cases?

                  Cheers!
                  I found the same thing and I don't think there is a more direct way than what you stated. What version of Newbler are you using? My answers address version 2.3 mostly, but I have tried 2.5 also.

                  The random nucleotide seems to be a nucleotide at the contig join meaning that sometimes the contigs have a one bp overlap at the join site and sometimes they don't. I am not sure that is true for all cases, but it seemed to be true for the ones that I checked, which should be easy enough to correct (this is true for v2.3, not sure if this is true for v2.5).

                  I haven't seen any cases where the reverse complement has been incorporated, but my samples were first strand cDNA so I would not expect to see any. I do see + and - in the 454Isotigs.txt file, but the contig sequences in the 454contigs.fna file (that i checked) are given in the same orientation as they appear in the 454isotigs.fna file (same thing happens in v2.5) so I don't really know what is going on there, but it looks like the output file for the contigs is automatically fixing the orientation to match the isotigs.

                  Code:
                  for example (not real data)
                  454Isotigs.fna file
                  >isotig00001  gene=isogroup00001  length=8  numContigs=2
                  ATCGAGGT
                  
                  454IsotigsLayout.txt file
                  >isogroup00001  numIsotigs=1  numContigs=2
                  Length :  4        4
                  Contig :  00001  00002
                            <<<<< >>>>>
                  
                  454Contigs.fna file
                  >contig00001  length=4  numreads=lots  gene=isogroup00001  status=isotig
                  ATCG
                  
                  >contig00001  length=4  numreads=lots  gene=isogroup00001  status=isotig
                  AGGT

                  Comment


                  • #10
                    I use version 2.5.

                    I now have seen also contigs with + direction to be in reveresed complement in the isotigs. That makes the identification of SNPs and SNP positions in the isotigs pretty ugly



                    Have you compared the SNPs in the isotigs to those in the contigs?

                    I altered my search: I now extract a small sequence around a SNP in the contig and search that sequence in the isotig, if necessary I search the reverse complemented sequence in the isotig.

                    Comment


                    • #11
                      Originally posted by dschika View Post
                      I now have seen also contigs with + direction to be in reveresed complement in the isotigs. That makes the identification of SNPs and SNP positions in the isotigs pretty ugly
                      I don't know about that. Hopefully someone else has dealt with it.

                      Originally posted by dschika View Post
                      Have you compared the SNPs in the isotigs to those in the contigs?
                      I only mapped against contigs because isotigs within an isogroup would cause mapping problems, so I have nothing to compare to. I have not bothered to calculate isotig positions for all the SNPs

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Non-Coding RNA Research and Technologies
                        by seqadmin




                        Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                        Nobel Prize for MicroRNA Discovery
                        This week,...
                        10-07-2024, 08:07 AM
                      • seqadmin
                        Recent Developments in Metagenomics
                        by seqadmin





                        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                        09-23-2024, 06:35 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 10-11-2024, 06:55 AM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-02-2024, 04:51 AM
                      0 responses
                      110 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-01-2024, 07:10 AM
                      0 responses
                      114 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-30-2024, 08:33 AM
                      1 response
                      119 views
                      0 likes
                      Last Post EmiTom
                      by EmiTom
                       
                      Working...
                      X