Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find the segemntal duplicates

    Hello, I have a sequence file that has three columns.
    The first one is chromosome, the second one is the position and the third one is the sequence.
    Ex,
    Code:
    chr10 89646218 TTTTTTGATTGGGGGATAATTGACCAATAAGGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAA
    chr10 89646221 TTTGATTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCGTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATA
    chr10 89646225 ATTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGG
    chr10 89646226 TTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGT
    chr10 89646229 GGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAGATAAAGGAATT
    chr10 89646232 GATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAG
    chr10 89646237 ATGGCCAATAAAGGTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGTATTGTTTTTTT
    chr10 89646238 TGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCCTCTTTTTTGTGAGAAAGGATGAACAGTGACCAGAAAAAAAGGGATTGTGTTTTTC
    chr10 89646242 CAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTTTGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGGATTGTTTTTTTTTATC
    My question: is there a software to find the segment duplicates?
    Or I need to develop an algorithm/code to find it?

    Actually the definition of the duplicates here can be 100% match or 80% match?

    Thanks for any hint.

  • #2
    UCSC browser has two tracks : "Segmental Dups" and "DGV Struct Var". You can download raw data and use it. There'd be several approaches: 1) load into mysql and query. 2) use awk to filter for the line you want. 3) load into memory using C/Java/Perl and interrogate the data for what you want. 4) or just parse out the data using your favorite command line tool.

    Make sure you download for the right build (hg18 or hg19).

    You can also "hand check" them if you have just a few using the browser. Try turning these two tracks on (set to "pack").

    Segmental dupes are a pain.
    Last edited by Richard Finney; 11-08-2011, 03:55 PM. Reason: gramerr

    Comment


    • #3
      Hi,

      I don't want to download dup data, I have my own data. How to generate segmental duplicates fron the data?

      To be honestly, I don't know the concept of the segmental duplicates.
      At least I need an example and some idea.

      Comment


      • #4
        Yeah, okay. "Segmental Dupes" means something in a genomic context. It means chunks of genome that appear more than once. It the case of a file of reads, it doesn't mean much unless you are de-novo assembling genomic dna reads and notice that, for instance, there are twice as many reads in a sub-assembly. In that case, there's evidence that you have a genomic duplication or "segmental dupe".

        Is that what you're looking for? Or are you looking for duplicate reads? Are you really looking for small repeated stretches? If you can explain exactly what you're looking for, there's likely good tools already available.

        Comment


        • #5
          I used samtools to extract data to output file out.txt from a bam file. Then I selected some columns which like above data. That means I have a lot of trunks of data. However I found each truck only has 100 characters. I want to find the duplicated which has the maximum length. Maybe it is a multiple sequence alignment problem. However I only can produce 100 character long sequence, how can I find real dups if it is longer than 100? So my question will be two: 1) How to generate a longer sequence from a sam file? 2) After get multiple sequence, how to align them? Thanks.

          Comment


          • #6
            The definition of the segmental duplicated is:

            sequence identity higher than 90%(or a value you defined) and alignment length 10 kB

            Comment


            • #8
              I'm guessing that what you're interested in finding are CNVs (copy number variations, which could vary between individuals/mice/specimen) rather than segmental duplications (which would be fixed a population and require creating a reference genome). You should just google around (or search the forum for CNV related software. I recall reading about CNVnator, but can't say I've ever personally looked for CNVs.

              If you actually DO want to find segmental duplications rather than CNVs, you'll need to first assemble a genome from your reads and then run the output through something like dupmasker (which is part of repeatmasker).

              Comment


              • #9
                I want to find segmental duplications. Can I use BLAST to compare two sequences?
                One is a section sequence, the other is genome reference?

                Comment


                • #10
                  Originally posted by ardmore View Post
                  I want to find segmental duplications. Can I use BLAST to compare two sequences?
                  Yes, you can use BLAST to compare sequences. Keep in mind that if you're going to run a LOT of BLAST searches that you should install a local copy and not overly tax the public servers. I would still recommend something like DupMasker since such programs are actually designed for this sort of task.

                  Comment


                  • #11
                    I feel that it is very hard to use "DupMasker", is there a tutorial?

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    69 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X