Header Leaderboard Ad

Collapse

Find the segemntal duplicates

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find the segemntal duplicates

    Hello, I have a sequence file that has three columns.
    The first one is chromosome, the second one is the position and the third one is the sequence.
    Ex,
    Code:
    chr10 89646218 TTTTTTGATTGGGGGATAATTGACCAATAAGGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAA
    chr10 89646221 TTTGATTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCGTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATA
    chr10 89646225 ATTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGG
    chr10 89646226 TTGGGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGT
    chr10 89646229 GGGGATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAGATAAAGGAATT
    chr10 89646232 GATAATTGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAG
    chr10 89646237 ATGGCCAATAAAGGTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTATGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGTATTGTTTTTTT
    chr10 89646238 TGGCCAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCCTCTTTTTTGTGAGAAAGGATGAACAGTGACCAGAAAAAAAGGGATTGTGTTTTTC
    chr10 89646242 CAATAAAGCTTTGATAGCCTCTATTGCCCAGGCCCCTCCTCTTCTTTTTTGAGAGAAAGGATGAACAGTGACCAGAAATAAAGGGATTGTTTTTTTTTATC
    My question: is there a software to find the segment duplicates?
    Or I need to develop an algorithm/code to find it?

    Actually the definition of the duplicates here can be 100% match or 80% match?

    Thanks for any hint.

  • #2
    UCSC browser has two tracks : "Segmental Dups" and "DGV Struct Var". You can download raw data and use it. There'd be several approaches: 1) load into mysql and query. 2) use awk to filter for the line you want. 3) load into memory using C/Java/Perl and interrogate the data for what you want. 4) or just parse out the data using your favorite command line tool.

    Make sure you download for the right build (hg18 or hg19).

    You can also "hand check" them if you have just a few using the browser. Try turning these two tracks on (set to "pack").

    Segmental dupes are a pain.
    Last edited by Richard Finney; 11-08-2011, 03:55 PM. Reason: gramerr

    Comment


    • #3
      Hi,

      I don't want to download dup data, I have my own data. How to generate segmental duplicates fron the data?

      To be honestly, I don't know the concept of the segmental duplicates.
      At least I need an example and some idea.

      Comment


      • #4
        Yeah, okay. "Segmental Dupes" means something in a genomic context. It means chunks of genome that appear more than once. It the case of a file of reads, it doesn't mean much unless you are de-novo assembling genomic dna reads and notice that, for instance, there are twice as many reads in a sub-assembly. In that case, there's evidence that you have a genomic duplication or "segmental dupe".

        Is that what you're looking for? Or are you looking for duplicate reads? Are you really looking for small repeated stretches? If you can explain exactly what you're looking for, there's likely good tools already available.

        Comment


        • #5
          I used samtools to extract data to output file out.txt from a bam file. Then I selected some columns which like above data. That means I have a lot of trunks of data. However I found each truck only has 100 characters. I want to find the duplicated which has the maximum length. Maybe it is a multiple sequence alignment problem. However I only can produce 100 character long sequence, how can I find real dups if it is longer than 100? So my question will be two: 1) How to generate a longer sequence from a sam file? 2) After get multiple sequence, how to align them? Thanks.

          Comment


          • #6
            The definition of the segmental duplicated is:

            sequence identity higher than 90%(or a value you defined) and alignment length 10 kB

            Comment


            • #7
              http://www.cs.brown.edu/people/braph...ra_revised.pdf

              Comment


              • #8
                I'm guessing that what you're interested in finding are CNVs (copy number variations, which could vary between individuals/mice/specimen) rather than segmental duplications (which would be fixed a population and require creating a reference genome). You should just google around (or search the forum for CNV related software. I recall reading about CNVnator, but can't say I've ever personally looked for CNVs.

                If you actually DO want to find segmental duplications rather than CNVs, you'll need to first assemble a genome from your reads and then run the output through something like dupmasker (which is part of repeatmasker).

                Comment


                • #9
                  I want to find segmental duplications. Can I use BLAST to compare two sequences?
                  One is a section sequence, the other is genome reference?

                  Comment


                  • #10
                    Originally posted by ardmore View Post
                    I want to find segmental duplications. Can I use BLAST to compare two sequences?
                    Yes, you can use BLAST to compare sequences. Keep in mind that if you're going to run a LOT of BLAST searches that you should install a local copy and not overly tax the public servers. I would still recommend something like DupMasker since such programs are actually designed for this sort of task.

                    Comment


                    • #11
                      I feel that it is very hard to use "DupMasker", is there a tutorial?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                        by seqadmin


                        ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                        01-24-2023, 01:19 PM
                      • seqadmin
                        Introduction to Single-Cell Sequencing
                        by seqadmin
                        Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                        The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                        ...
                        01-09-2023, 03:10 PM

                      ad_right_rmr

                      Collapse
                      Working...
                      X