Seqanswers Leaderboard Ad

**Richard Finney** · 11-08-2011, 03:42 PM

UCSC browser has two tracks : "Segmental Dups" and "DGV Struct Var". You can download raw data and use it. There'd be several approaches: 1) load into mysql and query. 2) use awk to filter for the line you want. 3) load into memory using C/Java/Perl and interrogate the data for what you want. 4) or just parse out the data using your favorite command line tool.

Make sure you download for the right build (hg18 or hg19).

You can also "hand check" them if you have just a few using the browser. Try turning these two tracks on (set to "pack").

Segmental dupes are a pain.

**ardmore** · 11-08-2011, 06:06 PM

Hi,

I don't want to download dup data, I have my own data. How to generate segmental duplicates fron the data?

To be honestly, I don't know the concept of the segmental duplicates.
At least I need an example and some idea.

**Richard Finney** · 11-09-2011, 05:41 AM

Yeah, okay. "Segmental Dupes" means something in a genomic context. It means chunks of genome that appear more than once. It the case of a file of reads, it doesn't mean much unless you are de-novo assembling genomic dna reads and notice that, for instance, there are twice as many reads in a sub-assembly. In that case, there's evidence that you have a genomic duplication or "segmental dupe".

Is that what you're looking for? Or are you looking for duplicate reads? Are you really looking for small repeated stretches? If you can explain exactly what you're looking for, there's likely good tools already available.

**ardmore** · 11-09-2011, 06:39 AM

I used samtools to extract data to output file out.txt from a bam file. Then I selected some columns which like above data. That means I have a lot of trunks of data. However I found each truck only has 100 characters. I want to find the duplicated which has the maximum length. Maybe it is a multiple sequence alignment problem. However I only can produce 100 character long sequence, how can I find real dups if it is longer than 100? So my question will be two: 1) How to generate a longer sequence from a sam file? 2) After get multiple sequence, how to align them? Thanks.

**ardmore** · 11-09-2011, 07:23 AM

The definition of the segmental duplicated is:

sequence identity higher than 90%(or a value you defined) and alignment length 10 kB

**ardmore** · 11-09-2011, 07:42 AM

404 Not Found // Raphael Lab

http://www.cs.brown.edu/people/braphael/publications/papers/Kahn_camera_revised.pdf

**dpryan** · 11-09-2011, 04:49 PM

I'm guessing that what you're interested in finding are CNVs (copy number variations, which could vary between individuals/mice/specimen) rather than segmental duplications (which would be fixed a population and require creating a reference genome). You should just google around (or search the forum for CNV related software. I recall reading about CNVnator, but can't say I've ever personally looked for CNVs.

If you actually DO want to find segmental duplications rather than CNVs, you'll need to first assemble a genome from your reads and then run the output through something like dupmasker (which is part of repeatmasker).

**ardmore** · 11-10-2011, 06:48 AM

I want to find segmental duplications. Can I use BLAST to compare two sequences?
One is a section sequence, the other is genome reference?

**dpryan** · 11-10-2011, 07:03 AM

Originally posted by ardmore View Post

I want to find segmental duplications. Can I use BLAST to compare two sequences?

Yes, you can use BLAST to compare sequences. Keep in mind that if you're going to run a LOT of BLAST searches that you should install a local copy and not overly tax the public servers. I would still recommend something like DupMasker since such programs are actually designed for this sort of task.

**ardmore** · 11-10-2011, 07:21 AM

I feel that it is very hard to use "DupMasker", is there a tutorial?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Find the segemntal duplicates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News