Seqanswers Leaderboard Ad

**dpryan** · 09-30-2013, 01:07 AM

Repeatmasker isn't exactly known for its speed. Since you have a cluster, your best option is to split the fasta file by chromosome/contig and run those on different nodes. You can then merge the results back together. In fact, I believe this is how the repeat masked files that are available from UCSC et al. were done.

**roll** · 09-30-2013, 01:45 AM

Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

The header of the FASTA file looks something like
>HS15_6922:7:2307:21180:13152#6/1

I am not sure how can I partition it using the above information. Can you please advise?

**dpryan** · 09-30-2013, 01:46 AM

Originally posted by roll View Post

Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

The header of the FASTA file looks something like
>HS15_6922:7:2307:21180:13152#6/1

I am not sure how can I partition it using the above information. Can you please advise?

That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?

**roll** · 09-30-2013, 01:50 AM

I converted fastq to fasta myself. The original fastq have the headers like

@HS15_6922:7:2307:21180:13152#6/1
mySequenceHere
+
CBFFJ=BJIIJKFKHFLIJJLIIAGLCKKKIHKEKJKJ9JEKQJ;MJIJHNKLHKLHJI=KJ5DFCEIB+H?4?A?I

31<FE=>ACG?F?A576;>./

**roll** · 09-30-2013, 02:31 AM

Originally posted by dpryan View Post

That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?

What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?

**dpryan** · 09-30-2013, 02:58 AM

Originally posted by roll View Post

What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?

You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.

**roll** · 09-30-2013, 03:07 AM

Originally posted by dpryan View Post

You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.

himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?

**dpryan** · 09-30-2013, 03:20 AM

Originally posted by roll View Post

himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?

Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

**roll** · 09-30-2013, 03:28 AM

Originally posted by dpryan View Post

Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

great, so we are getting there

I would like to have something with numbers rather than examining it as visual.

how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

What about bedtools? which option i should look for?

**roll** · 09-30-2013, 03:35 AM

Originally posted by roll View Post

great, so we are getting there

I would like to have something with numbers rather than examining it as visual.

how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

What about bedtools? which option i should look for?

I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

accepted_hits.bam
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
logs
prep_reads.info
unmapped.bam

**roll** · 09-30-2013, 03:36 AM

Originally posted by roll View Post

great, so we are getting there

I would like to have something with numbers rather than examining it as visual.

how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

What about bedtools? which option i should look for?

Originally posted by dpryan View Post

Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

accepted_hits.bam
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
logs
prep_reads.info
unmapped.bam

**dpryan** · 09-30-2013, 03:52 AM

Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.

**lh3** · 09-30-2013, 04:46 AM

BTW, you can find more detailed repeatMask results here:

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.{sql,txt.gz}

It is easy to convert this file to BED, I believe.

**roll** · 09-30-2013, 06:46 AM

Originally posted by dpryan View Post

Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.

Thanks a lot. This has been very helpful so far. i am trying what you have suggested and will let you know how it goes.

Do you know where i can download a genes and coordinates in a bed format? (Alternatively how can i assign gene names to my bed file)?

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

RepeatMasker for 7.5 GB of FASTA data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News