RepeatMasker for 7.5 GB of FASTA data

roll replied

11-12-2013, 03:15 AM
Originally posted by dpryan View Post

Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.

from the repeat list, i am trying to use the ones on forward strand. do you know how to extract this from the .out file?
The column headers are like
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

687 17.4 0.0 0.0 chr1 3000002 3000156 (194195276) C L1_Mur2 LINE/L1 (4310) 1567 1413 1
917 21.4 11.4 4.5 chr1 3000238 3000733 (194194699) C L1_Mur2 LINE/L1 (4488) 1389 913 1
215 3.1 0.0 3.0 chr1 3000734 3000766 (194194666) + (TTTG)n Simple_repeat 2 33 (0) 2
Leave a comment:
dpryan replied

10-03-2013, 03:03 PM
Originally posted by roll View Post

I did the check.
As you suggested the order is different.

The only differences are that
i have chrM in fa file and MN in the gtf file.
Also in the gtf file i have lots of 'NT_123456' type of entries which i do not what it is.

Do these differences cause any problem?

That will generally still cause issues with tophat. If you downloaded your bowtie indices from iGenomes (likely via a link on the bowtie webpage), then they came with an appropriate reference annotation file. Just use that one.
Leave a comment:
roll replied

10-03-2013, 04:16 AM
I did the check.
As you suggested the order is different.

The only differences are that
i have chrM in fa file and MN in the gtf file.
Also in the gtf file i have lots of 'NT_123456' type of entries which i do not what it is.

Do these differences cause any problem?
Leave a comment:
dpryan replied

10-02-2013, 06:37 AM
No need to manually do that

Just:

Code:

grep ">" reference.fa

on the reference fasta file to get a list of the contigs and then:

Code:

cat annotation.gtf | cut -f 1 | sort | uniq

on the GTF or GFF file. They should be the same, possibly with a different order (and the output from the grep command will all start with ">", which you can ignore).
Leave a comment:
roll replied

10-02-2013, 06:34 AM
Do you mean manually checking if all chromosomes are there? i do not know what contigs are or how to check them.

I think chrX and Y are missing in the gtf file. But how can I find information for missing chromosomes and complete it?
Leave a comment:
dpryan replied

10-02-2013, 06:11 AM
That should be sufficient, just make sure that the annotation doesn't mention any chromosomes/contigs missing from the reference fasta file (I can't recall if that's the case or not).
Leave a comment:
roll replied

10-02-2013, 05:50 AM
I am using mm9. And I downloaded the bowtie index from their webpage. I had to change the chr names as ensembl uses the 1 instead of chr1. But i don't know whether just adding chr in front of the chromosomes names will sort it out. Should I do other changes in addition?
Leave a comment:
dpryan replied

10-02-2013, 05:36 AM
If you're aligning against mm10, then don't use an annotation file from Ensembl (the chromosome names are different). That will cause no end of issue If you use the Ensembl annotation, just align against the genome that you can download from Ensembl (the Ensembl annotation is better anyway).

Yeah, that was probably me, there's a large overlap between the people here and on biostars.
Leave a comment:
roll replied

10-02-2013, 05:32 AM
Originally posted by dpryan View Post

Just download the GTF annotation and use gtf2bed from bedops. That should keep the gene names (or some other useful identifier).

Edit: The GTF annotation is available from the UCSC table browser, in case you weren't aware

Thanks Devon,
Very helpful so far.
I downloaded this annotation from ensembl. From biomart i simply left the filters sections empty and used the whole output as my list. This is right, no?

PS. i just asked a question in biostar about flagstat and it was answered by you i think but i am not sure.
Leave a comment:
dpryan replied

09-30-2013, 08:54 AM
Just download the GTF annotation and use gtf2bed from bedops. That should keep the gene names (or some other useful identifier).

Edit: The GTF annotation is available from the UCSC table browser, in case you weren't aware
Leave a comment:
roll replied

09-30-2013, 06:46 AM
Originally posted by dpryan View Post

Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.

Thanks a lot. This has been very helpful so far. i am trying what you have suggested and will let you know how it goes.

Do you know where i can download a genes and coordinates in a bed format? (Alternatively how can i assign gene names to my bed file)?
Leave a comment:
lh3 replied

09-30-2013, 04:46 AM
BTW, you can find more detailed repeatMask results here:

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.{sql,txt.gz}

It is easy to convert this file to BED, I believe.
Leave a comment:
dpryan replied

09-30-2013, 03:52 AM
Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.
Leave a comment:
roll replied

09-30-2013, 03:36 AM
Originally posted by roll View Post

great, so we are getting there

I would like to have something with numbers rather than examining it as visual.

how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

What about bedtools? which option i should look for?

Originally posted by dpryan View Post

Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

accepted_hits.bam
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
logs
prep_reads.info
unmapped.bam
Leave a comment:
roll replied

09-30-2013, 03:35 AM
Originally posted by roll View Post

great, so we are getting there

I would like to have something with numbers rather than examining it as visual.

how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

What about bedtools? which option i should look for?

I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

accepted_hits.bam
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
logs
prep_reads.info
unmapped.bam
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News