Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • roll
    replied
    Originally posted by dpryan View Post
    Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).
    great, so we are getting there

    I would like to have something with numbers rather than examining it as visual.

    how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

    What about bedtools? which option i should look for?

    Leave a comment:


  • dpryan
    replied
    Originally posted by roll View Post
    himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
    I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?
    Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

    Leave a comment:


  • roll
    replied
    Originally posted by dpryan View Post
    You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.
    himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
    I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?

    Leave a comment:


  • dpryan
    replied
    Originally posted by roll View Post
    What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?
    You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.

    Leave a comment:


  • roll
    replied
    Originally posted by dpryan View Post
    That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?
    What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?

    Leave a comment:


  • roll
    replied
    I converted fastq to fasta myself. The original fastq have the headers like

    @HS15_6922:7:2307:21180:13152#6/1
    mySequenceHere
    +
    CBFFJ=BJIIJKFKHFLIJJLIIAGLCKKKIHKEKJKJ9JEKQJ;MJIJHNKLHKLHJI=KJ5DFCEIB+H?4?A?I31<FE=>ACG?F?A576;>./
    Last edited by roll; 09-30-2013, 01:56 AM.

    Leave a comment:


  • dpryan
    replied
    Originally posted by roll View Post
    Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

    The header of the FASTA file looks something like
    >HS15_6922:7:2307:21180:13152#6/1

    I am not sure how can I partition it using the above information. Can you please advise?
    That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?

    Leave a comment:


  • roll
    replied
    Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

    The header of the FASTA file looks something like
    >HS15_6922:7:2307:21180:13152#6/1

    I am not sure how can I partition it using the above information. Can you please advise?

    Leave a comment:


  • dpryan
    replied
    Repeatmasker isn't exactly known for its speed. Since you have a cluster, your best option is to split the fasta file by chromosome/contig and run those on different nodes. You can then merge the results back together. In fact, I believe this is how the repeat masked files that are available from UCSC et al. were done.

    Leave a comment:


  • roll
    started a topic RepeatMasker for 7.5 GB of FASTA data

    RepeatMasker for 7.5 GB of FASTA data

    Hi,

    I am trying to use RepeatMasker with full fasta file. The size of my fasta file is 7.5 GB and i would like to identify all repeats.

    I use a cluster and requested 50 GB of memory from the cluster. And it still complaints that it is out of memory.

    I am using the following options:

    RepeatMasker -e crossmatch -q -species mouse -no_is -dir . -html -gff *.fasta

    How can I run this and also I want it to be fast as well for a whole fasta file?

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
35 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
38 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
33 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
54 views
0 likes
Last Post seqadmin  
Working...
X