Header Leaderboard Ad

Collapse

RepeatMasker & RepeatScout

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lyn Hsiong
    replied
    Originally posted by sunhh View Post
    Thanks, Mike.t
    Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
    I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.
    Hi, my repeatmoderler run very slowly too, and the input genome is 300M. Maybe the abblast, by default, also used only 1 cpu, so I assigned 10 by "-num_threads 10" like you, however, the repeatmoderler contained no this option. Could you pls tell me how to set the parameter in repeatmoderler/abblast.
    Thank you very much!
    lyn

    Leave a comment:


  • sunhh
    replied
    I found another thread in SEQanswer, and someone else had a similar problem with me.
    His blast+ aligning always drop to 1 thread no matter how many "-num_threads" he assigned.
    Some one said it is because the query sequence are too short (only word matching step is multithreads), but in my case, a batch sequence in RepeatModeler (for RECON) is 40kb. It is still not large enough?

    Leave a comment:


  • sunhh
    replied
    Originally posted by mike.t View Post
    I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

    There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.
    Thanks, Mike.t
    Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
    I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.

    Leave a comment:


  • mike.t
    replied
    Originally posted by sunhh View Post
    Hi mike.t,
    I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
    The program is still running (over a week), and I can not estimate when it will finish.
    Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

    Thanks!
    I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

    There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.

    Leave a comment:


  • DFJ111
    replied
    Originally posted by tnguyen View Post
    Hi DFJ111 and mike.t,

    I followed the suggestions from you both, the repeat library was successfully built.

    When I ran the first filter, the results said:

    14184 deleted. 14185 saved. 111 skipped for length.

    but the output file (contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1) was empty.
    Code:
    cat /group/aquaculture/mussels/sequencing/MUSSEL1/repeatscout/contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout | ./filter-stage-1.prl > contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1
    Do you have any idea why?

    Thanks,
    TN
    If the problem is occurring when using
    Code:
    filter-stage-1.prl
    check that TRF and nseg are properly installed and on your PATH. I had the same problem but I can't actually remember how I solved it.. it's solvable though.
    Last edited by DFJ111; 10-02-2012, 05:11 PM.

    Leave a comment:


  • sunhh
    replied
    Originally posted by mike.t View Post
    I haven't run RepeatScout in a while so I'm afraid I can't help you. You may want to try another de novo repeat finding program. Try piler or RepeatModeler. piler usually works pretty well on fungi, although I am using the REPET pipeline these days.
    Hi mike.t,
    I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
    The program is still running (over a week), and I can not estimate when it will finish.
    Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

    Thanks!

    Leave a comment:


  • mike.t
    replied
    I haven't run RepeatScout in a while so I'm afraid I can't help you. You may want to try another de novo repeat finding program. Try piler or RepeatModeler. piler usually works pretty well on fungi, although I am using the REPET pipeline these days.

    Leave a comment:


  • tnguyen
    replied
    Hi DFJ111 and mike.t,

    I followed the suggestions from you both, the repeat library was successfully built.

    When I ran the first filter, the results said:

    14184 deleted. 14185 saved. 111 skipped for length.

    but the output file (contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1) was empty.
    Code:
    cat /group/aquaculture/mussels/sequencing/MUSSEL1/repeatscout/contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout | ./filter-stage-1.prl > contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1
    Do you have any idea why?

    Thanks,
    TN

    Leave a comment:


  • DFJ111
    replied
    By the way Zimbobo, if you're doing de novo repeat element predictions you won't need existing repeat element libraries at all. You generate them yourself.

    Leave a comment:


  • DFJ111
    replied
    Here's an example of a run I did successfully. I never got RepeatModeler working, and the installation of the standalone Blast program RMblast was a bit tricky. Make sure TRF and nseg are working too, for the first filtering stage below. As I understand it, RepeatModeler is basically just a wrapper for the programs below anyway.

    Repeatscout run: using yourgenome.fasta

    Code:
    ./build_lmer_table -l 14 -sequence yourgenome.fasta  -freq ~/Desktop/Vi_14.freq
    Build a frequency table of all repeats of size 14 within the Vi genome

    Code:
    ./RepeatScout -sequence yourgenome.fasta  -output your_repeats.fasta -freq your_freq_table -l 14
    Greedily extend 14-mer repeats until they diverge (see http://bix.ucsd.edu/repeatscout/repeatscout-ismb.pptfor a good explanation of this)

    Code:
    cat your_repeats.fasta| ./filter-stage-1.prl >your_repeats_filtered1.fasta
    Filter out low-complexity or tandem repeats

    Code:
    ./RepeatMasker -s -lib your_repeats_filtered1.fasta yourgenome.fasta
    Generate a masked genome using (non-low-complexity, non-tandem) repeats

    Code:
    cat your_repeats_filtered1.fasta | ./filter-stage-2.prl --cat yourgenome.fasta.out --thresh 10  your_repeats_filtered2.fasta
    Filter out all (non-low-complexity, non-tandem) repeats that have less than 10 repeats

    Code:
    ./RepeatMasker -pa 4 -s -lib your_repeats_filtered2.fasta -nolow -norna -no_is -gff yourgenome.fasta
    Produce a .gff file (among other files) of all non-low-complexity, non-tandem, non-rRNA repeats.

    Obviously you might need to modify parameters here and there to fit your requirements. The naming of the features in the resulting .gff file is a bit uninformative too.

    Leave a comment:


  • tnguyen
    replied
    Thank you Mike. I will try what you suggest. Sounds like a good idea. I will let you know if it works.
    Thanks,
    TN

    Leave a comment:


  • mike.t
    replied
    You probably don't need to use the whole genome for RepeatScout. Just use a few chromosomes or supercontigs. If repeats are distributed across all the chromosomes in the genome, scanning just a few of them with RepeatScout should be enough to find then and create consensus sequences that you can input to RepeatMasker. Then, mask the whole genome with RepeatMasker.

    Leave a comment:


  • tnguyen
    replied
    Thank you Rahul,
    My genome size is ~1.7Gb, any idea how to make RepeatScout to work for large genome?
    TN

    Leave a comment:


  • rahularjun86
    replied
    Hi tnguyen,
    sorry for replying late. Genome was of ~20Mb and other one was in Gb's. Actually I ran on the cluster and I did'nt check the memory it used.
    Best wishes,
    Rahul

    Leave a comment:


  • tnguyen
    replied
    Sorry the full error message was:

    "Could not allocate space for sequence"
    Last edited by tnguyen; 09-22-2012, 07:02 AM.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
    by seqadmin



    Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
    03-21-2023, 01:49 PM
  • seqadmin
    Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
    by seqadmin




    Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
    03-10-2023, 05:31 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-24-2023, 02:45 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2023, 12:26 PM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-17-2023, 12:32 PM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-15-2023, 12:42 PM
0 responses
21 views
0 likes
Last Post seqadmin  
Working...
X