Header Leaderboard Ad

Collapse

RepeatMasker & RepeatScout

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thank you Mike. I will try what you suggest. Sounds like a good idea. I will let you know if it works.
    Thanks,
    TN

    Comment


    • #17
      Here's an example of a run I did successfully. I never got RepeatModeler working, and the installation of the standalone Blast program RMblast was a bit tricky. Make sure TRF and nseg are working too, for the first filtering stage below. As I understand it, RepeatModeler is basically just a wrapper for the programs below anyway.

      Repeatscout run: using yourgenome.fasta

      Code:
      ./build_lmer_table -l 14 -sequence yourgenome.fasta  -freq ~/Desktop/Vi_14.freq
      Build a frequency table of all repeats of size 14 within the Vi genome

      Code:
      ./RepeatScout -sequence yourgenome.fasta  -output your_repeats.fasta -freq your_freq_table -l 14
      Greedily extend 14-mer repeats until they diverge (see http://bix.ucsd.edu/repeatscout/repeatscout-ismb.pptfor a good explanation of this)

      Code:
      cat your_repeats.fasta| ./filter-stage-1.prl >your_repeats_filtered1.fasta
      Filter out low-complexity or tandem repeats

      Code:
      ./RepeatMasker -s -lib your_repeats_filtered1.fasta yourgenome.fasta
      Generate a masked genome using (non-low-complexity, non-tandem) repeats

      Code:
      cat your_repeats_filtered1.fasta | ./filter-stage-2.prl --cat yourgenome.fasta.out --thresh 10  your_repeats_filtered2.fasta
      Filter out all (non-low-complexity, non-tandem) repeats that have less than 10 repeats

      Code:
      ./RepeatMasker -pa 4 -s -lib your_repeats_filtered2.fasta -nolow -norna -no_is -gff yourgenome.fasta
      Produce a .gff file (among other files) of all non-low-complexity, non-tandem, non-rRNA repeats.

      Obviously you might need to modify parameters here and there to fit your requirements. The naming of the features in the resulting .gff file is a bit uninformative too.

      Comment


      • #18
        By the way Zimbobo, if you're doing de novo repeat element predictions you won't need existing repeat element libraries at all. You generate them yourself.

        Comment


        • #19
          Hi DFJ111 and mike.t,

          I followed the suggestions from you both, the repeat library was successfully built.

          When I ran the first filter, the results said:

          14184 deleted. 14185 saved. 111 skipped for length.

          but the output file (contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1) was empty.
          Code:
          cat /group/aquaculture/mussels/sequencing/MUSSEL1/repeatscout/contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout | ./filter-stage-1.prl > contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1
          Do you have any idea why?

          Thanks,
          TN

          Comment


          • #20
            I haven't run RepeatScout in a while so I'm afraid I can't help you. You may want to try another de novo repeat finding program. Try piler or RepeatModeler. piler usually works pretty well on fungi, although I am using the REPET pipeline these days.

            Comment


            • #21
              Originally posted by mike.t View Post
              I haven't run RepeatScout in a while so I'm afraid I can't help you. You may want to try another de novo repeat finding program. Try piler or RepeatModeler. piler usually works pretty well on fungi, although I am using the REPET pipeline these days.
              Hi mike.t,
              I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
              The program is still running (over a week), and I can not estimate when it will finish.
              Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

              Thanks!

              Comment


              • #22
                Originally posted by tnguyen View Post
                Hi DFJ111 and mike.t,

                I followed the suggestions from you both, the repeat library was successfully built.

                When I ran the first filter, the results said:

                14184 deleted. 14185 saved. 111 skipped for length.

                but the output file (contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1) was empty.
                Code:
                cat /group/aquaculture/mussels/sequencing/MUSSEL1/repeatscout/contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout | ./filter-stage-1.prl > contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1
                Do you have any idea why?

                Thanks,
                TN
                If the problem is occurring when using
                Code:
                filter-stage-1.prl
                check that TRF and nseg are properly installed and on your PATH. I had the same problem but I can't actually remember how I solved it.. it's solvable though.
                Last edited by DFJ111; 10-02-2012, 05:11 PM.

                Comment


                • #23
                  Originally posted by sunhh View Post
                  Hi mike.t,
                  I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
                  The program is still running (over a week), and I can not estimate when it will finish.
                  Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

                  Thanks!
                  I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

                  There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.

                  Comment


                  • #24
                    Originally posted by mike.t View Post
                    I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

                    There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.
                    Thanks, Mike.t
                    Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
                    I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.

                    Comment


                    • #25
                      I found another thread in SEQanswer, and someone else had a similar problem with me.
                      His blast+ aligning always drop to 1 thread no matter how many "-num_threads" he assigned.
                      Some one said it is because the query sequence are too short (only word matching step is multithreads), but in my case, a batch sequence in RepeatModeler (for RECON) is 40kb. It is still not large enough?

                      Comment


                      • #26
                        Originally posted by sunhh View Post
                        Thanks, Mike.t
                        Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
                        I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.
                        Hi, my repeatmoderler run very slowly too, and the input genome is 300M. Maybe the abblast, by default, also used only 1 cpu, so I assigned 10 by "-num_threads 10" like you, however, the repeatmoderler contained no this option. Could you pls tell me how to set the parameter in repeatmoderler/abblast.
                        Thank you very much!
                        lyn

                        Comment


                        • #27
                          Originally posted by Lyn Hsiong View Post
                          Hi, my repeatmoderler run very slowly too, and the input genome is 300M. Maybe the abblast, by default, also used only 1 cpu, so I assigned 10 by "-num_threads 10" like you, however, the repeatmoderler contained no this option. Could you pls tell me how to set the parameter in repeatmoderler/abblast.
                          Thank you very much!
                          lyn
                          It helps little to modify the threads value for rmblast. But you can do it in the .pm file (you can fing that file by grep threads in .pm files). Just wait for less than two weeks, and you will get final result.
                          Good luck!

                          Comment


                          • #28
                            Originally posted by sunhh View Post
                            It helps little to modify the threads value for rmblast. But you can do it in the .pm file (you can fing that file by grep threads in .pm files). Just wait for less than two weeks, and you will get final result.
                            Good luck!
                            thank you very much! but i don't know how to deal with the .pm file (i suppose you meant the file "RepModelConfig.pm"). the file only contains Pre-installed programs' paths (perl, recon, repeatmasker and so on), so where can i set the threads value? and could you pls tell me what the "grep threads" exactly mean? thank you!

                            Comment


                            • #29
                              Repeatmodeler error in building database

                              I have installed repeatmodeler. But when i am building database

                              ./BuildDatabase -name test test.fa

                              it is showing error and the RepModelConfig.pm file is empty

                              RepModelConfig.pm did not return a true value at ./BuildDatabase line 146.
                              BEGIN failed--compilation aborted at ./BuildDatabase line 146.

                              Anyone can help me to findout the error..

                              Thanks..
                              Last edited by amitbik; 01-22-2014, 11:18 PM.

                              Comment


                              • #30
                                Hi DFJ111,

                                I followed according to your steps and it is worked fine but in the .tbl file i am geting this output

                                file name: file.fa
                                sequences: 336145
                                total length: 330872632 bp (330872632 bp excl N/X-runs)
                                GC level: 39.43 %
                                bases masked: 199587278 bp ( 60.32 %)
                                ==================================================
                                number of length percentage
                                elements* occupied of sequence
                                --------------------------------------------------
                                SINEs: 0 0 bp 0.00 %
                                ALUs 0 0 bp 0.00 %
                                MIRs 0 0 bp 0.00 %

                                LINEs: 0 0 bp 0.00 %
                                LINE1 0 0 bp 0.00 %
                                LINE2 0 0 bp 0.00 %
                                L3/CR1 0 0 bp 0.00 %

                                LTR elements: 0 0 bp 0.00 %
                                ERVL 0 0 bp 0.00 %
                                ERVL-MaLRs 0 0 bp 0.00 %
                                ERV_classI 0 0 bp 0.00 %
                                ERV_classII 0 0 bp 0.00 %

                                DNA elements: 0 0 bp 0.00 %
                                hAT-Charlie 0 0 bp 0.00 %
                                TcMar-Tigger 0 0 bp 0.00 %

                                Unclassified: 866174 216405375 bp 65.40 %

                                Total interspersed repeats:216405375 bp 65.40 %


                                Small RNA: 0 0 bp 0.00 %

                                Satellites: 0 0 bp 0.00 %
                                Simple repeats: 51195 2109015 bp 0.64 %
                                Low complexity: 0 0 bp 0.00 %
                                ==================================================

                                * most repeats fragmented by insertions or deletions
                                have been counted as one element


                                The query species was assumed to be homo
                                RepeatMasker version open-4.0.3 , sensitive mode

                                run with rmblastn version 2.2.27+
                                The query was compared to unclassified sequences in ".../repeats_1.fa"
                                RepBase Update 20130422, RM database version 20130422

                                can you guide me why most of the output are showing 0.

                                Thanks in advance...

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                  by seqadmin


                                  ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                  01-24-2023, 01:19 PM
                                • seqadmin
                                  Introduction to Single-Cell Sequencing
                                  by seqadmin
                                  Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                  The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                  ...
                                  01-09-2023, 03:10 PM
                                • seqadmin
                                  AVITI from Element Biosciences: Latest Sequencing Technologies—Part 6
                                  by seqadmin
                                  Element Biosciences made its sequencing market debut this year when it released AVITI, its first sequencer. The AVITI System uses avidity sequencing, a novel sequencing chemistry that delivers higher quality data, decreases cycle times, and requires lower reagent concentrations. This new instrument reportedly features lower operating and start-up costs while maintaining quality sequencing.

                                  Read type and length
                                  AVITI is a short-read benchtop sequencer that also offers an innovative...
                                  12-29-2022, 10:44 AM

                                ad_right_rmr

                                Collapse
                                Working...
                                X