Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • roll
    Member
    • Aug 2009
    • 38

    RepeatMasker for 7.5 GB of FASTA data

    Hi,

    I am trying to use RepeatMasker with full fasta file. The size of my fasta file is 7.5 GB and i would like to identify all repeats.

    I use a cluster and requested 50 GB of memory from the cluster. And it still complaints that it is out of memory.

    I am using the following options:

    RepeatMasker -e crossmatch -q -species mouse -no_is -dir . -html -gff *.fasta

    How can I run this and also I want it to be fast as well for a whole fasta file?
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Repeatmasker isn't exactly known for its speed. Since you have a cluster, your best option is to split the fasta file by chromosome/contig and run those on different nodes. You can then merge the results back together. In fact, I believe this is how the repeat masked files that are available from UCSC et al. were done.

    Comment

    • roll
      Member
      • Aug 2009
      • 38

      #3
      Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

      The header of the FASTA file looks something like
      >HS15_6922:7:2307:21180:13152#6/1

      I am not sure how can I partition it using the above information. Can you please advise?

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Originally posted by roll View Post
        Thanks dpryan, but i do not know how to identify the chromosomes or contigs from my data.

        The header of the FASTA file looks something like
        >HS15_6922:7:2307:21180:13152#6/1

        I am not sure how can I partition it using the above information. Can you please advise?
        That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?

        Comment

        • roll
          Member
          • Aug 2009
          • 38

          #5
          I converted fastq to fasta myself. The original fastq have the headers like

          @HS15_6922:7:2307:21180:13152#6/1
          mySequenceHere
          +
          CBFFJ=BJIIJKFKHFLIJJLIIAGLCKKKIHKEKJKJ9JEKQJ;MJIJHNKLHKLHJI=KJ5DFCEIB+H?4?A?I31<FE=>ACG?F?A576;>./
          Last edited by roll; 09-30-2013, 01:56 AM.

          Comment

          • roll
            Member
            • Aug 2009
            • 38

            #6
            Originally posted by dpryan View Post
            That looks like the read name for a FASTQ read, not a contig name. Are you sure this file is fasta?
            What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              Originally posted by roll View Post
              What is the best way to convert fastq 2 fasta then so that i keep the chromosome information?
              You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.

              Comment

              • roll
                Member
                • Aug 2009
                • 38

                #8
                Originally posted by dpryan View Post
                You don't want to repeat mask that file (you could, but the results would be completely useless). What is the actual biological question you're trying to answer. From context, I'm guess that this is an organism that hasn't been sequenced before and you'd like to determine its repeat structure or something like that. If that's the case, you need to de novo assemble the genome first. That will produce a proper fasta file that can be meaningfully repeatmasked.
                himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
                I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Originally posted by roll View Post
                  himmmmm, that is an interesting point. it is mouse data that i am dealing with so it has definitely been sequenced before.
                  I am not an expert in the field, still learning and my boss would like to know if and how many retrotransposons ( L1, SINE etc. ) are found in the data that we generated. May be there is a better to analyse this rather than RepeatMask?
                  Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).

                  Comment

                  • roll
                    Member
                    • Aug 2009
                    • 38

                    #10
                    Originally posted by dpryan View Post
                    Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).
                    great, so we are getting there

                    I would like to have something with numbers rather than examining it as visual.

                    how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

                    What about bedtools? which option i should look for?

                    Comment

                    • roll
                      Member
                      • Aug 2009
                      • 38

                      #11
                      Originally posted by roll View Post
                      great, so we are getting there

                      I would like to have something with numbers rather than examining it as visual.

                      how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

                      What about bedtools? which option i should look for?
                      I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

                      accepted_hits.bam
                      align_summary.txt
                      deletions.bed
                      insertions.bed
                      junctions.bed
                      logs
                      prep_reads.info
                      unmapped.bam

                      Comment

                      • roll
                        Member
                        • Aug 2009
                        • 38

                        #12
                        Originally posted by roll View Post
                        great, so we are getting there

                        I would like to have something with numbers rather than examining it as visual.

                        how do I get the repeatmasker output from uscs? do i upload my sam file and then tick repeatmask option?

                        What about bedtools? which option i should look for?
                        Originally posted by dpryan View Post
                        Ah, actually repeat masking the reads won't do what you want then. What you want to do is align your reads to the mouse genome and then download the repeat masker output from UCSC. There are then a number of ways to compare your alignments to where the repeats are (e.g., just visually inspecting things with IGV or using bedtools or something similar to intersect the alignments).
                        I have used tophat2 for mapping. Shall i use the bam file? Other outputs are

                        accepted_hits.bam
                        align_summary.txt
                        deletions.bed
                        insertions.bed
                        junctions.bed
                        logs
                        prep_reads.info
                        unmapped.bam

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

                          In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.

                          Comment

                          • lh3
                            Senior Member
                            • Feb 2008
                            • 686

                            #14
                            BTW, you can find more detailed repeatMask results here:

                            http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.{sql,txt.gz}

                            It is easy to convert this file to BED, I believe.

                            Comment

                            • roll
                              Member
                              • Aug 2009
                              • 38

                              #15
                              Originally posted by dpryan View Post
                              Assuming that you're using the mm10 reference, you can download the repeatmasker output here (mm9 is here). The general idea is to extract the type of feature(s) you want from the repeatmasker .out file and convert that to bed format and use "bedtools intersect ..." to get a count of how many reads align there. There are many other ways to do this, but that should work.

                              In fact, a more straight-forward way might be simply to run cufflinks on your alignments and then intersect the novel transcripts it finds with the repeatmasker output file. That might end up being easier.
                              Thanks a lot. This has been very helpful so far. i am trying what you have suggested and will let you know how it goes.

                              Do you know where i can download a genes and coordinates in a bed format? (Alternatively how can i assign gene names to my bed file)?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              51 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              38 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              193 views
                              0 reactions
                              Last Post seqadmin  
                              Working...