Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • krafiq
    Junior Member
    • Jul 2013
    • 9

    How to convert genoma.fa to chr*.fa

    I am working with a genome file which is not on the USC genome browser. I have my own Genome.fa file and I want to separate the genome fasta file into chromosomes, so that there is one chr*.fa file for each chromosome.

    Could someone please guide me as to how? I am just beginning to use Linux!

    Thanks so much!
  • SNPsaurus
    Registered Vendor
    • May 2013
    • 525

    #2
    Hi krafiq,

    Here is some Perl to do that. You just need to save it (as split.pl) to the folder you have your fasta file, then in the command line, write "perl split.pl genome.fa" where genome.fa is the name of your file. It will save a file for whatever is used as the header (I'm assuming it looks like ">chr1".

    Code:
    #!/usr/bin/perl
    
    $f = $ARGV[0]; #get the file name
    
     open (INFILE, "<$f")
    	or die "Can't open: $f $!";
    		
    while (<INFILE>) {
    	$line = $_;
    	chomp $line;
    	if ($line =~ /\>/) { #if has fasta >
    		close OUTFILE;
    		$new_file = substr($line,1);
    		$new_file .= ".fa";
    		open (OUTFILE, ">$new_file")
    			or die "Can't open: $new_file $!";
    	}
    	print OUTFILE "$line\n";
    }
    close OUTFILE;
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    • krafiq
      Junior Member
      • Jul 2013
      • 9

      #3
      Dear SNPsaurus,

      Thanks for your response! Howevr, my genome.fa file contains scaffold files and not chr files. So if I use your code above, it would create one file per scaffold. How do I get to one chr*.fa file for each chromosome from here?

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Do you mean to say that there are no ">chr..." lines in the Genome.fa file, only ">scaffold..." or something like that? If that's the case then there's no way to do it without assembling the scaffolds into chromosomes (which would take more sequencing and likely some wet-bench experiments).

        If you mean to say that you have a mixture of ">chr..." and ">scaffold..." entries, then you can just modify the perl script above (just add a variable to store the context that you're in and only write output if you're in a ">chr..." context).

        Comment

        • krafiq
          Junior Member
          • Jul 2013
          • 9

          #5
          Dear dpryan,

          I'm trying to use the hotspot software and I'm trying to run their enumerateUniquelyMappableSpace script on my custom genome file. However, we don't have chromosome information for my genome. The genome.fa file only has >scaffolds. Is there a way to run hotspot without separating the genome fasta file into chromosomes/obtaining chr*.fa for every chromosome?

          Comment

          • EricHaugen
            Member
            • Sep 2009
            • 13

            #6
            Two options:

            1. Change "chr" to "scaffold" in the enumerateUniquelyMappableSpace bash wrapper script, to list the individual fasta files.

            2. Just run the whole genome fasta file, after building a bowtie index, with:

            enumerateUniquelyMappableSpace.pl read_length bowtie_index_prefix genome.fa | sort-bed - | bedops -m - > genome.read_length.mappable_only.bed

            If "sort-bed" runs out of memory here, the BEDOPS suite includes a "bbms" script that can be used in place of sort-bed.

            Comment

            • fengqi
              Member
              • Aug 2012
              • 10

              #7
              Did you try
              'samtools faidx genome.fasta chrX > chrX.fasta'

              Comment

              • krafiq
                Junior Member
                • Jul 2013
                • 9

                #8
                Thanks all!!

                EricHaugen: I'm trying option 1 for now. I'm trying to run the script again with bowtie and bowtie-build in the same folder as the script. But it's giving me the following error:
                ./enumerateUniquelyMappableSpace: line 30: bowtie-build: command not found

                And then it goes on to give the following error multiple times:
                Failed to find bowtie index file Genome.1.ebwt

                Does anyone know why and what I should change?

                Thanks!
                Last edited by krafiq; 07-25-2013, 09:02 PM.

                Comment

                • EricHaugen
                  Member
                  • Sep 2009
                  • 13

                  #9
                  It looks like "bowtie-build" isn't in your PATH, so the shell couldn't find it.

                  Try adding a line near the top of "enumerateUniquelyMappableSpace" like:

                  export PATH=$PATH:/location/of/this/script/folder

                  Then it should be able to find bowtie-build, and the Perl script it calls later will be able to find your bowtie executable there also.

                  Comment

                  • sivasubramani
                    Member
                    • Apr 2011
                    • 13

                    #10
                    In HHblits package, there is a script which does the way you want.

                    HHblits_src/scripts/splitfasta.pl

                    Comment

                    • krafiq
                      Junior Member
                      • Jul 2013
                      • 9

                      #11
                      My enumerateUniquelyMappableSpace script is calling this line:
                      bowtie-build $chromosomeFiles $genome

                      The chromosome files variable in this case is a list of 30,000 file names. So when I run the script, it gives me the following error:

                      Argument list too long

                      is there a way around this?

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        Try concatenating the files together first (you'll have to do it in a couple batches, since the command will be too long for "cat" too) and just use the multifasta file.

                        Comment

                        • krafiq
                          Junior Member
                          • Jul 2013
                          • 9

                          #13
                          dpryan: I had to split the genome.fa file to get the individual files in the first place to use the hotspot software. should i still cat them? won't that bring it back to genome.fa?

                          Comment

                          • dpryan
                            Devon Ryan
                            • Jul 2011
                            • 3478

                            #14
                            Well, if you read through the perl scripts, it'll become pretty apparent that they only designed hotspot around human/mouse/etc. genomes (rather than your situation with scaffolds), so you're probably going to have to just edit the script. It's just trying to run bowtie-build, which will effectively concatentate everything together anyway (it looks like they normally use individual chromosome files so things can more easily be split to later run on a cluster). The script is pretty simple, so go ahead and change it to suite your usage needs.

                            Comment

                            • krafiq
                              Junior Member
                              • Jul 2013
                              • 9

                              #15
                              dpryan: I'm sorry-could you please clarify a bit as to what exactly I should do?
                              Also, is there a way to get the source code for bowtie?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 05:03 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 07:27 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              15 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              185 views
                              0 reactions
                              Last Post seqadmin  
                              Working...