Header Leaderboard Ad

Collapse

ask perl script: break contigs into overlapping sequences

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ask perl script: break contigs into overlapping sequences

    Dear All,
    I am a perl beginner. I have a fasta file with many contigs sequences, and need to break these contigs into 2kb overlapping fragments (with overlap length of 100bp). Could anyone help to write a perl script for me, when you have spare time? I will greatly appreciate your help? THANKS!

  • #2
    Sounds like question to PerlMonks forum, you can ask there how properly use 'substr' function for your tasks.

    Comment


    • #3
      You need to get an idea on a) how to parse multi fasta files b) how to split each individual sequence found in your file.

      a) http://lmgtfy.com/?q=perl+parse+fasta+file
      b) http://lmgtfy.com/?q=perl+split+large+genome+sequence

      It's a good exercise for a beginner ..

      Comment


      • #4
        Dear zhidkov.ilia and sklages,
        THANKS A LOT for your replys. I think again. I can use a simplified method, i.e., combine all contigs into one sequence (I can do this), then break/split the sequence into every 2kb fragments (I need a script for this). Would you or otheres please generate this script for me? GREATLY APPRECIATE YOUR HELPS!!

        Comment


        • #5
          perl script:break contig into 2kb sequences

          Dear zhidkov.ilia and sklages,
          THANKS A LOT for your replys. I think again. I can use a simplified method, i.e., combine all contigs into one sequence (I can do this), then break/split the sequence into every 2kb fragments (I need a script for this). Would you or otheres please generate this script for me? GREATLY APPRECIATE YOUR HELPS!!

          Comment


          • #6
            I would use something like the for loop below:

            Code:
            for (my $i=0;$i<length($seq);$i+=1900){
                 my $j=$i+2000;
                 print OUT substr($seq,$i,$j);
            }
            But I don't think anyone is going to write your whole script for you!

            Comment


            • #7
              Thank you very much!

              Comment


              • #8
                Originally posted by bruce01 View Post
                I would use something like the for loop below:

                Code:
                for (my $i=0;$i<length($seq);$i+=1900){
                     my $j=$i+2000;
                     print OUT substr($seq,$i,$j);
                }
                But I don't think anyone is going to write your whole script for you!
                Sounds like a dare! It's really a trivial program & good template for writing other programs that transform sequence data. A good exercise is to use Getopt::Long to set the cutoff size and overlap size.

                Code:
                use strict;
                use Bio::SeqIO;
                my $cutSize=2000; my $overlapSize=100;
                my $writer=new Bio::SeqIO(-file=>">splits.fa");
                foreach my $arg(@ARGV)
                {
                   my $rdr=new Bio::SeqIO(-file=>$arg);
                   while (my $seqObj=$rdr->next_seq)
                   {
                      for (my $i=1; $i<$seqObj->length; $i+=$cutSize-$overlapSize)
                      {
                          my $endPoint=$i+$cutSize; 
                          $endPoint=$seqObj->length if ($endPoint>$seqObj->length);
                          my $subseq=$seqObj->subseq($i,$i+$cutSize);
                          $writer->write_seq(new Bio::Seq(-id=>$seqObj->id.".$endPoint",-seq=>$subseq));
                      }
                   }
                }
                Typo correction & debugging left as exercise for the student

                Comment


                • #9
                  Originally posted by krobison View Post
                  Sounds like a dare!
                  Good on you krobison! Wasn't being mean, I would have given it a go but had a bit much in front of me. Debugging is the hardest bit when learning.

                  Comment


                  • #10
                    I do not have the impression that the OP wants to learn too much ..
                    So he/she could use google to find some ready-to-use solutions, in perl or whatever language, e.g. http://cpansearch.perl.org/src/CJFIE...p_split_seq.pl ..

                    I still think it would be a great exercise for learning perl (in "bioinformatics"). Though I usually try to avoid bioperl ;-)

                    Comment


                    • #11
                      Thank you all for your inputs. As a true beginner of perl (I am mostly involved in bench work), I will persits on learning perl. THANKS for your help!

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                        by seqadmin



                        Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                        03-21-2023, 01:49 PM
                      • seqadmin
                        Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                        by seqadmin




                        Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                        03-10-2023, 05:31 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 01:40 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-29-2023, 11:44 AM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-24-2023, 02:45 PM
                      0 responses
                      20 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2023, 12:26 PM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X