Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rogerholmes.novogene
    Junior Member
    • Jul 2011
    • 3

    Help:scaffold 2 contigs!

    Hi
    I got a scaffold sequence file from SOAPdenovo like:
    >scaffold1 13.6
    AATCCGCCGAGATGGGCAGATCCCTTGAGCTCANNNNNNNAATCCGCCGAGATGGGCA
    .
    .


    I need a perl script to remove "N"s and split the scaffold into contigs like:

    >scaffold1 13.6 contig1
    AATCCGCCGAGATGGGCAGATCCCTTGAGCTCA
    >scaffold1 13.6 contig2
    AATCCGCCGAGATGGGCA
    .
    .


    I'm new in perl programming. Could anyone help me with that?

    Thanks!
  • BENM
    Member
    • May 2009
    • 33

    #2
    Try this run in your terminal:

    perl -e 'my ($name,$seq)=("","");while(<>){chomp;if ((/\>/)||(eof)){if (($seq ne "")||(eof)){$seq.=$_ if (eof);my @Seq=split /N+/,$seq;if (@Seq>1){for my $i(1..@Seq){print "$name contig$i\n";print $Seq[$i-1],"\n";}}else{print "$name\n$seq\n";}}$name=$_;$seq="";}else{chomp;$seq.=$_;}}' your_seq.fasta

    Comment

    • boetsie
      Senior Member
      • Feb 2010
      • 245

      #3
      I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

      Code:
      #!/usr/bin/perl
      use strict;
      
      my $file = $ARGV[0];
      open(IN,$file) || die "Incorrect file $file. Exiting...\n";
      
      my ($seq, $name)=('','');
      while(<IN>){
        chomp;
        my $line = $_;
        $seq.= uc($line) if(eof(IN));
        if (/\>(\S+)/ || eof(IN)){
          if($seq ne ''){
            my @seqgaps = split(/[N]{1,}/, $seq);
            if($#seqgaps > 0){
              my $ctgcount=0;
              foreach my $ctgseq (@seqgaps){
                $ctgcount++;
                print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
              }
            }else{
              print ">$name\n$seq\n";
            }
          }
          $seq='';
          $name = $_;
        }else{
          $seq.=uc($line);
        }
      }
      Regards,
      Boetsie
      Last edited by boetsie; 07-27-2011, 06:59 AM.

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        There is a potential problem with the output you requested and produced by the two examples given above. The sequence identifiers are not unique; every sub-contig within scaffold1 will be named "scaffold1" in the resulting FASTA file. Remember that only the first "word" of the definition line is considered the sequence ID and this should be unique for every entry in your file. Adjust the example scripts so that each contig gets a unique name like:

        Code:
        >scaffold1.1
        .......
        >scaffold1.2
        .......
        >scaffold2.1
        .......
        
        etc.

        Comment

        • BENM
          Member
          • May 2009
          • 33

          #5
          Originally posted by boetsie View Post
          I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

          Code:
          #!/usr/bin/perl
          use strict;
          
          my $file = $ARGV[0];
          open(IN,$file) || die "Incorrect file $file. Exiting...\n";
          
          my ($seq, $name)=('','');
          while(<IN>){
            chomp;
            my $line = $_;
            $seq.= uc($line) if(eof(IN));
            if (/\>(\S+)/ || eof(IN)){
              if($seq ne ''){
                my @seqgaps = split(/[N]{1,}/, $seq);
                if($#seqgaps > 0){
                  my $ctgcount=0;
                  foreach my $ctgseq (@seqgaps){
                    $ctgcount++;
                    print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
                  }
                }else{
                  print ">$name\n$seq\n";
                }
              }
              $seq='';
              $name = $_;
            }else{
              $seq.=uc($line);
            }
          }
          Regards,
          Boetsie
          Hi boetsie, you would find out you missed the last sequence.

          Comment

          • boetsie
            Senior Member
            • Feb 2010
            • 245

            #6
            Originally posted by BENM View Post
            Hi boetsie, you would find out you missed the last sequence.
            And why is that? i don't think so...

            Comment

            • BENM
              Member
              • May 2009
              • 33

              #7
              Originally posted by boetsie View Post
              And why is that? i don't think so...
              Yes, you're right. Sorry, I didn't pay attention to your code.

              Comment

              • rogerholmes.novogene
                Junior Member
                • Jul 2011
                • 3

                #8
                You guys are great!
                Thanks!

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

                  Code:
                  #!/usr/bin/perl
                  
                  # contiguous_fasta.pl -- splits fasta-formatted files into contiguous
                  # sequences of non-ambiguous bases
                  
                  # Author: David Eccles (gringer) 2011 <[email protected]>
                  
                  use warnings;
                  use strict;
                  
                  my $id = "";
                  
                  my $lineCount = 0;
                  my $goodCount = 0;
                  
                  my $allSequence = "";
                  my @sequences = ();
                  
                  my $minLength = 0;
                  if(@ARGV && ($ARGV[0] eq '-min')){
                      shift(@ARGV);
                      $minLength = shift(@ARGV);
                  }
                  
                  print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");
                  
                  while(<>){
                      if(++$lineCount % 10 ** 6 == 0){
                          print(STDERR " ");
                      }
                      chomp;
                      if(substr($_,0,1) eq ">"){
                          my $newID = $_;
                          if($allSequence){
                              @sequences = split(/N+/, $allSequence);
                              $allSequence = "";
                              my $seqNum = 0;
                              foreach my $sequence (@sequences){
                                  if(++$goodCount % 10 ** 6 == 0){
                                      print(STDERR ".");
                                  }
                                  if(!$minLength || (length($sequence) > $minLength)){
                                      print("$id/".$seqNum++."\n$sequence\n");
                                  }
                              }
                          }
                          $id = $newID;
                      } else {
                          $allSequence .= $_;
                      }
                  }
                  
                  if($allSequence){
                      @sequences = split(/N+/, $allSequence);
                      my $seqNum = 0;
                      foreach my $sequence (@sequences){
                          if($minLength && (length($sequence) > $minLength)){
                              print("$id/".$seqNum++."\n$sequence\n");
                          }
                      }
                  }
                  
                  print(STDERR " done\n");
                  Last edited by gringer; 07-28-2011, 12:12 AM. Reason: correcting a logic bug

                  Comment

                  • gringer
                    David Eccles (gringer)
                    • May 2011
                    • 845

                    #10
                    Here's an alternate version that does away with the duplicated code and progress output (but also doesn't handle a minimum length -- you could pipe through fastx_clipper for that if necessary):

                    Code:
                    #!/usr/bin/perl
                    # contiguous_fasta.pl -- splits fasta-formatted files into contiguous
                    # sequences of non-ambiguous bases
                    # Author: David Eccles (gringer) 2011 <[email protected]>
                    
                    use warnings;
                    use strict;
                    
                    my $id = "";
                    my $first = 1; # true
                    my @sequences = ();
                    my $seqNum = 0;
                    
                    while(<>){
                        chomp;
                        if(substr($_,0,1) eq ">"){
                            $id = $_;
                            $seqNum = 0;
                            print((($first)?"":"\n").$_."/".$seqNum++."\n");
                            $first = 0; # false
                        } else {
                            @sequences = split(/N+/, $_);
                            print(shift(@sequences)) unless !@sequences; # whole line could be N
                            foreach my $sequence (@sequences){
                                print("\n$id/".$seqNum++."\n$sequence");
                            }
                        }
                    }
                    print("\n");

                    Comment

                    • boetsie
                      Senior Member
                      • Feb 2010
                      • 245

                      #11
                      If you want to split only on gaps of say larger than 10 N's, you can change the line;

                      my @seqgaps = split(/[N]{1,}/, $seq);
                      to
                      my @seqgaps = split(/[N]{10,}/, $seq);

                      Comment

                      • rogerholmes.novogene
                        Junior Member
                        • Jul 2011
                        • 3

                        #12
                        Thanks !
                        They really helped!

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...