Unconfigured Ad

**BENM** · 07-27-2011, 06:32 AM

Try this run in your terminal:

perl -e 'my ($name,$seq)=("","");while(<>){chomp;if ((/\>/)||(eof)){if (($seq ne "")||(eof)){$seq.=$_ if (eof);my @Seq=split /N+/,$seq;if (@Seq>1){for my $i(1..@Seq){print "$name contig$i\n";print $Seq[$i-1],"\n";}}else{print "$name\n$seq\n";}}$name=$_;$seq="";}else{chomp;$seq.=$_;}}' your_seq.fasta

**boetsie** · 07-27-2011, 06:33 AM

I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:

#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}

Regards,
Boetsie

**kmcarr** · 07-27-2011, 07:10 AM

There is a potential problem with the output you requested and produced by the two examples given above. The sequence identifiers are not unique; every sub-contig within scaffold1 will be named "scaffold1" in the resulting FASTA file. Remember that only the first "word" of the definition line is considered the sequence ID and this should be unique for every entry in your file. Adjust the example scripts so that each contig gets a unique name like:

Code:

>scaffold1.1
.......
>scaffold1.2
.......
>scaffold2.1
.......

etc.

**BENM** · 07-27-2011, 09:14 AM

Originally posted by boetsie View Post

I once wrote something like this in perl, i've changed it for your purposes. Hope it helps.

Code:

#!/usr/bin/perl
use strict;

my $file = $ARGV[0];
open(IN,$file) || die "Incorrect file $file. Exiting...\n";

my ($seq, $name)=('','');
while(<IN>){
  chomp;
  my $line = $_;
  $seq.= uc($line) if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    if($seq ne ''){
      my @seqgaps = split(/[N]{1,}/, $seq);
      if($#seqgaps > 0){
        my $ctgcount=0;
        foreach my $ctgseq (@seqgaps){
          $ctgcount++;
          print "$name contig$ctgcount (size=".length($ctgseq).")\n$ctgseq\n";
        }
      }else{
        print ">$name\n$seq\n";
      }
    }
    $seq='';
    $name = $_;
  }else{
    $seq.=uc($line);
  }
}

Regards,
Boetsie

Hi boetsie, you would find out you missed the last sequence.

**boetsie** · 07-27-2011, 10:29 AM

Originally posted by BENM View Post

Hi boetsie, you would find out you missed the last sequence.

And why is that? i don't think so...

**BENM** · 07-27-2011, 11:04 AM

Originally posted by boetsie View Post

And why is that? i don't think so...

Yes, you're right. Sorry, I didn't pay attention to your code.

**rogerholmes.novogene** · 07-27-2011, 05:06 PM

You guys are great!
Thanks!

**gringer** · 07-27-2011, 11:46 PM

Here's the script I have used for breaking up 454 assemblies for de-novo sequencing (I didn't like the 'replace N with A' thing that was being done). It splits by sequences of N and appends /<num> to the end of split contigs. You can give it an optional -min parameter to make sure the resulting sequences have at least that length:

Code:

#!/usr/bin/perl

# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases

# Author: David Eccles (gringer) 2011 <[email protected]>

use warnings;
use strict;

my $id = "";

my $lineCount = 0;
my $goodCount = 0;

my $allSequence = "";
my @sequences = ();

my $minLength = 0;
if(@ARGV && ($ARGV[0] eq '-min')){
    shift(@ARGV);
    $minLength = shift(@ARGV);
}

print(STDERR "Splitting sequences into contiguous non-ambiguous sequences...");

while(<>){
    if(++$lineCount % 10 ** 6 == 0){
        print(STDERR " ");
    }
    chomp;
    if(substr($_,0,1) eq ">"){
        my $newID = $_;
        if($allSequence){
            @sequences = split(/N+/, $allSequence);
            $allSequence = "";
            my $seqNum = 0;
            foreach my $sequence (@sequences){
                if(++$goodCount % 10 ** 6 == 0){
                    print(STDERR ".");
                }
                if(!$minLength || (length($sequence) > $minLength)){
                    print("$id/".$seqNum++."\n$sequence\n");
                }
            }
        }
        $id = $newID;
    } else {
        $allSequence .= $_;
    }
}

if($allSequence){
    @sequences = split(/N+/, $allSequence);
    my $seqNum = 0;
    foreach my $sequence (@sequences){
        if($minLength && (length($sequence) > $minLength)){
            print("$id/".$seqNum++."\n$sequence\n");
        }
    }
}

print(STDERR " done\n");

**gringer** · 07-28-2011, 12:20 AM

Here's an alternate version that does away with the duplicated code and progress output (but also doesn't handle a minimum length -- you could pipe through fastx_clipper for that if necessary):

Code:

#!/usr/bin/perl
# contiguous_fasta.pl -- splits fasta-formatted files into contiguous
# sequences of non-ambiguous bases
# Author: David Eccles (gringer) 2011 <[email protected]>

use warnings;
use strict;

my $id = "";
my $first = 1; # true
my @sequences = ();
my $seqNum = 0;

while(<>){
    chomp;
    if(substr($_,0,1) eq ">"){
        $id = $_;
        $seqNum = 0;
        print((($first)?"":"\n").$_."/".$seqNum++."\n");
        $first = 0; # false
    } else {
        @sequences = split(/N+/, $_);
        print(shift(@sequences)) unless !@sequences; # whole line could be N
        foreach my $sequence (@sequences){
            print("\n$id/".$seqNum++."\n$sequence");
        }
    }
}
print("\n");

**boetsie** · 07-28-2011, 12:26 AM

If you want to split only on gaps of say larger than 10 N's, you can change the line;

my @seqgaps = split(/[N]{1,}/, $seq);
to
my @seqgaps = split(/[N]{10,}/, $seq);

**rogerholmes.novogene** · 07-28-2011, 10:05 PM

Thanks !
They really helped!

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Help:scaffold 2 contigs!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News