Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Adamo
    Member
    • Jun 2010
    • 28

    [Optimization] perl script for unaligned reads

    Hello,

    I have intended to write a script to extract unaligned reads from a fasta file.

    The script takes the first line of the file which contains all the reads, and run throught the entire one (the one containing all aligned reads), until it finds a match between two reads IDs. If it doesn't, then it means the read is unaligned and it is stored into the file unaligned.fasta (aligned.fasta in the other case).

    The problem is the script runs very, very slowly, and I don't know how to optimize it.
    Any help would be very appreciated.

    The code (I made no comments of it, do not hesitate to tell me if you want me to do so):

    Code:
    #!/usr/bin/perl
    
    use Bio::SeqIO;
    use Bio::Root::Exception;
    
    
    my $usage = "unalignedreads.pl alignedreads totalreads\n";
    my $alignedreads = shift or die $usage;
    my $totalreads = shift or die $usage;
    $inseq1 = Bio::SeqIO->new('-file' => "<$totalreads",
    			     '-format' => 'fasta');
    $inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    my %outfiles = (
    		'aligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/aligned.fasta',
    					     '-format'=> 'fasta'),
    		'unaligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/unaligned.fasta',
    					       '-format' => 'fasta')
    		);
    while ($seqin1 = $inseq1->next_seq) {
    
    	$flag = 0;
    	while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
    
    		$sequ2 = $seqin2->display_id;
    		$sequ1 = $seqin1->display_id;
    		chomp($sequ1);
    		chomp($sequ2);
    		if ( "$sequ1" eq "$sequ2"){
    			$flag = 1;
    		} 			
    	}
    	$inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    
    	if ( $flag == 1) {
    		$outfiles{'aligned'}->write_seq($seqin1);
    	} 
    	else {
    		$outfiles{'unaligned'}->write_seq($seqin1);
    	}
    }
    exit;
    Thanks in advance.
  • Dethecor
    Member
    • May 2010
    • 24

    #2
    Set intersection

    Now, I'm not an expert in (Bio-)Perl, but what you are doing seems to be somewhere in the time complexity of O( |all reads| * |aligned reads| ) with a lot of file connections being opened.

    Depending on how many reads there are, you might be better of just reading both sets of read-ids ('all' and 'aligned') into an appropriate data-structure and do a set intersection.
    Then use theses lists of id's to parse your .fasta-files and split them into 'unaligned' and 'aligned' (is there a Set with operations 'union', 'intersection', etc. defined in Perl?)

    Cheers

    "You are only young once, but you can stay immature indefinitely."

    Comment

    • Bruins
      Member
      • Feb 2010
      • 78

      #3
      First, you read in both 'aligned reads' and 'unaligned reads'. This takes some time.
      Within the while loop, you read in the unaligned reads (variable $inseq2) again. This means that for every sequence in 'aligned reads' the program reads in the unaligned reads. Find a way around this and save a LOT of time. Perhaps rename the first $seqin2 in
      Code:
      while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
      Chrz

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Dethecor is correct, you are doing this in the most complex (time wise) manner possible. He is also correct that this is formally a set operation, and the operation you want to perform is a 'difference'. We can also make a simplifying assumption, that being that Aligned reads is a proper subset of Total reads.

        Here is how I would do it:

        - Read in the aligned.fasta, storing the display_ids in a hash.

        - Read through the total.fasta, check if the current display_id is defined in your hash.

        - If it is not defined, write the current seq object to the unaligned.fasta file

        (I don't see the need to write out an aligned.fasta file since that file already exists; it was one of your input files after all.)

        This method opens and reads each of the files (aligned.fasta, total.fasta) only once. A single comparison is done for each member of total.fasta.
        Last edited by kmcarr; 07-13-2010, 06:26 AM.

        Comment

        • Adamo
          Member
          • Jun 2010
          • 28

          #5
          Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
          Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
          And you are right, aligned.fasta isn't useful at all, don't know why I did that.
          It is very much faster this way!

          @Bruins: Thank you too for your suggestion, it's fine now.

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Originally posted by Adamo View Post
            Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
            Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
            Adamo,

            Are you using an array or a hash to store the ids of the aligned reads? If you are using an array you then for each id in the total reads file you would need to scan through the array looking for the id of interest; this is inefficient. You should store the aligned read ids as keys in a hash (the values associated with these keys are irrelevant, setting them equal to 1 will be fine). Then for each id in your total reads file you only have to perform a single boolean operation, asking is this id defined as key in my hash?

            Comment

            Latest Articles

            Collapse

            • GATTACAT
              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by GATTACAT
              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
              07-01-2026, 11:43 AM
            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 11:08 AM
            0 responses
            6 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-30-2026, 05:37 AM
            0 responses
            11 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            19 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            53 views
            0 reactions
            Last Post SEQadmin2  
            Working...