Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [Optimization] perl script for unaligned reads

    Hello,

    I have intended to write a script to extract unaligned reads from a fasta file.

    The script takes the first line of the file which contains all the reads, and run throught the entire one (the one containing all aligned reads), until it finds a match between two reads IDs. If it doesn't, then it means the read is unaligned and it is stored into the file unaligned.fasta (aligned.fasta in the other case).

    The problem is the script runs very, very slowly, and I don't know how to optimize it.
    Any help would be very appreciated.

    The code (I made no comments of it, do not hesitate to tell me if you want me to do so):

    Code:
    #!/usr/bin/perl
    
    use Bio::SeqIO;
    use Bio::Root::Exception;
    
    
    my $usage = "unalignedreads.pl alignedreads totalreads\n";
    my $alignedreads = shift or die $usage;
    my $totalreads = shift or die $usage;
    $inseq1 = Bio::SeqIO->new('-file' => "<$totalreads",
    			     '-format' => 'fasta');
    $inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    my %outfiles = (
    		'aligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/aligned.fasta',
    					     '-format'=> 'fasta'),
    		'unaligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/unaligned.fasta',
    					       '-format' => 'fasta')
    		);
    while ($seqin1 = $inseq1->next_seq) {
    
    	$flag = 0;
    	while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
    
    		$sequ2 = $seqin2->display_id;
    		$sequ1 = $seqin1->display_id;
    		chomp($sequ1);
    		chomp($sequ2);
    		if ( "$sequ1" eq "$sequ2"){
    			$flag = 1;
    		} 			
    	}
    	$inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    
    	if ( $flag == 1) {
    		$outfiles{'aligned'}->write_seq($seqin1);
    	} 
    	else {
    		$outfiles{'unaligned'}->write_seq($seqin1);
    	}
    }
    exit;
    Thanks in advance.

  • #2
    Set intersection

    Now, I'm not an expert in (Bio-)Perl, but what you are doing seems to be somewhere in the time complexity of O( |all reads| * |aligned reads| ) with a lot of file connections being opened.

    Depending on how many reads there are, you might be better of just reading both sets of read-ids ('all' and 'aligned') into an appropriate data-structure and do a set intersection.
    Then use theses lists of id's to parse your .fasta-files and split them into 'unaligned' and 'aligned' (is there a Set with operations 'union', 'intersection', etc. defined in Perl?)

    Cheers

    "You are only young once, but you can stay immature indefinitely."

    Comment


    • #3
      First, you read in both 'aligned reads' and 'unaligned reads'. This takes some time.
      Within the while loop, you read in the unaligned reads (variable $inseq2) again. This means that for every sequence in 'aligned reads' the program reads in the unaligned reads. Find a way around this and save a LOT of time. Perhaps rename the first $seqin2 in
      Code:
      while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
      Chrz

      Comment


      • #4
        Dethecor is correct, you are doing this in the most complex (time wise) manner possible. He is also correct that this is formally a set operation, and the operation you want to perform is a 'difference'. We can also make a simplifying assumption, that being that Aligned reads is a proper subset of Total reads.

        Here is how I would do it:

        - Read in the aligned.fasta, storing the display_ids in a hash.

        - Read through the total.fasta, check if the current display_id is defined in your hash.

        - If it is not defined, write the current seq object to the unaligned.fasta file

        (I don't see the need to write out an aligned.fasta file since that file already exists; it was one of your input files after all.)

        This method opens and reads each of the files (aligned.fasta, total.fasta) only once. A single comparison is done for each member of total.fasta.
        Last edited by kmcarr; 07-13-2010, 06:26 AM.

        Comment


        • #5
          Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
          Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
          And you are right, aligned.fasta isn't useful at all, don't know why I did that.
          It is very much faster this way!

          @Bruins: Thank you too for your suggestion, it's fine now.

          Comment


          • #6
            Originally posted by Adamo View Post
            Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
            Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
            Adamo,

            Are you using an array or a hash to store the ids of the aligned reads? If you are using an array you then for each id in the total reads file you would need to scan through the array looking for the id of interest; this is inefficient. You should store the aligned read ids as keys in a hash (the values associated with these keys are irrelevant, setting them equal to 1 will be fine). Then for each id in your total reads file you only have to perform a single boolean operation, asking is this id defined as key in my hash?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X