Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Adamo
    Member
    • Jun 2010
    • 28

    [Optimization] perl script for unaligned reads

    Hello,

    I have intended to write a script to extract unaligned reads from a fasta file.

    The script takes the first line of the file which contains all the reads, and run throught the entire one (the one containing all aligned reads), until it finds a match between two reads IDs. If it doesn't, then it means the read is unaligned and it is stored into the file unaligned.fasta (aligned.fasta in the other case).

    The problem is the script runs very, very slowly, and I don't know how to optimize it.
    Any help would be very appreciated.

    The code (I made no comments of it, do not hesitate to tell me if you want me to do so):

    Code:
    #!/usr/bin/perl
    
    use Bio::SeqIO;
    use Bio::Root::Exception;
    
    
    my $usage = "unalignedreads.pl alignedreads totalreads\n";
    my $alignedreads = shift or die $usage;
    my $totalreads = shift or die $usage;
    $inseq1 = Bio::SeqIO->new('-file' => "<$totalreads",
    			     '-format' => 'fasta');
    $inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    my %outfiles = (
    		'aligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/aligned.fasta',
    					     '-format'=> 'fasta'),
    		'unaligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/unaligned.fasta',
    					       '-format' => 'fasta')
    		);
    while ($seqin1 = $inseq1->next_seq) {
    
    	$flag = 0;
    	while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
    
    		$sequ2 = $seqin2->display_id;
    		$sequ1 = $seqin1->display_id;
    		chomp($sequ1);
    		chomp($sequ2);
    		if ( "$sequ1" eq "$sequ2"){
    			$flag = 1;
    		} 			
    	}
    	$inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
    			     '-format' => 'fasta');
    
    	if ( $flag == 1) {
    		$outfiles{'aligned'}->write_seq($seqin1);
    	} 
    	else {
    		$outfiles{'unaligned'}->write_seq($seqin1);
    	}
    }
    exit;
    Thanks in advance.
  • Dethecor
    Member
    • May 2010
    • 24

    #2
    Set intersection

    Now, I'm not an expert in (Bio-)Perl, but what you are doing seems to be somewhere in the time complexity of O( |all reads| * |aligned reads| ) with a lot of file connections being opened.

    Depending on how many reads there are, you might be better of just reading both sets of read-ids ('all' and 'aligned') into an appropriate data-structure and do a set intersection.
    Then use theses lists of id's to parse your .fasta-files and split them into 'unaligned' and 'aligned' (is there a Set with operations 'union', 'intersection', etc. defined in Perl?)

    Cheers

    "You are only young once, but you can stay immature indefinitely."

    Comment

    • Bruins
      Member
      • Feb 2010
      • 78

      #3
      First, you read in both 'aligned reads' and 'unaligned reads'. This takes some time.
      Within the while loop, you read in the unaligned reads (variable $inseq2) again. This means that for every sequence in 'aligned reads' the program reads in the unaligned reads. Find a way around this and save a LOT of time. Perhaps rename the first $seqin2 in
      Code:
      while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
      Chrz

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Dethecor is correct, you are doing this in the most complex (time wise) manner possible. He is also correct that this is formally a set operation, and the operation you want to perform is a 'difference'. We can also make a simplifying assumption, that being that Aligned reads is a proper subset of Total reads.

        Here is how I would do it:

        - Read in the aligned.fasta, storing the display_ids in a hash.

        - Read through the total.fasta, check if the current display_id is defined in your hash.

        - If it is not defined, write the current seq object to the unaligned.fasta file

        (I don't see the need to write out an aligned.fasta file since that file already exists; it was one of your input files after all.)

        This method opens and reads each of the files (aligned.fasta, total.fasta) only once. A single comparison is done for each member of total.fasta.
        Last edited by kmcarr; 07-13-2010, 06:26 AM.

        Comment

        • Adamo
          Member
          • Jun 2010
          • 28

          #5
          Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
          Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
          And you are right, aligned.fasta isn't useful at all, don't know why I did that.
          It is very much faster this way!

          @Bruins: Thank you too for your suggestion, it's fine now.

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Originally posted by Adamo View Post
            Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
            Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
            Adamo,

            Are you using an array or a hash to store the ids of the aligned reads? If you are using an array you then for each id in the total reads file you would need to scan through the array looking for the id of interest; this is inefficient. You should store the aligned read ids as keys in a hash (the values associated with these keys are irrelevant, setting them equal to 1 will be fine). Then for each id in your total reads file you only have to perform a single boolean operation, asking is this id defined as key in my hash?

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            40 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            46 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...