Unconfigured Ad

**dpryan** · 10-16-2013, 01:29 AM

Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

**splaisan** · 10-16-2013, 01:45 AM

I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

Thanks for the links anyway.

**dpryan** · 10-16-2013, 02:10 AM

Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:

Code:

#!/usr/bin/env python
import sys

f = sys.stdin
of = sys.stdout

read1 = None
name1 = None

for read in f :
    #deal with the header
    if(read[0] == '@') :
        of.write("%s" % read)
    if(name1 == None) :
        read1 = read
        name1 = read1.split("\t")[0]
    else :
        name2 = read.split("\t")[0]
        if(name1 == name2) :
            of.write("%s%s" % (read1, read))
            read1 = None
            name1 = None
        else :
            read1 = read
            name1 = read1.split("\t")[0]

This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

Code:

samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -

I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.

**splaisan** · 10-16-2013, 05:18 AM

Thanks a LOT Devon,

I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

HTML Code:

for read in f :
   #deal with the header
   if(read[0] == '@') :
       of.write("%s" % read)
       continue

HTML Code:

samtools view -h <name_sorted.bam> | \
	bam_re-pair.py | \
	samtools view -bSo <name_sorted.filtered.bam> -

I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

Thanks you really for this code, it helped me a lot

S

### Perl translation of Devon python code

HTML Code:

#!/usr/bin/perl -w

# filter unpaired reads from a - read-name sorted - BAM file
# bam_re-pair.pl
# author: Stephane Plaisance (translated from python version by Devon Ryan
# http://seqanswers.com/forums/showthread.php?p=118936#post118936
# usage:
# samtools view -h <name_sorted.bam> | \
#	bam_re-pair.pl | \
#	samtools view -bSo <name_sorted.filtered.bam> -

use warnings;
use strict;

# variables
my $read = "";
my $read1 = "none";
my $read2 = "none";
my $name1 = "none";
my $name2 = "none";

my ($ln,$ok,$no)=(0,0,0);

while (my $read = <>) {

# forward header lines
if ($read =~ /^@/){
	print STDOUT $read;
	next;
	}
	
# process data
$ln++;
if( $name1 eq "none" ){
	$read1 = $read;
    $name1 = (split("\t", $read1))[0];
	} else {
		$name2 = (split("\t", $read))[0];
		if( $name1 eq $name2 ){
			# is paired
			$ok++;
			print STDOUT sprintf("%s%s", $read1, $read);
			$read1 = "none";
			$name1 = "none";
			} else {
				# is not paired
				$no++;
				$read1 = $read;
				$name1 = (split("\t", $read1))[0];
				}
	}
}

# report counts
print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
exit 0;

**dpryan** · 10-16-2013, 05:21 AM

Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

cleaning partial PE sam data

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News