Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dpryan
    replied
    Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

    Leave a comment:


  • splaisan
    replied
    Thanks a LOT Devon,

    I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

    HTML Code:
    for read in f :
       #deal with the header
       if(read[0] == '@') :
           of.write("%s" % read)
           continue
    HTML Code:
    samtools view -h <name_sorted.bam> | \
    	bam_re-pair.py | \
    	samtools view -bSo <name_sorted.filtered.bam> -
    I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

    Thanks you really for this code, it helped me a lot

    S

    ### Perl translation of Devon python code
    HTML Code:
    #!/usr/bin/perl -w
    
    # filter unpaired reads from a - read-name sorted - BAM file
    # bam_re-pair.pl
    # author: Stephane Plaisance (translated from python version by Devon Ryan
    # http://seqanswers.com/forums/showthread.php?p=118936#post118936
    # usage:
    # samtools view -h <name_sorted.bam> | \
    #	bam_re-pair.pl | \
    #	samtools view -bSo <name_sorted.filtered.bam> -
    
    use warnings;
    use strict;
    
    # variables
    my $read = "";
    my $read1 = "none";
    my $read2 = "none";
    my $name1 = "none";
    my $name2 = "none";
    
    my ($ln,$ok,$no)=(0,0,0);
    
    while (my $read = <>) {
    
    # forward header lines
    if ($read =~ /^@/){
    	print STDOUT $read;
    	next;
    	}
    	
    # process data
    $ln++;
    if( $name1 eq "none" ){
    	$read1 = $read;
        $name1 = (split("\t", $read1))[0];
    	} else {
    		$name2 = (split("\t", $read))[0];
    		if( $name1 eq $name2 ){
    			# is paired
    			$ok++;
    			print STDOUT sprintf("%s%s", $read1, $read);
    			$read1 = "none";
    			$name1 = "none";
    			} else {
    				# is not paired
    				$no++;
    				$read1 = $read;
    				$name1 = (split("\t", $read1))[0];
    				}
    	}
    }
    
    # report counts
    print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
    exit 0;

    Leave a comment:


  • dpryan
    replied
    Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:
    Code:
    #!/usr/bin/env python
    import sys
    
    f = sys.stdin
    of = sys.stdout
    
    read1 = None
    name1 = None
    
    for read in f :
        #deal with the header
        if(read[0] == '@') :
            of.write("%s" % read)
        if(name1 == None) :
            read1 = read
            name1 = read1.split("\t")[0]
        else :
            name2 = read.split("\t")[0]
            if(name1 == name2) :
                of.write("%s%s" % (read1, read))
                read1 = None
                name1 = None
            else :
                read1 = read
                name1 = read1.split("\t")[0]
    This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

    Code:
    samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -
    I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

    Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.
    Last edited by dpryan; 10-16-2013, 02:14 AM.

    Leave a comment:


  • splaisan
    replied
    I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

    Thanks for the links anyway.

    Leave a comment:


  • dpryan
    replied
    Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

    Leave a comment:


  • splaisan
    started a topic cleaning partial PE sam data

    cleaning partial PE sam data

    Hello there,
    I obtained PE data from Illumina (chr21 subset of NA18507 - ftp://webdata:[email protected]..._100_chr21.bam).

    After a lot of misery and systematic LENIENT use of picard I could manage to use the data to extract fastQ paired reads from it and remap them to another reference build.

    BUT

    I discovered (among many other serious SAM compliancy problems) that not all reads are present in that file and many pairs have lost one end (which probably did not map on chr21 and was filtered out uncleanly!)

    My question is how to clean such SAM/BAM where FLAGS indicate paired reads but one read of the pair is not present anymore.
    • I cannot use the SAM flags to do it because they are erroneous
    • I could not fix the flags to reflect the true paired status


    Below is the head of the original name-sorted reads showing the absence of one of the 'EAS51_0210:7:33:5109:13959' reads (many 1000's like that)

    Thanks for your suggestions on which command to use and how to eliminate these reads to obtain a fully paired file on which bam2fastq will run smoothly. Preferentially using picard and not with some fancy perl code keeping only '*' in the 7th column .

    Thanks a lot for your lights,
    Stephane

    Code:
    EAS51_0210:3:6:3797:7459	165	chr21	9719702	255	*	*	0	0	AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT	GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC	H0:i:0	H1:i:0	H2:i:2	SM:i:-1	AS:i:0
    EAS51_0210:3:6:3797:7459	89	chr21	9719702	73	100M	*	0	0	ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC	DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34	SM:i:73	AS:i:0
    EAS51_0210:7:33:5109:13959	145	chr21	9719707	254	100M	chrY	10653706	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC	FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2	SM:i:461	AS:i:0
    EAS25_0078:8:23:14907:11377	165	chr21	9719708	255	*	*	0	0	CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF	H0:i:1	H1:i:6	H2:i:60	SM:i:-1	AS:i:0
    EAS25_0078:8:23:14907:11377	89	chr21	9719708	254	100M	*	0	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA	EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3	SM:i:461	AS:i:0
    EAS25_0078:3:4:1830:9254	101	chr21	9719709	255	*	*	0	0	TTGAACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTTTGAGCCCATTGAGGCCTATGGTGAAATACGAAA	GGGGGGGGGGGGGGGGGGFGBGGGGGGGFEGGGGEGGGGGGGGGEGGGGGGEFGGEFGGEFFFFF/&8?@EEECCFGFGGFDFGFEGF?DEEDEFEEFEE	H0:i:0	H1:i:0	H2:i:3	SM:i:-1	AS:i:0
    EAS25_0078:3:4:1830:9254	153	chr21	9719709	254	100M	*	0	0	GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC	EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4	SM:i:461	AS:i:0

Latest Articles

Collapse

  • seqadmin
    Pathogen Surveillance with Advanced Genomic Tools
    by seqadmin




    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
    03-24-2025, 11:48 AM
  • seqadmin
    New Genomics Tools and Methods Shared at AGBT 2025
    by seqadmin


    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

    The Headliner
    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
    03-03-2025, 01:39 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-20-2025, 05:03 AM
0 responses
41 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-19-2025, 07:27 AM
0 responses
49 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-18-2025, 12:50 PM
0 responses
36 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-03-2025, 01:15 PM
0 responses
191 views
0 reactions
Last Post seqadmin  
Working...