Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • splaisan
    senior molecular biologist
    • Jun 2009
    • 32

    cleaning partial PE sam data

    Hello there,
    I obtained PE data from Illumina (chr21 subset of NA18507 - ftp://webdata:[email protected]..._100_chr21.bam).

    After a lot of misery and systematic LENIENT use of picard I could manage to use the data to extract fastQ paired reads from it and remap them to another reference build.

    BUT

    I discovered (among many other serious SAM compliancy problems) that not all reads are present in that file and many pairs have lost one end (which probably did not map on chr21 and was filtered out uncleanly!)

    My question is how to clean such SAM/BAM where FLAGS indicate paired reads but one read of the pair is not present anymore.
    • I cannot use the SAM flags to do it because they are erroneous
    • I could not fix the flags to reflect the true paired status


    Below is the head of the original name-sorted reads showing the absence of one of the 'EAS51_0210:7:33:5109:13959' reads (many 1000's like that)

    Thanks for your suggestions on which command to use and how to eliminate these reads to obtain a fully paired file on which bam2fastq will run smoothly. Preferentially using picard and not with some fancy perl code keeping only '*' in the 7th column .

    Thanks a lot for your lights,
    Stephane

    Code:
    EAS51_0210:3:6:3797:7459	165	chr21	9719702	255	*	*	0	0	AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT	GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC	H0:i:0	H1:i:0	H2:i:2	SM:i:-1	AS:i:0
    EAS51_0210:3:6:3797:7459	89	chr21	9719702	73	100M	*	0	0	ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC	DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34	SM:i:73	AS:i:0
    EAS51_0210:7:33:5109:13959	145	chr21	9719707	254	100M	chrY	10653706	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC	FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2	SM:i:461	AS:i:0
    EAS25_0078:8:23:14907:11377	165	chr21	9719708	255	*	*	0	0	CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF	H0:i:1	H1:i:6	H2:i:60	SM:i:-1	AS:i:0
    EAS25_0078:8:23:14907:11377	89	chr21	9719708	254	100M	*	0	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA	EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3	SM:i:461	AS:i:0
    EAS25_0078:3:4:1830:9254	101	chr21	9719709	255	*	*	0	0	TTGAACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTTTGAGCCCATTGAGGCCTATGGTGAAATACGAAA	GGGGGGGGGGGGGGGGGGFGBGGGGGGGFEGGGGEGGGGGGGGGEGGGGGGEFGGEFGGEFFFFF/&8?@EEECCFGFGGFDFGFEGF?DEEDEFEEFEE	H0:i:0	H1:i:0	H2:i:3	SM:i:-1	AS:i:0
    EAS25_0078:3:4:1830:9254	153	chr21	9719709	254	100M	*	0	0	GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC	EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4	SM:i:461	AS:i:0
    http://www.bits.vib.be/index.php
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

    Comment

    • splaisan
      senior molecular biologist
      • Jun 2009
      • 32

      #3
      I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

      Thanks for the links anyway.
      http://www.bits.vib.be/index.php

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:
        Code:
        #!/usr/bin/env python
        import sys
        
        f = sys.stdin
        of = sys.stdout
        
        read1 = None
        name1 = None
        
        for read in f :
            #deal with the header
            if(read[0] == '@') :
                of.write("%s" % read)
            if(name1 == None) :
                read1 = read
                name1 = read1.split("\t")[0]
            else :
                name2 = read.split("\t")[0]
                if(name1 == name2) :
                    of.write("%s%s" % (read1, read))
                    read1 = None
                    name1 = None
                else :
                    read1 = read
                    name1 = read1.split("\t")[0]
        This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

        Code:
        samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -
        I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

        Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.
        Last edited by dpryan; 10-16-2013, 02:14 AM.

        Comment

        • splaisan
          senior molecular biologist
          • Jun 2009
          • 32

          #5
          Thanks a LOT Devon,

          I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

          HTML Code:
          for read in f :
             #deal with the header
             if(read[0] == '@') :
                 of.write("%s" % read)
                 continue
          HTML Code:
          samtools view -h <name_sorted.bam> | \
          	bam_re-pair.py | \
          	samtools view -bSo <name_sorted.filtered.bam> -
          I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

          Thanks you really for this code, it helped me a lot

          S

          ### Perl translation of Devon python code
          HTML Code:
          #!/usr/bin/perl -w
          
          # filter unpaired reads from a - read-name sorted - BAM file
          # bam_re-pair.pl
          # author: Stephane Plaisance (translated from python version by Devon Ryan
          # http://seqanswers.com/forums/showthread.php?p=118936#post118936
          # usage:
          # samtools view -h <name_sorted.bam> | \
          #	bam_re-pair.pl | \
          #	samtools view -bSo <name_sorted.filtered.bam> -
          
          use warnings;
          use strict;
          
          # variables
          my $read = "";
          my $read1 = "none";
          my $read2 = "none";
          my $name1 = "none";
          my $name2 = "none";
          
          my ($ln,$ok,$no)=(0,0,0);
          
          while (my $read = <>) {
          
          # forward header lines
          if ($read =~ /^@/){
          	print STDOUT $read;
          	next;
          	}
          	
          # process data
          $ln++;
          if( $name1 eq "none" ){
          	$read1 = $read;
              $name1 = (split("\t", $read1))[0];
          	} else {
          		$name2 = (split("\t", $read))[0];
          		if( $name1 eq $name2 ){
          			# is paired
          			$ok++;
          			print STDOUT sprintf("%s%s", $read1, $read);
          			$read1 = "none";
          			$name1 = "none";
          			} else {
          				# is not paired
          				$no++;
          				$read1 = $read;
          				$name1 = (split("\t", $read1))[0];
          				}
          	}
          }
          
          # report counts
          print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
          exit 0;
          http://www.bits.vib.be/index.php

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

            Comment

            Latest Articles

            Collapse

            • GATTACAT
              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by GATTACAT
              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
              07-01-2026, 11:43 AM
            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 07-02-2026, 11:08 AM
            0 responses
            7 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-30-2026, 05:37 AM
            0 responses
            12 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            20 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            54 views
            0 reactions
            Last Post SEQadmin2  
            Working...