Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dpryan
    replied
    Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

    Leave a comment:


  • splaisan
    replied
    Thanks a LOT Devon,

    I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

    HTML Code:
    for read in f :
       #deal with the header
       if(read[0] == '@') :
           of.write("%s" % read)
           continue
    HTML Code:
    samtools view -h <name_sorted.bam> | \
    	bam_re-pair.py | \
    	samtools view -bSo <name_sorted.filtered.bam> -
    I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

    Thanks you really for this code, it helped me a lot

    S

    ### Perl translation of Devon python code
    HTML Code:
    #!/usr/bin/perl -w
    
    # filter unpaired reads from a - read-name sorted - BAM file
    # bam_re-pair.pl
    # author: Stephane Plaisance (translated from python version by Devon Ryan
    # http://seqanswers.com/forums/showthread.php?p=118936#post118936
    # usage:
    # samtools view -h <name_sorted.bam> | \
    #	bam_re-pair.pl | \
    #	samtools view -bSo <name_sorted.filtered.bam> -
    
    use warnings;
    use strict;
    
    # variables
    my $read = "";
    my $read1 = "none";
    my $read2 = "none";
    my $name1 = "none";
    my $name2 = "none";
    
    my ($ln,$ok,$no)=(0,0,0);
    
    while (my $read = <>) {
    
    # forward header lines
    if ($read =~ /^@/){
    	print STDOUT $read;
    	next;
    	}
    	
    # process data
    $ln++;
    if( $name1 eq "none" ){
    	$read1 = $read;
        $name1 = (split("\t", $read1))[0];
    	} else {
    		$name2 = (split("\t", $read))[0];
    		if( $name1 eq $name2 ){
    			# is paired
    			$ok++;
    			print STDOUT sprintf("%s%s", $read1, $read);
    			$read1 = "none";
    			$name1 = "none";
    			} else {
    				# is not paired
    				$no++;
    				$read1 = $read;
    				$name1 = (split("\t", $read1))[0];
    				}
    	}
    }
    
    # report counts
    print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
    exit 0;

    Leave a comment:


  • dpryan
    replied
    Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:
    Code:
    #!/usr/bin/env python
    import sys
    
    f = sys.stdin
    of = sys.stdout
    
    read1 = None
    name1 = None
    
    for read in f :
        #deal with the header
        if(read[0] == '@') :
            of.write("%s" % read)
        if(name1 == None) :
            read1 = read
            name1 = read1.split("\t")[0]
        else :
            name2 = read.split("\t")[0]
            if(name1 == name2) :
                of.write("%s%s" % (read1, read))
                read1 = None
                name1 = None
            else :
                read1 = read
                name1 = read1.split("\t")[0]
    This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

    Code:
    samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -
    I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

    Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.
    Last edited by dpryan; 10-16-2013, 02:14 AM.

    Leave a comment:


  • splaisan
    replied
    I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

    Thanks for the links anyway.

    Leave a comment:


  • dpryan
    replied
    Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

    Leave a comment:


  • splaisan
    started a topic cleaning partial PE sam data

    cleaning partial PE sam data

    Hello there,
    I obtained PE data from Illumina (chr21 subset of NA18507 - ftp://webdata:[email protected]..._100_chr21.bam).

    After a lot of misery and systematic LENIENT use of picard I could manage to use the data to extract fastQ paired reads from it and remap them to another reference build.

    BUT

    I discovered (among many other serious SAM compliancy problems) that not all reads are present in that file and many pairs have lost one end (which probably did not map on chr21 and was filtered out uncleanly!)

    My question is how to clean such SAM/BAM where FLAGS indicate paired reads but one read of the pair is not present anymore.
    • I cannot use the SAM flags to do it because they are erroneous
    • I could not fix the flags to reflect the true paired status


    Below is the head of the original name-sorted reads showing the absence of one of the 'EAS51_0210:7:33:5109:13959' reads (many 1000's like that)

    Thanks for your suggestions on which command to use and how to eliminate these reads to obtain a fully paired file on which bam2fastq will run smoothly. Preferentially using picard and not with some fancy perl code keeping only '*' in the 7th column .

    Thanks a lot for your lights,
    Stephane

    Code:
    EAS51_0210:3:6:3797:7459	165	chr21	9719702	255	*	*	0	0	AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT	GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC	H0:i:0	H1:i:0	H2:i:2	SM:i:-1	AS:i:0
    EAS51_0210:3:6:3797:7459	89	chr21	9719702	73	100M	*	0	0	ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC	DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34	SM:i:73	AS:i:0
    EAS51_0210:7:33:5109:13959	145	chr21	9719707	254	100M	chrY	10653706	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC	FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2	SM:i:461	AS:i:0
    EAS25_0078:8:23:14907:11377	165	chr21	9719708	255	*	*	0	0	CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF	H0:i:1	H1:i:6	H2:i:60	SM:i:-1	AS:i:0
    EAS25_0078:8:23:14907:11377	89	chr21	9719708	254	100M	*	0	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA	EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3	SM:i:461	AS:i:0
    EAS25_0078:3:4:1830:9254	101	chr21	9719709	255	*	*	0	0	TTGAACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTTTGAGCCCATTGAGGCCTATGGTGAAATACGAAA	GGGGGGGGGGGGGGGGGGFGBGGGGGGGFEGGGGEGGGGGGGGGEGGGGGGEFGGEFGGEFFFFF/&8?@EEECCFGFGGFDFGFEGF?DEEDEFEEFEE	H0:i:0	H1:i:0	H2:i:3	SM:i:-1	AS:i:0
    EAS25_0078:3:4:1830:9254	153	chr21	9719709	254	100M	*	0	0	GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC	EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4	SM:i:461	AS:i:0

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 06:54 AM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-14-2024, 07:24 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-13-2024, 08:58 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-12-2024, 02:20 PM
0 responses
17 views
0 likes
Last Post seqadmin  
Working...
X