Announcement

Collapse
No announcement yet.

Extract unaligned reads (Tophat) from FastQ

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract unaligned reads (Tophat) from FastQ

    I would like to further examine reads that Tophat didn't manage to align in a first run and i wonder, if there is any easy way to get these reads. With Bowtie this would be easy using the "--un" argument, but Tophat doesn't seem to have smth like this. I am so far able to extract read-ids of the reads that do align by:

    Code:
    cut --fields=1 accepted_hits.sam | sort --unique > accepted_hits_readsIds.txt
    From that point i'd need to extract fastq-entries that don't match any of the lines in the readIds file. Since FastQ-entries do not consist of single lines i got stuck here - any ideas/help would be appreciated!

    Thanks in advance & Cheers
    Uwe
    Last edited by Uwe Appelt; 09-15-2010, 03:10 AM.

  • #2
    you can try
    http://www.ncbi.nlm.nih.gov/pubmed/20605925
    Bioinformatics. 2010 Jul 6. [Epub ahead of print]
    G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

    Tembe W, Lowey J, Suh E.

    Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
    Abstract

    SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

    or maybe use bioperl
    http://www.bioperl.org/wiki/Module:Bio::Index::Fastq
    if the number of reads are not a lot
    http://kevin-gattaca.blogspot.com/

    Comment


    • #3
      You probably need to do this in two passes. It's also a pain that tophat seems to alter the sequence id (but maybe this is because I was using paired end data?), so you have to ajdust the ids a bit.

      The code below seems to work on the tophat files I just ran it against.

      Code:
      #!/usr/bin/perl
      use warnings;
      use strict;
      
      my ($fastq,$sam,$outfile) = @ARGV;
      
      unless ($outfile) {
        die "Usage is filter_unmapped_reads.pl [FastQ file] [SAM File] [File for unmapped reads]\n";
      }
      
      if (-e $outfile) {
        die "Won't overwrite an existing file, delete it first!";
      }
      
      open (FASTQ,$fastq) or die "Can't open fastq file: $!";
      open (SAM,$sam) or die "Can't open SAM file: $!";
      open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";
      
      my $ids = read_ids();
      
      filter_fastq($ids);
      
      close OUT or die "Can't write to $outfile: $!";
      
      
      sub filter_fastq {
      
        warn "Filtering FastQ file\n";
      
        my ($ids) = @_;
      
        while (<FASTQ>) {
      
          if (/^@(\S+)/) {
            my $id = $1;
      
            # Remove the end designator from paired end reads
            $id =~ s/\/\d+$//;
      
            my $seq = <FASTQ>;
            my $id2 = <FASTQ>;
            my $qual = <FASTQ>;
      
      
            unless (exists $ids->{$id}) {
      	print OUT $_,$seq,$id2,$qual;
            }
          }
          else {
            warn "Line '$_' should have been an id line, but wasn't\n";
          }
      
        }
      
      }
      
      
      sub read_ids {
      
        warn "Collecting mapped ids\n";
      
        my $ids;
      
        while (<SAM>) {
      
          next if (/^@/);
          my ($id) = split(/\t/);
          $ids->{$id} = 1;
        }
      
        return $ids;
      }

      Comment


      • #4
        Hi Simon,

        thank you so much for that chunk of code, it works like a charme! I worked out a solution of my own as well, but besides the obvious (e.g. poor parsing) drawbacks, fgrep appears to consume 40Gb of RAM in order to filter for ~18e6 read ids (in accepted_readIds.txt).

        Code:
        cut --fields=1 ./tophat_out/accepted_hits.sam | sort --unique > ./tophat_out/accepted_readIds.txt
        paste - - - - < ./reads_1.fq | fgrep --invert-match --file ./tophat_out/accepted_readsIds.txt | tr "\t" "\n" > ./readsFilt_1.fq
        So thanks again and Cheers,
        Uwe

        Comment


        • #5
          Simon, that script works wonderfully. Thanks.

          Comment


          • #6
            Excellent Simon. Thanks!

            Comment

            Working...
            X