Seqanswers Leaderboard Ad

**KevinLam** · 09-14-2010, 11:13 PM

you can try

G-SQZ: compact encoding of genomic sequence and quality data - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/20605925

available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe ([email protected]).

or maybe use bioperl

Page not found · GitHub Pages

http://www.bioperl.org/wiki/Module:Bio::Index::Fastq

if the number of reads are not a lot

**simonandrews** · 09-15-2010, 12:03 AM

You probably need to do this in two passes. It's also a pain that tophat seems to alter the sequence id (but maybe this is because I was using paired end data?), so you have to ajdust the ids a bit.

The code below seems to work on the tophat files I just ran it against.

Code:

#!/usr/bin/perl
use warnings;
use strict;

my ($fastq,$sam,$outfile) = @ARGV;

unless ($outfile) {
  die "Usage is filter_unmapped_reads.pl [FastQ file] [SAM File] [File for unmapped reads]\n";
}

if (-e $outfile) {
  die "Won't overwrite an existing file, delete it first!";
}

open (FASTQ,$fastq) or die "Can't open fastq file: $!";
open (SAM,$sam) or die "Can't open SAM file: $!";
open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";

my $ids = read_ids();

filter_fastq($ids);

close OUT or die "Can't write to $outfile: $!";


sub filter_fastq {

  warn "Filtering FastQ file\n";

  my ($ids) = @_;

  while (<FASTQ>) {

    if (/^@(\S+)/) {
      my $id = $1;

      # Remove the end designator from paired end reads
      $id =~ s/\/\d+$//;

      my $seq = <FASTQ>;
      my $id2 = <FASTQ>;
      my $qual = <FASTQ>;


      unless (exists $ids->{$id}) {
	print OUT $_,$seq,$id2,$qual;
      }
    }
    else {
      warn "Line '$_' should have been an id line, but wasn't\n";
    }

  }

}


sub read_ids {

  warn "Collecting mapped ids\n";

  my $ids;

  while (<SAM>) {

    next if (/^@/);
    my ($id) = split(/\t/);
    $ids->{$id} = 1;
  }

  return $ids;
}

**Uwe Appelt** · 09-15-2010, 04:07 AM

Hi Simon,

thank you so much for that chunk of code, it works like a charme! I worked out a solution of my own as well, but besides the obvious (e.g. poor parsing) drawbacks, fgrep appears to consume 40Gb of RAM in order to filter for ~18e6 read ids (in accepted_readIds.txt).

Code:

cut --fields=1 ./tophat_out/accepted_hits.sam | sort --unique > ./tophat_out/accepted_readIds.txt
paste - - - - < ./reads_1.fq | fgrep --invert-match --file ./tophat_out/accepted_readsIds.txt | tr "\t" "\n" > ./readsFilt_1.fq

So thanks again and Cheers,
Uwe

**chadn737** · 07-06-2011, 01:35 PM

Simon, that script works wonderfully. Thanks.

**fjrossello** · 08-07-2012, 04:33 AM

Excellent Simon. Thanks!

Topics	Statistics	Last Post
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, Today, 08:18 AM	0 responses 7 views 0 likes	Last Post by seqadmin Today, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, Today, 08:04 AM	0 responses 6 views 0 likes	Last Post by seqadmin Today, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 27 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM

Seqanswers Leaderboard Ad

Announcement

Extract unaligned reads (Tophat) from FastQ

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News