Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastq xdget? Mosaik issue

    Hello,
    I am using Mosaik to assembly reads using a reference. Here is my problem. The total data set consists of 18 million reads. After filtering the reads for a certain criteria I am interested in I am down to 6 million reads. To get these 6 million reads into a separate file I have converted the fastq to fasta then used xdget to retrieve the reads of interest. I now have a fasta file consisting of the 6 million reads. For the "Build" portion of Mosaik, I need either a fastq file or a fasta file and an accompanying file of base quality scores.

    Does anyone know of an equivalent way retrieve specific sequences from a fastq files (such as xdget to fasta)?

    Thanks,

    John

  • #2
    I think I've figured out a roundabout way of solving this problem using grep commands.

    grep headers form my fasta file:

    >READ_123
    >READ_456
    >READ_789
    >READ_246

    use find and replace in VI editor to create new file containing grep commands:

    grep "READ_123" TOTAL_READS.fastq" -m 2 -A 1 > MyReads.fastq
    grep "READ_456" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq
    grep "READ_789" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq
    grep "READ_246" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq

    -m will take the first 2 occurrences of the search item (ie. READ_123) because fastq files have one header for the base quality score and one header for the sequence

    -A 1 will take one line directly beneath the search item

    The end product is the fastq file of MyReads

    @READ_123
    NGGATTTTTCACTCTACATCCAATANNNNNNNNNNC
    +READ_123
    DMWXYXYZYY[[YYXY[BBBBBBBBBBBBBBBBBBB
    @READ_456
    NCCTCTCCTCAAACATCTTCTCCGANNNNNNNNNNT
    +READ_456
    DLUURWUYYXWSSXBBBBBBBBBBBBBBBBBBBBBB
    @READ_789
    NGCAGTTCTTGTTCCGCGGATCGTGNNNNNNNNNNT
    +READ_789
    DHQNKNWWWSNBBBBBBBBBBBBBBBBBBBBBBBBB
    @READ_246
    NTGCTTTTCATCTTTCGATCACTCTNNNNNNNNNNT
    +READ_246
    DOPTXXXXXYYYYYYBBBBBBBBBBBBBBBBBBBBB
    Last edited by jgibbons1; 01-06-2010, 09:39 AM.

    Comment


    • #3
      So you have an 18 million read FASTQ file, and a 6 million read FASTA file, and you want to make a 6 million read FASTQ file containing all those reads in the FASTA file?

      Personally I'd do this in Biopython, but given the number of records I'd be cautious about indexing an 18 million record file in memory. Instead I'd use Bio.SeqIO.parse to loop over all the FASTA entries and record their identifiers in memory (I'd use a Python set here). Keeping just 6 million short strings in memory shouldn't be a problem. Then loop over the FASTQ file, saving only those records with a desired identifier.

      Something like this should work in Biopython 1.51 or later:
      Code:
      from Bio import SeqIO
      
      #This is a memory efficient list comprehension,
      #only the record identifiers are all kept in memory
      wanted_ids = set(record.id for record in \
                       SeqIO.parse(open("selected.fasta"), "fasta"))
      print "%i unique identifiers in the FASTA file" % len(wanted_ids)
      
      #This is a memory efficient list comprehension,
      #only one record is in memory at a time
      wanted = (record for record in \
                SeqIO.parse(open("everything.fastq"), "fastq") \
                if record.id in wanted_ids)
      
      handle = open("selected.fastq", "w") #output file
      count = SeqIO.write(wanted, handle, "fastq")
      handle.close()
      
      print "%i records saved to FASTQ file" % count
      This can be made faster if need be, using the ideas outlined here:

      Comment


      • #4
        Thanks so much for the insight!

        I'll give this a shot.

        Comment


        • #5
          Originally posted by jgibbons1 View Post
          I think I've figured out a roundabout way of solving this problem using grep commands.
          That's very inventive, although it means creating a 6 million line grep script

          [It also makes some assumptions about the FASTQ files, like no line wrapping and the inclusion of the optional second identifier on the plus lines - but it would get the job done]

          But seriously - learning a scripting language would make this sort of thing much easier (e.g. Python, Perl, ...)
          Last edited by maubp; 01-06-2010, 09:37 AM. Reason: fixed smilie

          Comment


          • #6
            Originally posted by maubp View Post
            But seriously - learning a scripting language would make this sort of thing much easier (e.g. Python, Perl, ...)
            I couldn't agree with you more...learning some perl is on my 2010 to do list!

            Comment


            • #7
              In the rare case that someone is trying to do the same thing as me...here is a handy perl script that takes a fastq file and a headers file and outputs a fastq file from the headers

              #!/usr/local/bin/perl
              use strict;

              ####### Usage: perl parsefastq.pl fastq_file headers_file > output_file

              open(FILE, $ARGV[0]) or die "Fastaq file not found";

              my @fastqarray = <FILE>;

              close FILE;


              open(FILE2, $ARGV[1]) or die "Headers file not found";

              my @headers = <FILE2>;

              close FILE2;

              my @backfastqarray = @fastqarray;
              my $size = @fastqarray;
              my %fastqhash=();
              my @fastqstringarray;
              my @greparray;


              for(my $i=0;$i<$size;$i++) {

              my @fastqstringa = splice(@fastqarray,0,4);
              push(@fastqstringarray,[@fastqstringa]);

              }



              @greparray = grep(/@/, @backfastqarray);

              my $size2 = @greparray;


              for(my $i=0;$i<$size2;$i++) {
              my $key = $greparray[$i];
              $fastqhash{$key} = "@{$fastqstringarray[$i]}";

              }


              foreach my $element(@headers) {


              print $fastqhash{$element};

              }

              Comment


              • #8
                Just FYI, related to this topic, there is another FASTA indexing tool that also supports FASTQ file indexing and querying of random records based on read ID. It's the cdbfasta/cdbyank suite that can be found on this page:
                http://compbio.dfci.harvard.edu/tgi/software/
                (direct link to the source package: ftp://occams.dfci.harvard.edu/pub/bi...dbfasta.tar.gz - don't use the cdbfasta binaries on that site, they are very old and the -Q option wasn't implemented)

                cdbfasta is the indexing program, typically used for FASTA files but with the -Q option it can index FASTQ as well.
                Example usage for fastq files would be:
                Code:
                cdbfasta -Q reads.fq
                .. and this will create the index file reads.fq.cidx.

                cdbyank is the query program (like xdget) for the index file previously created with cdbfasta, and it can take a list of read IDs at stdin, e.g.:
                Code:
                cdbyank reads.fq.cidx < selreads.txt > selreads.fq
                .. and selreads.fq will then have the FASTQ records whose names were given in the selreads.txt file.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  Today, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 07:17 AM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-02-2024, 08:06 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-30-2024, 12:17 PM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-29-2024, 10:49 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Working...
                X