Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by maubp View Post
    Sign up to the BioPerl mailing list and talk to them - it may encourage someone to work on it, especially if you offer to help test it
    I'm on the mailing list and this is something I've been thinking about. It can't hurt to ask for some guidance, or at least, offer to test. Thanks for the suggestion.

    Comment


    • #17
      Originally posted by themerlin View Post
      The mira assembler:

      Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.


      comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

      J
      I tried this script but it doesn't seem to play nice with fastq files that don't contain the sequence name after the "+" before the quality score line (which is optional):

      Fastq file:

      Code:
      @seq1
      CGAGATTGGTTGTCTCCTACTACCGAGTTGCCTCGAGAGCACCAGACCCGTCCGCCTCGCCTTCAGGGTTTTTTTCCGGGCAGCGATCCGAGGTCATCGACTGGGTGTTTAGCCCGGGCGACGCCCCGATCCCCCAATCTCG
      +
      @@@FFDFDHBF?FGBGHIGHGGII@F@:EBDFGGGGEGIIIIFEIIIIII;CHHGEEFFCCC@CCCCDDDDDDDDDDDDDBBB@>@BD@BBADCA@@CA:DDBDDCAA>B@>DFBAHBE@7@FEJGGHGDDFFAAFDFF@@@
      Script output:

      Code:
      Reading names
      Copying sequence data
      Last sequence name: seq1
      Now reading line: +
      The names don't match?!
          while executing
      "error "Last sequence name: $seqname\nNow reading line: $line\nThe names don't match?!""
          (procedure "conditionalFASTQCopy" line 28)
          invoked from within
      "conditionalFASTQCopy $fin $fout"
          (procedure "faqsel::processit" line 16)
          invoked from within
      "faqsel::processit"
          (file "/bioware/mira/scripts/fastqselect.tcl" line 166)
      Any thoughts or a fix for this?

      Comment


      • #18
        Hmm..I just copied your fastq sequence into one of my fastq files, then pulled it out successfully with that script. The name after the "+" doesn't appear to adversely affect the script in my tests. I'm not sure what the problem might be. Have you tested the script on a different fastq file?

        Comment


        • #19
          Originally posted by greigite View Post
          I tried this script but it doesn't seem to play nice with fastq files that don't contain the sequence name after the "+" before the quality score line (which is optional):
          ...
          Any thoughts or a fix for this?
          Tell Bastien (the MIRA author) - it should be easy for him to fix once he knows about it.

          Comment


          • #20
            I'm just chatting to Peter Rice from EMBOSS and v6.3 can do this with the dbxflat and seqret tools. The documentation should be updated to make this more obvious. In addition to FASTQ, this handles other major flat files too (there is a more specialised tool for FASTA files, dbxfasta, to handle all the different ID line conventions).

            Comment


            • #21
              cdbfasta

              One issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.


              Originally posted by SES View Post
              You have 1.4 billion reads in one file? And you used this with BLAST?

              (I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

              Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.

              Comment


              • #22
                Originally posted by greigite View Post
                One issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.
                Except I don't believe it solely depends on the record delimiter if you tell it the input file is FASTQ. It expects that the FASTQ file uses at least 4 lines per record and it doesn't start checking for a new record delimiter until after it has read 4 lines. If your FASTQ file sticks to the usual convention of using exactly 4 lines per record (I know it's not required by the standard) then you should be o.k. even if a quality line starts with '@'.

                *Note that this is based on a rudimentary understanding of C; someone please correct me if I'm wrong.
                Last edited by kmcarr; 08-08-2011, 11:54 AM. Reason: typo fix

                Comment


                • #23
                  Originally posted by greigite View Post
                  One issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.
                  I believe you are correct about the delimiter, but ...

                  Originally posted by kmcarr View Post
                  Except I don't believe it solely depends on the record delimiter if you tell it the input file is FASTQ. It expects that the FASTQ file uses at least 4 lines per record and it doesn't start checking for a new record delimiter until after it has read 4 lines. If your FASTQ file sticks to the usual convention of using exactly 4 lines per record (I know it's not required by the standard) then you should be o.k. even if a quality line starts with '@'.
                  This has been my experience. If you give cdbyank a list of IDs then it will return 4 lines (for my particular data) for each record/ID. I guess it depends on how you are using the index but I have not had any problems.

                  Comment


                  • #24
                    Thanks for the responses, sounds like I might be doing something else wrong with cdbfasta that's causing it not to work properly.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      06-06-2024, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 07:23 AM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-17-2024, 06:54 AM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-14-2024, 07:24 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-13-2024, 08:58 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X