Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cdbfasta -> Error adding cdb record with key

    Hi,

    After running cdbfasta,
    I obtain this message :Error adding cdb record with key 'HISEQxxx#xxx/1'
    and I obtain the file _tmp : fastqfile.idx_tmp and not fastqfile.idx

    CODE]my @result = `cdbyank fastqfile.idx -d fastqfile -a id';[/CODE]


    The key HISEQxxx#xxx/1 correspond to this line in my file :
    @HISEQxxx#xxx/1
    TGTGCGAATATACTTGTGAATCTGTGTGTTTATAAAAATGTTGTAGTATATGTTGTGTCTCGGATTACGATGCNTATAAACAAGCCGACGGGTATGTTTTT
    +HISEQxxx#xxx/1
    eececddadUNN^YZb[a_][]XYR^P\Y`ddddWT[QY]]dd_]c_[^_b`bc]]VX`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

    Someone can help me?

  • #2
    It looks like you are trying to call cdbyank from a Perl script. It sounds like the index construction failed if you are ending up with a ".idx_tmp" and there should be a clear message. Try running cdbfasta at the command line to see the error message. One possible solution might be to split the fastq file into smaller files and then build the index of each (assuming you are trying to index a huge file).

    Comment


    • #3
      Thanks SES,
      Indeed, I split the fastq file into smaller files and then I build the index of each file.

      Comment


      • #4
        Originally posted by manore View Post
        Thanks SES,
        Indeed, I split the fastq file into smaller files and then I build the index of each file.
        Glad to hear that worked for you. I've had a look at the code causing the problem and don't quite understand why this happens on a 64bit system, but splitting up the input into smaller files works for me as well.

        Comment


        • #5
          Originally posted by SES View Post
          Glad to hear that worked for you. I've had a look at the code causing the problem and don't quite understand why this happens on a 64bit system, but splitting up the input into smaller files works for me as well.
          I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
          Last edited by kmcarr; 02-28-2012, 03:01 PM. Reason: Fix hyperlink

          Comment


          • #6
            Originally posted by kmcarr View Post
            I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
            I couldn't get that link to work, but I think you mean this one: cdbfasta usage.

            That is good information to know. This is somewhat unrelated, but in that section I noticed that there are two options (-F and -R) that are not available when using a compressed database. These options are not documented in the Usage statement printed by cdbfasta. Perhaps they did this intentionally to keep users from improperly invoking those options? Regardless, the compression option is not related to the errors in this thread anyway because I've never used that option.

            Comment


            • #7
              Originally posted by kmcarr View Post
              I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
              I think the reason for this limit is do to their "architecture independent" design.
              From the manual:

              Code:
              The index files are now architecture independent, the same index file can be created and used on many different Unix platform (be it 32bit/64bit, big-endian or little-endian architectures) and even Windows.
              I guess this design makes the program available to more users but it also limits the functionality. Pity that multiple versions are not available, though I suppose one could edit the source if they really wanted to.

              Comment


              • #8
                Originally posted by SES View Post
                I couldn't get that link to work, but I think you mean this one cdbfasta usage.
                Doh! Sorry.
                That is good information to know. This is somewhat unrelated, but in that section I noticed that there are two options (-F and -R) that are not available when using a compressed database. These options are not documented in the Usage statement printed by cdbfasta. Perhaps they did this intentionally to keep users from improperly invoking those options? Regardless, the compression option is not related to the errors in this thread anyway because I've never used that option.
                The error manore reported in the OP
                Code:
                Error adding cdb record with key 'HISEQxxx#xxx/1'
                is what happens when cdbfasta reaches the 4GB limit to its index file. I then dies and leaves the incomplete (and useless) *.cidx_tmp file. This happens regardless of using the compression options; I only mentioned the compression section of the usage page because that's where it mentions the 4GB limit.

                The -F and -R options are from cdbyank, not cdbfasta. They are included in the usage statement for cdbyank.

                I think the reason for this limit is do to their "architecture independent" design.
                From the manual:

                Code:
                The index files are now architecture independent, the same index file can be created and used on many different Unix platform (be it 32bit/64bit, big-endian or little-endian architectures) and even Windows.
                I guess this design makes the program available to more users but it also limits the functionality. Pity that multiple versions are not available, though I suppose one could edit the source if they really wanted to.
                I asked our CS guy to look at the code to confirm my suspicion about the 4GB limit. He says that indeed the index file is limited to 4GB in size. Further, because of design decisions, data structures, pointer sizes, etc. used throughout the code it would take significant effort to change this.

                I know we've discussed alternatives, like Bio::Index::Fastq and Bio:B::Fasta, in other threads and found them not up to the task. Alas it seems cdbfasta may be added to that list. I was discussing it this afternoon with aforementioned CS guy and the alternative we landed on was abandoning FASTQ for storing reads and moving to BAM. Illumina is starting to move in this direction with CASAVA 1.8. Obviously all downstream analysis software will then need to be modified to use BAM as a read input format; fortunately the samtools library already exists to provide the needed functionality.

                Comment


                • #9
                  Originally posted by kmcarr View Post
                  The -F and -R options are from cdbyank, not cdbfasta. They are included in the usage statement for cdbyank.
                  Okay, thanks. I guess I was not reading too closely.

                  Originally posted by kmcarr,66130
                  I know we've discussed alternatives, like Bio::Index::Fastq and Bio:B::Fasta, in other threads and found them not up to the task. Alas it seems cdbfasta may be added to that list. I was discussing it this afternoon with aforementioned CS guy and the alternative we landed on was abandoning FASTQ for storing reads and moving to BAM. Illumina is starting to move in this direction with CASAVA 1.8. Obviously all downstream analysis software will then need to be modified to use BAM as a read input format; fortunately the samtools library already exists to provide the needed functionality.
                  This is an interesting topic that seems to be on the forefront of a lot of email lists and discussion boards lately. My solution has been parallel processing of subsets of the reads for trimming, BLAST searches, etc. and then collating results. I know there has been some discussion of possibly implementing a persistent data structure for efficiently parsing these files in BioPerl. It makes me wary to think about putting a lot of time into this since so many people are discussing going away from Fastq (though some say it is fine). Some of our sequencing providers just provide Fastq, though some only provide BAMs.

                  However, since we are probably all under a lot of pressure to write papers, theses, present results, and so on, it still seems sensible to develop an improvement for working with these files right now, even if the formats change and it is a short-lived solution.

                  Comment


                  • #10
                    Dear All

                    I have a paired end Illumina sequences and I tried to use the cdbfasta. When I ran the cdbfasta to generate the index file I am getting this error
                    Error adding cdb record with key 'HWI-ST365:2580R6LACXX:5:2106:19203:15268'
                    My file is around ~22 GB after concatenation. Is that has to do some thing with this error?


                    Regards

                    Comment


                    • #11
                      Originally posted by figo1019 View Post
                      My file is around ~22 GB after concatenation. Is that has to do some thing with this error?
                      figo,

                      Look at the discussion above. cdbfasta has a limit of 4GB on the size of index file which means it will not work for any moderately large Illumina FASTQ.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Non-Coding RNA Research and Technologies
                        by seqadmin




                        Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                        Nobel Prize for MicroRNA Discovery
                        This week,...
                        10-07-2024, 08:07 AM
                      • seqadmin
                        Recent Developments in Metagenomics
                        by seqadmin





                        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                        09-23-2024, 06:35 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:55 AM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-02-2024, 04:51 AM
                      0 responses
                      108 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-01-2024, 07:10 AM
                      0 responses
                      114 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-30-2024, 08:33 AM
                      1 response
                      118 views
                      0 likes
                      Last Post EmiTom
                      by EmiTom
                       
                      Working...
                      X