Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cdbfasta -> Error adding cdb record with key

    Hi,

    After running cdbfasta,
    I obtain this message :Error adding cdb record with key 'HISEQxxx#xxx/1'
    and I obtain the file _tmp : fastqfile.idx_tmp and not fastqfile.idx

    CODE]my @result = `cdbyank fastqfile.idx -d fastqfile -a id';[/CODE]


    The key HISEQxxx#xxx/1 correspond to this line in my file :
    @HISEQxxx#xxx/1
    TGTGCGAATATACTTGTGAATCTGTGTGTTTATAAAAATGTTGTAGTATATGTTGTGTCTCGGATTACGATGCNTATAAACAAGCCGACGGGTATGTTTTT
    +HISEQxxx#xxx/1
    eececddadUNN^YZb[a_][]XYR^P\Y`ddddWT[QY]]dd_]c_[^_b`bc]]VX`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

    Someone can help me?

  • #2
    It looks like you are trying to call cdbyank from a Perl script. It sounds like the index construction failed if you are ending up with a ".idx_tmp" and there should be a clear message. Try running cdbfasta at the command line to see the error message. One possible solution might be to split the fastq file into smaller files and then build the index of each (assuming you are trying to index a huge file).

    Comment


    • #3
      Thanks SES,
      Indeed, I split the fastq file into smaller files and then I build the index of each file.

      Comment


      • #4
        Originally posted by manore View Post
        Thanks SES,
        Indeed, I split the fastq file into smaller files and then I build the index of each file.
        Glad to hear that worked for you. I've had a look at the code causing the problem and don't quite understand why this happens on a 64bit system, but splitting up the input into smaller files works for me as well.

        Comment


        • #5
          Originally posted by SES View Post
          Glad to hear that worked for you. I've had a look at the code causing the problem and don't quite understand why this happens on a 64bit system, but splitting up the input into smaller files works for me as well.
          I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
          Last edited by kmcarr; 02-28-2012, 03:01 PM. Reason: Fix hyperlink

          Comment


          • #6
            Originally posted by kmcarr View Post
            I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
            I couldn't get that link to work, but I think you mean this one: cdbfasta usage.

            That is good information to know. This is somewhat unrelated, but in that section I noticed that there are two options (-F and -R) that are not available when using a compressed database. These options are not documented in the Usage statement printed by cdbfasta. Perhaps they did this intentionally to keep users from improperly invoking those options? Regardless, the compression option is not related to the errors in this thread anyway because I've never used that option.

            Comment


            • #7
              Originally posted by kmcarr View Post
              I just stumbled across this problem myself and it appears that there is a 4GB limit (?on the index file) within cdbfasta. See section "3. Data compression option" on the cdbfasta usage page. Unfortunately compression of the index is not allowed when the input is FASTQ.
              I think the reason for this limit is do to their "architecture independent" design.
              From the manual:

              Code:
              The index files are now architecture independent, the same index file can be created and used on many different Unix platform (be it 32bit/64bit, big-endian or little-endian architectures) and even Windows.
              I guess this design makes the program available to more users but it also limits the functionality. Pity that multiple versions are not available, though I suppose one could edit the source if they really wanted to.

              Comment


              • #8
                Originally posted by SES View Post
                I couldn't get that link to work, but I think you mean this one cdbfasta usage.
                Doh! Sorry.
                That is good information to know. This is somewhat unrelated, but in that section I noticed that there are two options (-F and -R) that are not available when using a compressed database. These options are not documented in the Usage statement printed by cdbfasta. Perhaps they did this intentionally to keep users from improperly invoking those options? Regardless, the compression option is not related to the errors in this thread anyway because I've never used that option.
                The error manore reported in the OP
                Code:
                Error adding cdb record with key 'HISEQxxx#xxx/1'
                is what happens when cdbfasta reaches the 4GB limit to its index file. I then dies and leaves the incomplete (and useless) *.cidx_tmp file. This happens regardless of using the compression options; I only mentioned the compression section of the usage page because that's where it mentions the 4GB limit.

                The -F and -R options are from cdbyank, not cdbfasta. They are included in the usage statement for cdbyank.

                I think the reason for this limit is do to their "architecture independent" design.
                From the manual:

                Code:
                The index files are now architecture independent, the same index file can be created and used on many different Unix platform (be it 32bit/64bit, big-endian or little-endian architectures) and even Windows.
                I guess this design makes the program available to more users but it also limits the functionality. Pity that multiple versions are not available, though I suppose one could edit the source if they really wanted to.
                I asked our CS guy to look at the code to confirm my suspicion about the 4GB limit. He says that indeed the index file is limited to 4GB in size. Further, because of design decisions, data structures, pointer sizes, etc. used throughout the code it would take significant effort to change this.

                I know we've discussed alternatives, like Bio::Index::Fastq and Bio:B::Fasta, in other threads and found them not up to the task. Alas it seems cdbfasta may be added to that list. I was discussing it this afternoon with aforementioned CS guy and the alternative we landed on was abandoning FASTQ for storing reads and moving to BAM. Illumina is starting to move in this direction with CASAVA 1.8. Obviously all downstream analysis software will then need to be modified to use BAM as a read input format; fortunately the samtools library already exists to provide the needed functionality.

                Comment


                • #9
                  Originally posted by kmcarr View Post
                  The -F and -R options are from cdbyank, not cdbfasta. They are included in the usage statement for cdbyank.
                  Okay, thanks. I guess I was not reading too closely.

                  Originally posted by kmcarr,66130
                  I know we've discussed alternatives, like Bio::Index::Fastq and Bio:B::Fasta, in other threads and found them not up to the task. Alas it seems cdbfasta may be added to that list. I was discussing it this afternoon with aforementioned CS guy and the alternative we landed on was abandoning FASTQ for storing reads and moving to BAM. Illumina is starting to move in this direction with CASAVA 1.8. Obviously all downstream analysis software will then need to be modified to use BAM as a read input format; fortunately the samtools library already exists to provide the needed functionality.
                  This is an interesting topic that seems to be on the forefront of a lot of email lists and discussion boards lately. My solution has been parallel processing of subsets of the reads for trimming, BLAST searches, etc. and then collating results. I know there has been some discussion of possibly implementing a persistent data structure for efficiently parsing these files in BioPerl. It makes me wary to think about putting a lot of time into this since so many people are discussing going away from Fastq (though some say it is fine). Some of our sequencing providers just provide Fastq, though some only provide BAMs.

                  However, since we are probably all under a lot of pressure to write papers, theses, present results, and so on, it still seems sensible to develop an improvement for working with these files right now, even if the formats change and it is a short-lived solution.

                  Comment


                  • #10
                    Dear All

                    I have a paired end Illumina sequences and I tried to use the cdbfasta. When I ran the cdbfasta to generate the index file I am getting this error
                    Error adding cdb record with key 'HWI-ST365:2580R6LACXX:5:2106:19203:15268'
                    My file is around ~22 GB after concatenation. Is that has to do some thing with this error?


                    Regards

                    Comment


                    • #11
                      Originally posted by figo1019 View Post
                      My file is around ~22 GB after concatenation. Is that has to do some thing with this error?
                      figo,

                      Look at the discussion above. cdbfasta has a limit of 4GB on the size of index file which means it will not work for any moderately large Illumina FASTQ.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Understanding Genetic Influence on Infectious Disease
                        by seqadmin




                        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                        09-09-2024, 10:59 AM
                      • seqadmin
                        Addressing Off-Target Effects in CRISPR Technologies
                        by seqadmin






                        The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                        08-27-2024, 04:44 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 02:44 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-06-2024, 08:02 AM
                      0 responses
                      143 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-03-2024, 08:30 AM
                      0 responses
                      151 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 08-27-2024, 04:40 AM
                      0 responses
                      158 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X