Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • PinkTips
    Member
    • Feb 2019
    • 10

    BBSplit assertion error: invalid fasta file

    Good morning, BBMappers!

    I have been trying to run BBSplit (on my university's computing cluster) to remove host sequences from metatranscriptome data of a gut community.

    This is the command I am using:
    Code:
    /home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fasta ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fasta
    The error message returned after running on the cluster is :
    Code:
    Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fasta'
            at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
            at align2.AbstractMapper.<init>(AbstractMapper.java:53)
            at align2.BBMap.<init>(BBMap.java:43)
            at align2.BBMap.main(BBMap.java:31)
            at align2.BBSplitter.main(BBSplitter.java:47)
    The first four sequences in my FASTA file appear as:
    Code:
    >NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
    GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
    >NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
    CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
    >NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
    GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
    >NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
    TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
    This FASTA file was converted from a FASTQ file using:
    Code:
    paste - - - - < Qualtrimmed_bran11.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > Qualtrimmed_bran11.fasta
    I am stumped as to why my FASTA format is invalid, so any thoughts/help would be greatly appreciated! Thanks!
    Last edited by GenoMax; 03-21-2019, 11:36 AM.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Do you get an error right away or does the program run for some time?

    Are these bbmerged reads? Wonder if you should try to do the binning with original fastq data. Is that a possibility?

    Comment

    • PinkTips
      Member
      • Feb 2019
      • 10

      #3
      From what I can tell, the error shows up right away. I only get an email when my job is finished on the cluster, but the log file shows the error appearing right away (right after the reference files are merged).

      Yes, these reads were merged with bbmerge.

      I was under the impression that bbsplit wanted the reads as FASTA files, but I will try with the FASTQ files!

      Thank you, and I will let you know how it goes!

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        BBSplit will take fastq reads and bin them. Let me know how that works.

        You can convert the merged fastq reads afterwards with
        Code:
        reformat.sh in=merged.fq.gz out=merged.fa
        No paste/cut/sed needed :-)

        Comment

        • PinkTips
          Member
          • Feb 2019
          • 10

          #5
          Unfortunately, I get the same assertion error as before when I use my FASTQ file.

          The first few sequences of the FASTQ file:
          Code:
          @NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
          GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
          +
          AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA
          @NB502039:96:HGLYGBGX3:1:11101:23904:2797 1:N:0:AGTTCC
          CCGCCTTCAACGCCAAGAGCGCGAATTATGCGTATAGATGCACTTCTAAGCATCATGAGTTCTCTATCAGAAAGTGTTTGCGCAGGAGCTGCAACTATACTGTCACCTGTATGAACACCAACAGGGTCAAGGTTTTCCATCGAACAAATCGTAATGCAGTTATCTGCG
          +
          AAAAAEEEEEEEEEEEEEEJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJEEEEEEEEEEEEAAAAA
          @NB502039:96:HGLYGBGX3:1:11101:16907:2810 1:N:0:AGTTCC
          GGCACCGAACGCCTTGGCAGCCAAAGCCATAGCCGGCACGAACTGACGGTCGCCGACCGTCTTGCCGCCGCCCGCTCCGGGACGCTGCACCGAGTGGGTACAGTCCATTATCACGCGTGGCGTTATCTGCTTCATATCGGGAATATTGCGGAAATCAACCACCAAGTTATTGTACCCGAAGCTGTTGCCTCGCTCTATCAACCACACGTTTTCGTTACCGCTCTCGCGCACTTTCTGCACGG
          +
          AAAAAEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEAEEEJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHJJJJJJJJJJJJJJJJJHJJJJJJJJEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA
          @NB502039:96:HGLYGBGX3:1:11101:20216:2823 1:N:0:AGTTCC
          TAAAGGCAAATGGCTCTATCATGAAATCCTGGAGCCGGGCGTGTTGGTGCATGTTTCTGAGAGCGGTGCCAAAGTATGGACCGTTCGCTGTGGTTCCCCCCGTCTGGTAACGGTCAATTATGTTCGCG
          +
          DFDDJJH77HJHJHJHJHJHJJJHJHJHJJJJJJH77JHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ
          Last edited by GenoMax; 03-21-2019, 11:51 AM.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Can you validate your fastq files to make sure there are no errors in the file?

            Use validateFiles from Kent Utilities (UCSC). Linux version linked. After download add execute permissions (chmod a+x validateFiles) before running.

            validateFiles -type=fastq file1.gz file2.gz etc

            Comment

            • PinkTips
              Member
              • Feb 2019
              • 10

              #7
              I used
              Code:
              validateFiles -type=fastq QualTrimmed_bran11.fastq
              and the output was
              Error count 0
              When I used
              Code:
              validateFiles -type=fastq QualTrimmed_bran11.fasta
              the output was
              Error [file=QualTrimmed_bran11.fastq, line=1]: sequence name first char invalid (got '@', wanted '>') [@NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC]
              Aborting .. found 1 error
              So I figured it was working properly.

              (I am using v38.22)
              Last edited by PinkTips; 02-12-2019, 11:31 AM. Reason: forgot to add

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                I am wondering if the error you are seeing is a red herring. How much memory are you allocating to this job on the cluster (bbsplit can need a lot of memory depending on size of the reference genomes).

                You should explicitly add "-Xmx20g" (this is 20 gig, just an example) flag to your bbsplit command. Make sure you match the sample amount of memory on the cluster side.

                On a side note:

                Code:
                validateFiles -type=fastq QualTrimmed_bran11.fasta
                generated an error since you need to change the type to fasta to match. So try

                Code:
                validateFiles -type=fasta QualTrimmed_bran11.fasta

                Comment

                • PinkTips
                  Member
                  • Feb 2019
                  • 10

                  #9
                  I was only allotting 2GB from the cluster side, likely not enough for BBSplit to do its thing!
                  I've tried again with "-Xmx200gb" and will let you know how it goes!

                  Thanks - I never would have gotten that from java's error message!

                  Comment

                  • PinkTips
                    Member
                    • Feb 2019
                    • 10

                    #10
                    Hi, I'm back again -- with the same assertion error at the same step.

                    I allotted 100 GB (from both the BBSplit side and the cluster side) and still get the same assertion error as before. If it's helpful, this is the output from the cluster after my job in run:
                    java -Djava.library.path=/home/hd55218/BBSplit/bbmap/jni/ -ea -Xmx48g -cp /home/hd55218/BBSplit/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx48g
                    Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, -Xmx48g]

                    Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq, basename=out_%.fasta, outu=/home/hd55218/BBSplit/cleaned_bran11.fastq, ref_p.americana_genome=/home/hd55218/BBSplit/p.americana_genome.fasta, ref_Blattabacterium_genome=/home/hd55218/BBSplit/Blattabacterium_genome.fasta, ref_Blattabacterium_plasmid=/home/hd55218/BBSplit/Blattabacterium_plasmid.fasta]
                    Creating merged reference file ref/genome/1/merged_ref_3113916972846229527.fa.gz
                    Ref merge time: 140.410 seconds.
                    Exception in thread "main" java.lang.AssertionError: Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
                    at align2.AbstractMapper.preparse0(AbstractMapper.java:821)
                    at align2.AbstractMapper.<init>(AbstractMapper.java:53)
                    at align2.BBMap.<init>(BBMap.java:43)
                    at align2.BBMap.main(BBMap.java:31)
                    at align2.BBSplitter.main(BBSplitter.java:47)
                    Below is the bbsplit command I used:
                    /home/hd55218/BBSplit/bbmap/bbsplit.sh in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fasta outu=/home/hd55218/BBSplit/cleaned_bran11.fastq -Xmx100g
                    My reference genomes are 3.43GB, 646KB, and 4KB.

                    Thanks for helping me work through this!

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Now we have a different error.

                      Invalid input file: '/home/hd55218/BBSplit/QualTrimmed_bran11.fastq'
                      Is that file actually in fastq format and is it in that location?

                      Post the output from
                      Code:
                      head -4 /home/hd55218/BBSplit/QualTrimmed_bran11.fastq

                      Comment

                      • PinkTips
                        Member
                        • Feb 2019
                        • 10

                        #12
                        The output from
                        "head -4 /home/hd55218/BBSplit/Qualtrimmed_bran11.fastq"
                        is:
                        Code:
                        @NB502039:96:HGLYGBGX3:1:11101:19340:2795 1:N:0:AGTTCC
                        GTCCTCTTCCGGGGTCTGGGTGCCAAGGCCCATCGCCTGCAGACCTTCGTTCAGCGGGGTGTACACGGGGCCTTCGAATGCGCCATCGATGACCACGGTCGTCTTGTCATACTCGTTGCCGAAGTTCGCCATTTCGATCTGCAGCGGCTCCAGATCCAGCGTGGTGTAGTCGATGTCCACACGGCTGGGGGGGGGCACGCCGCCGGTGACGAGCCTGTAGGTCTGGCACTCCCC
                        +
                        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEJJJJHJHJHJJJJJJJJJJJDJGFJJJJJJHJJJJJJJJHJ?JHJJJJJJJHJJ7JHJJHJJJJJHJEEEEEEEEEEE/EEAE/EEAA/E/E/EEEEEEEEE/AEEEE/EEEEEAEE/EEEEEAEEEE/EAEEAAEEE/AE/EEEAAAAAA
                        Last edited by GenoMax; 03-21-2019, 11:23 AM.

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          I don't think you can split to a fasta format file directly. Can you try following?

                          Code:
                          /home/hd55218/BBSplit/bbmap/bbsplit.sh -Xmx100g threads=2 in=/home/hd55218/BBSplit/QualTrimmed_bran11.fastq ref=/home/hd55218/BBSplit/p.americana_genome.fasta,/home/hd55218/BBSplit/Blattabacterium_genome.fasta basename=out_%.fastq outu=/home/hd55218/BBSplit/cleaned_bran11.fastq
                          Reads that do not match to the two genomes will end up in "cleaned_bran11.fastq" file. Just making sure that is what you want.

                          Comment

                          • PinkTips
                            Member
                            • Feb 2019
                            • 10

                            #14
                            Yes, I want reads that do not match the references to go into "cleaned".

                            I tried changing the basename parameter's extension to fastq, but the invalid input file error remains.

                            Comment

                            • GenoMax
                              Senior Member
                              • Feb 2008
                              • 7142

                              #15
                              Is the invalid assertion error about "fasta" files or "fastq" data? It is possible that something is wrong with the fasta files that you are using. You would want to check on those using the validateFiles tool you used before.

                              If the error is about fastq data then at this point I am going to say that go back to the very original data (not quality trimmed/otherwise) and see if that works with bbsplit. All BBtools will accept gzipped files so there is not need to uncompress them.
                              Last edited by GenoMax; 03-28-2019, 03:51 AM.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              33 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              35 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              23 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...