Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    Have you tried adding "VALIDATION_STRINGENCY=LENIENT" to your picard command? Default value is STRICT and sometime it leads to picard throwing errors like this.

    It does looks like reformat.sh took out some of the reads so the problem reads have been hopefully addressed.

    Comment

    • bwubb
      Member
      • Jan 2012
      • 61

      Originally posted by GenoMax View Post
      Have you tried adding "VALIDATION_STRINGENCY=LENIENT" to your picard command? Default value is STRICT and sometime it leads to picard throwing errors like this.
      ...
      Yes indeed, but it will still be an ERROR.

      However, after more tinkering I was able to come up with a combination that worked.

      Code:
      reformat.sh -Xmx5g in=qSortA.bam out=qSortA_reads.fq
      reformat.sh -Xmx5g int=t addcolon=t uniquenames=t in=qSortA_reads.fq out=out_reads.fq
      repair.sh -Xmx5g in=out_reads.fq out1=reads1.fq out2=reads2.fq
      Taking those reads into alignment and so forth will pass ValidateSamFile. Im guessing these actions are allowing me to completely rewrite the pairing status, which I am perfectly ok with.

      Thank you for the comments and help!

      Comment

      • noobie
        Junior Member
        • Jun 2012
        • 7

        I want to subsample 10 million reads from both an interleaved reads file and a singletons reads file at the same time. Can reformat.sh do this?

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          @Noobie: I don't think so. You will have to do them independently. What is the use case here, if I may ask?

          Comment

          • bartn
            Junior Member
            • Jul 2019
            • 6

            Hi!

            I want to create a (small) bacterial RNAseq testset from transcripts using randomreads.sh:

            Code:
            randomreads.sh reads=30000 ref=transcripts.fasta metagenome=t out1=test_RNA_1.fq out2=test_RNA_2.fq paired=t
            Mapping it back to the transcripts:

            Code:
            bbmap.sh ref=transcripts.fasta  in1=test_RNA_1.fq in2=test_RNA_2.fq covstats=RNA.cov
            Results
            Code:
            Reads:                               	60000
            Mapped reads:                        	54225
            Mapped bases:                        	8063372
            Ref scaffolds:                       	562
            Ref bases:                           	521537
            
            Percent mapped:                      	90.375
            Percent proper pairs:                	64.822
            Average coverage:                    	15.461
            Standard deviation:                    	51.768
            Percent scaffolds with any coverage: 	59.96
            Percent of reference bases covered:  	45.87

            Mapping it to the genome:

            Code:
            Reads:                               	60000
            Mapped reads:                        	54104
            Mapped bases:                        	8061481
            Ref scaffolds:                       	1
            Ref bases:                           	592529
            
            Percent mapped:                      	90.173
            Percent proper pairs:                	64.573
            Average coverage:                    	13.605
            Standard deviation:                    	48.628
            Percent scaffolds with any coverage: 	100.00
            Percent of reference bases covered:  	41.02

            Why are the mapping percentages so low, what I missing with either generating the reads or mapping them back?

            (bowtie2 gives an even worse result..
            15482 (51.61%) aligned concordantly 0 times
            14351 (47.84%) aligned concordantly exactly 1 time
            167 (0.56%) aligned concordantly >1 times
            )

            Comment

            • bartn
              Junior Member
              • Jul 2019
              • 6

              randomreads.sh

              Hi!

              I want to created a (small) RNAseq test set from a small bacterial plasmid. So I used :
              Code:
              randomreads.sh reads=30000 paired=t metagenome=t  ref=transcripts.fasta out1=test_RNA_1.fq out2=test_RNA_2.fq
              (Then I get 30k reads for each pair. No problem, I guess that's because the default is single end)

              Then I mapped back the reads with BBmap
              Code:
              bbmap.sh ref=Rhizobium_Leguminosarum_plasmid_NZ_CP025014.1_transcripts.fasta  in1=Unlock_test_RNA_1.fq in2=Unlock_test_RNA_2.fq covstats=RNA.cov
              Results:
              Code:
              Reads:                               	60000
              Mapped reads:                        	54225
              Mapped bases:                        	8063372
              Ref scaffolds:                       	562
              Ref bases:                           	521537
              
              Percent mapped:                      	90.375
              Percent proper pairs:                	64.822
              Average coverage:                    	15.461
              Standard deviation:                    	51.768
              Percent scaffolds with any coverage: 	59.96
              Percent of reference bases covered:  	45.87
              The mapping percentages seem very low to me. Mapping to the genome gives similar results.
              I am using randomreads.sh in the wrong way for this?

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                Since you asked for 30K reads you got 30K paired-end reads.

                I would sugegst trying without the metagenome=t option for a single small input fasta. If you do have the entire genome then use that in your randomread generation step along with the plasmid.

                Comment

                • bartn
                  Junior Member
                  • Jul 2019
                  • 6

                  Originally posted by GenoMax View Post
                  Since you asked for 30K reads you got 30K paired-end reads.

                  I would sugegst trying without the metagenome=t option for a single small input fasta. If you do have the entire genome then use that in your randomread generation step along with the plasmid.
                  Yes, that does seem to increase the mapping percentages. Any idea why it is not working with meta option? Changes read length and insert size does not seem to do much..
                  The plasmid is about 0.5MB with 562 transcripts. Median transcript size is 830bp, so that could be a bit short, (but then I would expect the same with the meta option off)

                  I will also look for another suitable (small) genome. I don't want to add the complete genome because it should finish quick, since it's for testing purposes of tools and pipelines.

                  (btw, one other small thing. Adding any statistic file in the arguments will give the small summary in the stdout (like I posted before). Otherwise it will not be printed. Not sure if that is expected behavior)

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    Metagenome option is supposed to take in multiple (genome) sequences to generate an artificial metagenomic dataset.

                    BBMap writes stats to STDERR by default. So you can capture that.

                    Comment

                    • bartn
                      Junior Member
                      • Jul 2019
                      • 6

                      From the description of the option I understood it can be used for RNA as well:

                      metagenome=f Assign scaffolds a random exponential coverage level, to simulate a metagenomic or RNA coverage distribution

                      Which makes sense in my view since I just give multiple transcripts instead of genomes (albeit short compared to a genome)

                      So why are so many reads not mapping back even thought almost all of it with an even distributed coverage does map.


                      What I meant with the summary part, if I run just bbmap in its simplest form:
                      Code:
                      bbmap.sh ref=transcripts.fasta  in1=read_1.fq  in2=read_2.fq
                      the stderr does not contain this part:
                      Code:
                      Reads:                               	120000
                      Mapped reads:                        	111248
                      Mapped bases:                        	8357095
                      Ref scaffolds:                       	1
                      Ref bases:                           	592529
                      
                      Percent mapped:                      	92.707
                      Percent proper pairs:                	56.370
                      Average coverage:                    	14.104
                      Standard deviation:                    	46.411
                      Percent scaffolds with any coverage: 	100.00
                      Percent of reference bases covered:  	41.67
                      But it does with adding a covstats=cov.stats for example.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Unfortunately Brian would be the only person who can provide an authoritative answer and he no longer participates on this forum. You may want to try emailing him with your question directly.

                        BBMap provides a wide range of coverage/histogram/stats outputs. Default output (while voluminous does not contain coverage stats) but mainly contains % alignments which most users want to see by default. In-line help (run bbmap.sh without any options/inputs) details different options available for coverage/histogram/stats outputs.

                        Comment

                        • bartn
                          Junior Member
                          • Jul 2019
                          • 6

                          That's indeed unfortunate. I might try sending an email. A previous bug report was solved very quickly!

                          One of the reasons I like about BBMap is indeed the wide range of statistics it can generate. Just noticed that final summary percentages like I showed, are to me the most useful thing to look at on first check was missing by default.

                          And thanks a lot for your answers and your time GenoMax!

                          Comment

                          • Illusive Man
                            Member
                            • Sep 2013
                            • 15

                            Dear BBMap users or Brian,

                            I have a database of about 800 proteins and all I would like to do is create an abundance table of "hits" to these 800 proteins for each one of my 30 samples. Samples as columns, genes as rows, cells as counts. Can BBMap help me create this abundance table?

                            Thanks

                            Comment

                            • GenoMax
                              Senior Member
                              • Feb 2008
                              • 7142

                              BBMap is not designed to work with protein data. You will have to find a different tool.

                              Take a look at blat from UCSC's Jim Kent. That should produce the stats in parse-able tab-delimited columner format. You will need to do post-processing to get the table you are looking for.

                              Comment

                              • bartn
                                Junior Member
                                • Jul 2019
                                • 6

                                another program that is often used for these cases is DIAMOND



                                But like GenoMax said you have to do some post processing of the output to get to the format you want..

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 08:59 AM
                                0 responses
                                7 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...