Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    Hi Dario and Genomax,

    I though I had gotten rid of that a while ago, but I guess not - I'll investigate and fix it. It's harmless and due to a race condition for a thread finishing after it was prematurely shut down because enough reads were generated, but it looks scary.

    Leave a comment:


  • GenoMax
    replied
    I would like to add that reformat.sh also exhibits this "issue".

    Code:
    $ reformat.sh in=P1.fastq.gz out=stdout.fa reads=5
    
    Input:                          5 reads                 1094 bases
    Output:                         5 reads (100.00%)       1094 bases (100.00%)
    
    Time:                           0.178 seconds.
    Reads Processed:           5    0.03k reads/sec
    Bases Processed:        1094    0.01m bases/sec
    Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
            at jgi.ReformatReads.process(ReformatReads.java:1148)
            at jgi.ReformatReads.main(ReformatReads.java:45)

    Leave a comment:


  • dariober
    replied
    bbduk.sh: "reads" parameter does not exit clean

    Hi- Not sure this thread is the best place to post this...

    It seems to me that the `reads` parameter makes bbduk exits with an error when the input is not fully processed.

    These runs are both ok (input file has less then 1M reads):

    Code:
    bbduk.sh in=test.fq.gz out=test.out.fq reads=-1 overwrite=true
    bbduk.sh in=test.fq.gz out=test.out.fq reads=1000000 overwrite=true
    However reads=10 gives:

    Code:
    bbduk.sh in=test.fq.gz out=test.out.fq reads=10 overwrite=true
    
    java -Djava.library.path=/home/db291g/applications/BBmap/bbmap/jni/ -ea -Xmx15023m -Xms15023m -cp /home/db291g/applications/BBmap/bbmap/current/ jgi.BBDukF in=test.fq.gz out=test.out.fq reads=10 overwrite=true
    Executing jgi.BBDukF [in=test.fq.gz, out=test.out.fq, reads=10, overwrite=true]
    
    BBDuk version 37.54
    NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed.
    Initial:
    Memory: max=15097m, free=14703m, used=394m
    
    Input is being processed as paired
    Started output streams:	0.082 seconds.
    Processing time:   		0.017 seconds.
    
    Input:                  	20 reads 		3011 bases.
    Total Removed:          	0 reads (0.00%) 	0 bases (0.00%)
    Result:                 	20 reads (100.00%) 	3011 bases (100.00%)
    
    Time:   			0.114 seconds.
    Reads Processed:          20 	0.18k reads/sec
    Bases Processed:        3011 	0.03m bases/sec
    Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
    	at jgi.BBDukF.process(BBDukF.java:913)
    	at jgi.BBDukF.main(BBDukF.java:72)
    
    echo $?
    1
    I guess the output is still ok but the non-zero exit code causes problems inside a pipeline.

    Am I missing something here or should this be fixed?

    Thanks

    Dario

    Leave a comment:


  • boulund
    replied
    Originally posted by Brian Bushnell View Post
    Yes, that's correct. This is especially important for BBMerge, where insert size can only be computed from reads that overlap; so if the average calculated insert size is say 270bp for 2x150bp reads, but "PercentOfPairs" is only 20%, then presumably the actual insert size is substantially higher but those pairs did not overlap so it could not be computed. It's less important for BBMap but still useful in some cases - if PercentOfPairs is very low, something probably went wrong and the insert size is not trustworthy. For example, if you have an extremely fragmented reference with contig length shorter than insert size, such that most paired reads map to different contigs, then the pairing rate will be low and the calculated insert size will be untrustworthy (shorter than the actual insert size).
    Hi, thanks for that excellent response!
    I'm one of the authors of a BBMap module for the MultiQC bioinformatics log aggregation/summarization tool, and we'd love your input/help with small snippets of explanatory text for some of the outputs produced by the different BBMap tools. Would you be willing to help out? The first reasonably finished version of the BBMap module for MultiQC was finished and merged yesterday, but it unfortunately still lacks explanatory help text for most plot types. Information at the level of your reply regarding PercentOfPairs is perfect for the help text in the MultiQC reports, to aid people unfamiliar with BBMap output to interpret the plotted results included in a MultiQC report. Perhaps you even have suggestions on how to better visualize aggregations of some of the different outputs. Let me know what you think, and we'll set up an Issue in the MultiQC Github issue tracker to organize such work if you're interested.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by boulund View Post
    Hi,

    I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?
    Yes, that's correct. This is especially important for BBMerge, where insert size can only be computed from reads that overlap; so if the average calculated insert size is say 270bp for 2x150bp reads, but "PercentOfPairs" is only 20%, then presumably the actual insert size is substantially higher but those pairs did not overlap so it could not be computed. It's less important for BBMap but still useful in some cases - if PercentOfPairs is very low, something probably went wrong and the insert size is not trustworthy. For example, if you have an extremely fragmented reference with contig length shorter than insert size, such that most paired reads map to different contigs, then the pairing rate will be low and the calculated insert size will be untrustworthy (shorter than the actual insert size).

    Leave a comment:


  • boulund
    replied
    Hi,

    I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?

    Leave a comment:


  • luc
    replied
    Hi Brian,

    thanks for the details -- there is always a another twist to it.

    Originally posted by Brian Bushnell View Post
    Hi Luc,

    To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

    Code:
       111111111111111
    -------------------------------------------------------------------
                                              22222222222222222
    ...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.

    Leave a comment:


  • Brian Bushnell
    replied
    That's correct for single-ended reads, but I do not recommend setting "secondarysitescoreratio = 1.0" for paired reads because one pairing is currently slightly favored over others due to the difficulty of tracking all possible pairings. So, just set "secondary=t ambig=all maxsites=100". In other words, it's not currently possible to ensure you get exactly and only all reads that have the exact same (best) alignment score unless they are processed unpaired.

    Leave a comment:


  • darthsequencer
    replied
    Originally posted by Brian Bushnell View Post
    "ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.
    Hi - So if I set secondarysitescoreratio to 1.0 do I need to set secondary = T if ambiguous=all and increase the maxsites setting?

    I guess what I'm really asking is what do I need to set to ensure that I'm getting all mapping locations of read with all the same top score.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Luc,

    To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

    Code:
       111111111111111
    -------------------------------------------------------------------
                                              22222222222222222
    ...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.

    Leave a comment:


  • luc
    replied
    I am sorry I can't find the info - I am sure it is somewhere.

    How does BBmap define insert size ( e.g of the insert size histograms)?
    Distance between start point of the reads (forward and reverse) or between the ends of the reads?

    #############
    I just checked with a test file; as expected the insert size seems to be the distance between the start point of the reads.
    Last edited by luc; 09-07-2017, 08:52 AM.

    Leave a comment:


  • boulund
    replied
    Oh, that was easy. Thanks a lot for your quick response, both of you. Sorry for not spotting that option!

    Originally posted by GenoMax View Post
    @boulund: Can you include the exact starswrapper.sh command you are running?

    The command I used was simple:
    Code:
    statswrapper.sh in=assembly_1.fasta in=assembly_2.fasta
    With the "format=3" option it produces exactly the output I was expecting.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Boulund,

    Please add the flag "format=3" to get one line per file.

    Leave a comment:


  • GenoMax
    replied
    @boulund: Can you include the exact starswrapper.sh command you are running?

    Leave a comment:


  • boulund
    replied
    Hi,

    I have a question about statswrapper.sh. It says in the help text description that it supposedly runs stats.sh on multiple assemblies to produce one output line per file, but I only get a concatenation of regular stats.sh outputs, making it far less useful to me than I was hoping for. Is this the intended behavior or am I misunderstanding the help text formulation somehow?

    I was hoping to get some kind of overviewable summary table of all files in a single file, with one output line per file.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 11:49 AM
0 responses
4 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X