Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • filterbyname.sh is not pulling out reads

    Hello,
    Any suggestions as to why filterbyname is not pulling out reads that I know are there.

    Help??

    Jen

    Comment


    • Originally posted by JenBarb View Post
      Hello,
      Any suggestions as to why filterbyname is not pulling out reads that I know are there.

      Help??

      Jen
      Are you including the entire fastq/fasta header in your retrieval command?

      Comment


      • Originally posted by Coyk View Post
        Hello,
        I've been trying to get any basic command to work on my windows 7 machine. I've installed bbmap in the C:\Users\me directory. My .fastq.gz files are in the bbmap folder. When I type the command
        C:\Users\me>java -ea -Xmx1g -cp \bbmap\current\jgi.BBDukF in1=inputR1.fastq.gz in2=inputR2.fastq.gz out1=test1.fastq out2=test2.fastq ktrim=r mink=8 k=15

        I get "could not find or load main class in1=inputR1.fastq.gz"

        I've messed with the type of slash, putting the infiles in the current directory, and taking away parts of the directory names. I even truncated the argument after -cp to just jgi.BBDukF, which I know is wrong, and it still wants the main class of in1. I don't know a thing about java and when I read instructions about paths, classpaths, path environments, etc, I get confused quickly. When I type "echo %CLASSPATH% into the command prompt, I get CLASSPATH back so that's not set. I tried using Git Bash but when I typed "bbduk.sh" I got "command not found" so I'm stuck, but I would really like to use this tool on my windows machine. Anyone willing to give me very basic instructions on the exact commands that get bbduk to run on a windows machine?
        Try the following. Make sure you change /path/to/ to a real value on your computer.
        Code:
        java -ea -Xmx200m -cp /path/to/bbmap/current/ jgi.BBDukF in=reads.fq out=processed.fq

        Comment


        • bbmap SAM output question

          Hi,

          I don't quite understand the SAM output of bbmap.sh

          here is my line:

          bbmap.sh ref=ref.fasta ambig=all nodisk=t nzo=t mappedonly=t outu=unmapped.fasta scafstats=stats.map in=reads.fastq out=mapped.sam

          I usually just use the stats file to get read counts per feature using a ref with features as a multifasta, but I need to annotate from a .gff3 file using a draft genome ref and the only way I can see to do that is from the sam (to get the mapping coordinate. However, the sam seems to have low quality mappings too (MAPQ=3)..

          What is the threshold actually used for mapping quality? I can see that in the options, and how can I get the sam to only contain the mapped reads above that threshold?

          Thanks,

          S.

          Comment


          • Since you are using ambig=all you are going to get multi-mappers with their MAPQ being set to a small value (3). You could post filter your sam file or set ambig=toss (or ambig=random/best to get only one alignment).

            Comment


            • Originally posted by GenoMax View Post
              Since you are using ambig=all you are going to get multi-mappers with their MAPQ being set to a small value (3). You could post filter your sam file or set ambig=toss (or ambig=random/best to get only one alignment).
              ambig=best worked, Thanks.

              S.

              Comment


              • Ok I thought this was solved, but see below, still getting sam file alignments with MAPQ=3, even though they actually look like good alignments. Is '3' given because you can't calculate a perfect hit MAPQ? i.e. MAPQ=-10*log10(zero)

                @PG ID:BBMap PN:BBMap VN:37.93 CL:java -Djava.library.path=/bin/bbmap/jni/ -ea -Xmx43986m align2.BBMap build=1 overwrite=true fastareadlen=500 ref=genome/genome.fasta ambig=best vslow perfectmode maxsites=10000 nzo=t mappedonly=t outu=test/unmapped/testKT45.unmapped.fasta scafstats=test/mapped/testKT45.map in=testKT45.fastq out=test/mapped/testKT45.sam -idfilter=0.9
                700666F:325:CCC50ANXX:6:2211:7729:2233 1:N:0:GTCGAG 0 ctg3849_RHIMB__RM170330.3849 Backbone_3849|arrow|pilon 111580 3 31= * 0 0 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGG XT:A:R NM:i:0 AM:i:3
                700666F:325:CCC50ANXX:6:2211:7914:2047 1:N:0:GTAGAG 16 ctg18961_RHIMB__RM170330.18961 Backbone_18961|arrow|pilon 20097 3 31= * 0 0 TCACAATGATAGGAAGAGCCGACATCGAAGG GGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3
                700666F:325:CCC50ANXX:6:2211:7905:2066 1:N:0:GTCGAG 16 ctg18961_RHIMB__RM170330.18961 Backbone_18961|arrow|pilon 34735 3 31= * 0 0 GGTAGGCGCAGAAAGTACCATCGAAAGTTGA GGGGGFGGGGGGGGGGGGGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3
                700666F:325:CCC50ANXX:6:2211:7970:2099 1:N:0:GTCGAG 0 ctg16585_RHIMB__RM170330.16585 Backbone_16585|arrow|pilon 33974 3 23= * 0 0 GGGAATACCAGGTGCTGTAGGCT CCCCCGEGGGGGGGGGGGGGGGG XT:A:R NM:i:0 AM:i:3
                700666F:325:CCC50ANXX:6:2211:7937:2145 1:N:0:GTCGAG 16 ctg13544_RHIMB__RM170330.13544 Backbone_13544|arrow|pilon 7627 3 39= * 0 0 GGTGAGAGCGCCGAATCCTAACCACTAGACCACCAGGGA GGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3

                Comment


                • Hi Brian - Is there a bbtool that will keep only mapped reads and if the mate doesn't map output the unmapped also from sam/bam files?

                  Thanks

                  Comment


                  • Originally posted by susanklein View Post
                    Ok I thought this was solved, but see below, still getting sam file alignments with MAPQ=3, even though they actually look like good alignments. Is '3' given because you can't calculate a perfect hit MAPQ? i.e. MAPQ=-10*log10(zero)

                    @PG ID:BBMap PN:BBMap VN:37.93 CL:java -Djava.library.path=/bin/bbmap/jni/ -ea -Xmx43986m align2.BBMap build=1 overwrite=true fastareadlen=500 ref=genome/genome.fasta ambig=best vslow perfectmode maxsites=10000 nzo=t mappedonly=t outu=test/unmapped/testKT45.unmapped.fasta scafstats=test/mapped/testKT45.map in=testKT45.fastq out=test/mapped/testKT45.sam -idfilter=0.9
                    700666F:325:CCC50ANXX:6:2211:7729:2233 1:N:0:GTCGAG 0 ctg3849_RHIMB__RM170330.3849 Backbone_3849|arrow|pilon 111580 3 31= * 0 0 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGG XT:A:R NM:i:0 AM:i:3
                    700666F:325:CCC50ANXX:6:2211:7914:2047 1:N:0:GTAGAG 16 ctg18961_RHIMB__RM170330.18961 Backbone_18961|arrow|pilon 20097 3 31= * 0 0 TCACAATGATAGGAAGAGCCGACATCGAAGG GGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3
                    700666F:325:CCC50ANXX:6:2211:7905:2066 1:N:0:GTCGAG 16 ctg18961_RHIMB__RM170330.18961 Backbone_18961|arrow|pilon 34735 3 31= * 0 0 GGTAGGCGCAGAAAGTACCATCGAAAGTTGA GGGGGFGGGGGGGGGGGGGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3
                    700666F:325:CCC50ANXX:6:2211:7970:2099 1:N:0:GTCGAG 0 ctg16585_RHIMB__RM170330.16585 Backbone_16585|arrow|pilon 33974 3 23= * 0 0 GGGAATACCAGGTGCTGTAGGCT CCCCCGEGGGGGGGGGGGGGGGG XT:A:R NM:i:0 AM:i:3
                    700666F:325:CCC50ANXX:6:2211:7937:2145 1:N:0:GTCGAG 16 ctg13544_RHIMB__RM170330.13544 Backbone_13544|arrow|pilon 7627 3 39= * 0 0 GGTGAGAGCGCCGAATCCTAACCACTAGACCACCAGGGA GGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGCCCCC XT:A:R NM:i:0 AM:i:3
                    MAPQ is a measure of the probability (estimate) that the mapping location is correct. This can factor in various things, including the number of mismatches... but, for example, with these very short sequences, even though they are perfect matches, the specificity is low. MAPQ=3 specifically means that there is at most a 50% chance that the mapping is correct, and thus, there's at least one other perfect alignment for each of these.

                    Comment


                    • Originally posted by darthsequencer View Post
                      Hi Brian - Is there a bbtool that will keep only mapped reads and if the mate doesn't map output the unmapped also from sam/bam files?

                      Thanks
                      Not exactly. reformat.sh has a "mappedonly" flag, but that would only keep the mapped reads. However, you can use requiredbits and filterbits in 2 passes, like this:

                      Code:
                      reformat.sh in=data.sam out=mapped.sam mappedonly
                      reformat.sh in=data.sam out=unmapped_mates_of_mapped_reads.sam requiredbits=4 filterbits=8
                      Then merge the two output files. Alternatively:
                      Code:
                      reformat.sh in=data.sam out=mapped.sam mappedonly
                      filterbyname.sh in=data.sam names=mapped.sam include=t out=mapped_or_mate_mapped.sam
                      The second solution takes more memory but it keeps the reads in the input order and is generally more convenient.

                      Comment


                      • optical duplicates

                        Hi Brian! Hope I'm posting this in the right place.

                        I've been using clumpify for a while. Great tool! I was totally blown away by how fast it is...

                        Simple question. I noticed that you make the following recommendations for detection of optical duplicates:

                        dupedist=40 (dist) Max distance to consider for optical duplicates.
                        Higher removes more duplicates but is more likely to
                        remove PCR rather than optical duplicates.
                        This is platform-specific; recommendations:
                        NextSeq 40 (and spany=t)
                        HiSeq 1T 40
                        HiSeq 2500 40
                        HiSeq 3k/4k 2500
                        Novaseq 12000

                        I was surprised to see that NovaSeq has such a big distance. Is this something you have measured? I have only been able to find the Picard recommendation which groups all the patterned flowcells with the same distance (2500).

                        Thanks!

                        Comment


                        • The coordinate system varies across platforms; the physical distance of 1 pixel is much larger on HiSeq 2500 than HiSeq 3000/4000m for example.

                          NovaSeq has unique problems, though. The distance of 12000 is not really to catch optical duplicates, but rather, clonal colonies that presumably form when a molecule breaks off from a colony and forms a new colony nearby (typically downstream). These have a high degree of positional correlation but are not actually adjacent; they can be fairly distant. They are not even necessarily on the same tile, so Clumpify won't catch all of them when it is restricted to looking for duplicates only within a tile.

                          And yes, the 12000 comes from empirical testing. That catches most of the duplicates formed by this mechanism. I think I posted the approximate amount somewhere... maybe it was around 85%? You can't get all of them unless you completely disregard position and remove all duplicates.

                          Comment


                          • We had a discussion about what the x:y coordinates mean on Biostars.

                            Short take home from poster of that thread was:

                            Contacted illumina support and they said the information I needed, such as dimension of the tile viewing area or zoom of the microscope, was "considered proprietary," which is frustrating.

                            Comment


                            • Originally posted by Brian Bushnell View Post
                              MAPQ is a measure of the probability (estimate) that the mapping location is correct. This can factor in various things, including the number of mismatches... but, for example, with these very short sequences, even though they are perfect matches, the specificity is low. MAPQ=3 specifically means that there is at most a 50% chance that the mapping is correct, and thus, there's at least one other perfect alignment for each of these.
                              ah I see, thanks. So they are good alignments, but to multiple sites. I'll have to rethink my strategy because I really should be keeping all alignments.

                              Thanks,

                              S.

                              Comment


                              • Originally posted by susanklein View Post
                                ah I see, thanks. So they are good alignments, but to multiple sites. I'll have to rethink my strategy because I really should be keeping all alignments.

                                Thanks,

                                S.
                                Consider http://seqanswers.com/forums/showthread.php?p=214346 (posts 204-206). BBMap does not seem to report ALL alignments in output file. @Brian may seen this and comment.

                                Comment

                                Working...
                                X