Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Sebastian,

    You can reduce the memory needs of BBNorm by adding the flags "prefilter" and "bits=16", which might allow it to run on a lower-memory computer, depending on the complexity of the input data.

    But as for reducing the CPU utilization... that's tricky. You can add "pigz=f unpigz=f" to disable pigz, so you will only be left with java threads. Java thread scheduling is nondeterministic due to garbage collection and various other background stuff, and I'm not sure if it's possible to ensure the process will be capped at a certain number of total threads without binding the process to specific cores. But you can try manually specifying the number of GC threads:

    -XX:ParallelGCThreads=2

    This command may or may not work, depending on your java version. You'd have to skip (or modify) the shellscript so that the actual java command would look something like this:

    java -ea -Xmx200g -Xms200g -XX:ParallelGCThreads=2 -cp /path/to/bbmap/current/ jgi.KmerNormalize in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2

    Comment


    • bbmapskimmer does not find all results

      Hello Brain,

      I'm trying to make use of bbmapskimmer.
      Here is a simple fasta example (read.fasta):

      Code:
      >1
      GAACATGATCCCCTTGTACTA[B]C[/B]TACATTATCACTAGCTTTATGTTTTCTA
      >2
      TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC
      >3
      GAACATGATCCCCTTGTACTA[B]T[/B]TACATTATCACTAGCTTTATGTTTTCTA
      Read 1 and read 3 are only different in single "T/C" letter, read 2 is reverse-complement of 1.

      I'm aligning to mm10 genome:

      ~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local

      Here is test_bbmap1.sam:
      Code:
      @HD	VN:1.4	SO:unsorted
      @SQ	SN:chr10	LN:130694993
      ....
      @SQ	SN:chrM	LN:16299
      @SQ	SN:chrX	LN:171031299
      @SQ	SN:chrY	LN:91744698
      @PG	ID:BBMap	PN:BBMap	VN:37.02	CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
      1	0	chr16	39205299	43	50=	*	0	0	GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA	*	NM:i:0	AM:i:43	NH:i:2
      1	272	chr10	38707112	39	28=1X21=	*	0	0	*	*	NM:i:1	AM:i:39	NH:i:2
      2	16	chr16	39205299	43	50=	*	0	0	GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA	*	NM:i:0	AM:i:43	NH:i:2
      2	256	chr10	38707112	39	28=1X21=	*	0	0	*	*	NM:i:1	AM:i:39	NH:i:2
      3	16	chr10	38707112	3	50=	*	0	0	TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC	*	XT:A:R	NM:i:0	AM:i:3	NH:i:80
      3	272	chr10	72154315	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr10	77806115	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	272	chr10	127797927	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr13	4007635	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr14	17647057	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr14	40019581	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr15	50069753	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	272	chr15	72113945	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr15	75259455	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr17	69579150	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr19	60830365	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr1	95272134	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      3	256	chr1	102465771	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
      ....
      (I reduced some header lines and some alinments of read 3)

      As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?

      Could you, please, help to clarify this?

      Best,
      Minja

      Comment


      • bbmapskimmer does not find all results

        Hello Brain,

        I'm trying to make use of bbmapskimmer.
        Here is a simple fasta example (read.fasta):

        Code:
        >1
        GAACATGATCCCCTTGTACTA[B]C[/B]TACATTATCACTAGCTTTATGTTTTCTA
        >2
        TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC
        >3
        GAACATGATCCCCTTGTACTA[B]T[/B]TACATTATCACTAGCTTTATGTTTTCTA
        Read 1 and read 3 are only different in single "T/C" letter, read 2 is reverse-complement of 1.

        I'm aligning to mm10 genome:

        ~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local

        Here is test_bbmap1.sam:
        Code:
        @HD	VN:1.4	SO:unsorted
        @SQ	SN:chr10	LN:130694993
        ....
        @SQ	SN:chrM	LN:16299
        @SQ	SN:chrX	LN:171031299
        @SQ	SN:chrY	LN:91744698
        @PG	ID:BBMap	PN:BBMap	VN:37.02	CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
        1	0	chr16	39205299	43	50=	*	0	0	GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA	*	NM:i:0	AM:i:43	NH:i:2
        1	272	chr10	38707112	39	28=1X21=	*	0	0	*	*	NM:i:1	AM:i:39	NH:i:2
        2	16	chr16	39205299	43	50=	*	0	0	GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA	*	NM:i:0	AM:i:43	NH:i:2
        2	256	chr10	38707112	39	28=1X21=	*	0	0	*	*	NM:i:1	AM:i:39	NH:i:2
        3	16	chr10	38707112	3	50=	*	0	0	TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC	*	XT:A:R	NM:i:0	AM:i:3	NH:i:80
        3	272	chr10	72154315	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr10	77806115	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	272	chr10	127797927	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr13	4007635	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr14	17647057	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr14	40019581	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr15	50069753	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	272	chr15	72113945	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr15	75259455	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr17	69579150	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr19	60830365	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr1	95272134	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        3	256	chr1	102465771	3	50=	*	0	0	*	*	NM:i:0	AM:i:3	NH:i:80
        ....
        (I reduced some header lines and some alinments of read 3)

        As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?

        Could you, please, help to clarify this?

        Best,
        Minja

        Comment


        • Hi Brian,

          Thank you for your quick reply. I've tried to run BBNorm with prefilter=t and bits=16 on our server, but I still run out of memory. I have a server with 128GB of RAM, and I've tried also on a 256 GB but it didn't work. But I saw that the high CPU usage (more than defined in threads=x) is just at the beginning, for a short time and afterwards java limits itself to the required number of cores.
          Luckily, I've found that the PBS system divides the CPU hours to the walltime, and thus gets the CPU usage. The workaround was to ask the job as a STDIN, and be idle for around 30 minutes, which will create a "buffer", and thus even if I start with much higher initial CPU usage, the time buffer is enough to compensate, and thus the job doesn't get killed by the scheduler.

          However now I get an error with pigz:

          Table creation time: 1667.979 seconds.
          Started output threads.
          pigz: write error code 122
          pigz: abort: write error on <stdout>
          Exception in thread "Thread-53" java.lang.RuntimeException: java.io.IOException: Stream closed
          at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:31)
          Caused by: java.io.IOException: Stream closed
          at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:433)
          at java.io.OutputStream.write(OutputStream.java:116)
          at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
          at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:451)
          at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:96)
          at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
          at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
          Exception in thread "Thread-64" java.lang.RuntimeException: Writing to a terminated thread.
          at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:207)
          at stream.ConcurrentGenericReadOutputStream.addOrdered(ConcurrentGenericReadOutputStream.java:193)
          at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:98)
          at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3129)
          at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2801)
          Output buffer became full; key 95560 waiting on 95304.
          LE: I set the $TMPDIR on the cluster to my current working directory and i got rid of the pigz error:-)
          Last edited by sebastian_1; 03-16-2017, 03:24 AM.

          Comment


          • While you wait for @Brian to respond (he is on California Daylight Time) you could turn off "pigz=f unpigz=f" and see if that allows the job to move forward.

            Comment


            • Hello Brain,

              I'm trying to make use of bbmapskimmer.
              Here is a simple fasta example (read.fasta):

              Code:
              >1
              GAACATGATCCCCTTGTACTA[B]C[/B]TACATTATCACTAGCTTTATGTTTTCTA
              >2
              TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC
              >3
              GAACATGATCCCCTTGTACTA[B]T[/B]TACATTATCACTAGCTTTATGTTTTCTA
              Read 1 and read 3 are only different in single "T/C" letter, read 2 is reverse-complement of 1.

              I'm aligning to mm10 genome:

              ~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local

              Here is test_bbmap1.sam:
              Code:
              @HD    VN:1.4    SO:unsorted
              @SQ    SN:chr10    LN:130694993
              ....
              @SQ    SN:chrM    LN:16299
              @SQ    SN:chrX    LN:171031299
              @SQ    SN:chrY    LN:91744698
              @PG    ID:BBMap    PN:BBMap    VN:37.02    CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
              1    0    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
              1    272    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
              2    16    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
              2    256    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
              3    16    chr10    38707112    3    50=    *    0    0    TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC    *    XT:A:R    NM:i:0    AM:i:3    NH:i:80
              3    272    chr10    72154315    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr10    77806115    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    272    chr10    127797927    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr13    4007635    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr14    17647057    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr14    40019581    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr15    50069753    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    272    chr15    72113945    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr15    75259455    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr17    69579150    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr19    60830365    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr1    95272134    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              3    256    chr1    102465771    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
              ....
              (I reduced some header lines and some alinments of read 3)

              And, just to note, without "local" option I got same results.

              As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?

              Could you, please, help to clarify this?
              And one more question, why reads 1 and 2 do not have "XT:A:" tag?

              Best,
              Minja
              Last edited by minja; 03-16-2017, 06:07 AM.

              Comment


              • Bbmap for mapping a transcriptome to a genome

                Hi Brian,
                I would like to use Bbmap to map a transcriptome to a genome.
                My first questions would be in the Results it saids:

                Reads Used: 165317 (61187186 bases)

                but the transcriptome that I am using has 95,389 transcripts. Shouldn't the number of transcripts be the same as the reads used?

                2nd. This transcriptome has a N50 of 363 and N75 of 696, the manual saids to use BBMapPacBio for reads longer than 700bp. However, when I use BBMapPacBio instead of bbmap I get a 28% mapping percentage, that is lower than when using bbmap (38%).This transcriptome has transcripts from other organisms and the mapping percentage should be around 50%.

                What else should I take into account when mapping a transcriptome to the genome, since is it different than mapping mRNA reads?

                Thanks

                Comment


                • @minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.

                  @catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.

                  Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.

                  Comment


                  • Hello Brian,

                    I made some viruses that have a set of degenerate nucleotides in the genome to look at bottlenecks - we're calling these barodes. I'm wondering if any of your tools have a way of extracting the specific region where the barcodes are and then listing the different sequences that are present and possibly how many times they are present. I've been able to do pretty much everything I've needed to with your tools so I figured I would ask. Thanks for all of the work you've done.

                    James

                    Comment


                    • Hi James,

                      Sorry for the slow reply. You could, in your case, do this:

                      Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:

                      reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
                      kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa


                      "kmers.fa" will contain all the barcodes and their counts.

                      Comment


                      • Originally posted by Brian Bushnell View Post
                        Hi James,

                        Sorry for the slow reply. You could, in your case, do this:

                        Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:

                        reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
                        kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa


                        "kmers.fa" will contain all the barcodes and their counts.


                        Awesome Brian, thanks for the response. It seems like that is going to work. It's a little more complicated since there are actually three amplicons with barcodes in different areas in the same file but if I can pull them out using one of your other tools the kmercountexact.sh should work just fine.

                        I thought maybe I would ask you a question about another problem I'm currently having. I have been trying to use BBmap to detect large deletions in viral sequencing data (defective genomes) but so far haven't had good luck. Is there a way I can sort out specifically these reads that have "break points" or deletions in the viral genome away from the reads that map normally?

                        Thanks again for all of your work.

                        James

                        Comment


                        • Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

                          bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
                          bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
                          bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k

                          Comment


                          • Originally posted by Brian Bushnell View Post
                            Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

                            bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
                            bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
                            bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
                            Brian, that's awesome, I didn't even know there was a variant caller in your suite. That worked well and I was able to find a bunch of deletions by using the flag "clearfilters" in callvariants.sh. However, I'm sure a lot of it is probably not real. Is there a particular parameter you suggest looking at as a measure of probability of being real? I'm not sure what some of the readouts in the output from callvariants.sh actually mean.

                            Thanks again.

                            Comment


                            • Normally, the variant quality is the best indicator of whether a variant is real. I think, in this case, it's likely that most of them are real - viruses mutate rapidly, so viral "isolates" often contain multiple different viral genomes (which is why they are hard to assemble). The possibilities are, essentially...

                              1) Sequencing error randomly matched somewhere distant in the genome, causing the false appearance of a deletion. This is extremely unlikely with small, non-redundant genomes, and would usually only be represented by a single read.
                              2) Chimeric reads formed from different genomic fragments fusing together during sequencing may look like long deletion events. These should universally be represented by a single read (or pair) with a PCR-free or deduplicated library.
                              3) Real deletion events. Hopefully these will be represented by multiple reads.

                              So, you may want to deduplicate your library (if it was PCR'd, otherwise don't) and then call variants using a threshold of, say, 3 reads (minreads=3 flag, placed after "clearfilters").

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                @minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.

                                @catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.

                                Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
                                Hi Brian thanks for your reply.

                                I haven't been able to improve the mapping percentage. I tried your suggestion (bbmap/mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40) and still get 13%.

                                Is it because the mapping is too stringent? As I am dealing with different species within the same genus. Is there another flag aside from "minid" that I could change to improve the mapping.

                                Thanks again

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                17 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                14 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                43 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X