Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I don't understand why you are aligning DNA reads to a transcriptome. To clarify, are these DNA reads or RNA-seq reads? And can you explain what you're trying to do?
-
I tried using the settings you had recommended, but the standard deviation is still awfully high. Would this be related to the fact that we are trying to map WGS sequences to a transcriptome?
Here are the results:
With ambig=random setting and local=t
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535
Average coverage: 31.27
Standard deviation: 628.72
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 75.89
Time: 2084.076 seconds.
----------------------------------------------------------------------------------------------------------------------
With ambig=random setting, but no local flag
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535
Average coverage: 31.27
Standard deviation: 635.37
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 77.75
Time: 1935.627 seconds.
--------------------------------------------------------------------------------------------------------------------
Thank you again Brian for your last reply!
Leave a comment:
-
Well... I'm not quite sure why you're mapping DNA reads to the transcriptome. If the annotation is accurate, you already have the full mRNA nucleotide sequences in the file "rna.fa.gz" and don't need to generate a sam file at all, nor will WGS help you find the mRNA sequences with any methodology I can think of.
Are you sure you meant WGS (ie, DNA reads) rather than RNA-seq data? For RNA-seq data, to find full-length mRNAs, you would probably assemble it using an assembler like Trinity, or map it to the genome and use some other tool to guess isoforms from mapping data.
As for the high standard deviation - BBMap by default will map reads that could map to multiple locations (aka multimapping, ambiguous, or non-uniquely-mapping reads) to the first genomic location they can map to optimally. For this kind of purpose (RNA-seq, or mapping to a transcriptome) I recommend you set "ambig=random" or "ambig=all". Otherwise, for genes with multiple isoforms, all the reads from that gene will map to the first isoform containing the exon in question, leading to a high standard deviation. For mapping RNA to a transcriptome I generally recommend "ambig=random".
Also, if you ARE mapping DNA to the transcriptome (which, again, is probably not useful), you should set the "local" flag - otherwise reads that span intron/exon boundaries will have a lot of mismatches since there is intronic content in DNA reads, but none in the transcriptome.
-Brian
Leave a comment:
-
Greetings community! I am only a first year undergrad student, so please bare with my ignorance.
I have encountered some difficulties trying to use BBMap to align WGS pair end reads belonging to a Sheep to it's pertinent rna.fa file - I'm only interested in the organism's full mRNA sequences, but I do not know where else to get them.
The problem I'm facing surfaces after verifying Average Coverage and Standard Deviation calculations. While the coverage results are as expected, the standard deviation values are well over 600!
I will post the command line and the path to the reference data I am using below, to ease out the process of troubleshooting:
Reference Data: found under NCBI's ftp site, under the following path: genomes/Ovis_aries/RNA/rna.fa.gz
Here is a link that leads directly to the directory containing the reference file I am using: ftp://ftp.ncbi.nlm.nih.gov/genomes/Ovis_aries/RNA/
*If anyone else would kindly suggest a better way of obtaining only the full mRNA nucleotide sequences of the organism, please be kind enough to enlighten my poor soul*
Reads: I am using paired end fastq WGS reads that were used to build the organism's genome assembly. They have already undergone proper quality testing.
After I create my build using BBMap and my reference file (rna.fa), I proceed to align my reads to the reference build. Here is what my command line looks like:
bbmap.sh in1=read_1.fastq in2=read_2.fastq out=MappedFile.sam build=1 minratio=0.56
(I am trying to compare ambiguous values later on, so I need the min ratio. Making the "minratio=<>" parameter closer to 1 does lower the Standard Deviation, but it is still unusually high)
Cool, up to there everything runs smoothly. I take a look at my alignment results and proceed to calculate coverage and standard deviation values using pileup.sh. Again, here is my command line:
pileup.sh in=MappedFile.sam out=stats.txt hist=histogram.txt
After looking at the results, they look something like this:
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535
Average Coverage: 37.84
Standard Deviation: 707.47
Percent scaffolds with any coverage: 80.76
Percent of reference bases covered: 35.87
I'd be incredibly thankful if anyone would shed some light into this little hindrance. What am I messing up?
Leave a comment:
-
Hi Rahul,
For a paired end dataset and a single-end dataset, you need to run mapping twice. However, you can do it in one command with BBWrap like this:
Code:bbwrap.sh ref=reference.fa in=read1.fq,merged.fq in2=read2.fq,null out=mapped.sam append=t
Code:bbmap.sh ref=reference.fa bbmap.sh in=read1.fq in2=read2.fq out=mapped_pairs.sam bbmap.sh in=merged.fq out=mapped_merged.sam
Leave a comment:
-
Aligning reads after bbmerge
Hello,
I have PE reads, many of which overlap. I used bbmerge to merge the overlapping reads. In the end I have 3 fastq files:
1. Merged single end reads
2. End 1 of no-merged reads
3. End 2 of no-merged reads
How do I specify these 3 fastq files to bbmap [what should I give for in and in2] ? Thanks.
bye,
RahulLast edited by rkanwar; 09-11-2015, 10:42 AM.
Leave a comment:
-
Originally posted by Brian Bushnell View PostAs for your second post - that seems much more concerning (to me). I'll examine it in detail tomorrow, and thanks as always for your thorough bug reports. I'm still working on the Seal qhdist issue, by the way; I've just been preoccupied with upgrading some programs to support unlimited max kmer lengths, which is now mostly finished.
Leave a comment:
-
I'll have to doublecheck that pairedonly/killbadpairs thing I thought I was seeing. I like the pairedonly strategy and I can deal with mates mapped to different references in other ways.
Yes, tools like RSEM and eXpress need to have as many alignments as possible per read within reason. In fact the default alignment for RSEM is with bowtie (1) allowing 2 mismatches in the seed and then up to the entire rest of the read to be mismatches and all alignments that fall under that threshold. Multiple best alignments is also super useful for the program because they it can see which reads are totes ambiguous (i.e. the ones that map exactly the same to mulitple isoforms - those that hit an exon that's shared among multiple isoforms). So that information is necessary but it's also necessary to have what most aligners consider suboptimal alignments because those programs don't necessarily look at each alignment's quality but the overall solution to the abundance problem. The need to see which target references a read can map to within some reason. I don't like to allow as much wiggle room as the default RSEM bowtie setting - I typically allow up to 10% error. In some cases it would be entirely reasonable to only accept the best alignment...but I would still need ambig=all to satisfy the requirements for RSEM/eXpress. Obviously I could just use bowtie2 which works great but it is slooowwwwwww. STAR works pretty good and is a great substitute for bowtie2. I want to see if I can find the right settings for BBMap too and see how it ranks. The variable between programs tends to be the number of alignments found. bowtie2 finds many more PE alignments than bowtie1, for example. STAR comes close to bowtie2 while bwa falls short of them both. Basically whichever aligner combines finding the 'most' mappings with speed is my favorite for this job. I haven't specifically looked to see how well BBMap stacks up but I'll do that. The kind of test I'm talking about is to generate transcriptome based simulated reads and then map them with different aligners. I just check, per read/pe-fragment, if the aligner successfully reported the original correct alignment along with, if any, additional alignments. Again bowtie2 was the winner between bowtie1, STAR, bwa (aln and mem), and I think subread. Unfortunately it is slow in the alignment mode where it reports more than a single alignment (-k or -a mode).
FYI the default behavior for STAR, bowtie and bowtie2 when allowing multiple mappings per pair is to report them in R1, R2, R1, R2 strict formatting. So if bowtie2 finds 12 locations for a single fragment it will print out 24 lines in R1,R2 pairs. All of the secondary locations get the 0x100 flag set. Basically that just makes parsing and interpreting them really easy since, as you know, if you can count on a certain formatting it means you can write less code. Obviously the RSEM program could do this but they chose to put the burden of formatting on the user - probably to keep the number of decisions going on behind the scenes to a minimum. I was parsing the BBMap alignments by watching the read name and piling up the 0x40 mates in one array with 0x80 mates in a second array and then afterwards I ran all possible combinations of them through a function that validates whether or not they are a valid pair. It would make their program a little more powerful, in my opinion, if they could just incorporate something like that but alas...I guess that's what keeps Perl in business.
Leave a comment:
-
Woah, I replied to your first post and it never registered. To summarize:
I was unable to reproduce an interaction of the "pairedonly" and "killbadpairs" flag generating singletons, so if you can send me a read pair and reference for which that occurs, it would be helpful. In my testing, when pairedonly=true, I get identical results with killbadpairs true or false. But, I don't use killbadpairs anymore - when I want to produce only properly paired mapped reads, I set pairedonly and use the outm stream, e.g.
bbmap.sh in=reads.fq outm=mapped_paired.fq pairedonly
For RSEM - the default behaviour of BBMap is to produce R1, R2, R1, R2, etc. in sam files, so it should be fine. It will do this with the flags "ambig=best" (default), "ambig=random", and "ambig=toss". The only time it will not do that is with "ambig=all", in which it will print all R1 alignments, then all R2 alignments. Enforcing strict interleaving when printing all alignments would probably result in output that is either incorrect or violates the sam spec, which makes me wonder where it ever occurs... so, I'm a bit reluctant to add the option, but I already have other flags like "keepnames" that (as noted) violate the sam spec but are still useful, so I'm willing to add it if useful. Does RSEM require all mapping locations, rather than a random location, for multimapping reads?
As for your second post - that seems much more concerning (to me). I'll examine it in detail tomorrow, and thanks as always for your thorough bug reports. I'm still working on the Seal qhdist issue, by the way; I've just been preoccupied with upgrading some programs to support unlimited max kmer lengths, which is now mostly finished.
Leave a comment:
-
Sort of continuing on the subject I posted above. Again I'm looking at paired-end alignments to a transcriptome where I'm allowing multiple alignments per pair. I'm using both ambiguous=all and secondary=t to allow some additional sub-optimal alignments to be reported. I need pairs to be reported as neighbors in the output so when there are 2 alignments for the first mate but only 1 for the second I need that to either be two lines or four lines depending on how those alignments were paired. So in parsing the output of bbmap today I found something that's throwing me off that appears to be a bug.
I've been parsing out reads in cases where there are more than one alignment of either the left or right mate so that I can see how bbmap reports those alignments. The following is one such set (first mate separated from second mate alignments):
Code:#== left ==# SRR1016945.2666 83 NM_146193 1295 42 100M = 1199 -196 GAACCTATAGAGATCCTGCCCAATGTCTGTTACACAGCGTGCGCAACACTCAAGGGCCCAGATTCTCACTATGGCACAAAAGGACTGAAGAAAGTAGTGT DCCDEEDEECCCDDDBDDDDDDEEDEEFFFEFFFHHIIIIIIIJIGGFJIJJJJJIJJJJIJJJJJJIJJJIJJJHJJJJJJJIJJJHHHHHFFFFFCCC NM:i:1 AM:i:42 NH:i:1 SRR1016945.2666 353 ENSMUST00000121441 1072 40 100M NM_146193 1295 0 CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD NM:i:2 AM:i:40 NH:i:3 SRR1016945.2666 353 ENSMUST00000178360 1054 40 100M NM_146193 1295 0 CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD NM:i:2 AM:i:40 NH:i:3 #== right ==# SRR1016945.2666 163 NM_146193 1199 42 100M = 1295 196 CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD NM:i:1 AM:i:42 NH:i:3
This appears to be happening randomly because sometimes I'll see multiple first-mate mappings and a single second-mate mapping but all of the 0x40 and 0x80 flags are set correctly.
Another interesting case:
Code:#== left ==# SRR1016945.1906 83 NM_011653 367 44 100M = 266 -201 AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG DDDDDCEDDDC@DDDDDCDEEDEEEFFFFHGHHGHIIJIIIJJJJJIIJIJJIHHGDJIGGGJJJJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC NM:i:0 AM:i:44 NH:i:3 SRR1016945.1906 337 NM_009448 800 42 100M NM_011653 266 0 AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG DDDDDCEDDDC@DDDDDCDEEDEEEFFFFHGHHGHIIJIIIJJJJJIIJIJJIHHGDJIGGGJJJJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC NM:i:1 AM:i:42 NH:i:3 SRR1016945.1906 337 ENSMUST00000122404 269 42 100M NM_011653 266 0 AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG DDDDDCEDDDC@DDDDDCDEEDEEEFFFFHGHHGHIIJIIIJJJJJIIJIJJIHHGDJIGGGJJJJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC NM:i:1 AM:i:42 NH:i:3 SRR1016945.1906 353 NM_009448 699 42 100M NM_011653 367 0 AGGAGCTGGCAAGCATGTGCCCCGGGCAGTGTTCGTAGACCTGGAACCCACGGTCATCGATGAAGTTCGCACCGGCACCTACCGCCAGCTCTTCCACCCT B@CDDDFDHHHHHIJIJI>AHGHIGIBGIGHDHGIBFHGJJJJJIJFI6D;BHDFFFCEEDEDDDACDDBBB=@BBBDBDDCB@BBBDBDCCDDDDD<<A NM:i:1 AM:i:42 NH:i:2 #== right ==# SRR1016945.1906 163 NM_011653 266 44 100M = 367 201 AGGAGCTGGCAAGCATGTGCCCCGGGCAGTGTTCGTAGACCTGGAACCCACGGTCATCGATGAAGTTCGCACCGGCACCTACCGCCAGCTCTTCCACCCT B@CDDDFDHHHHHIJIJI>AHGHIGIBGIGHDHGIBFHGJJJJJIJFI6D;BHDFFFCEEDEDDDACDDBBB=@BBBDBDDCB@BBBDBDCCDDDDD<<A NM:i:0 AM:i:44 NH:i:2
Leave a comment:
-
Hello Brian-
I'm curious if there is an output setting that I'm missing for PE alignments that would maintain a strict grouping of the mates in the output. I'd like to try using bbmap.sh as an aligner to feed RSEM (and maybe eXpress) which was designed around bowtie alignments. RSEM is VERY picky and there is almost no way to get alignments to work with their program without running them through some sort of reformatting script. One thing, however, that would be useful is to have the alignment output be able to keep mates in two-row pairs no matter what. So no matter what if line X is the left mate (mapped or unmapped) then line X+1 is the right mate mapped or unmapped. This way every pair of lines go together even through this will result in one mate or the other's alignment being reported more than once. RSEM flips out if it doesn't find the alignments like this.
TL;DR; An option so that every pair of lines in the output SAM represent the first and second mates of a fragment even if this causes redundant reporting of individual mate alignments.
Another question - I wanted to exclude singletons and improper pairs (most specifically pairs with mates mapped to different transcripts) so I enabled pairedonly AND killbadpairs. Each option taken separately works however killbadpairs seems to happen AFTER pairedonly, internally, because killbadpairs creates singletons in the output. Is it possible to have pairedonly apply to the result of killbadpairs as well?
thanks!
Leave a comment:
-
That's actually intentional, and happens when there's an N in the read or the reference. The SAM spec does not specify what to do in these situations. So, the way I see it, A aligning to A is a match (=), a aligning to C is a substitution (X), and A aligning to N - or even N aligning to N - might be a match, and might be a substitution. Since it's not an insertion or deletion, it's length-neutral, so I call it an alignment match.
Are these causing problems for some downstream application? If so, I can add a flag that changes the behavior.
Leave a comment:
-
Hi Brian,
I found a small bug in the v1.4 cigar strings: Whenever there is a "N" in the read sequence, one gets a "M" for this position instead of a "=".
Originally posted by ExampleRead:
GATATCGGTTTCATCCTCGCCTTAGCATGATTTATCCTACACTCCAACTCATGAGACCCCANAACAAATAGCCCTTCTAAACGCTAATCCAAGCCTCACCCCCACTACTAGGCCTCCTCCTAGCAGCAGCAGGCAAATCAGCCCAATTGGCTCCACCCCTGACTCCCCTCAG
aligns with the following cigar:
56=1I4=1M36=1I49=1X1=2D22=
Leave a comment:
-
This bug has now been fixed as of 35.14, which I just uploaded. Filtering was essentially not being applied when the best site was filtered and inferior sites passed the minid yet did not have cigar strings (a very rare scenario). But, it works fine now.
-Brian
P.S. Without filtering, just with default parameters, you get this alignment:
Code:M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78 0 NC_001802DRannotations_(modified) 2154 35 46=1032D34=1X1=1X24=1X21=
Leave a comment:
-
OK, that looks like a bug that is related to the "subfilter" setting. Without subfilter, it gets mapped here:
Code:M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78 0 NC_001802DRannotations_(modified) 3186 15 1=1X2=1X7=3X2=2X1=1X3=3X9=3X2=2X1=2X34=1X1=1X24=1X21=
-Brian
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
31 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
32 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Leave a comment: