Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SillyPoint
    replied
    I'd just logged on here to post exactly the question acnoll poses above: "is it possible to have pairs with only one end mapping to the genome be included in the alignment file?"

    The implication there, which after reading the manual and running Bowtie 0.12.1 I believe, is that only read pairs which both match, and fall within the -I/-X constraints, will be output. True?

    The alternative for now is to specify the -a option to get all the mapped output, and post-process that to find what you're interested in, be that the best pair (for some definition of "best"), or reads where only one end matches.

    To have the option to do that directly in Bowtie would be nice.

    --TS

    Leave a comment:


  • acnoll
    replied
    Option for output of pairs where only one end aligns

    With bowtie's current set of options is it possible to have pairs with only one end mapping to the genome be included in the alignment file (e.g. sam file)? I am interested in identifying intra-read short indels through the
    anchoring of one of a mate pair's ends.

    Leave a comment:


  • xuying
    replied
    Hi Ben:
    It seems I can't find a suitable place to put my csfastq file.
    Here I just show some lines in the csfastq file generated from program "solid2fastq" of bfast. Do you think it is ok to go? Should I remove the first primer letter and 1st color to get a true base there?

    @2292_469_84
    T210002310010221002200330303002200201120221.2111.2.
    +
    8<;==:=@?=<<>>>;;??<=<;96:?:5<>;85:=7,,:5/",(/)"*"
    @2292_469_216
    T000111101020011320222113222200220200120202.2222.2.
    +
    /6=>=::>>=;==>;;6=;;9<6:8<(3:-<;/9:852=-7/"2(6)")"
    @2292_469_274
    T300101122322222232222222210222222222022220.2222.2.
    +
    ,=#$$#@%#'#>$,&(;$*$*=)*'&6%,%##*,+#,4),#)",5'#","

    Leave a comment:


  • xuying
    replied
    Oh, yes, sorry. I just confused the file with CIGAR notation.

    Leave a comment:


  • Ben Langmead
    replied
    Originally posted by xuying View Post
    There are millions of lines in SAM and pileup files. So fixed "48M" in SAM and fixed "A" in pileup file are unreasonable. (pls wait for me to send you the csfastq files). Thanks a lot! :-)
    Why? Given that M = "match or mismatch", when would you expect something other than 48M?

    Ben

    Leave a comment:


  • xuying
    replied
    Hi Ben:
    I will put the csfastq (maybe part of it) later somewhere because it's huge.
    And I am using bowtie 0.12.1 (but color index was built by using 0.12-beta).
    There are millions of lines in SAM and pileup files. So fixed "48M" in SAM and fixed "A" in pileup file are unreasonable. (pls wait for me to send you the csfastq files). Thanks a lot! :-)

    Leave a comment:


  • Ben Langmead
    replied
    Hi xuying,

    Thank you for the detailed report:

    Originally posted by xuying View Post
    I tried SAMtools on "base space" SAM files generated by aligning "color space" reads with Bowtie. Why I always get "A" in the 3rd column of the pileup file? It seems some kind of errors exists.

    chr1 1185 g A 0 0 60 1 . !
    chr1 1190 c A 0 0 60 1 . !
    chr1 1191 t A 0 0 60 1 . !
    chr1 1222 c A 0 0 60 2 .. !!
    chr1 1231 t A 0 0 60 2 .. !!
    chr1 1232 t A 0 0 60 2 .. !!
    chr1 1509 c A 0 0 60 1 . !
    chr1 1511 t A 0 0 60 1 . !
    chr1 1512 t A 0 0 60 1 .$ !
    chr1 1850 G A 0 0 60 3 ... !!!
    chr1 2134 C A 0 0 60 1 . !

    Is there any problem when calling SNPs from "base space" SAM file producted by aligning "color space" reads? Or the coverage is too low? I just want to try if bowtie -C and samtools works for calling SNPs in color space reads (I converted csfasta and qual files into csfastq file with "solid2fastq" in bfast).
    If I understand correctly, that is very low coverage (~2), and the qualities of are also low. Can you send me the fastq file you're using? Also, are you using 0.12.1? Note that versions < 0.12.1 had an issue whereby Bowtie would fail to trim the first color from csfasta reads.

    Originally posted by xuying View Post
    some content in SAM file:

    4_1246_1108 67 chr16 20648174 255 48M = 20650124 1998 CCTCTGGGTTTGTAGATTTGCCACTCTTAAGAGGCAAGGATTGACAGG OOQSQKOJJQECHAE?-=D82.893
    '!/3!!0=2!!9F?)!!!=?'00 XA:i:0 MD:Z:48 NM:i:0 CM:i:8
    4_1246_1108 131 chr16 20650125 255 48M = 20648173 -2000 AGTAAGTGGTCATCTATAAAGCAAAGACTGCCTGTGAAATAAATGGGA KEFJQTVSTVUSGIOSE2GJ!!@OO
    SRNPJI@/ATF("/4:;@ACHG+ XA:i:1 MD:Z:48 NM:i:0 CM:i:3
    4_1253_1656 179 chr17 66720558 255 48M = 66723289 2779 GACATGCTAAGGAAAGAGTGAAAATGGAGTCATATTAAAATGTTAAGT !&!!!!!:@N"'WSNNKHIMKORHA
    DMQTTPOULFJMXUUZRRYXYIG XA:i:0 MD:Z:48 NM:i:0 CM:i:7
    4_1253_1656 115 chr17 66723290 255 48M = 66720557 -2781 TAAAGAAATCTCCAGGCCCAAATGGTTTTACTTGTCAATTCTACCAAA !!!!/8NRJCGPULHBBPI@BLVND
    AO[QNDK\NLTWVUVNNPNIGTL XA:i:0 MD:Z:48 NM:i:0 CM:i:3
    4_1254_1557 67 chr1 40359166 255 48M = 40361009 1891 TACTGGACAACACAGTTCTAGTATGTAAGCTTTGAGAGAGCAGGGATT K??CGR>;JFHNL>@OA@GCF94<B
    OF::;84=@C!!NML4/I;9&(A XA:i:0 MD:Z:48 NM:i:0 CM:i:3
    4_1254_1557 131 chr1 40361010 255 48M = 40359165 -1893 CCTTTTTCTTGAATAATCTATTTCTTAGTATGTCTTAATTTACTAATA YTVXX[Y\^^^VJPZYMN[YRNLPV
    NCKUWZUJLSD?;>IIA:FM!!! XA:i:0 MD:Z:48 NM:i:0 CM:i:2

    all "48M" alignment? Mismatches should be reported Since I used "-C -q -n 2 -l 25 --snpfrac 0.001" to do the bowtie mapping. Can you help me identify my problem? Thanks a lot!
    In CIGAR, "M" means "either match or mismatch". (See SAM paper). So that output is correct correct.

    Thanks,
    Ben

    Leave a comment:


  • xuying
    replied
    Hi Ben:
    I tried SAMtools on "base space" SAM files generated by aligning "color space" reads with Bowtie. Why I always get "A" in the 3rd column of the pileup file? It seems some kind of errors exists.

    chr1 1185 g A 0 0 60 1 . !
    chr1 1190 c A 0 0 60 1 . !
    chr1 1191 t A 0 0 60 1 . !
    chr1 1222 c A 0 0 60 2 .. !!
    chr1 1231 t A 0 0 60 2 .. !!
    chr1 1232 t A 0 0 60 2 .. !!
    chr1 1509 c A 0 0 60 1 . !
    chr1 1511 t A 0 0 60 1 . !
    chr1 1512 t A 0 0 60 1 .$ !
    chr1 1850 G A 0 0 60 3 ... !!!
    chr1 2134 C A 0 0 60 1 . !

    Is there any problem when calling SNPs from "base space" SAM file producted by aligning "color space" reads? Or the coverage is too low? I just want to try if bowtie -C and samtools works for calling SNPs in color space reads (I converted csfasta and qual files into csfastq file with "solid2fastq" in bfast).

    some content in SAM file:

    4_1246_1108 67 chr16 20648174 255 48M = 20650124 1998 CCTCTGGGTTTGTAGATTTGCCACTCTTAAGAGGCAAGGATTGACAGG OOQSQKOJJQECHAE?-=D82.893
    '!/3!!0=2!!9F?)!!!=?'00 XA:i:0 MD:Z:48 NM:i:0 CM:i:8
    4_1246_1108 131 chr16 20650125 255 48M = 20648173 -2000 AGTAAGTGGTCATCTATAAAGCAAAGACTGCCTGTGAAATAAATGGGA KEFJQTVSTVUSGIOSE2GJ!!@OO
    SRNPJI@/ATF("/4:;@ACHG+ XA:i:1 MD:Z:48 NM:i:0 CM:i:3
    4_1253_1656 179 chr17 66720558 255 48M = 66723289 2779 GACATGCTAAGGAAAGAGTGAAAATGGAGTCATATTAAAATGTTAAGT !&!!!!!:@N"'WSNNKHIMKORHA
    DMQTTPOULFJMXUUZRRYXYIG XA:i:0 MD:Z:48 NM:i:0 CM:i:7
    4_1253_1656 115 chr17 66723290 255 48M = 66720557 -2781 TAAAGAAATCTCCAGGCCCAAATGGTTTTACTTGTCAATTCTACCAAA !!!!/8NRJCGPULHBBPI@BLVND
    AO[QNDK\NLTWVUVNNPNIGTL XA:i:0 MD:Z:48 NM:i:0 CM:i:3
    4_1254_1557 67 chr1 40359166 255 48M = 40361009 1891 TACTGGACAACACAGTTCTAGTATGTAAGCTTTGAGAGAGCAGGGATT K??CGR>;JFHNL>@OA@GCF94<B
    OF::;84=@C!!NML4/I;9&(A XA:i:0 MD:Z:48 NM:i:0 CM:i:3
    4_1254_1557 131 chr1 40361010 255 48M = 40359165 -1893 CCTTTTTCTTGAATAATCTATTTCTTAGTATGTCTTAATTTACTAATA YTVXX[Y\^^^VJPZYMN[YRNLPV
    NCKUWZUJLSD?;>IIA:FM!!! XA:i:0 MD:Z:48 NM:i:0 CM:i:2

    all "48M" alignment? Mismatches should be reported Since I used "-C -q -n 2 -l 25 --snpfrac 0.001" to do the bowtie mapping. Can you help me identify my problem? Thanks a lot!
    Last edited by xuying; 01-12-2010, 03:04 AM.

    Leave a comment:


  • Ben Langmead
    replied
    Yes, it should be usable by tools (like samtools) that call SNPs from .sam files.

    Thanks
    Ben

    Leave a comment:


  • xuying
    replied
    Hi Ben Langmead.
    Can the resulted .SAM file in "base space" by mapping "color space" reads be used for SNP calling (samtools) or other tools that can be used for dealing with Solexa data? Thanks!

    Leave a comment:


  • Ben Langmead
    replied
    Originally posted by Xi Wang View Post
    Code:
    Read1 16      chr1    7947971 255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:50 NM:i:0
    Read1 16      chr12   48275260        255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:12A7T29    NM:i:2
    Hi Xi,

    In -n mode, the "stratum" referred to by --strata is the number of mismatches in the seed. The seed length is set with -l. In your case, the seed doesn't extend to those mismatches.

    Thanks,
    Ben

    Leave a comment:


  • bioinfosm
    replied
    I think that is to do with the seed length. For your seed length, are both reads equally good hits!

    Originally posted by Xi Wang View Post
    Hi,

    I am confused by the bowtie options again. I used the options "-a --best --strata", but got a result as below:

    Code:
    Read1 16      chr1    7947971 255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:50 NM:i:0
    Read1 16      chr12   48275260        255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:12A7T29    NM:i:2
    The result shows that there are two hits for this read: one hits to chr1 (where the sequence from) perfectly, and the other hits to chr12 with 2 mismatches. However, my expectation is to make bowtie only report the best hit (namely the hit to chr1) by using the options "-a --best --strata". Why I get this weird result?
    Thanks in advance.
    --
    Xi

    Leave a comment:


  • Xi Wang
    replied
    Hi,

    I am confused by the bowtie options again. I used the options "-a --best --strata", but got a result as below:

    Code:
    Read1 16      chr1    7947971 255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:50 NM:i:0
    Read1 16      chr12   48275260        255     50M     *       0       0       ATTAAGGTCACCGTTGCAGGCCTGGCTGGAAAAGACCCAGTACAGTGTAG      IIIIIIIIIIII
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:12A7T29    NM:i:2
    The result shows that there are two hits for this read: one hits to chr1 (where the sequence from) perfectly, and the other hits to chr12 with 2 mismatches. However, my expectation is to make bowtie only report the best hit (namely the hit to chr1) by using the options "-a --best --strata". Why I get this weird result?
    Thanks in advance.
    --
    Xi

    Leave a comment:


  • Ben Langmead
    replied
    Originally posted by bioinfosm View Post
    When I limited my reference sequence to the blat hit region, I got the hit with 3 mis-matches, however, not before I increased the -e option to -e 80. Why would I not get this hit previously, when I used -a -e 90 to report all hits?

    And why do I have to do -n 3, when the seed length by default is 28, and there are no more than 2 mis-matches in 28bp?

    HTML Code:
    $ /home/m049157/build/bowtie-0.10.0/bowtie --best -p 4 -t -n 3 -e 80 -a www w ww
    Time loading forward index: 00:00:00
    Time loading mirror index: 00:00:00
    Seeded quality full-index search: 00:00:00
    Reported 1 alignments to 1 output stream(s)
    Time searching: 00:00:00
    Overall time: 00:00:00
    $ cat ww
    HWI-E4:1:87:1633:1127#0/1       -       Zv7_scaffold910 5660144 AGTCTGCTTTTCCATATAAAACTGAGAAGAAGAGACTGCAGCCTTGAACAAACTTGGGAAGTCTTAACTTACACG     %%%%%%3=A;/-(8990(8<:9)<6:@,.4<A?A;28@24B/+<?B@4=BA><?@BBBBA@?70>@@=?@?724B       0       10:G>T,18:C>G,27:T>A
    Hi bioinfosm,

    Try using the --maxbts or -y options to increase the amount of searching effort put in by Bowtie. Note that -n 2 and -n 3 modes are not fully fully sensitive by default to avoid excessive backtracking (see manual section on Maq-like alignment).

    That alignment does have 3 mismatches in the seed (at 0-based offsets 10, 18 and 27 from the 5' end).

    Hope that helps,
    Ben

    Leave a comment:


  • bioinfosm
    replied
    When I limited my reference sequence to the blat hit region, I got the hit with 3 mis-matches, however, not before I increased the -e option to -e 80. Why would I not get this hit previously, when I used -a -e 90 to report all hits?

    And why do I have to do -n 3, when the seed length by default is 28, and there are no more than 2 mis-matches in 28bp?

    HTML Code:
    $ /home/m049157/build/bowtie-0.10.0/bowtie --best -p 4 -t -n 3 -e 80 -a www w ww
    Time loading forward index: 00:00:00
    Time loading mirror index: 00:00:00
    Seeded quality full-index search: 00:00:00
    Reported 1 alignments to 1 output stream(s)
    Time searching: 00:00:00
    Overall time: 00:00:00
    $ cat ww
    HWI-E4:1:87:1633:1127#0/1       -       Zv7_scaffold910 5660144 AGTCTGCTTTTCCATATAAAACTGAGAAGAAGAGACTGCAGCCTTGAACAAACTTGGGAAGTCTTAACTTACACG     %%%%%%3=A;/-(8990(8<:9)<6:@,.4<A?A;28@24B/+<?B@4=BA><?@BBBBA@?70>@@=?@?724B       0       10:G>T,18:C>G,27:T>A

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X