Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • raphael123
    replied
    A possible problem :

    According to htseq documentation : http://www-huber.embl.de/users/ander.../counting.html

    "The fact that the records describe the same fragment can be seen from the fact that they have the same read name"

    So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

    So simply do that:

    (samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam

    Leave a comment:


  • kcm.eid
    replied
    Sorting SAM file properly can fix the warning

    Hi,
    I came across the same issue while using htseq-count.
    I ve used
    Code:
    sort -n 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
    command to sort my accepted_hits.sam, but still getting the warning message. When I little analysed the warning message and my input sorted SAM file, found that my sam file was not sorted properly. One read which is showing warning is as follows:

    ERR032234.10311751 147 chr12 100208754 50 76M = 100208593 -237 GGCAGCCCAGGGCTTGGTGTTGGCAGTTTGAGTTCGTGAGATAGAAAGAGTGGGGTGTCAAGGCAGTACCCCTGAG >>=@=<9;9?>A<=;A;7>5;;;;=8@=<7==@:;5AA;A@A19A5AAAA@<<<;<;=8=5;@;8;<?.55>7?7> XA:i:0 MD:Z:76 NM:i:0 NH:i:1

    ERR032234.1031175 145 chr1 51757180 50 76M = 51755662 -1594 CTCGGACTTGATCTGCCCAGACTTTTGGTCAGCAAGGAGAAGGTTATTGTTTGTTAAGAGGAAAATCCGAGATGTA A3=>D<D??>D?@DD?9DDD>DDBBDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD XA:i:0 MD:Z:76 NM:i:0 NH:i:1

    ERR032234.10311751 99 chr12 100208593 50 76M = 100208754 237 CTAAAAGCTTACCTCCAAAACAGGATTCTTGTGTAACTAGGAATCCTGCATGAGAACCAGAAACCCTAACCTCCGA CAA@?AC8>ACC>AA<C?8B=6B<98?=>BC<C9=AAA<CA>A=A:=BB=>>::<>@==@*2:;8:7><<;7<><; XA:i:0 MD:Z:76 NM:i:0 NH:i:1


    So the reads were not sorted correctly. I used the instruction from here

    Code:
    sort -bn -k 1.11,1 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
    my reads have ids like this: ERR032234.1031175
    this sort command would sort the file on field 1, from chars 11 to end of field, numerically.


    now htseq-count working fine.
    Last edited by kcm.eid; 03-02-2014, 10:07 PM. Reason: correcting instructions

    Leave a comment:


  • Simon Anders
    replied
    If TopHat marks the multimapping reads with the "NH" tag (and, IIRC, it does), htseq-count filters these out for you, too.

    Leave a comment:


  • bbl
    replied
    Thanks Simon. Does it mean I can understand that I only need to filter multiple mapped reads in the accepted_hits.bam from tophat2? No other filtering step necessary prior to HTseq-count?

    Originally posted by Simon Anders View Post
    They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by bbl View Post
    How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?
    They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

    Leave a comment:


  • Eveleee
    replied
    Transcriptom annotation

    Hello to everyone,
    I would like to ask for advice wich tools to use to annotate assembled RNA seq. We have sequenced several species at Illumina HiSeq and want to annotate genes now. Then see the differences in expression, because we have different species and hybrids as well.
    Does anyone can advice me the pipeline? I have heard about RAST and then to manually approved it in GMOD.
    I see majority is working with blast2go, but it is not free

    Leave a comment:


  • bbl
    replied
    How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?

    Leave a comment:


  • adaigle
    replied
    Oh sorry, there should be a bunch of colons in there. That's because I was fooling around with it when I was getting errors earlier about an illegal colon in the optional fields. Here is the real line:

    Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZP:B:f,0.0155953,0.00769459,0.0049578 ZM:B:s,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZF:i:28 RG:Z:Y71VY.IonXpress_014 PG:Z:tmap MD:Z:12 NM:i:0 AS:i:12 XA:Z:map4-1 XS:i:12

    Does this still look screwed up?

    Leave a comment:


  • dpryan
    replied
    That appears to be a pretty screwed up SAM line. Did your file get corrupted?

    Leave a comment:


  • adaigle
    replied
    Hi, I was wondering if you ever figured out the error message you were describing above? I am getting a similar thing with HTSeq, only it is with the first read of the SAM file.

    Error occured when reading first line of sam file.
    Error: ("unsupported format character ''' (0x27) at index 34", 'line 30 of file
    CAL120_sr.sam')
    [Exception type: ValueError, raised in _HTSeq.pyx:1168]

    And this is the line:
    Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZPBf,0.0155953,0.00769459,0.0049578 ZMBs,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZFi28 RGZY71VY.IonXpress_014 PGZtmap MDZ12 NMi0 ASi12 XAZmap4-1 XSi12

    Any help would be GREATLY appreciated!

    Leave a comment:


  • acervera
    replied
    Thanks for the quick reply @dpryan
    I will try to get the counts again following your suggestion for removing the chimeric/fusion alignments.
    It would be great if Simon Anders can implement a better solution for this, but unfortunately "monetary encouragement" is beyond my possibilities at this early stage of my PhD studies ....
    So I will give it a try now and perhaps by Monday I will have the new counts.
    thanks agan!

    Leave a comment:


  • dpryan
    replied
    I wouldn't say that the counts are unreliable, so much as not necessarily representing exactly what you want...though that's just splitting semantic hairs.

    In the example you posted above (in #14), if all the parts of that fusion gene mapped to, say, KCNJ12, then that fusion read would end up getting counted twice, which is not what you'll want. If the second part of the fusion read mapped to, say, KCNJ18 (random example), then the whole paired-read would end up getting counted once for each of the genes, which is also probably not ideal.

    The conservative solution (short of just realigning) would be to simply remove fusion reads, which could be done with
    Code:
    samtools view file.bam | grep -v "XF:Z" | htseq-count [options] - GTF_file > counts_file
    or something like that. This will still produce a bunch of error messages due to having missing mates, but you remove the chimeric-alignment-double-counting issue. In an ideal world, the counting program would treat chimeric alignments that map to the same feature as a single entity and those aligning to different features in a user-defined manner. If you have some budget to spare, perhaps Simon Anders could be "monetarily encouraged" to modify htseq-count to do that...

    Regarding the dexseq_count.py error, it's not clear to me from that line why that would be occuring. It's complaining about there being an apostrophe somewhere, though I don't see one. I suspect there's something with the format of the chimeric read that it doesn't like, but I'm not familiar enough with the inner-workings of the script to say what (perhaps Simon will chime in).

    Leave a comment:


  • acervera
    replied
    @dpryan: thanks for the reply!
    Yes, I used tophat-fusion, so this means that the counts I get from these alignments are unreliable?
    Also, I am using dexseq_count.py on my 105 RNA-Seq samples, and only one of the files triggered an error:
    ValueError: ("unsupported format character ''' (0x27) at index 34", 'line 433369370
    And this is the line:
    SRR391511.sra.ncbi_enc.52549224 163 chr3 49396810 50 50M = 49396966 205 GGGAAACCAATTCCTATTTACTTAGCCCAGCTCCATGGGGTACTGAGATA B*CBABA><7ABB6BACAA>;AB@B@?ABB@@A?=AB<<6-?>BB>A>6@ XA:i:1 MD:Z:1T48 N@BAA=?9:@4@ XA:i:0 MD:Z:50 NM:i:0 XS:A:- XP:Z:chr3 49397428 50M NH:i:1

    Is there something I should fix in that line to make it work? Also, it would be good to know why it is only failing with that one file when all of them have been processed the same way. Thanks for any input in the matter.

    Leave a comment:


  • dpryan
    replied
    @acevera: It looks like you have chimeric reads, is this output from tophat-fusion (or tophat2 with --fusion-search)?

    Edit: I should add a link to a short back and forth on an identical issue from sourceforge. Counting chimeric reads requires a good bit of forethought. In the case above, both sections probably map to the same gene, which has an inversion. In this case, you only want the read counted once (though it will get counted twice), while in other cases, where the read might map to a fusion of two genes, you might want it counted twice or not at all. I wouldn't expect htseq-count to know exactly how you want each of these cases handled.

    BTW, the SAM spec. has only recently been updated to include chimeric alignments, through the addition of the 0x800 flag bit. Tophat doesn't follow that, though given that tophat-fusion predates the specification change, that's pretty unsurprising.
    Last edited by dpryan; 08-15-2013, 06:58 AM.

    Leave a comment:


  • acervera
    replied
    Hi,
    I am having a similiar problem, i.e. getting lots of warning messages from htseq-count, but I believe my files are properly sorted and I did not mix paired-end with single-end reads.
    I used the following commands:
    samtools sort -n accepted_hits.bam sorted
    samtools view sorted.bam | htseq-count -s no - genes.gtf > counts.txt
    Then I get tons of warnings like
    Warning: Read SRR316646.sra.ncbi_enc.1030 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read SRR316646.sra.ncbi_enc.1140 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read SRR316646.sra.ncbi_enc.1193 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

    For the first warning the corresponding section of my sorted.bam file looks like this:
    SRR316646.sra.ncbi_enc.1029 163 chr6 30036407 50 69M405N7M = 30036975 644 CANAGAAAAAGGTAGAATGGACAAGTGACACTGTGGACAATGAACACATGGGCCGCCGCTCATCCAAATG
    CTGCTG C@#@BCCCC;>D@CDEEEDDGGFGGEEEEGDEFDFGFFAGGDGEEFGGFCBDEGGGGGEEBEE=EB?@BEB?C?@F XA:i:1 MD:Z:2G73 NM:i:1 XS:A:+ XP:Z:chr6 30036975 76M NH:i:1
    SRR316646.sra.ncbi_enc.1029 83 chr6 30036975 50 76M = 30036407 -644 ACGTGGCCACCGCAAAGGACGGCGTCGTGCAACCCTAGGACCGACCCCCACCACACCTCCCCAGCCTCCTGACCCT :?==@?C?5ECBE?@E?B@EBCED?DECEEABDDD-EEEEC?CBDEDBEEBE;<EEFGDFFFAFAEDFFFFFDE XA:i:1 MD:Z:54C21 NM:i:1 XS:A:+ XP:Z:chr6 30036407 69M405N7M NH:i:1
    SRR316646.sra.ncbi_enc.1030 163 chr3 101544467 50 76M = 101544619 196 GTNTTTTACATATAAATATTTTTCTATGTTAAATAGTCTTCATAAGTTAATTATAAAGATCTGCTTAATAGTTCTG @?#<A@C=B>DA?C?AECEEFFEFEEFFBFFDFFFFBFFFEEBEBFBBFDDB?EEDECBADEF?EEE@D?E=@EB? XA:i:1 MD:Z:2T73 NM:i:1 XS:A:+ XP:Z:chr3-chr3 101544619 36M101544702F40m NH:i:1
    SRR316646.sra.ncbi_enc.1030 83 chr3 101544619 50 36M = 101544467 -196 TATATCAATTTCAGTGGTTTAAAATATGAATTTCTA EE=E@EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDD AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:1 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@EBEB@BDEEDEDEEFE
    FEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
    SRR316646.sra.ncbi_enc.1030 83 chr3 101544664 50 40M = 101544467 -196 GAAAGTATTAATTTAATGTAGTGAATCTCATATTCCAGAA DCCDD?EEEEDFFFEDEFDED@EBCCCCED
    FFFFFBFFEB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:2 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@
    EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
    SRR316646.sra.ncbi_enc.1032 99 chr6 119499446 50 76M = 119499512 142 ATTAGGTGGAACCATATGAAACTGCTGACAGTTTTTAAACTACAAATGCAGCAATTTCATATGTTTCAGCCTAATC EEE:AD:?:CA>ADCD?DBBDDDA?CCBDDDDD545@CD=CBB:BBA?DBB5B?C=??CAC?=AB?BCAA5?5A XA:i:0 MD:Z:76 NM:i:0 XS:A:- XP:Z:chr6 119499512 76M NH:i:1

    Is it ok to ignore the warnings or there is something wrong with my alignment files??
    I appreciate any help on these.
    Thanks!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:58 AM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 02:20 PM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-07-2024, 06:58 AM
0 responses
181 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-06-2024, 08:18 AM
0 responses
231 views
0 likes
Last Post seqadmin  
Working...
X