Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • acervera
    Junior Member
    • Feb 2012
    • 6

    #16
    @dpryan: thanks for the reply!
    Yes, I used tophat-fusion, so this means that the counts I get from these alignments are unreliable?
    Also, I am using dexseq_count.py on my 105 RNA-Seq samples, and only one of the files triggered an error:
    ValueError: ("unsupported format character ''' (0x27) at index 34", 'line 433369370
    And this is the line:
    SRR391511.sra.ncbi_enc.52549224 163 chr3 49396810 50 50M = 49396966 205 GGGAAACCAATTCCTATTTACTTAGCCCAGCTCCATGGGGTACTGAGATA B*CBABA><7ABB6BACAA>;AB@B@?ABB@@A?=AB<<6-?>BB>A>6@ XA:i:1 MD:Z:1T48 N@BAA=?9:@4@ XA:i:0 MD:Z:50 NM:i:0 XS:A:- XP:Z:chr3 49397428 50M NH:i:1

    Is there something I should fix in that line to make it work? Also, it would be good to know why it is only failing with that one file when all of them have been processed the same way. Thanks for any input in the matter.

    Comment

    • dpryan
      Devon Ryan
      • Jul 2011
      • 3478

      #17
      I wouldn't say that the counts are unreliable, so much as not necessarily representing exactly what you want...though that's just splitting semantic hairs.

      In the example you posted above (in #14), if all the parts of that fusion gene mapped to, say, KCNJ12, then that fusion read would end up getting counted twice, which is not what you'll want. If the second part of the fusion read mapped to, say, KCNJ18 (random example), then the whole paired-read would end up getting counted once for each of the genes, which is also probably not ideal.

      The conservative solution (short of just realigning) would be to simply remove fusion reads, which could be done with
      Code:
      samtools view file.bam | grep -v "XF:Z" | htseq-count [options] - GTF_file > counts_file
      or something like that. This will still produce a bunch of error messages due to having missing mates, but you remove the chimeric-alignment-double-counting issue. In an ideal world, the counting program would treat chimeric alignments that map to the same feature as a single entity and those aligning to different features in a user-defined manner. If you have some budget to spare, perhaps Simon Anders could be "monetarily encouraged" to modify htseq-count to do that...

      Regarding the dexseq_count.py error, it's not clear to me from that line why that would be occuring. It's complaining about there being an apostrophe somewhere, though I don't see one. I suspect there's something with the format of the chimeric read that it doesn't like, but I'm not familiar enough with the inner-workings of the script to say what (perhaps Simon will chime in).

      Comment

      • acervera
        Junior Member
        • Feb 2012
        • 6

        #18
        Thanks for the quick reply @dpryan
        I will try to get the counts again following your suggestion for removing the chimeric/fusion alignments.
        It would be great if Simon Anders can implement a better solution for this, but unfortunately "monetary encouragement" is beyond my possibilities at this early stage of my PhD studies ....
        So I will give it a try now and perhaps by Monday I will have the new counts.
        thanks agan!

        Comment

        • adaigle
          Junior Member
          • Sep 2013
          • 6

          #19
          Hi, I was wondering if you ever figured out the error message you were describing above? I am getting a similar thing with HTSeq, only it is with the first read of the SAM file.

          Error occured when reading first line of sam file.
          Error: ("unsupported format character ''' (0x27) at index 34", 'line 30 of file
          CAL120_sr.sam')
          [Exception type: ValueError, raised in _HTSeq.pyx:1168]

          And this is the line:
          Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZPBf,0.0155953,0.00769459,0.0049578 ZMBs,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZFi28 RGZY71VY.IonXpress_014 PGZtmap MDZ12 NMi0 ASi12 XAZmap4-1 XSi12

          Any help would be GREATLY appreciated!

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #20
            That appears to be a pretty screwed up SAM line. Did your file get corrupted?

            Comment

            • adaigle
              Junior Member
              • Sep 2013
              • 6

              #21
              Oh sorry, there should be a bunch of colons in there. That's because I was fooling around with it when I was getting errors earlier about an illegal colon in the optional fields. Here is the real line:

              Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZP:B:f,0.0155953,0.00769459,0.0049578 ZM:B:s,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZF:i:28 RG:Z:Y71VY.IonXpress_014 PG:Z:tmap MD:Z:12 NM:i:0 AS:i:12 XA:Z:map4-1 XS:i:12

              Does this still look screwed up?

              Comment

              • bbl
                Member
                • Jul 2010
                • 16

                #22
                How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?

                Comment

                • Eveleee
                  Junior Member
                  • Nov 2013
                  • 2

                  #23
                  Transcriptom annotation

                  Hello to everyone,
                  I would like to ask for advice wich tools to use to annotate assembled RNA seq. We have sequenced several species at Illumina HiSeq and want to annotate genes now. Then see the differences in expression, because we have different species and hybrids as well.
                  Does anyone can advice me the pipeline? I have heard about RAST and then to manually approved it in GMOD.
                  I see majority is working with blast2go, but it is not free

                  Comment

                  • Simon Anders
                    Senior Member
                    • Feb 2010
                    • 995

                    #24
                    Originally posted by bbl View Post
                    How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?
                    They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

                    Comment

                    • bbl
                      Member
                      • Jul 2010
                      • 16

                      #25
                      Thanks Simon. Does it mean I can understand that I only need to filter multiple mapped reads in the accepted_hits.bam from tophat2? No other filtering step necessary prior to HTseq-count?

                      Originally posted by Simon Anders View Post
                      They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

                      Comment

                      • Simon Anders
                        Senior Member
                        • Feb 2010
                        • 995

                        #26
                        If TopHat marks the multimapping reads with the "NH" tag (and, IIRC, it does), htseq-count filters these out for you, too.

                        Comment

                        • kcm.eid
                          Junior Member
                          • Oct 2012
                          • 2

                          #27
                          Sorting SAM file properly can fix the warning

                          Hi,
                          I came across the same issue while using htseq-count.
                          I ve used
                          Code:
                          sort -n 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
                          command to sort my accepted_hits.sam, but still getting the warning message. When I little analysed the warning message and my input sorted SAM file, found that my sam file was not sorted properly. One read which is showing warning is as follows:

                          ERR032234.10311751 147 chr12 100208754 50 76M = 100208593 -237 GGCAGCCCAGGGCTTGGTGTTGGCAGTTTGAGTTCGTGAGATAGAAAGAGTGGGGTGTCAAGGCAGTACCCCTGAG >>=@=<9;9?>A<=;A;7>5;;;;=8@=<7==@:;5AA;A@A19A5AAAA@<<<;<;=8=5;@;8;<?.55>7?7> XA:i:0 MD:Z:76 NM:i:0 NH:i:1

                          ERR032234.1031175 145 chr1 51757180 50 76M = 51755662 -1594 CTCGGACTTGATCTGCCCAGACTTTTGGTCAGCAAGGAGAAGGTTATTGTTTGTTAAGAGGAAAATCCGAGATGTA A3=>D<D??>D?@DD?9DDD>DDBBDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD XA:i:0 MD:Z:76 NM:i:0 NH:i:1

                          ERR032234.10311751 99 chr12 100208593 50 76M = 100208754 237 CTAAAAGCTTACCTCCAAAACAGGATTCTTGTGTAACTAGGAATCCTGCATGAGAACCAGAAACCCTAACCTCCGA CAA@?AC8>ACC>AA<C?8B=6B<98?=>BC<C9=AAA<CA>A=A:=BB=>>::<>@==@*2:;8:7><<;7<><; XA:i:0 MD:Z:76 NM:i:0 NH:i:1


                          So the reads were not sorted correctly. I used the instruction from here

                          Code:
                          sort -bn -k 1.11,1 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
                          my reads have ids like this: ERR032234.1031175
                          this sort command would sort the file on field 1, from chars 11 to end of field, numerically.


                          now htseq-count working fine.
                          Last edited by kcm.eid; 03-02-2014, 10:07 PM. Reason: correcting instructions

                          Comment

                          • raphael123
                            Member
                            • Dec 2013
                            • 37

                            #28
                            A possible problem :

                            According to htseq documentation : http://www-huber.embl.de/users/ander.../counting.html

                            "The fact that the records describe the same fragment can be seen from the fact that they have the same read name"

                            So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

                            So simply do that:

                            (samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              New Genomics Tools and Methods Shared at AGBT 2025
                              by seqadmin


                              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                              The Headliner
                              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                              03-03-2025, 01:39 PM
                            • seqadmin
                              Investigating the Gut Microbiome Through Diet and Spatial Biology
                              by seqadmin




                              The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                              02-24-2025, 06:31 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Today, 05:03 AM
                            0 responses
                            10 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 07:27 AM
                            0 responses
                            11 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-18-2025, 12:50 PM
                            0 responses
                            14 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-03-2025, 01:15 PM
                            0 responses
                            185 views
                            0 reactions
                            Last Post seqadmin  
                            Working...