Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Hi Simon,

    I can't believe I missed that. It was clearly explained in the documentation. Sorry about that! When I read the warning "Is the SAM file properly sorted?" I assumed it was coordinate-based sorted

    Now it's working as nicely as usual.
    Thanks again!

    Comment


    • #92
      htseq-count problem

      Dear Simon, all,
      Possibly my or a htseq-count problem (HTSeq-0.4.7p1). I have ran some paired-end RNA-seq SOLiD data with Tophat (version TopHat v1.2.0) and have generated a test SAM file as follows;

      Situation 1) Sam input format 1, as it came from tophat.

      $ htseq-count test.sam fred.gtf > f
      100000 GFF lines processed.
      Warning: Read 1131_1472_555_F5-P2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
      Warning: Read 1131_1472_555_F3 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
      0 reads processed.

      There is only 1 mate pair in the file, please see the attached text file for formats.

      So maybe I thought the names have to be equivalent, so I knocked off the extensions (see attached, situation 2)

      Same result;
      $ htseq-count test.sam fred.gtf > f
      100000 GFF lines processed.
      Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
      0 reads processed.

      Situation 3) So I thought maybe the “forward” mate should come first, so I flipped read positions.

      $ htseq-count test.sam fred.gtf > f
      100000 GFF lines processed.
      Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
      0 reads processed.

      Turning off stranded did not help either;

      $ htseq-count -s no test.sam fred.gtf > f
      100000 GFF lines processed.
      Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
      0 reads processed.

      It could be very well my user problem but I am not seeing why this error message would occur. Thank you for any advice,

      Kind regards,

      John.
      Attached Files

      Comment


      • #93
        Hi John

        according to the SAM format specification, the names of reads from the same fragment have to be identical, so it was by design that your situation 1 did not work.

        Situations 2 and 3 should work, and I guess they did, only the counting was wrong. (It says "0 read processed" although the reads were processed.) They should be counted as "non_unique_alignment", because of the optional field "NH:i:4".

        I've just fixed the issue that thew wrong number of processed reads is displayed, so please try again with v0.7.4p3 and check whether you get non-zero counts for any of the couters.

        Simon

        Comment


        • #94
          Thank you

          Hi Simon,
          Thanks a lot for looking at that and fixing it; I will give it a whirl tomorrow. About read names; this is how they come of the SOLiD4 machine and I guess AB expect users to use Bioscope.

          Kind regards,

          John.

          Comment


          • #95
            htseq-count; a small consideration/nice to have

            Dear Simon,
            I have been doing some testing regards counts of reads to genes based on genome mapping.

            My GTF file is from hg19 Genome Browser for Refseq gene (table browser).
            Reads are FASTQ mapped with tophat (bam output sorted by name)

            It is possibly just a "nice to have" but because of how Refseq is annotating gene-fusions, some counts are lost. E.g. for DDX39B

            $ grep DDX39B htseq_output
            ATP6V1G2-DDX39B 0
            DDX39B 0

            With a quick/dirty Perl program that does not care about ambiguity.

            $ grep DDX39B dirty_perl_counts
            ATP6V1G2-DDX39B 26
            DDX39B 23

            Since Refseq is annotating gene fusions along with "vanilla" genes in this way, total read counts for a gene or gene-fusion are lost whilst enforcing the ambiguity rule (which I don't think is optional, even with -m intersection-nonempty).

            It seems a bit cumbersome to filter fusions from GTF files. Maybe it could be good to only enforce ambiguity if the overlapping features are on different strands or enable an option to turn it off if desired by the user?

            As it is, it is possible an interesting differentially expressed gene could be missed.

            Kind regards,

            John.

            Comment


            • #96
              Hi Simon and John,

              I have exactly the problem you describe John, getting them warnings and no result. My sam files are sorted on read name and produced with Tophat 1.2.0. Data is Solid 50+35.

              However when I ran the analysis on position-sorted sam files, 6 out of my 8 samples generated results. Still with the warnings though. I can't figure this one out...

              One thing that I have realized is that surprisingly few of my read pairs have both mates properly mapped from Tophat. This is spotted in the sam file as '=' (mate is on the same chrom) or '*' (mate is not mapped). I wonder how this is handled by HTSeq-count - perhaps singles are ignored?

              I have sam files from Bioscope too, and in them the majority of reads pairs mapped together ('=').

              John, did you manage to get proper HTSeq-count results from your Solid-Tophat data, and did you figure out any problem with Solid read pairs in Tophat? I also tried chopping off the read suffixes with no effect, and I checked the csfasta files and made sure they match pair by pair.

              Best regards
              Markus

              Comment


              • #97
                To use paired-end data, SAM files must be sorted by read name. If you sort by position, htseq-count won't find the mate, complain about it with a warning and treat it as a single read.

                Two reasons come to mind why you might not get any results when sorted by read name:

                - Solid data seems to violate the convention that mated reads must have the same read name by adding some suffix to the reverse reads. If present, these need to be removed. (See posts #92, #93 further up.)

                - htseq-count assumed that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand. This works fine for my Illumina data. However, maybe it is the other way round with SOLiD. I've promised an option '--stranded=reverse" to deal with this but haven't added it yet. (See this thread.)

                Comment


                • #98
                  Hi Markus, Simon.
                  Markus, I had a slightly different point, htseq is working for me but some genes would have a count of 0 if the GTF file had a gene fusion annotation.

                  In the GTF there is annotation for both ATP6V1G2-DDX39B fusion and the
                  DDX39B alone. I make the assumption (possibly wrongly) that because these features overlap, the counts are not assigned as it looks like these reads are ambiguous. Simon, is that right?

                  If it helps Markus, read on;

                  My reads typically look like this, following a tophat run.

                  Code:
                  863_343_1081_F5-P2     0        chr1    14571  3 35M    *       0       0       TCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAG     ccc]]babba]RYRPaUO[cNHOUZXSFP]^K3GG     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516560
                  1009_1698_1545_F5-P2   0        chr1    14571  3 35M    *       0       0       TCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAG     [[__ac_]\[\YTU]RNMP_(!2BRSRMYZXW2::     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516560
                  1120_1709_1928_F5-P2   0        chr1    14572  0 35M    *       0       0       CCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGG     ^^WZaa^_^XAIWT[WIH])!]bTRL9FVXN&6^^     NM:i:0  NH:i:6  CC:Z:chr12      CP:i:91019
                  1070_1592_1892_F5-P2   163      chr1    14586  3 35M    =       14594   0       TTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGC     [[abcGG``cWR]ab``a`bRJY`_GI\\[@>W^^     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516545  XS:A:-
                  1267_389_283_F5-P2     0        chr1    14586  3 35M    *       0       0       TTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGC     SS[b`<;TR[PAL[_ML]UYN5E^^CBRSYK>DUU     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516545
                  1070_1592_1892_F3      83       chr1    14594  3 50M    =       14586   0       CCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATG      TTWFCWYRQUVUX^]\_^^^WNQ\_``_a_XX]a^[__`cccbbca_acc      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516522  XS:A:-
                  1247_1145_1797_F3      83       chr1    14597  3 50M    =       14567   0       CCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCA      DDZZRHEOX[`Y6>a\XTW`][XX^_`FFa][^`a`bSScb`Y[ccc___      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516519  XS:A:-
                  1100_941_1049_F3       16       chr1    14597  3 50M    *       0       0       CCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCA      JJ\]KCEM^``_DFb]QPX[[]\[^ZZGD[_ZY_bccXXbbbZ[bbcbbb      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516519
                  If you want to compare what I do/did;

                  Code:
                  #!/bin/bash
                  
                  
                  if [ $# -ne 1 ]
                  then
                    echo "Usage: `basename $0` name_of_bam_from_tophat"
                    exit 1
                  fi
                  
                  
                  
                  
                  # remove _F3 or _F5-P2 from sam output
                  samtools view $1 | perl -wpe 's/^(\S+)(_F3|_F5-P2)/$1/' > sam_file 
                  
                  
                  # USCS utilities has a program called fetchChromSizes; use to get Chromosome sizes, in turn used to make a bam file
                  fetchChromSizes hg19 > sizes
                  
                  # make a new bam file
                  samtools view -bt sizes -o sam_file.bam sam_file
                  
                  # sort data
                  samtools sort -n sam_file.bam sorted_sam_file
                  
                  # run htseq
                  samtools view sorted_sam_file.bam | htseq-count -m intersection-nonempty - ../RefseqHG19_gtf_gene > gene_counts
                  I have found that in most cases my dirtyperl works the same as htseq but not all, I am currently investigating why that is.

                  Kind regards,

                  John.

                  Comment


                  • #99
                    Originally posted by poisson200 View Post
                    In the GTF there is annotation for both ATP6V1G2-DDX39B fusion and the
                    DDX39B alone. I make the assumption (possibly wrongly) that because these features overlap, the counts are not assigned as it looks like these reads are ambiguous. Simon, is that right?
                    Yes, that's right. By the way, are you specially interested in these fusion genes? If not, maybe using the GTF file from Ensembl would make your life easier. It does not contain your fusion gene and probably does not contain other fusion genes either.

                    Comment


                    • Originally posted by Simon Anders View Post
                      To use paired-end data, SAM files must be sorted by read name. If you sort by position, htseq-count won't find the mate, complain about it with a warning and treat it as a single read.

                      Two reasons come to mind why you might not get any results when sorted by read name:

                      - Solid data seems to violate the convention that mated reads must have the same read name by adding some suffix to the reverse reads. If present, these need to be removed. (See posts #92, #93 further up.)

                      - htseq-count assumed that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand. This works fine for my Illumina data. However, maybe it is the other way round with SOLiD. I've promised an option '--stranded=reverse" to deal with this but haven't added it yet. (See this thread.)
                      Hi Simon,

                      I have been having similar problems with SOLID data and HT-seq. I did remove the suffixes but some reads were still erroring (warning cant find aligned mate) despite being sorted correctly. Is it likely that this stranded issue is the problem? If htseq-count assumes that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand and you think that solid data is the opposite way round, then why don't all the reads error?

                      Thanks

                      p.s. I personal messeged you the error and a excerpt of the sam file on the 04/04/2011 09:43

                      Comment


                      • Hello Dr. Anders,

                        I had a brief question for you concerning your great program htseq-counts. I can some of the cases you consider in your clear illustration here:



                        But does htseq-counts consider the (very infrequent) case in which a read may be longer than a gene?

                        Comment


                        • Originally posted by adeonari View Post
                          But does htseq-counts consider the (very infrequent) case in which a read may be longer than a gene?
                          This is treated the same way as a read extending over the gene's boundary (second row in the figure.)

                          Comment


                          • Another quick question: is it possible with htseq-counts to exclude the reads from the 3' UTR?

                            Comment


                            • The '-i' and '-t' options give you some flexibility what htseq-count is counted.

                              If this is insufficient, please remember that htseq-count was actually meant not so much as a tool in its own right but rather a demonstration of the possibilities of HTSeq, our Python framework for HTS analysis. As demonstrated in the documentation, it is fairly easy (if you know Python or are willing to learn it) to write your own counting script using HTSeq and then implement whatever custom logic you need for your use case.

                              Comment


                              • Will HTSeq be able to count reads from a bed file

                                Dear Simon,

                                My case is somehow unusual. I have a mapped sequence reads file in bed format. I wonder if you could point me out how I can use HTSeq for reads counts on gene basis. I understand that the bed file contains only interval data and doesn't have quality scores for each base. But in my case, I only need read counts. I have limited knowledge about python.

                                Your help will be greatly appreciated!

                                Best
                                Steve

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X