No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using htseq-count with paired-end and orphaned reads

    Dear all,

    I am building a pipeline for 100bp paired-end sequencing and I have a question regarding how htseq-count deals with the output from TopHat and how to best map my reads to features prior to using DESeq.

    For each sample I have 2 bam files: one corresponding to alignments of paired-end reads and another containing single reads that didn't have a mate pair. Originally, I merged my 2 bam files, sorted and converted into SAM and ran htseq-count to get gene counts, but I got an "expecting mate pair error". This made sense as single reads coming from merged files lacked mate pairs that htseq-count expected in sorted SAM files.

    My questions is: Is running htseq-count separately on these two files and them combining their outputs for each sample biasing my analysis, or does it render the same results as if htseq allowed runs of files containing both paired-end and single reads?

    I apologize for the long explanation, and thanks for any help you can give me.

  • #2
    What the difference between the library type single and paired, DeSeq

    I am preparing read counts for DeSeq. I have no reference genome for these transcriptomes. I am writing a script to extract counts from sam files produced by bowtie2

    It seems to me that Deseq assumes each pair counts as one alignment for "paired" data. Bowtie2 and many other packages attempt aligning single reads when the pair fails. I also have reads that have no pairs (one set failed my base quality filters).

    It seems like addition, useful information may be gained using the pairs with a quality issue in only one end or an unexpected insert length. It also seems that feeding the pairing information to Bowtie2 could increase the quality of alignment for some ambiguously mapping reads. It also seems to me DeSeq makes no use of pairs other than to normalize between samples. Does this sound correct to you? If that is the case why label samples "paired" or "single."

    My main questions are:

    1) If I use "paired" in DeSeq should I force Bowtie2 to only report pairs where both align as a pair and divide the counts by 2?

    2) If I include my single end alignments, should I count each single read as +1 and each pair as +1? Should I use "paired" or "single" in DeSeq and why does this setting matter?



    • #3
      The "paired" or "single" in the DESeq vignette is just an example of (1) how to deal with multiple factors and (2) that single-end and paired-end libraries can produce slightly different results (likely due to mappability). If you just have one type of library, then there's no reason to include that in your model. As you suspected, both read pairs and single reads count as 1, since they each represent a single original molecule.

      Perhaps one of the DESeq authors has a different opinion, but I would only really worry about including orphaned reads if the rate of them was very different between samples (you'd see if this is causing an issue in a principal components graph or by doing clustering). This would likely indicate a general difference in overall quality between the two

      BTW, your perl script could be replaced by htseq-count, where you make an annotation file wherein each contig is a feature (this has the benefit of not counting multimappers and allowing filtering by MAPQ).


      • #4
        DeSeq paired-end and orphaned reads

        That you for your reply. I have no reference genome. I am assuming you mean that I create a gff where each feature is a contig. If this is correct, what software do you use to create the gff from contigs.

        I am adding filtering steps to my script currently:
        1) For Bowtie2 column 5 is map quality.

        2) Column 7 indicates whether the read has a mate and which contig the mate aligns to. Keeping reads aligning with a value other than "=" or "*" in column 7 would remove pairs with reads that align to different contigs.

        3) The optional field "XS:i:<N>" only appears for reads that aligns ambiguously. (Would you recommend filtering a pair if one read aligns ambiguously or only if both align ambiguously?)

        4) I could also filter by by problematic pairing using the optional field "YT:Z:<S>" where a value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.

        Any thoughts on the filters listed above.

        Last edited by sheltonj; 09-16-2013, 11:12 AM.


        • #5
          Yeah, you'd need to make the GFF or GTF file. A GTF file would be easy to make from the header of one of your SAM/BAM files (since you just need the contig names and lengths). It sounds like the additions to your script should take care of most of the issues )

          Regarding (2), that will also filter any read with a secondary alignment (XS score) equivalent to that of the primary alignment and where the primary alignment contains mismatches. Regardless, you'll certainly want to get rid of those!

          Regarding (3), the presence of an XS score actually just means that there's a secondary alignment with a score of at least that specified by --score-min. If AS=0 and XS=-60, I wouldn't call the alignment ambiguous. This is generally why it's better to just filter at a MAPQ (say 10 or 20).

          Regarding (4), I don't recall off-hand what bowtie2's metric is for calling something unique or not, so I can't advise on the utility of this.

          Best of luck,


          • #6
            Hi thanks,

            I drafted a perl script to count paired end reads and broken pairs that align to de novo transcriptomes. The script takes Bowtie2 "best match" sam files as input.

            The script "" is briefly described at the bottom of this README here

            The script is available here

            Comments, questions, or suggestions would be appreciated



            Latest Articles


            • seqadmin
              Advanced Tools Transforming the Field of Cytogenomics
              by seqadmin

              At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
              09-26-2023, 06:26 AM
            • seqadmin
              How RNA-Seq is Transforming Cancer Studies
              by seqadmin

              Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
              09-07-2023, 11:15 PM





            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 09:38 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 09-27-2023, 06:57 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 09-26-2023, 07:53 AM
            1 response
            Last Post seed_phrase_metal_storage  
            Started by seqadmin, 09-25-2023, 07:42 AM
            0 responses
            Last Post seqadmin