Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    To clear up this confusion:

    1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.

    Furthermore, the format specifies that each read contains information about whether the read was aligned, whether its mate was aligned and where the read aligned to and where its mate aligned to. For the partner, another line has to be present that shows a "mirror" of the same alignment, i.e., the information on whether and where the read and its mate align should match if one exchanges read and mate information.

    If you now perform any filtering that might remove reads but not their mates, this rule gets broken, and HTSeq issues a warning (but proceeds anyway, treating the missing partner as not aligned).

    Some aligners manage to produce SAM files with missing mates even without filtering.

    2. If a SAM line's flag field says that the read was not aligned then, in my opinion, the fields for where the read aligned should not contain any coordinates. HTSeq issues a warning in this case but treats the read as not aligned. BWA has a habit of sometimes producing such reads, and this is even documented somewhere.

    3. Finally, a "proper pair" is, as I read the standard, a pair in which both reads align to opposite strands. If a SAM line's flag field indicates that a pair is proper but the alignment information disagrees with this, a warning is issued.

    It is rather frustrating that the SAM format allows so many ways of storing self-contradictory information in a SAM file, without giving clear rules in the specification, because this forces software to expect all these cases.

    Comment


    • #17
      Regarding BWA and the FLAG fields, see also this thread: http://seqanswers.com/forums/showthread.php?t=8429

      Comment


      • #18
        Originally posted by Simon Anders View Post
        To clear up this confusion:

        1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.
        So, if I get it well, we don't need to have sorted SAM files if dealing with single-end RNAseq reads? Am-I right?

        Comment


        • #19
          Originally posted by ThePresident View Post
          So, if I get it well, we don't need to have sorted SAM files if dealing with single-end RNAseq reads? Am-I right?
          Yes.

          ------

          Comment


          • #20
            Dear all.

            I'm trying to find information about how HTSeq counts reads. I understood that one pair (properly paired) is counted as 1 count.
            What about pairs that are not flagged as 'properly paired'?
            What about the reads that lost their mate and became single reads?
            Are they counted as 1 count as well? Or not counted at all?

            Additionally I'm loosing quite many reads that have multiple mappings. Anyone figured out a way to deal with this in HTSeq, instead of just throwing them all out?

            Comment


            • #21
              Hi Simon,

              1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.
              How could I sorted the SAM file according to name? Thank you

              Comment


              • #22
                "samtools sort -n"

                Comment


                • #23
                  I use picard (http://picard.sourceforge.net/)

                  Best regards,
                  Douglas

                  Comment


                  • #24
                    Hi Simon,

                    Already sort according to name and run htseq-count and here is the output;

                    ENSG00000261838 0
                    ENSG00000261839 2
                    ENSG00000261840 6
                    ENSG00000261841 0
                    ENSG00000261842 0
                    no_feature 1288128
                    ambiguous 512283
                    too_low_aQual 0
                    not_aligned 0
                    alignment_not_unique 1033314

                    Actually, I can't understand well about the htseq-count output. Should it given the raw count of each read/genes? As I want to know the raw count, instead FPKM (from cuff diff output) value. I want to compare the raw count with my array data.

                    Thank you

                    Comment


                    • #25
                      Sorry to add up,

                      I was using GTF file from ensembl (genes.gtf), thus the "ENSG00000261838" is a gene from human. How could I replace the number to name of gene? What should I do? Thanks

                      Comment


                      • #26
                        Originally posted by DZhang View Post
                        Hi vadim,

                        I am happy to report it worked! I guess "export LC_ALL=POSIX" helped. I had not known it was related to the sorting or htseq-count. But you pointed me to the right direction and offered a working solution.

                        Thank you very much and now I am off for a great July 4th!

                        Douglas
                        Originally posted by vadim View Post
                        You may also want to do this prior to sorting:
                        export LC_ALL=POSIX
                        Hi Vadim and Douglas,

                        I am wondering, what does the export quote do?
                        Code:
                        export LC_ALL=POSIX
                        And how would one use it? I am assuming LC_ALL is your .sam file which is unsorted? Would this be correct..

                        I sorted the .sam file using
                        Code:
                        sort -k1
                        Although there were occasional "Is the SAM file properly sorted?", it worked for most of the reads. So I was thinking if the "export: function would help. But I am unsure how to use it

                        Any insight would be great, Many thanks

                        Comment


                        • #27
                          Hi Simon,
                          I am using DEXSeq for analyzing Tophat results. I have some problems for preprocessing the BAM files produced by Tophat, existing in DEXSeq package.
                          1. dexseq_prepare_annotation.py only works with GTF files, while alot of annotation files are in GFF3.
                          2. dexseq_count.py needs sam files to be sorted. but I can not sort the sam file using commands mentioned in this thread. sam files are big anf it says no space left on the device.

                          Comment


                          • #28
                            not very helpful but if you can't sort your alignment files on your computer then you need to find a computer with more memory/hard drive space. unless this is a one-off experiment and you don't plan to work with sequencing much afterwards.

                            it seems to me you should be able to find a GTF format for your annotation - i've never used a GFF3 format file except for a long while back when Tophat accepted that formation and not GTF. what species are you working with?

                            depending on what sorting you need you might be able to have Tophat produce the output the way you need it. For example you can use the --keep-fasta-order option or the --no-sort-bam option which makes it so the output is not coordinate sorted. I'm not sure if either of them will replicate name sorting though.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #29
                              Yes, we should add an option to deal with GFF3. I'll try to get to this soon. For now: It should be sufficient to reduce your GFF3 file to only the lines containing "exon" in the third column and then replacing in each line the word "Parent" by the word "gene_id" in the attributes.

                              As for sorting: Of course you need to have enough free space on your hard disk to save the sorted file. Actually, you need twice as much free space as your SAM file takes to accommodate both the resulting file and the intermediate ones. But come on, hard disks are really cheap now. Just buy a new one if your old one is full.

                              Comment


                              • #30
                                I am working on a powerful server with about 200 TB hard drive and 256 GB memory, but I can not sort the SAM files using sort -s -c -k1,1 !!! weird >_<

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                19 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                18 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X