Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alittleboy
    Member
    • Apr 2011
    • 60

    a general question on DEXSeq data preparations

    I was following the steps in the vignette of DEXSeq here to construct the exon count files (.txt) for each of my 16 biological samples. However, as shown on page 3, we need to go through these steps:

    (1) samtools index treated2.bam
    (2) samtools view treated2.bam > treated2.sam
    (3) sort k1,1 k2,2n treated2.sam > treated2_sorted.sam
    (4) python deseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff treated2_sorted.sam treated2fb.txt

    In (1), the indexing is done in a reasonable time (<3 min); in (2) I realize that the SAM files can be huge (>15GB) for each sample; in (3) will the sorted SAM file be of the same size as the unsorted SAM file? How long it will take for the sorting? in (4), can I say after step (3), the generated SAM file (unsorted) is no longer useful, and we only need the sorted SAM files?

    The reason I ask this is because I have 16 samples, so the total size of files becomes very large and the disk space is limited... Just wanna make sure these sizes are reasonable, and ask if I could delete the (if it's useless) SAM files while sorting them in step (3). What would be the Linux command, then?

    Thank you so much!
  • areyes
    Senior Member
    • Aug 2010
    • 165

    #2
    Hi @alittleboy,

    It should not take to long to sort them, however it will be even faster if you use samtools or http://picard.sourceforge.net/command-line-oIt should not take to long to sort them, however it will be even faster if you use samtools or picard tools. You could remove the initial sam files, sort them by read name and then you could sort them back with respect with position in the genome and you should get back your initial file.

    Alejandro

    Comment

    • krespim
      Member
      • Jul 2012
      • 49

      #3
      Hi alittleboy,

      may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:

      samtools view sample.bam | python dexseq_count.py <options> geneModel - sample.output.dexseq_count.txt

      You will notice the "-" after geneModel which indicates that the argument (the reads file) is passed to dexseq_count.py as a stdin (standard input). This bypasses the creation of a intermediate huge SAM file.

      Regarding the sorting of the reads, if I am not mistaken, tophat produces already sorted bam files. Just in case, you can always sort your BAM before indexing it:
      samtools sort sample.bam sample.sorted # creates a sorted file
      samtools index sample.sorted.bam # index sorted bam
      rm sample.bam # removes the old bam

      You will end up with a sorted and indexed bam that can be used for pretty much any downstream analysis.

      I hope it helps.

      Comment

      • areyes
        Senior Member
        • Aug 2010
        • 165

        #4
        Hi @krespin,

        may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:
        I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

        Alejandro

        Comment

        • krespim
          Member
          • Jul 2012
          • 49

          #5
          Originally posted by areyes View Post
          Hi @krespin,



          I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

          Alejandro
          Still a damn good bioconductor package

          To be fair, I also had some difficulty in the beginning because my annotations/mappings were done with UCSC gene symbols and had to circumvent some issues. Otherwise all good - I am doing some DEXseq analysis as we speak.

          Comment

          • alittleboy
            Member
            • Apr 2011
            • 60

            #6
            Hi @krespim and @areyes:

            Thank you so much for your inputs and suggestions! Yes, as you see, I strictly follow the instructions of the pasilla vignette, and I think we can avoid generating the huge SAM files during the steps. To @areyes, would you please, if possible, send me an updated vignette on the most efficient way of generating the exon count files in command line? Currently I have TopHat accepted_hits.bam files at hand, and I use:

            samtools view C26.bam | sort -k1,1 -k2,2n | python dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - C26.txt

            Is this correct? Is there a dash between the gene model and the output .TXT file?

            BTW, the first C26.bam file is obtained from running:

            samtools index C26.bam (<-- this C26.bam is the "accepted_hits.bam" from TopHat).

            Thank you so much!!

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

              FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.

              Comment

              • alittleboy
                Member
                • Apr 2011
                • 60

                #8
                Originally posted by dpryan View Post
                The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

                FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.
                Hi @dpryan, @areyes and @krespim:

                Thanks a lot for all your suggestions! I think I can get the exon count files by "piping" in Linux commands ;-)

                Some question about the .TXT output file: I got the following (first gene with 15 exons) output:

                ENSG00000000003:001 10
                ENSG00000000003:002 0
                ...
                ENSG00000000003:014 0
                ENSG00000000003:015 0


                Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

                ENSG00000206495+ENSG00000242726:001 0
                ENSG00000206495+ENSG00000242726:002 0
                ...
                ENSG00000206495+ENSG00000242726:057 0
                ENSG00000206495+ENSG00000242726:058 0


                There are also some items with more than 2 genes (non-zero exon counts):


                ...
                ENSG00000197013+ENSG00000269237+ENSG00000196268+ENSG00000268658:064 6



                Q2: At the end of the file, I notice some summaries:

                _ambiguous 0
                _empty 24914546
                _lowaqual 0
                _notaligned 0


                What are each of them represent? I can guess its meaning, but just wanna confirm... especially the "_empty" item... As each of the .TXT files are to be passed to the read.HTSeqCounts() function, will these extra information automatically filtered out?

                Thank you so much!!
                Last edited by alittleboy; 06-26-2013, 11:08 AM.

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Originally posted by alittleboy View Post
                  Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

                  ENSG00000206495+ENSG00000242726:001 0
                  ENSG00000206495+ENSG00000242726:002 0
                  ...
                  ENSG00000206495+ENSG00000242726:057 0
                  ENSG00000206495+ENSG00000242726:058 0
                  Have a look at those genes in a genome browser. You'll notice that they overlap in at least one transcript.

                  Originally posted by alittleboy View Post
                  Q2: At the end of the file, I notice some summaries:

                  _ambiguous 0
                  _empty 24914546
                  _lowaqual 0
                  _notaligned 0
                  The whole python program is only ~180 lines, so it's easy to go through and just see how those numbers are generated.

                  "_notaligned" is self explanatory (you're using tophat, so those will be in a different file). "_lowaqual" will be incremented when the mapping quality of a read is below the value set by -a/--minaqual. "_ambiguous" will be set when a read can map to more than one "gene". I would assume that this might occur in unstranded libraries with reads aligning to genes that partly overlap on opposite strands, but perhaps areyes could weigh in here. "_empty" is when a read aligns outside of a gene/exon.

                  Comment

                  • krespim
                    Member
                    • Jul 2012
                    • 49

                    #10
                    Originally posted by dpryan View Post
                    "_empty" is when a read aligns outside of a gene/exon.
                    Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.

                    Comment

                    • alittleboy
                      Member
                      • Apr 2011
                      • 60

                      #11
                      Originally posted by krespim View Post
                      Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.
                      Hi @krespim:

                      Thanks for pointing this out!! Could you explain in some more details about "wrong annotation" and "wrong library type"? I think I used the Homo_sapiens.GRCh37.72_norm.gtf file from Ensembl for the annotation, and library type is fr-unstranded for Standard Illumina (paired-end?). Is there anything wrong with this set-up?

                      BTW, how can I tell if the _empty entry of 24914546 is unreasonably large or not?
                      Last edited by alittleboy; 06-28-2013, 11:06 AM.

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        Originally posted by krespim View Post
                        Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.
                        One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).

                        Comment

                        • alittleboy
                          Member
                          • Apr 2011
                          • 60

                          #13
                          Originally posted by dpryan View Post
                          One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).
                          Hi @dpryan and @areyes

                          Oops... I guess the number of empty I got (24914546) might be too large. Is there a way to check which step went wrong? Here I detailed the steps I used:

                          (1) I got BAM files for each of the 2 lanes of the samples, and I used samtools to merge the BAM files into a single one:

                          samtools merge combined.bam accepted_hits_from_L1.bam accepted_hits_from_L2.bam

                          (2) I used the Python script to generate GFF file:

                          python dexseq_prepare_annotation.py Homo_sapiens.GRCh37.72_norm.gtf HS.GRCh37.72_norm.DEXSeq.gff

                          (3) I counted the reads of exons using another Python script to generate TXT files:

                          samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

                          (4) I switched to R to use the read.HTSeqCounts() function to include those TXT files, and then do the analyses.

                          Interestingly, I did get results for testing differential exon usage (~400 genes are involved by FDR<0.1), and the results "seem" to be reasonable... Just wanna confirm if this is not "garbage in, garbage out"...

                          Thank you so much!!

                          Comment

                          • dpryan
                            Devon Ryan
                            • Jul 2011
                            • 3478

                            #14
                            How many reads mapped uniquely? You can also open your accepted_hits.bam (or whatever it's called) file in IGV or another browser and just spot-check a bit of the data to see if it matches up. It's always good to visually look at things to see if they make sense.

                            Comment

                            • krespim
                              Member
                              • Jul 2012
                              • 49

                              #15
                              Originally posted by dpryan View Post
                              One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).
                              I misread the outpout and thought that none of this reads were mapped - my bad. Nevertheless, 24M empty seems quite a bit to me. That said I also follow routinely your suggestion of looking at IGV shots to check if everything is Kosher. Another thing I do as a sanity check is to grep a few genes that I know are expressed in my samples just to see if they have a reasonable number of counts*.

                              @ alittleboy: I really don't what is going wrong there, your parameters seem correct. By wrong library I mean using "stranded" instead of default for instance

                              *funnily enough this happened the other day. I ran dexseq count on a number of files and my goto confirmation genes had the right counts in some files but not in others. No pattern. To cut a long story short, it turns out that during the transfer of the bam files from the cluster were the mapping was done, some files got corrupted (the file size remained unchanged). And yes, it was with scp.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...