Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • a general question on DEXSeq data preparations

    I was following the steps in the vignette of DEXSeq here to construct the exon count files (.txt) for each of my 16 biological samples. However, as shown on page 3, we need to go through these steps:

    (1) samtools index treated2.bam
    (2) samtools view treated2.bam > treated2.sam
    (3) sort k1,1 k2,2n treated2.sam > treated2_sorted.sam
    (4) python deseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff treated2_sorted.sam treated2fb.txt

    In (1), the indexing is done in a reasonable time (<3 min); in (2) I realize that the SAM files can be huge (>15GB) for each sample; in (3) will the sorted SAM file be of the same size as the unsorted SAM file? How long it will take for the sorting? in (4), can I say after step (3), the generated SAM file (unsorted) is no longer useful, and we only need the sorted SAM files?

    The reason I ask this is because I have 16 samples, so the total size of files becomes very large and the disk space is limited... Just wanna make sure these sizes are reasonable, and ask if I could delete the (if it's useless) SAM files while sorting them in step (3). What would be the Linux command, then?

    Thank you so much!

  • #2
    Hi @alittleboy,

    It should not take to long to sort them, however it will be even faster if you use samtools or http://picard.sourceforge.net/command-line-oIt should not take to long to sort them, however it will be even faster if you use samtools or picard tools. You could remove the initial sam files, sort them by read name and then you could sort them back with respect with position in the genome and you should get back your initial file.

    Alejandro

    Comment


    • #3
      Hi alittleboy,

      may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:

      samtools view sample.bam | python dexseq_count.py <options> geneModel - sample.output.dexseq_count.txt

      You will notice the "-" after geneModel which indicates that the argument (the reads file) is passed to dexseq_count.py as a stdin (standard input). This bypasses the creation of a intermediate huge SAM file.

      Regarding the sorting of the reads, if I am not mistaken, tophat produces already sorted bam files. Just in case, you can always sort your BAM before indexing it:
      samtools sort sample.bam sample.sorted # creates a sorted file
      samtools index sample.sorted.bam # index sorted bam
      rm sample.bam # removes the old bam

      You will end up with a sorted and indexed bam that can be used for pretty much any downstream analysis.

      I hope it helps.

      Comment


      • #4
        Hi @krespin,

        may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:
        I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

        Alejandro

        Comment


        • #5
          Originally posted by areyes View Post
          Hi @krespin,



          I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

          Alejandro
          Still a damn good bioconductor package

          To be fair, I also had some difficulty in the beginning because my annotations/mappings were done with UCSC gene symbols and had to circumvent some issues. Otherwise all good - I am doing some DEXseq analysis as we speak.

          Comment


          • #6
            Hi @krespim and @areyes:

            Thank you so much for your inputs and suggestions! Yes, as you see, I strictly follow the instructions of the pasilla vignette, and I think we can avoid generating the huge SAM files during the steps. To @areyes, would you please, if possible, send me an updated vignette on the most efficient way of generating the exon count files in command line? Currently I have TopHat accepted_hits.bam files at hand, and I use:

            samtools view C26.bam | sort -k1,1 -k2,2n | python dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - C26.txt

            Is this correct? Is there a dash between the gene model and the output .TXT file?

            BTW, the first C26.bam file is obtained from running:

            samtools index C26.bam (<-- this C26.bam is the "accepted_hits.bam" from TopHat).

            Thank you so much!!

            Comment


            • #7
              The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

              FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.

              Comment


              • #8
                Originally posted by dpryan View Post
                The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

                FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.
                Hi @dpryan, @areyes and @krespim:

                Thanks a lot for all your suggestions! I think I can get the exon count files by "piping" in Linux commands ;-)

                Some question about the .TXT output file: I got the following (first gene with 15 exons) output:

                ENSG00000000003:001 10
                ENSG00000000003:002 0
                ...
                ENSG00000000003:014 0
                ENSG00000000003:015 0


                Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

                ENSG00000206495+ENSG00000242726:001 0
                ENSG00000206495+ENSG00000242726:002 0
                ...
                ENSG00000206495+ENSG00000242726:057 0
                ENSG00000206495+ENSG00000242726:058 0


                There are also some items with more than 2 genes (non-zero exon counts):


                ...
                ENSG00000197013+ENSG00000269237+ENSG00000196268+ENSG00000268658:064 6



                Q2: At the end of the file, I notice some summaries:

                _ambiguous 0
                _empty 24914546
                _lowaqual 0
                _notaligned 0


                What are each of them represent? I can guess its meaning, but just wanna confirm... especially the "_empty" item... As each of the .TXT files are to be passed to the read.HTSeqCounts() function, will these extra information automatically filtered out?

                Thank you so much!!
                Last edited by alittleboy; 06-26-2013, 11:08 AM.

                Comment


                • #9
                  Originally posted by alittleboy View Post
                  Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

                  ENSG00000206495+ENSG00000242726:001 0
                  ENSG00000206495+ENSG00000242726:002 0
                  ...
                  ENSG00000206495+ENSG00000242726:057 0
                  ENSG00000206495+ENSG00000242726:058 0
                  Have a look at those genes in a genome browser. You'll notice that they overlap in at least one transcript.

                  Originally posted by alittleboy View Post
                  Q2: At the end of the file, I notice some summaries:

                  _ambiguous 0
                  _empty 24914546
                  _lowaqual 0
                  _notaligned 0
                  The whole python program is only ~180 lines, so it's easy to go through and just see how those numbers are generated.

                  "_notaligned" is self explanatory (you're using tophat, so those will be in a different file). "_lowaqual" will be incremented when the mapping quality of a read is below the value set by -a/--minaqual. "_ambiguous" will be set when a read can map to more than one "gene". I would assume that this might occur in unstranded libraries with reads aligning to genes that partly overlap on opposite strands, but perhaps areyes could weigh in here. "_empty" is when a read aligns outside of a gene/exon.

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    "_empty" is when a read aligns outside of a gene/exon.
                    Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.

                    Comment


                    • #11
                      Originally posted by krespim View Post
                      Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.
                      Hi @krespim:

                      Thanks for pointing this out!! Could you explain in some more details about "wrong annotation" and "wrong library type"? I think I used the Homo_sapiens.GRCh37.72_norm.gtf file from Ensembl for the annotation, and library type is fr-unstranded for Standard Illumina (paired-end?). Is there anything wrong with this set-up?

                      BTW, how can I tell if the _empty entry of 24914546 is unreasonably large or not?
                      Last edited by alittleboy; 06-28-2013, 11:06 AM.

                      Comment


                      • #12
                        Originally posted by krespim View Post
                        Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.
                        One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).

                        Comment


                        • #13
                          Originally posted by dpryan View Post
                          One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).
                          Hi @dpryan and @areyes

                          Oops... I guess the number of empty I got (24914546) might be too large. Is there a way to check which step went wrong? Here I detailed the steps I used:

                          (1) I got BAM files for each of the 2 lanes of the samples, and I used samtools to merge the BAM files into a single one:

                          samtools merge combined.bam accepted_hits_from_L1.bam accepted_hits_from_L2.bam

                          (2) I used the Python script to generate GFF file:

                          python dexseq_prepare_annotation.py Homo_sapiens.GRCh37.72_norm.gtf HS.GRCh37.72_norm.DEXSeq.gff

                          (3) I counted the reads of exons using another Python script to generate TXT files:

                          samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

                          (4) I switched to R to use the read.HTSeqCounts() function to include those TXT files, and then do the analyses.

                          Interestingly, I did get results for testing differential exon usage (~400 genes are involved by FDR<0.1), and the results "seem" to be reasonable... Just wanna confirm if this is not "garbage in, garbage out"...

                          Thank you so much!!

                          Comment


                          • #14
                            How many reads mapped uniquely? You can also open your accepted_hits.bam (or whatever it's called) file in IGV or another browser and just spot-check a bit of the data to see if it matches up. It's always good to visually look at things to see if they make sense.

                            Comment


                            • #15
                              Originally posted by dpryan View Post
                              One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).
                              I misread the outpout and thought that none of this reads were mapped - my bad. Nevertheless, 24M empty seems quite a bit to me. That said I also follow routinely your suggestion of looking at IGV shots to check if everything is Kosher. Another thing I do as a sanity check is to grep a few genes that I know are expressed in my samples just to see if they have a reasonable number of counts*.

                              @ alittleboy: I really don't what is going wrong there, your parameters seem correct. By wrong library I mean using "stranded" instead of default for instance

                              *funnily enough this happened the other day. I ran dexseq count on a number of files and my goto confirmation genes had the right counts in some files but not in others. No pattern. To cut a long story short, it turns out that during the transfer of the bam files from the cluster were the mapping was done, some files got corrupted (the file size remained unchanged). And yes, it was with scp.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X