Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by krespim View Post
    I misread the outpout and thought that none of this reads were mapped - my bad. Nevertheless, 24M empty seems quite a bit to me. That said I also follow routinely your suggestion of looking at IGV shots to check if everything is Kosher. Another thing I do as a sanity check is to grep a few genes that I know are expressed in my samples just to see if they have a reasonable number of counts*.

    @ alittleboy: I really don't what is going wrong there, your parameters seem correct. By wrong library I mean using "stranded" instead of default for instance

    *funnily enough this happened the other day. I ran dexseq count on a number of files and my goto confirmation genes had the right counts in some files but not in others. No pattern. To cut a long story short, it turns out that during the transfer of the bam files from the cluster were the mapping was done, some files got corrupted (the file size remained unchanged). And yes, it was with scp.
    Hi @krespim:

    Thanks so much for your suggestions! I am working in a Linux environment, and all files are stored in cluster. Is there a way (please point out) that I can use IGV in the Red Hat Linux OS? Or is there any alternative ways for a sanity check?

    I realize this issue ("_empty" so huge that many reads align outside of gene/exon) must be solved before I continue with subsequent DEXSeq steps... Is there a way to check what proportion for this "_empty"? Sorry I am a newbie in this bioinformatics area, and I'm not very familiar with relevant manipulations (but eager to learn )

    BTW, I think I merged BAM files from two lanes of a sample into a single BAM file. Is that a common practice? Or there is something wrong with the GTF file? Or this step:

    samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

    is problematic?

    Thanks again for your kind help!
    Last edited by alittleboy; 06-30-2013, 06:27 PM.

    Comment


    • #17
      Use samtools to sort your file

      Originally posted by alittleboy View Post
      Hi @krespim:

      Thanks so much for your suggestions! I am working in a Linux environment, and all files are stored in cluster. Is there a way (please point out) that I can use IGV in the Red Hat Linux OS? Or is there any alternative ways for a sanity check?

      I realize this issue ("_empty" so huge that many reads align outside of gene/exon) must be solved before I continue with subsequent DEXSeq steps... Is there a way to check what proportion for this "_empty"?
      It should be no problem as all you have to do is download IGV to the cluster and add the folder where igv.jar is your PATH(or ask the admnistrator). 2nd step is: do you login remotely to the server? something like:

      ssh [email protected]

      if so, just had to "-Y" to get a graphical interface to the server, meaning that when a window is open from the terminal on the server side you will see it on your local computer. This explains how to run igv from your command line.

      Looking on IGV for genes that you expect high/low read count will tell you if all is well.


      Originally posted by alittleboy View Post
      Sorry I am a newbie in this bioinformatics area, and I'm not very familiar with relevant manipulations (but eager to learn )
      I am on the same boat Here is some advice based on my experience:

      For minor/basic things you would be best advised to get help from someone tech-savvy at your institute, like a linux administrator or a bioinformatitian. Just go and ask. That is what I do all the time. And if you plan on doing bioinformatics regularly do a course on basic programming/Unix like the very good Software Carpentry Boot Camps. Better still, go through the exercises of this book. I have it and it is a life saver.


      Originally posted by alittleboy View Post
      BTW, I think I merged BAM files from two lanes of a sample into a single BAM file. Is that a common practice? Or there is something wrong with the GTF file? Or this step:
      BAM file merging it common practice and it is fine. If you are using the same GTF for mapping and read counting it also should not be the source of the problem


      This I don't like because it is a hack when there are good tools that will do a better job (I've suggested it earlier):
      Originally posted by alittleboy View Post
      samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt
      After merging the BAM files do:
      samtools sort merged.bam merged_sorted
      samtools index merged_sorted.bam merged_sorted.bam.bai
      This is the right way to sort a BAM file!
      Then run DEXSeq without the "sort" command and using your sorted BAM file:
      samtools view merged_sorted.bam | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

      If there is a problem with your BAM file, samtools will probably detect it, "sort" will not.

      Originally posted by alittleboy View Post
      Thanks again for your kind help!
      No problem. But use SAMTOOLS to sort your file.

      Comment


      • #18
        Originally posted by krespim View Post
        It should be no problem as all you have to do is download IGV to the cluster and add the folder where igv.jar is your PATH(or ask the admnistrator). 2nd step is: do you login remotely to the server? something like:

        ssh [email protected]

        if so, just had to "-Y" to get a graphical interface to the server, meaning that when a window is open from the terminal on the server side you will see it on your local computer. This explains how to run igv from your command line.

        Looking on IGV for genes that you expect high/low read count will tell you if all is well.




        I am on the same boat Here is some advice based on my experience:

        For minor/basic things you would be best advised to get help from someone tech-savvy at your institute, like a linux administrator or a bioinformatitian. Just go and ask. That is what I do all the time. And if you plan on doing bioinformatics regularly do a course on basic programming/Unix like the very good Software Carpentry Boot Camps. Better still, go through the exercises of this book. I have it and it is a life saver.




        BAM file merging it common practice and it is fine. If you are using the same GTF for mapping and read counting it also should not be the source of the problem


        This I don't like because it is a hack when there are good tools that will do a better job (I've suggested it earlier):


        After merging the BAM files do:
        samtools sort merged.bam merged_sorted
        samtools index merged_sorted.bam merged_sorted.bam.bai
        This is the right way to sort a BAM file!
        Then run DEXSeq without the "sort" command and using your sorted BAM file:
        samtools view merged_sorted.bam | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

        If there is a problem with your BAM file, samtools will probably detect it, "sort" will not.



        No problem. But use SAMTOOLS to sort your file.
        Hi @krespim:

        Thank you so much for the instructions and suggestions! I've order the book you mentioned, and plan to read it carefully ;-)

        I tried to use your correct approach for sorting the BAM files, and in running dexseq_count.py, I got tons of warning messages:

        HTSeq/__init__.py:598: UserWarning: Read FCD15N1ACXX:3:1110:18071:3100#TAGGAATA claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
        "which could not be found. (Is the SAM file properly sorted?)" )

        I guess that's the problem?

        Also, I notice that the sorted.bam file generated in this way is of the exactly the same size as what I originally did using the sort command, so I guess there might not be a difference... It's weird that originally I can get the TXT exon count files (but got an error this time...)
        Last edited by alittleboy; 07-01-2013, 06:07 AM.

        Comment


        • #19
          Originally posted by alittleboy View Post
          Hi @krespim:

          Thank you so much for the instructions and suggestions! I've order the book you mentioned, and plan to read it carefully ;-)

          I tried to use your correct approach for sorting the BAM files, and in running dexseq_count.py, I got tons of warning messages:

          HTSeq/__init__.py:598: UserWarning: Read FCD15N1ACXX:3:1110:18071:3100#TAGGAATA claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
          "which could not be found. (Is the SAM file properly sorted?)" )

          I guess that's the problem?

          Also, I notice that the sorted.bam file generated in this way is of the exactly the same size as what I originally did using the sort command, so I guess there might not be a difference... It's weird that originally I can get the TXT exon count files (but got an error this time...)
          Well, this is a bit out of my league now. I did found this which might help:
          http://seqanswers.com/forums/showthread.php?t=12376

          Otherwise send and email to the HTseq authors. Good luck.

          Comment


          • #20
            Are you sure you use "-n" option when doing samtools sort? In order to sort them by read name?

            Comment


            • #21
              Originally posted by areyes View Post
              Are you sure you use "-n" option when doing samtools sort? In order to sort them by read name?
              Yes, this time I used "-n" option when doing samtools sort, and got the merged_sort.bam file. However, when I continue to run:

              samtools index merged_sorted.bam merged_sorted.bam.bai

              It has the error message:

              [bam_index_core] the alignment is not sorted (FCD15N1ACXX:3:1101:1493:42913#NAGGAATA): 17-th chr > 5-th chr
              [bam_index_build2] fail to index the BAM file.

              Why? I think it is already sorted in the first step...

              Thanks for your suggests, areyes ;-)

              Comment


              • #22
                HI @alittleboy,

                In order to index a bam file you need it to be sorted by position. However for htseq scripts you do not require the index, you just need the reads sorted by gene name. You should be able to proceed with the htseq scripts without the index!

                Alejandro

                Comment


                • #23
                  Originally posted by areyes View Post
                  HI @alittleboy,

                  In order to index a bam file you need it to be sorted by position. However for htseq scripts you do not require the index, you just need the reads sorted by gene name. You should be able to proceed with the htseq scripts without the index!

                  Alejandro
                  Hi Alejandro:

                  Thanks for the update -- sorry I am a little confused by the different inputs from you, @krespim and the pasilla vignette... I apologize for my confusion. Here I list the steps that I use (from your last post)

                  samtools sort -n S1.bam S1_sorted

                  samtools view S1_sorted.bam | dexseq_count.py -p yes -s yes HS.GRCh37.72_norm.DEXSeq.gff - S1.txt


                  These two commands can run without error, and I think that's the way dexseq_count.py is supposed to do. I checked the results, but still found (I use a single BAM file instead of the merged file this time):

                  _ambiguous 0
                  _empty 9677861
                  _lowaqual 0
                  _notaligned 0

                  the huge number of _empty, which bothers me a lot... I really don't know if this is within a reasonable scope. Because I also used SplicingCompass for DEU, I compared the results with DEXSeq, and found that they're consistent for the very top genes (which makes me feel a little better...).
                  Last edited by alittleboy; 07-01-2013, 12:48 PM.

                  Comment


                  • #24
                    Originally posted by alittleboy View Post
                    samtools sort -n S1.bam S1_sorted

                    samtools view S1_sorted.bam | dexseq_count.py -p yes -s yes HS.GRCh37.72_norm.DEXSeq.gff - S1.txt
                    That looks correct.

                    _ambiguous 0
                    _empty 9677861
                    _lowaqual 0
                    _notaligned 0

                    the huge number of _empty, which bothers me a lot... I really don't know if this is within a reasonable scope. Because I also used SplicingCompass for DEU, I compared the results with DEXSeq, and found that they're consistent for the very top genes (which makes me feel a little better...).
                    Without known how many reads mapped, it's difficult to know how good or bad that number is. Elsewhere, you mention using whole lanes, which would suggest to me that that number could be reasonable.

                    Comment


                    • #25
                      Originally posted by dpryan View Post
                      That looks correct.



                      Without known how many reads mapped, it's difficult to know how good or bad that number is. Elsewhere, you mention using whole lanes, which would suggest to me that that number could be reasonable.
                      Hi @dpryan:

                      Thanks for the clarifications! Here is some summary statistics from the folder (that contains accepted_hits.bam, junctions.bed files, etc.) with the file name "left_kept_reads.info":

                      min_read_len=49
                      max_read_len=49
                      reads_in =19390575
                      reads_out=19389262

                      From the file named "right_kept_reads.info", it shows:

                      min_read_len=49
                      max_read_len=49
                      reads_in =19390575
                      reads_out=19389169

                      It's paired-end, so the total number of reads is ~40 million? How can we know from this information that the "_empty" is large or at reasonable range? BTW, this is from a single lane. Actually there are 2 lanes for each sample, but I choose not to merge them (for fair methodological comparisons).

                      Thank you!
                      Last edited by alittleboy; 07-02-2013, 03:51 AM.

                      Comment


                      • #26
                        You actually will only have ~20 million reads (mates will only be counted once), so that would mean that ~1/2 aren't aligning to annotated genes. That would seem quite high. I'm used to getting around 200 million reads from a lane (HiSeq) rather than ~20 million (were these run on a genome analyzer or multiplexed?), so my "could be reasonable" bit was dependent upon having more reads.

                        Comment


                        • #27
                          Originally posted by dpryan View Post
                          You actually will only have ~20 million reads (mates will only be counted once), so that would mean that ~1/2 aren't aligning to annotated genes. That would seem quite high. I'm used to getting around 200 million reads from a lane (HiSeq) rather than ~20 million (were these run on a genome analyzer or multiplexed?), so my "could be reasonable" bit was dependent upon having more reads.
                          Hi @dpryan:

                          Thank you for the update -- I also feel that ~1/2 not aligning to annotated genes is quite high... but unfortunately I don't know why.

                          The information above is from a single lane. In my experiment, each sample is sequenced in two lanes, and there are a total of 8 samples for each treatment group. Each lane has similar aligning rate (~1/2 not aligning). Thus, for detecting DEU purpose, do you think that would be enough (may be not...)? They were multiplexed as there are some other samples in one lane. However, our study just focuses on a particular sample (which is very likely to give us significant results).

                          So, due to our particular purpose of the study, we plan to just use the data from one lane for this sample. So, the consequence is that, we'll have power loss due to low coverage?

                          Thanks!

                          Comment


                          • #28
                            You'd have to use something like scotty to see how many samples at what depth you might need to figure out what sort of power you might have. For most of my needs (differential expression in mice), 8 samples at that depth would work fine.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            31 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X