Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Two questions

    1. I want to ask a question about bam files.

    I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
    When I do tophat, because I need to specify the -r, I cannot merge the two fastq files. But after I got the accepted.bam files, can I merge them (bam files) with the samtools merge?

    I need to do cufflinks and cuffdiff using the merged bam files.

    2. I see the parameter of cuffdiff is
    cuffdiff transcripts.gtf 1.bam 2.bam

    Does this transcritpts.gtf is the output of cufflinks or just the reference transcript annotation?


    thanks everyone.

  • #2
    Originally posted by camelbbs View Post
    1. I want to ask a question about bam files.

    I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
    When I do tophat, because I need to specify the -r, I cannot merge the two fastq files. But after I got the accepted.bam files, can I merge them (bam files) with the samtools merge?

    I need to do cufflinks and cuffdiff using the merged bam files.

    2. I see the parameter of cuffdiff is
    cuffdiff transcripts.gtf 1.bam 2.bam

    Does this transcritpts.gtf is the output of cufflinks or just the reference transcript annotation?


    thanks everyone.
    I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



    You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

    Chris

    Comment


    • #3
      Thanks very much. But the sequences are paried end. Because one sample have several libraries, and the sequencing length is different between the libraries. So we just first to get the bam files by tophat -r xxx -G hg19_ucsc.gtf ERR001_1.fastq ERR001_2.fastq

      and then merge all the bam files that not belong to the sample library, but belong to the same sample. Is that right? Thanks
      Last edited by camelbbs; 10-24-2011, 12:01 PM.

      Comment


      • #4
        Originally posted by cjp View Post
        I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



        You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

        Chris
        And If we use the output from cufflinks, there will be two gtf files when we work on two samples. So how to input these two files into the cuffdiff. thanks very much for your help

        Comment


        • #5
          Originally posted by camelbbs View Post
          Thanks very much. But the sequences are paried end. Because one sample have several libraries, and the sequencing length is different between the libraries. So we just first to get the bam files by tophat -r xxx -G hg19_ucsc.gtf ERR001_1.fastq ERR001_2.fastq

          and then merge all the bam files that not belong to the sample library, but belong to the same sample. Is that right? Thanks
          Yes, you can merge BAM files from multiple sequencing runs if they are the same sample even if they have a different read length.

          Comment


          • #6
            Originally Posted by camelbbs

            And If we use the output from cufflinks, there will be two gtf files when we work on two samples. So how to input these two files into the cuffdiff. thanks very much for your help

            Cufflinks provides some software called gffread - from gffread -h, there are these options:

            -M/--merge : cluster the input transcripts into loci, collapsing matching
            transcripts (those with the same exact introns and fully contained)
            --cluster-only: same as --merge but without collapsing matching transcripts
            -K for -M option: also collapse shorter, fully contained transcripts
            with fewer introns than the container
            -Q for -M option, remove the containment restriction:
            (multi-exon transcripts will be collapsed if just their introns match,
            while single-exon transcripts can partially overlap (80%))

            I've never used myself, so am not sure if it does what you want. You could also convert to bed format and then use BEDtools, which has something called intersectBed that will get one bed file from combining two input bed files. To get a final GTF file from this bed file, I found this link on seqAnswers:

            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


            But converting between GTF and bed is not always so easy, as you can lose data.

            Chris

            Comment


            • #7
              Originally posted by cjp View Post
              I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



              You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

              Chris
              Thanks a lot Chris,
              Actually my purpose is to search and compare the alternative splicing events between two samples.

              My workflow is like this:

              First I got the two merged bam files from the two samples by tophat. Then I run

              cuffdiff hg19_ucsc.gtf sample1.bam sample2.bam

              And I got some results. But they don't contain the novel transcript assembled by cufflinks.

              So I run cufflinks in order to get the novel transcript

              cufflinks -g hg19_ucsc.gtf sample1.bam
              cufflinks -g hg19_ucsc.gtf sample2.bam

              I got two transcript.gtf files in the two samples.

              Then I merged the two transcript.gtf files, transcript1.gtf and transcript2.gtf with the reference annotation

              cuffmerge -o merged gtf_list (hg19_ucsc.gtf, transcript1.gtf, transcript2.gtf)

              Then run cuffdiff:

              cuffdiff merged.gtf sample1.bam sample2.bam

              Is that the right workflow for comparing the novel alternative splicing transcripts and their expression between the two samples.

              But I see there is a script called cuffcompare. If I run

              cuffcompare hg19_ucsc.gtf transcript1.gtf transcript2.gtf

              I can also get the different alternative splicing transcripts. So does that mean

              cufflinks + cuffcompare == cuffdiff ?

              Thanks a lot!!!
              Last edited by camelbbs; 10-25-2011, 01:35 PM.

              Comment


              • #8
                Sounds like you've got a better method than I suggested as have never used cuffcompare or cuffmerge before.

                cuffdiff seems to be always the last program to run whether you want FPKM's (expression levels) for known or novel transcripts. It gives the data in nice spreadsheet (.csv) formats and does some useful stats tests as well.

                Chris

                Comment


                • #9
                  Originally posted by camelbbs View Post
                  Thanks a lot Chris,
                  Actually my purpose is to search and compare the alternative splicing events between two samples.

                  My workflow is like this:

                  First I got the two merged bam files from the two samples by tophat. Then I run

                  cuffdiff hg19_ucsc.gtf sample1.bam sample2.bam

                  And I got some results. But they don't contain the novel transcript assembled by cufflinks.

                  So I run cufflinks in order to get the novel transcript

                  cufflinks -g hg19_ucsc.gtf sample1.bam
                  cufflinks -g hg19_ucsc.gtf sample2.bam

                  I got two transcript.gtf files in the two samples.

                  Then I merged the two transcript.gtf files, transcript1.gtf and transcript2.gtf with the reference annotation

                  cuffmerge -o merged gtf_list (hg19_ucsc.gtf, transcript1.gtf, transcript2.gtf)

                  Then run cuffdiff:

                  cuffdiff merged.gtf sample1.bam sample2.bam

                  Is that the right workflow for comparing the novel alternative splicing transcripts and their expression between the two samples.

                  But I see there is a script called cuffcompare. If I run

                  cuffcompare hg19_ucsc.gtf transcript1.gtf transcript2.gtf

                  I can also get the different alternative splicing transcripts. So does that mean

                  cufflinks + cuffcompare == cuffdiff ?

                  Thanks a lot!!!
                  I have done the same a few days ago, and in my project, I only used the merged.gtf for cuffdiff, and it goes well(there are "u" in the class code ), while for my workmate, she found there were not any "u" in the class code from merged.gtf, so she then run cuffcompare with merged.gtf and known.gtf(the species was not human), and last she used the combined.gtf as well for cuffdiff.

                  So, I am still a littlte confused for the difference of the merged.gtf and the combined.gtf. Any help will be grateful.

                  Comment


                  • #10
                    hi, i just want to know what do you mean the combine.gtf

                    Comment


                    • #11
                      Originally posted by tiffany081126 View Post
                      I have done the same a few days ago, and in my project, I only used the merged.gtf for cuffdiff, and it goes well(there are "u" in the class code ), while for my workmate, she found there were not any "u" in the class code from merged.gtf, so she then run cuffcompare with merged.gtf and known.gtf(the species was not human), and last she used the combined.gtf as well for cuffdiff.

                      So, I am still a littlte confused for the difference of the merged.gtf and the combined.gtf. Any help will be grateful.
                      I want to ask what do you mean combined.gtf

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X