Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • stringtie parameters

    Hi,

    I have been trying to use Stringtie for transcriptome re-assembly, based on a reference gtf file.
    Here is how I ran it:

    # for each of the bam files from my project (aligned with tophat2):
    stringtie file.bam -G reference.gtf -o file_stringtie.gtf -p 4 -v -C file_coverage.txt -A file_gene_abundance.out

    # then merging all gtf files together:
    stringtie --merge -G reference.gtf -p 4 -o all_merged.gtf gtf_list.txt

    It is very straightforward. It is also incredibly fast as compared to the cufflinks + cuffmerge pipeline.

    But when I compare the number of transcripts found in the reference GTF file and in the output of Stringtie, it is dramatically different:
    awk '$3=="transcript"' reference.gtf | wc -l
    # 23963
    awk '$3=="transcript"' all_merged.gtf | wc -l
    # 57830

    I expect and hope for new transcripts, but I think this is a bit too much difference (Am I wrong?).

    How can I make the pipeline more stringent?

    Would you advice to increase the minimum input transcript coverage for example, in the merging step?
    Also, If I look at some of cuffmerge's parameters, the minimum isoform fraction is set to 0.05 while in stringtie it is set as 0.01 by default: is it the way to go?

    I have tried these parameters:

    stringtie --merge -c 2.5 -G reference.gtf -p 4 -o all_merged_bis.gtf gtf_list.txt
    awk '$3=="transcript"' all_merged_bis.gtf | wc -l
    # 57476

    stringtie --merge -f 0.05 -G reference.gtf -p 4 -o all_merged_ter.gtf gtf_list.txt
    awk '$3=="transcript"' all_merged_ter.gtf | wc -l
    # 36164

    I am merging together results from about 60 bam files, so I guess the approach can be different than for smaller projects.

    Thank you for any help and advice!

    Best,

  • #2
    I would try gffcompare (by the same author) instead of "stringtie --merge" because it seems to be more stringent. I have also experienced the same issue that you report, but it is worse for a large genome. In my case, "stringtie --merge" generated 3X more transcripts than the reference, while gffcompare only generated about 2X more. You can also discard novel loci with gffcompare if you want to only consider the reference set.

    Alternatively, you can increase the thresholds for stringtie to merge transcripts.

    Comment


    • #3
      Thanks a lot for your input.

      I have now tried gffcompare, but it is actually a lot worse in my case:

      gffcompare -r reference.gtf -s reference.fa -C -D -i gtf_list.txt

      awk '$3=="transcript"' gffcmp.combined.gtf | wc -l
      # 185653

      As I understand it, gffcompare creates the union of all the gtf files given as an input, and as I am merging about 60 files, I get a huge final number of transcripts.

      I think stringtie --merge is more appropriate in my case as it rather constructs a kind of consensus, so I will try and work on optimizing the parameters, although I would like to make sure not to be too stringent on some of them, and too flexible on others.

      Comment


      • #4
        It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

        The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.

        Comment


        • #5
          Originally posted by mpertea View Post
          It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

          The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.
          This is very helpful, thanks. One question I have would be about the merging that gffcompare does vs. the "stringtie --merge" method. It seems like "stringtie --merge" is the more appropriate method for joining libraries from different tissues, followed by an assessment with gffcompare. Is this correct? The docs say that gffcompare also does merging but it is not clear to how this relates to what "stringtie --merge" is doing.

          Comment


          • #6
            Can anyone suggest an interpretation of the following results using stringtie --merge: ?

            Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

            Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

            Thanks!!
            Matt

            Comment


            • #7
              Originally posted by mcsimenc View Post
              Can anyone suggest an interpretation of the following results using stringtie --merge: ?

              Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

              Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

              Thanks!!
              Matt
              Hello Matt,

              According to my understanding, the number of merged transcripts presented depends on the relative expression of the input transcript files. As the literature stares, " generate a non-redundant set of transcripts observed in all the RNA-Seq samples assembled previously to generate a a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples." which means, the merge option will only produce transcripts with robust expression (or whatever expression cutoff one selects). Are you using a reference transcriptome file in assembly? you can also use gff compare to check the accuracy of your files.

              Cheers

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X