Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sbcn
    Member
    • Oct 2012
    • 16

    stringtie parameters

    Hi,

    I have been trying to use Stringtie for transcriptome re-assembly, based on a reference gtf file.
    Here is how I ran it:

    # for each of the bam files from my project (aligned with tophat2):
    stringtie file.bam -G reference.gtf -o file_stringtie.gtf -p 4 -v -C file_coverage.txt -A file_gene_abundance.out

    # then merging all gtf files together:
    stringtie --merge -G reference.gtf -p 4 -o all_merged.gtf gtf_list.txt

    It is very straightforward. It is also incredibly fast as compared to the cufflinks + cuffmerge pipeline.

    But when I compare the number of transcripts found in the reference GTF file and in the output of Stringtie, it is dramatically different:
    awk '$3=="transcript"' reference.gtf | wc -l
    # 23963
    awk '$3=="transcript"' all_merged.gtf | wc -l
    # 57830

    I expect and hope for new transcripts, but I think this is a bit too much difference (Am I wrong?).

    How can I make the pipeline more stringent?

    Would you advice to increase the minimum input transcript coverage for example, in the merging step?
    Also, If I look at some of cuffmerge's parameters, the minimum isoform fraction is set to 0.05 while in stringtie it is set as 0.01 by default: is it the way to go?

    I have tried these parameters:

    stringtie --merge -c 2.5 -G reference.gtf -p 4 -o all_merged_bis.gtf gtf_list.txt
    awk '$3=="transcript"' all_merged_bis.gtf | wc -l
    # 57476

    stringtie --merge -f 0.05 -G reference.gtf -p 4 -o all_merged_ter.gtf gtf_list.txt
    awk '$3=="transcript"' all_merged_ter.gtf | wc -l
    # 36164

    I am merging together results from about 60 bam files, so I guess the approach can be different than for smaller projects.

    Thank you for any help and advice!

    Best,
  • SES
    Senior Member
    • Mar 2010
    • 275

    #2
    I would try gffcompare (by the same author) instead of "stringtie --merge" because it seems to be more stringent. I have also experienced the same issue that you report, but it is worse for a large genome. In my case, "stringtie --merge" generated 3X more transcripts than the reference, while gffcompare only generated about 2X more. You can also discard novel loci with gffcompare if you want to only consider the reference set.

    Alternatively, you can increase the thresholds for stringtie to merge transcripts.

    Comment

    • sbcn
      Member
      • Oct 2012
      • 16

      #3
      Thanks a lot for your input.

      I have now tried gffcompare, but it is actually a lot worse in my case:

      gffcompare -r reference.gtf -s reference.fa -C -D -i gtf_list.txt

      awk '$3=="transcript"' gffcmp.combined.gtf | wc -l
      # 185653

      As I understand it, gffcompare creates the union of all the gtf files given as an input, and as I am merging about 60 files, I get a huge final number of transcripts.

      I think stringtie --merge is more appropriate in my case as it rather constructs a kind of consensus, so I will try and work on optimizing the parameters, although I would like to make sure not to be too stringent on some of them, and too flexible on others.

      Comment

      • mpertea
        Junior Member
        • Mar 2012
        • 1

        #4
        It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

        The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.

        Comment

        • SES
          Senior Member
          • Mar 2010
          • 275

          #5
          Originally posted by mpertea View Post
          It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

          The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.
          This is very helpful, thanks. One question I have would be about the merging that gffcompare does vs. the "stringtie --merge" method. It seems like "stringtie --merge" is the more appropriate method for joining libraries from different tissues, followed by an assessment with gffcompare. Is this correct? The docs say that gffcompare also does merging but it is not clear to how this relates to what "stringtie --merge" is doing.

          Comment

          • mcsimenc
            Junior Member
            • May 2013
            • 5

            #6
            Can anyone suggest an interpretation of the following results using stringtie --merge: ?

            Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

            Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

            Thanks!!
            Matt

            Comment

            • rajeev.vikram
              Junior Member
              • Aug 2015
              • 6

              #7
              Originally posted by mcsimenc View Post
              Can anyone suggest an interpretation of the following results using stringtie --merge: ?

              Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

              Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

              Thanks!!
              Matt
              Hello Matt,

              According to my understanding, the number of merged transcripts presented depends on the relative expression of the input transcript files. As the literature stares, " generate a non-redundant set of transcripts observed in all the RNA-Seq samples assembled previously to generate a a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples." which means, the merge option will only produce transcripts with robust expression (or whatever expression cutoff one selects). Are you using a reference transcriptome file in assembly? you can also use gff compare to check the accuracy of your files.

              Cheers

              Comment

              Latest Articles

              Collapse

              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM
              • seqadmin
                Investigating the Gut Microbiome Through Diet and Spatial Biology
                by seqadmin




                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                02-24-2025, 06:31 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              17 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              18 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-18-2025, 12:50 PM
              0 responses
              19 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              185 views
              0 reactions
              Last Post seqadmin  
              Working...