Header Leaderboard Ad

Collapse

STAR-alignment/confusion with mismatches & multi-mapping parameter

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • STAR-alignment/confusion with mismatches & multi-mapping parameter

    Hello everyone,


    1. I used STAR for mapping RNA seq data with default parameters only changing --outFilterMismatchNmax 5 (allowing 5 mismatches, default was 10). I am now confused if I should have gone for 3 or less? My read length varies between 10-50. What is the ideal cutoff one should allow for mismatch parameter?

    2. Also, one should go for multi-mapping reads or not? because around 24% of reads in my case is showing multi-mapping, is it significant? i don't think so.. should I switch it to zero multi-mapping?

    3. I also want to do some statistics to see the correlation between my replicates, and some post-mapping statistical analysis like length counts, log(n) etc..can anybody refer a good tool for this..?

    Please suggest. thanks.

  • #2
    1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

    Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
    --outFilterMismatchNoverLmax 0.05
    This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on.

    2. Multi-mapping reads are quite common especially when you map short reads. 24% is a decent number for short RNA-seq. What to do with multi-mappers is a complicated issue. Typically they are used to represent expression of whole classes/families of RNA (repeats (e.g. transposons), gene families etc). If you are looking at your data in a browser, it's useful to have two separate tracks - for unique mappers and multi-mappers.

    3. We typically check correlation between bio-reps for read counts on annotated small RNAs. A good tool to do it is 'coverageBed' from BEDtools package.
    Another nice tool is HTseq.

    Comment


    • #3
      Originally posted by alexdobin View Post
      1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

      .

      Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

      Thanks again,

      Comment


      • #4
        Originally posted by babi2305 View Post
        Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

        Thanks again,
        We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts. Many RNAs are not not to be polyadenylated, histone genes being the well-known example, while it is not as well understood for more recently discovered classes such as lncRNAs or eRNAs.

        Comment


        • #5
          Cufflink producing 0.0 FPKM values for Ensembl matched transcripts

          Originally posted by alexdobin View Post
          We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts.
          Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

          Thanks.

          Comment


          • #6
            Originally posted by babi2305 View Post
            Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

            Thanks.
            The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

            I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.

            Comment


            • #7
              How cuffdiff works? Need to know the novel transcripts

              Originally posted by alexdobin View Post
              The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

              I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.
              Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

              I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?

              Comment


              • #8
                Originally posted by babi2305 View Post
                Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

                I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?
                Sorry I cannot give you any solid advise on using Cuffdiff and Cuffcompare - I have only been using Cufflinks for transcripts assembly from our data (and have not been happy with the results).

                As far as I understand, Cuffdiff will give you the differentially expressed isoforms given the same list of transcripts for the two samples, but it will not tell you which isoforms are "structurally" different between two samples. You probably should be able to parse some of the Cuffcompare outputs to get the novel isoforms.

                Comment


                • #9
                  Star doubt

                  Originally posted by alexdobin View Post
                  Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
                  --outFilterMismatchNoverLmax 0.05
                  This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on..
                  Hello Alexander,

                  I am using STAR for my single-end small RNAseq data (15-46 read length) and its working great. thanks! However, since its new, I can't find much help on public forums, there are couple of things I need to know, and who can tell it better than you..

                  I used STAR with the following parameters:
                  --genomeDir </path/genome/hg19>
                  --genomeLoad NoSharedMemory
                  --readFilesIn <path/inputReadsFile/min15_max_46length_reads>
                  --runThreadN 4
                  --outFilterMismatchNoverLmax 0.05
                  --outSAMattributes All
                  --outSJfilterOverhangMin 20 10 10 10
                  --outFileNamePrefix <output_location>
                  --sjdbFileChrStartEnd <introns_ucsc_hg19>
                  --sjdbOverhang 45


                  Now, when I created contigs / or visualize it on genome browser I am getting large chunks of regions spanning 100,000bp and more in the genome, and there are many such occurrences in multiple files. which therefore is giving large set of contigs as well. Is this a mapping error?

                  Also, when I identified those regions from my bam files, I found bitwise flag '64' in the second column, and '255' in 5th (for uniquely mapped).

                  I have attached the screenshot, first one is contig file and second one is reads extracted from bam file.

                  Please help,

                  Thanks.
                  Attached Files

                  Comment


                  • #10
                    Hi @babi2305,

                    I am not sure how you make contigs, but I suspect that these long clusters are made of alignments that are spliced. Some small RNA can be spliced (e.g. short mRNA decay products that happen to cross exon-exon junction). Unless you are looking for them specifically, I would suggest prohibiting splicing with --alignIntronMax 10.
                    If you do not want splicing, you also do not --sjdbFileChrStartEnd <introns_ucsc_hg19> --sjdbOverhang 45, and in any case, these parameters should be used at the genome generation step, not mapping step.

                    Note, that you can also filter out spliced alignments after the mapping - they will contain "N" in the CIGAR string.

                    We are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project with the following parameters:
                    --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1
                    (>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

                    You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper --clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1 (second parameter is the proportion of mismatches in the matched adapter length).

                    You would also likely want to filter out reads that STAR "genomically" trims at the 5'.
                    This simple awk one-liner will filter out all alignments that are trimmed by more than 1 base from the 5':
                    Code:
                    awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                      by seqadmin


                      ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                      01-24-2023, 01:19 PM
                    • seqadmin
                      Introduction to Single-Cell Sequencing
                      by seqadmin
                      Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                      The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                      ...
                      01-09-2023, 03:10 PM

                    ad_right_rmr

                    Collapse
                    Working...
                    X