Unconfigured Ad

**alexdobin** · 02-21-2013, 02:00 PM

1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
--outFilterMismatchNoverLmax 0.05
This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on.

2. Multi-mapping reads are quite common especially when you map short reads. 24% is a decent number for short RNA-seq. What to do with multi-mappers is a complicated issue. Typically they are used to represent expression of whole classes/families of RNA (repeats (e.g. transposons), gene families etc). If you are looking at your data in a browser, it's useful to have two separate tracks - for unique mappers and multi-mappers.

3. We typically check correlation between bio-reps for read counts on annotated small RNAs. A good tool to do it is 'coverageBed' from BEDtools package.
Another nice tool is HTseq.

**babi2305** · 02-22-2013, 12:51 AM

Originally posted by alexdobin View Post

1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

.

Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

Thanks again,

**alexdobin** · 02-25-2013, 05:20 PM

Originally posted by babi2305 View Post

Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

Thanks again,

We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts. Many RNAs are not not to be polyadenylated, histone genes being the well-known example, while it is not as well understood for more recently discovered classes such as lncRNAs or eRNAs.

**babi2305** · 02-26-2013, 08:56 AM

Cufflink producing 0.0 FPKM values for Ensembl matched transcripts

Originally posted by alexdobin View Post

We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts.

Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

Thanks.

**alexdobin** · 02-26-2013, 07:52 PM

Originally posted by babi2305 View Post

Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

Thanks.

The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.

**babi2305** · 03-05-2013, 06:00 AM

How cuffdiff works? Need to know the novel transcripts

Originally posted by alexdobin View Post

The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.

Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?

**alexdobin** · 03-05-2013, 08:02 AM

Originally posted by babi2305 View Post

Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?

Sorry I cannot give you any solid advise on using Cuffdiff and Cuffcompare - I have only been using Cufflinks for transcripts assembly from our data (and have not been happy with the results).

As far as I understand, Cuffdiff will give you the differentially expressed isoforms given the same list of transcripts for the two samples, but it will not tell you which isoforms are "structurally" different between two samples. You probably should be able to parse some of the Cuffcompare outputs to get the novel isoforms.

**babi2305** · 05-07-2013, 06:12 AM

Star doubt

Originally posted by alexdobin View Post

Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
--outFilterMismatchNoverLmax 0.05
This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on..

Hello Alexander,

I am using STAR for my single-end small RNAseq data (15-46 read length) and its working great. thanks! However, since its new, I can't find much help on public forums, there are couple of things I need to know, and who can tell it better than you..

I used STAR with the following parameters:
--genomeDir </path/genome/hg19>
--genomeLoad NoSharedMemory
--readFilesIn <path/inputReadsFile/min15_max_46length_reads>
--runThreadN 4
--outFilterMismatchNoverLmax 0.05
--outSAMattributes All
--outSJfilterOverhangMin 20 10 10 10
--outFileNamePrefix <output_location>
--sjdbFileChrStartEnd <introns_ucsc_hg19>
--sjdbOverhang 45

Now, when I created contigs / or visualize it on genome browser I am getting large chunks of regions spanning 100,000bp and more in the genome, and there are many such occurrences in multiple files. which therefore is giving large set of contigs as well. Is this a mapping error?

Also, when I identified those regions from my bam files, I found bitwise flag '64' in the second column, and '255' in 5th (for uniquely mapped).

I have attached the screenshot, first one is contig file and second one is reads extracted from bam file.

Please help,

Thanks.

Attached Files

Screenshot-1.png (122.6 KB, 271 views)

**alexdobin** · 05-07-2013, 12:45 PM

Hi @babi2305,

I am not sure how you make contigs, but I suspect that these long clusters are made of alignments that are spliced. Some small RNA can be spliced (e.g. short mRNA decay products that happen to cross exon-exon junction). Unless you are looking for them specifically, I would suggest prohibiting splicing with --alignIntronMax 10.
If you do not want splicing, you also do not --sjdbFileChrStartEnd <introns_ucsc_hg19> --sjdbOverhang 45, and in any case, these parameters should be used at the genome generation step, not mapping step.

Note, that you can also filter out spliced alignments after the mapping - they will contain "N" in the CIGAR string.

We are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project with the following parameters:
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper --clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1 (second parameter is the proportion of mismatches in the matched adapter length).

You would also likely want to filter out reads that STAR "genomically" trims at the 5'.
This simple awk one-liner will filter out all alignments that are trimmed by more than 1 base from the 5':

Code:

awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

STAR-alignment/confusion with mismatches & multi-mapping parameter

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News