Seqanswers Leaderboard Ad

**Ajayi Oyeyemi** · 04-05-2013, 08:02 AM

Simon and all,

I found another one(see the attached). I used cufflinks to assemble the transcripts and it appears that cufflinks assembled the reads in between these two genes as shown. I went further in ensemble and I realised that this (I mean reads between these two genes) were supported by EST alignments (Sequence: GH304014.1 : HW_FAT_169_A03.ab1 HW cDNA library intermuscular fat Bos taurus cDNA clone 169_A03, mRNA sequence.). This is the first time I'm analysing rna-seq data and I basically don't have a clue as to what's going on.

Is there any inference one could deduce from this?

Attached Files

ESTs_igv_snapshot.png (15.0 KB, 4 views)

**Darwin** · 04-05-2013, 08:04 AM

I guess you could filter the reads and only keep the reads that map uniquely (samtools view with the -q option should do that). I would worry about relying on degraded samples and the potential biases introduced, though.

I suspect that you've got a perfect storm of incomplete annotation of gene models (which primarily affects UTRs), a highly paralogous genome, and 3' bias in your sample.

How did the read quality look and what kind of sequencing did you do (length, paired/single)?

**Simon Anders** · 04-05-2013, 08:17 AM

I usually work with model organisms, so I am used to quite complete annotations. For cow, it is maybe not surprising that there are genes missing. Whether it is worth spending time on filling in these missing genes is another question. After all, if you find some of them to be differentially epxressed, you won't know what their function is, anyway. Maybe some other users working with non-model organisms have mroe advice there.

I would still try to figure out first why most of your reads seem to map to repetitive regions. Take a few of those reads with NH:20 and maybe BLAST their sequence against the NCBI nucleotide collection. That is sometimes helpful.

On the other hand, I just wondered whether you really have that many NH:20 reads. You have 32,980,337 SAM records with NH>1. however, if most of these have NH:20, then they all occur 20 times in your SAM file (NH is the number of _reported_ alignments), and so, these are only around 2 million reads, which would be less than 5% of your total number of reads and hence nothing to worry that much about.

**Ajayi Oyeyemi** · 04-05-2013, 08:42 AM

Originally posted by Darwin View Post

I guess you could filter the reads and only keep the reads that map uniquely (samtools view with the -q option should do that). I would worry about relying on degraded samples and the potential biases introduced, though.

I suspect that you've got a perfect storm of incomplete annotation of gene models (which primarily affects UTRs), a highly paralogous genome, and 3' bias in your sample.

How did the read quality look and what kind of sequencing did you do (length, paired/single)?

Thanks Darwin and truthfully, it is a perfect storm indeed. I used fastX tools to clip and trim my sequences. The boxplots ranged from 30 to 40. The size of the pictures is more than what it can accept as the other end is scrapped off. I can send it to you as an e-mail attachment if you don't mind.

Moreso, these are single end reads...

Thanks.

Attached Files

boxplot picture 22.png (41.3 KB, 11 views)

**Darwin** · 04-05-2013, 09:05 AM

sure, if you can't resize the picture (which is probably better so that others can chime in). dichmann<at>berkeley<dot>edu. What's on the uploaded picture looks good.

As Simon points out, it would be helpful if you analyzed the multi mapping statistics.

It still leaves you with 3'bias, but on the other hand some protocols for DE genes rather than transcript discovery actually do targeted sequencing of the 3'end of transcripts. But again, I have no idea of potential biases appearing after RNA degradation.

**Ajayi Oyeyemi** · 04-05-2013, 09:05 AM

Originally posted by Simon Anders View Post

I usually work with model organisms, so I am used to quite complete annotations. For cow, it is maybe not surprising that there are genes missing. Whether it is worth spending time on filling in these missing genes is another question. After all, if you find some of them to be differentially epxressed, you won't know what their function is, anyway. Maybe some other users working with non-model organisms have mroe advice there.

I would still try to figure out first why most of your reads seem to map to repetitive regions. Take a few of those reads with NH:20 and maybe BLAST their sequence against the NCBI nucleotide collection. That is sometimes helpful.

On the other hand, I just wondered whether you really have that many NH:20 reads. You have 32,980,337 SAM records with NH>1. however, if most of these have NH:20, then they all occur 20 times in your SAM file (NH is the number of _reported_ alignments), and so, these are only around 2 million reads, which would be less than 5% of your total number of reads and hence nothing to worry that much about.

@Simon.,
I'll do some of the sequences and blast them as advised. However, a thorough inspection of my sam file revealed that there are some with NH:2, NH:4 NH:16 and NH:20. I guess that's the reason why I have millions of reads with no feature.

While I await input from scientist working with non-model organisms, Is it logical to re-run some tophat alignments using either UCSC or NCBI? (I had wanted to use DESeq that was why I used ensemble anyway) and even if it is logical which of the UCSC or NCBI is more compatible with htseq and DEXSeq scripts? I'm just worried about millions of reads with no features.
Thanks Simon.

**Darwin** · 04-05-2013, 09:26 AM

Your quality scores are great.

I used samtools view -q 50 to get unique maps in my data. lowering -q leaves in progressively more multi aligned reads, so that -q1 filters reads that align 10 times or more. Combine with -c option to output the number of reads. Consult the manual for more.

**Simon Anders** · 04-05-2013, 10:26 AM

No need to rerun tophat, if you want to use a more complete GTF file, provided the new GTF file uses the same coordinate system (assembly) as the old one. If you use something other than Ensembl GTF file, have a very close look at the gene ID attributes. (UCSC tends to mess them up; different transcripts of the same gene must have the same gene ID; but this can be corrected with some fiddling)

**Ajayi Oyeyemi** · 04-09-2013, 06:37 AM

@Darwin,

Thanks for the advice. I followed your lead and I use the following code:
samtools view -q 50 -c sample1.bam > sample_uniquehits.bam

count = 39647278

samtools view -q 1 -c sample1.bam > sample_uniquehits.bam

count = 44277228

For my second sample:
-q50 option gave 39847467 while -q1 option gave 46427418.

Any clue given my output from HTseq count? I have ~48 million reads and ~50 million reads for sample 1 and 2 respectively.

**Ajayi Oyeyemi** · 04-09-2013, 06:40 AM

Originally posted by Simon Anders View Post

No need to rerun tophat, if you want to use a more complete GTF file, provided the new GTF file uses the same coordinate system (assembly) as the old one. If you use something other than Ensembl GTF file, have a very close look at the gene ID attributes. (UCSC tends to mess them up; different transcripts of the same gene must have the same gene ID; but this can be corrected with some fiddling)

Thanks Simon.

Given that I don't have much experience with the fiddling strategy you proposed, I'll just stick to ensemble anyway. I used samtools to check counts for unique alignments and I just posted it. If I can't wiggle my way out, I'll just use the htseq-count output in DESeq.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News