Seqanswers Leaderboard Ad

**alittleboy** · 06-30-2013, 06:18 PM

Originally posted by krespim View Post

I misread the outpout and thought that none of this reads were mapped - my bad. Nevertheless, 24M empty seems quite a bit to me. That said I also follow routinely your suggestion of looking at IGV shots to check if everything is Kosher. Another thing I do as a sanity check is to grep a few genes that I know are expressed in my samples just to see if they have a reasonable number of counts*.

@ alittleboy: I really don't what is going wrong there, your parameters seem correct. By wrong library I mean using "stranded" instead of default for instance

*funnily enough this happened the other day. I ran dexseq count on a number of files and my goto confirmation genes had the right counts in some files but not in others. No pattern. To cut a long story short, it turns out that during the transfer of the bam files from the cluster were the mapping was done, some files got corrupted (the file size remained unchanged). And yes, it was with scp.

Hi @krespim:

Thanks so much for your suggestions! I am working in a Linux environment, and all files are stored in cluster. Is there a way (please point out) that I can use IGV in the Red Hat Linux OS? Or is there any alternative ways for a sanity check?

I realize this issue ("_empty" so huge that many reads align outside of gene/exon) must be solved before I continue with subsequent DEXSeq steps... Is there a way to check what proportion for this "_empty"? Sorry I am a newbie in this bioinformatics area, and I'm not very familiar with relevant manipulations (but eager to learn

)

BTW, I think I merged BAM files from two lanes of a sample into a single BAM file. Is that a common practice? Or there is something wrong with the GTF file? Or this step:

samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

is problematic?

Thanks again for your kind help!

**krespim** · 06-30-2013, 11:50 PM

Use samtools to sort your file

Originally posted by alittleboy View Post

Hi @krespim:

Thanks so much for your suggestions! I am working in a Linux environment, and all files are stored in cluster. Is there a way (please point out) that I can use IGV in the Red Hat Linux OS? Or is there any alternative ways for a sanity check?

I realize this issue ("_empty" so huge that many reads align outside of gene/exon) must be solved before I continue with subsequent DEXSeq steps... Is there a way to check what proportion for this "_empty"?

It should be no problem as all you have to do is download IGV to the cluster and add the folder where igv.jar is your PATH(or ask the admnistrator). 2nd step is: do you login remotely to the server? something like:

ssh [email protected]

if so, just had to "-Y" to get a graphical interface to the server, meaning that when a window is open from the terminal on the server side you will see it on your local computer. This explains how to run igv from your command line.

Looking on IGV for genes that you expect high/low read count will tell you if all is well.

Originally posted by alittleboy View Post

Sorry I am a newbie in this bioinformatics area, and I'm not very familiar with relevant manipulations (but eager to learn

)

I am on the same boat

Here is some advice based on my experience:

For minor/basic things you would be best advised to get help from someone tech-savvy at your institute, like a linux administrator or a bioinformatitian. Just go and ask. That is what I do all the time. And if you plan on doing bioinformatics regularly do a course on basic programming/Unix like the very good Software Carpentry Boot Camps. Better still, go through the exercises of this book. I have it and it is a life saver.

Originally posted by alittleboy View Post

BTW, I think I merged BAM files from two lanes of a sample into a single BAM file. Is that a common practice? Or there is something wrong with the GTF file? Or this step:

BAM file merging it common practice and it is fine. If you are using the same GTF for mapping and read counting it also should not be the source of the problem

This I don't like because it is a hack when there are good tools that will do a better job (I've suggested it earlier):

Originally posted by alittleboy View Post

samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

After merging the BAM files do:
samtools sort merged.bam merged_sorted
samtools index merged_sorted.bam merged_sorted.bam.bai
This is the right way to sort a BAM file!
Then run DEXSeq without the "sort" command and using your sorted BAM file:
samtools view merged_sorted.bam | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

If there is a problem with your BAM file, samtools will probably detect it, "sort" will not.

Originally posted by alittleboy View Post

Thanks again for your kind help!

No problem. But use SAMTOOLS to sort your file.

**alittleboy** · 07-01-2013, 05:39 AM

Originally posted by krespim View Post

It should be no problem as all you have to do is download IGV to the cluster and add the folder where igv.jar is your PATH(or ask the admnistrator). 2nd step is: do you login remotely to the server? something like:

ssh [email protected]

if so, just had to "-Y" to get a graphical interface to the server, meaning that when a window is open from the terminal on the server side you will see it on your local computer. This explains how to run igv from your command line.

Looking on IGV for genes that you expect high/low read count will tell you if all is well.

I am on the same boat

Here is some advice based on my experience:

For minor/basic things you would be best advised to get help from someone tech-savvy at your institute, like a linux administrator or a bioinformatitian. Just go and ask. That is what I do all the time. And if you plan on doing bioinformatics regularly do a course on basic programming/Unix like the very good Software Carpentry Boot Camps. Better still, go through the exercises of this book. I have it and it is a life saver.

BAM file merging it common practice and it is fine. If you are using the same GTF for mapping and read counting it also should not be the source of the problem

This I don't like because it is a hack when there are good tools that will do a better job (I've suggested it earlier):

After merging the BAM files do:
samtools sort merged.bam merged_sorted
samtools index merged_sorted.bam merged_sorted.bam.bai
This is the right way to sort a BAM file!
Then run DEXSeq without the "sort" command and using your sorted BAM file:
samtools view merged_sorted.bam | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

If there is a problem with your BAM file, samtools will probably detect it, "sort" will not.

No problem. But use SAMTOOLS to sort your file.

Hi @krespim:

Thank you so much for the instructions and suggestions! I've order the book you mentioned, and plan to read it carefully ;-)

I tried to use your correct approach for sorting the BAM files, and in running dexseq_count.py, I got tons of warning messages:

HTSeq/__init__.py:598: UserWarning: Read FCD15N1ACXX:3:1110:18071:3100#TAGGAATA claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
"which could not be found. (Is the SAM file properly sorted?)" )

I guess that's the problem?

Also, I notice that the sorted.bam file generated in this way is of the exactly the same size as what I originally did using the sort command, so I guess there might not be a difference... It's weird that originally I can get the TXT exon count files (but got an error this time...)

**krespim** · 07-01-2013, 07:04 AM

Originally posted by alittleboy View Post

Hi @krespim:

Thank you so much for the instructions and suggestions! I've order the book you mentioned, and plan to read it carefully ;-)

I tried to use your correct approach for sorting the BAM files, and in running dexseq_count.py, I got tons of warning messages:

HTSeq/__init__.py:598: UserWarning: Read FCD15N1ACXX:3:1110:18071:3100#TAGGAATA claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
"which could not be found. (Is the SAM file properly sorted?)" )

I guess that's the problem?

Also, I notice that the sorted.bam file generated in this way is of the exactly the same size as what I originally did using the sort command, so I guess there might not be a difference... It's weird that originally I can get the TXT exon count files (but got an error this time...)

Well, this is a bit out of my league now. I did found this which might help:
http://seqanswers.com/forums/showthread.php?t=12376

Otherwise send and email to the HTseq authors. Good luck.

**areyes** · 07-01-2013, 07:57 AM

Are you sure you use "-n" option when doing samtools sort? In order to sort them by read name?

**alittleboy** · 07-01-2013, 08:03 AM

Originally posted by areyes View Post

Are you sure you use "-n" option when doing samtools sort? In order to sort them by read name?

Yes, this time I used "-n" option when doing samtools sort, and got the merged_sort.bam file. However, when I continue to run:

samtools index merged_sorted.bam merged_sorted.bam.bai

It has the error message:

[bam_index_core] the alignment is not sorted (FCD15N1ACXX:3:1101:1493:42913#NAGGAATA): 17-th chr > 5-th chr
[bam_index_build2] fail to index the BAM file.

Why? I think it is already sorted in the first step...

Thanks for your suggests, areyes ;-)

**areyes** · 07-01-2013, 09:59 AM

HI @alittleboy,

In order to index a bam file you need it to be sorted by position. However for htseq scripts you do not require the index, you just need the reads sorted by gene name. You should be able to proceed with the htseq scripts without the index!

Alejandro

**alittleboy** · 07-01-2013, 10:57 AM

Originally posted by areyes View Post

HI @alittleboy,

In order to index a bam file you need it to be sorted by position. However for htseq scripts you do not require the index, you just need the reads sorted by gene name. You should be able to proceed with the htseq scripts without the index!

Alejandro

Hi Alejandro:

Thanks for the update -- sorry I am a little confused by the different inputs from you, @krespim and the pasilla vignette... I apologize for my confusion. Here I list the steps that I use (from your last post)

samtools sort -n S1.bam S1_sorted

samtools view S1_sorted.bam | dexseq_count.py -p yes -s yes HS.GRCh37.72_norm.DEXSeq.gff - S1.txt

These two commands can run without error, and I think that's the way dexseq_count.py is supposed to do. I checked the results, but still found (I use a single BAM file instead of the merged file this time):

_ambiguous 0
_empty 9677861
_lowaqual 0
_notaligned 0

the huge number of _empty, which bothers me a lot... I really don't know if this is within a reasonable scope. Because I also used SplicingCompass for DEU, I compared the results with DEXSeq, and found that they're consistent for the very top genes (which makes me feel a little better...).

**dpryan** · 07-01-2013, 11:11 PM

Originally posted by alittleboy View Post

samtools sort -n S1.bam S1_sorted

samtools view S1_sorted.bam | dexseq_count.py -p yes -s yes HS.GRCh37.72_norm.DEXSeq.gff - S1.txt

That looks correct.

_ambiguous 0
_empty 9677861
_lowaqual 0
_notaligned 0

the huge number of _empty, which bothers me a lot... I really don't know if this is within a reasonable scope. Because I also used SplicingCompass for DEU, I compared the results with DEXSeq, and found that they're consistent for the very top genes (which makes me feel a little better...).

Without known how many reads mapped, it's difficult to know how good or bad that number is. Elsewhere, you mention using whole lanes, which would suggest to me that that number could be reasonable.

**alittleboy** · 07-02-2013, 03:49 AM

Originally posted by dpryan View Post

That looks correct.

Without known how many reads mapped, it's difficult to know how good or bad that number is. Elsewhere, you mention using whole lanes, which would suggest to me that that number could be reasonable.

Hi @dpryan:

Thanks for the clarifications! Here is some summary statistics from the folder (that contains accepted_hits.bam, junctions.bed files, etc.) with the file name "left_kept_reads.info":

min_read_len=49
max_read_len=49
reads_in =19390575
reads_out=19389262

From the file named "right_kept_reads.info", it shows:

min_read_len=49
max_read_len=49
reads_in =19390575
reads_out=19389169

It's paired-end, so the total number of reads is ~40 million? How can we know from this information that the "_empty" is large or at reasonable range? BTW, this is from a single lane. Actually there are 2 lanes for each sample, but I choose not to merge them (for fair methodological comparisons).

Thank you!

**dpryan** · 07-08-2013, 01:56 AM

You actually will only have ~20 million reads (mates will only be counted once), so that would mean that ~1/2 aren't aligning to annotated genes. That would seem quite high. I'm used to getting around 200 million reads from a lane (HiSeq) rather than ~20 million (were these run on a genome analyzer or multiplexed?), so my "could be reasonable" bit was dependent upon having more reads.

**alittleboy** · 07-08-2013, 05:59 AM

Originally posted by dpryan View Post

You actually will only have ~20 million reads (mates will only be counted once), so that would mean that ~1/2 aren't aligning to annotated genes. That would seem quite high. I'm used to getting around 200 million reads from a lane (HiSeq) rather than ~20 million (were these run on a genome analyzer or multiplexed?), so my "could be reasonable" bit was dependent upon having more reads.

Hi @dpryan:

Thank you for the update -- I also feel that ~1/2 not aligning to annotated genes is quite high... but unfortunately I don't know why.

The information above is from a single lane. In my experiment, each sample is sequenced in two lanes, and there are a total of 8 samples for each treatment group. Each lane has similar aligning rate (~1/2 not aligning). Thus, for detecting DEU purpose, do you think that would be enough (may be not...)? They were multiplexed as there are some other samples in one lane. However, our study just focuses on a particular sample (which is very likely to give us significant results).

So, due to our particular purpose of the study, we plan to just use the data from one lane for this sample. So, the consequence is that, we'll have power loss due to low coverage?

Thanks!

**dpryan** · 07-08-2013, 06:36 AM

You'd have to use something like scotty to see how many samples at what depth you might need to figure out what sort of power you might have. For most of my needs (differential expression in mice), 8 samples at that depth would work fine.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News