Seqanswers Leaderboard Ad

**DarthMapper** · 09-20-2015, 11:28 AM

Greetings community! I am only a first year undergrad student, so please bare with my ignorance.

I have encountered some difficulties trying to use BBMap to align WGS pair end reads belonging to a Sheep to it's pertinent rna.fa file - I'm only interested in the organism's full mRNA sequences, but I do not know where else to get them.

The problem I'm facing surfaces after verifying Average Coverage and Standard Deviation calculations. While the coverage results are as expected, the standard deviation values are well over 600!

I will post the command line and the path to the reference data I am using below, to ease out the process of troubleshooting:

Reference Data: found under NCBI's ftp site, under the following path: genomes/Ovis_aries/RNA/rna.fa.gz
Here is a link that leads directly to the directory containing the reference file I am using: ftp://ftp.ncbi.nlm.nih.gov/genomes/Ovis_aries/RNA/

*If anyone else would kindly suggest a better way of obtaining only the full mRNA nucleotide sequences of the organism, please be kind enough to enlighten my poor soul*

Reads: I am using paired end fastq WGS reads that were used to build the organism's genome assembly. They have already undergone proper quality testing.

After I create my build using BBMap and my reference file (rna.fa), I proceed to align my reads to the reference build. Here is what my command line looks like:

bbmap.sh in1=read_1.fastq in2=read_2.fastq out=MappedFile.sam build=1 minratio=0.56

(I am trying to compare ambiguous values later on, so I need the min ratio. Making the "minratio=<>" parameter closer to 1 does lower the Standard Deviation, but it is still unusually high)

Cool, up to there everything runs smoothly. I take a look at my alignment results and proceed to calculate coverage and standard deviation values using pileup.sh. Again, here is my command line:

pileup.sh in=MappedFile.sam out=stats.txt hist=histogram.txt

After looking at the results, they look something like this:
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average Coverage: 37.84
Standard Deviation: 707.47
Percent scaffolds with any coverage: 80.76
Percent of reference bases covered: 35.87

I'd be incredibly thankful if anyone would shed some light into this little hindrance. What am I messing up?

**Brian Bushnell** · 09-20-2015, 07:13 PM

Well... I'm not quite sure why you're mapping DNA reads to the transcriptome. If the annotation is accurate, you already have the full mRNA nucleotide sequences in the file "rna.fa.gz" and don't need to generate a sam file at all, nor will WGS help you find the mRNA sequences with any methodology I can think of.

Are you sure you meant WGS (ie, DNA reads) rather than RNA-seq data? For RNA-seq data, to find full-length mRNAs, you would probably assemble it using an assembler like Trinity, or map it to the genome and use some other tool to guess isoforms from mapping data.

As for the high standard deviation - BBMap by default will map reads that could map to multiple locations (aka multimapping, ambiguous, or non-uniquely-mapping reads) to the first genomic location they can map to optimally. For this kind of purpose (RNA-seq, or mapping to a transcriptome) I recommend you set "ambig=random" or "ambig=all". Otherwise, for genes with multiple isoforms, all the reads from that gene will map to the first isoform containing the exon in question, leading to a high standard deviation. For mapping RNA to a transcriptome I generally recommend "ambig=random".

Also, if you ARE mapping DNA to the transcriptome (which, again, is probably not useful), you should set the "local" flag - otherwise reads that span intron/exon boundaries will have a lot of mismatches since there is intronic content in DNA reads, but none in the transcriptome.

-Brian

**DarthMapper** · 09-23-2015, 07:25 AM

I tried using the settings you had recommended, but the standard deviation is still awfully high. Would this be related to the fact that we are trying to map WGS sequences to a transcriptome?

Here are the results:

With ambig=random setting and local=t

Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 31.27
Standard deviation: 628.72
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 75.89

Time: 2084.076 seconds.

----------------------------------------------------------------------------------------------------------------------

With ambig=random setting, but no local flag

Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 31.27
Standard deviation: 635.37
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 77.75

Time: 1935.627 seconds.
--------------------------------------------------------------------------------------------------------------------
Thank you again Brian for your last reply!

**Brian Bushnell** · 09-23-2015, 10:56 AM

I don't understand why you are aligning DNA reads to a transcriptome. To clarify, are these DNA reads or RNA-seq reads? And can you explain what you're trying to do?

**DarthMapper** · 09-23-2015, 11:04 AM

I'm so sorry; I was editing my last reply when you responded.

I forgot to mention that we are trying to determine copy number via read depth analysis, so we need to know the average and SD read mapping to all genes present in the genome (to compare against the ambiguous genes in question).

We are using WGS reads, which are DNA, and we are mapping these to the transcriptome.

**HESmith** · 09-23-2015, 03:01 PM

You should be mapping to the genome, then extracting the information that corresponds to the transcriptome. I'm not surprised that your standard deviations are wonky. It's probably due to repeat elements in the genome that are under-represented in the transcriptome, which would produce much higher-than-expected coverage. You can check by BLASTing a few examples of your high-coverage regions against the genome, and see how many hits you get.

**DarthMapper** · 09-24-2015, 08:11 AM

Well, usually we map the WGS reads to the genome assembly for quality testing. But, how exactly would we extract the information corresponding to the transcriptome? Would I be able to use BBMap for this?

**Brian Bushnell** · 09-24-2015, 09:49 AM

I think you can do that with Bedtools and a gene annotation file.

**GenoMax** · 09-24-2015, 09:51 AM

One would need BBMap to create the alignment file that can then be used with bedtools

**JamesJoyce** · 10-04-2015, 01:23 PM

Hello Brian;
I am trying to do something similar to what DarthMapper was suggesting.
My PI had instructed me to determine the copy number of certain genes in a single organism by mapping WGS reads to said gene sequences as reference. I then am supposed to compare coverage values of these individual genes with that obtained from mapping the WGS reads to the whole set of mRNA sequences present in the studied organism's transcriptome. The problem is that I am also getting strange standard deviations when using BBMap.
What is wrong with my methodology?

**Brian Bushnell** · 10-04-2015, 04:54 PM

Hi James,

Can you give me the exact command line you are using?

Also, the complete output printed to the screen would be helpful, as well as the name of the organism.

**JamesJoyce** · 10-04-2015, 07:04 PM

Thanks for your reply Brian.
I am using BBMap to align WGS reads to a set of mRNA sequences extracted from the organism's NCBI transcriptome. This is being done with different organisms, such as mammals, reptiles, and fungi. So far, I have tested a few mammals, including Felix Catus (cat) and Rattus Norvegicus (Rat). Again, I was instructed to find the copy number of certain genes within a single organism. My PI told me to map WGS sequences to the transcriptome's mRNA sequences as reference before mapping to individual gene sequences, but we stopped once we saw the results.

The results for both mammals are very similar, so I will only post one example

Here is the commandline I am using and the results:

Index Build
bbmap.sh ref=mRNA.fa build=1

Alignment
bbmap.sh build=1 in1=catreads/read_1.fastq in2=catreads/read_2.fastq ambig=random local=t out=mRNA.sam

Coverage and Standard Deviation:
pileup.sh in=mRNA.sam out=covstats.txt hist=histogram.txt

P.S. I have tried to use samtools depth to calculate coverage and STDV with BBMap's sam outputs but it does not let me - even when I have converted the sam's to bam, and sorted the bam. How can this be achieved?

CoverageResults
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 15.29
Standard deviation: 384.14
Percent scaffolds with any coverage: 99.81
Percent of reference bases covered: 69.32

Is there any way to lower that STDV?

**Brian Bushnell** · 10-05-2015, 10:42 AM

First off, you should expect a high standard deviation in this scenario. If one gene has 10 isoforms and another has 1 isoform, the gene with 1 isoform will have (on average) 10x the coverage of the isoforms from the gene with 10, when ambiguously-mapped reads are placed randomly. You need to use exactly one isoform per gene (preferably the longest).

However, it's also possible that there's a problem with the random selection of sites for ambiguously-mapped paired reads. Would you mind trying this (mapping with read 1 only):

bbmap.sh build=1 in1=catreads/read_1.fastq ambig=random local=t out=mRNA.sam covstats=covstats.txt hist=histogram.txt

Note that bbmap can internally generate coverage information and does not need a second pass from pileup. As for why samtools depth is not working, I have no idea. Does it give some kind of error message?

**GenoMax** · 10-05-2015, 10:47 AM

@Brian: How is this analysis is going to give information about number of copies of a gene to @DarthMapper/@JamesJoyce? Am I missing something obvious?

**Brian Bushnell** · 10-05-2015, 10:59 AM

Well... it is kind of a strange analysis... and a normal transcriptome file, with many isoforms per gene, will not very useful. But, if you had exactly one transcript per gene in your reference, then theoretically, genes with 2 copies should get double the coverage of genes with 1 copy, if you map uniform WGS data. Using the "local" flag would be helpful because wherever reads go into introns they will not match the transcriptome. Also, pseudogenes will cause confusion. So I'm not sure how much signal there will be compared to the noise, but with those assumptions (one transcript per gene, local alignments, fairly uniform coverage, no pseudogenes), it should work.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News