% On-Off target - SEQanswers

rogic replied

04-08-2013, 12:43 PM
Hello all,

one follow-up question regarding the baits file - I downloaded the file from Agilent earray website (SureSelect Human All Exon 50Mb), however it does not contain the strand info which seem to be needed by Picard's CalculateHsMetrics tool. Did the bait files that other people used had the strand info? If yes, where did they get them? If not, how did they circumvent Picard's requirement? Is the strand info actually used by the tool?

Thanks,
Sanja
Leave a comment:
chongm replied

01-17-2013, 03:13 PM
% reads on target does not equal % bp on target

Originally posted by ECO View Post

Another vote for Picard's CalculateHsMetrics. It's in the public Galaxy (http://main.g2.bx.psu.edu/ under "NGS: Picard (beta)").

I just wanted to note that Picard's hsmetrics will calculate the percentage of basepairs that map to target regions which isn't the same as the percentage of reads that map to target regions. e.g. if a read sits right on a target region, (correct me if i'm wrong), then picard's hsmetrics (PCT_USABLE_BASES_ON_BAIT) will count 50 bp out of 100 as mapping to the target region, whereas if you are taking % reads mapping to target region, you will count that 1 read as mapping to the target region.
Leave a comment:
chongm replied

01-16-2013, 02:17 PM
Thanks!

Thank you Jon_Keats! The -H seemed to have done the trick!

chongm
Leave a comment:
Jon_Keats replied

01-16-2013, 12:02 PM
Big H not little

samtools view -H

Not

samtools view -h
Leave a comment:
bwubb replied

01-16-2013, 10:20 AM
Can you show a bit of your intervals file?

Some of the header and some of the actual intervals.

Not sure if that is the issue, but its where I usually start.
Leave a comment:
chongm replied

01-16-2013, 10:05 AM
Has anyone used hsmetrics for Illumina truseq exome targets? I'm having issues running this (even though I'm using the same modified bed file - with header from samtools - for both target and bait intervals).

Here is what I've done

samtools view -h sample.bam > header.txt
awk -F $'\t' 'BEGIN { OFS = FS } {print $1,$2,$3,$6,$4;}' truseq.bed > intervals
cat header.txt intervals > truseq_intervals

Then I ran the hsmetrics

java -Xmx4g -Djava.io.tmpdir=/tmp/ \
-jar CalculateHsMetrics.jar \
INPUT=sample.bam \
OUTPUT=sample.hsmetrics \
TARGET_INTERVALS=truseq_intervals\
BAIT_INTERVALS=truseq_intervals

I run into the following error:

Exception in thread "main" net.sf.picard.PicardException: Invalid interval record contains 22 fields: MISEQ:23:000000000-A34AH:1:1111:11133:10035 99 chrM 339 60 151M = 469 283 ACACATCTCTGCCAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCA AAAAABBBDDDDDDEEGGGGGGIIIHIIIIIIIIHIIIIIIHIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIIHHHHEEHIIHHHHHHHHHHHHHHHHFHHGGGGHGGGGGGGGGGGGGGEGGGGGGGGGGGGGEGGGGGGEGGGGGGGGG X0:i:1 X1:i:0 MD:Z:71A79 RG:Z:Exome_2011-002 XG:i:0 AM:i:37 NM:i:1 SM:i:37 XM:i:1XO:i:0 XT:A:U
at net.sf.picard.util.IntervalList.fromReader(IntervalList.java:209)
at net.sf.picard.util.IntervalList.fromFile(IntervalList.java:169)
at net.sf.picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:99)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.analysis.directed.CalculateHsMetrics.main(CalculateHsMetrics.java:73)

Can anyone help with this? I'm new to picard.
Leave a comment:
pengchy replied

09-16-2012, 06:35 AM
Originally posted by bwubb View Post

So this method more or less worked. Picard seems to demand the columns be tab separated so my awk was more like:

awk -F $'\t' 'BEGIN { OFS = FS } {print $1,$2,$3,$6,$4;}' SureSelect_baits.bed > bi.txt

Figured Id share.

I had a question about the header still though, and I expect this is something I just dont understand about the conversion to bam process or something with picard.

The CalculateHSMetrics still yells at me that interval file needs a header.
It seems my aligned_reads.bam files are lacking "@HD VN:1.0 SO:coordinate" at the very top. Is this abnormal?

If I use a bam file that has gone through Picard Read group assignment it does have the @HD etc., but it also will have a @RG line as well.

So do these interval files need to be made for each sample after RG assignment?

I think the “@HD VN:1.0 SO:coordinate” information can be added manually to the top of the header.
But why not "samtools sort" added itself?
Leave a comment:

bwubb replied

09-04-2012, 02:24 PM

Ok this is really frustrating. Still is complaining I do not have header in my interval file. Here is a sample of my file...

Code:

@HD	VN:1.0	SO:coordinate
@SQ	SN:1	LN:249250621
@SQ	SN:2	LN:243199373
@SQ	SN:3	LN:198022430
@SQ	SN:4	LN:191154276
@SQ	SN:5	LN:180915260
@SQ	SN:6	LN:171115067
@SQ	SN:7	LN:159138663
@SQ	SN:8	LN:146364022
...

@SQ	SN:GL000192.1	LN:547496
@RG	ID:4346_TTAGGC_ID	PL:illumina	PU:TTAGGC	LB:4346_TTAGGC_LB	SM:4346_TTAGGC_SM
@PG	ID:bwa	PN:bwa	VN:0.5.9-r16
1	45787138	45787258	+	BI426105800_19859
1	45787178	45787298	+	BI426105800_19860
1	45790352	45790472	+	BI426105800_19867
1	45790392	45790512	+	BI426105800_19868
1	45791243	45791363	+	BI426105800_19875
1	45791283	45791403	+	BI426105800_19876
...

Am I doing it wrong? Does the Baits file also need to be in interval format? Thank you.

Leave a comment:

bwubb replied

08-24-2012, 12:14 PM
Originally posted by ECO View Post

That intervals file is annoying to make... here's how I do it (basically you need to add a SAM header and rearrange some columns):

Code:

#Input files to CalculateHsMetrics need SAM header on an interval file ("picard interval file") #example here: ftp://ftp.broadinstitute.org/pub/gsa/exampleFiles/thousand_genomes_alpha_redesign.targets.interval_list #put header from bam file at the top of the BI file above "baits.txt" samtools view -H aligned_reads.bam > header.txt #interval file needs to look like this: #1 1104841 1104940 + target_1 #1 1105283 1105599 + target_2 #1 1105712 1105860 + target_3 #rearrange columns of baits bed file, and add SAM header awk '{print $1,$2,$3,$6,$4;}' SureSelect_baits.bed > bi.txt cat header.txt bi.txt > baits.txt

So this method more or less worked. Picard seems to demand the columns be tab separated so my awk was more like:

awk -F $'\t' 'BEGIN { OFS = FS } {print $1,$2,$3,$6,$4;}' SureSelect_baits.bed > bi.txt

Figured Id share.

I had a question about the header still though, and I expect this is something I just dont understand about the conversion to bam process or something with picard.

The CalculateHSMetrics still yells at me that interval file needs a header.
It seems my aligned_reads.bam files are lacking "@HD VN:1.0 SO:coordinate" at the very top. Is this abnormal?

If I use a bam file that has gone through Picard Read group assignment it does have the @HD etc., but it also will have a @RG line as well.

So do these interval files need to be made for each sample after RG assignment?
Leave a comment:
bwubb replied

07-18-2012, 10:41 AM
Oh wonderful. Thank you so much.
Leave a comment:

ECO replied

07-18-2012, 10:14 AM

That intervals file is annoying to make... here's how I do it (basically you need to add a SAM header and rearrange some columns):

Code:

#Input files to CalculateHsMetrics need SAM header on an interval file ("picard interval file")
#example here: ftp://ftp.broadinstitute.org/pub/gsa/exampleFiles/thousand_genomes_alpha_redesign.targets.interval_list

#put header from bam file at the top of the BI file above "baits.txt"
samtools view -H aligned_reads.bam > header.txt

#interval file needs to look like this:
#1    1104841    1104940    +    target_1
#1    1105283    1105599    +    target_2
#1    1105712    1105860    +    target_3

#rearrange columns of baits bed file, and add SAM header 
awk '{print $1,$2,$3,$6,$4;}' SureSelect_baits.bed > bi.txt
cat header.txt bi.txt > baits.txt

Leave a comment:

bwubb replied

07-18-2012, 08:47 AM
Does anyone know what format the INTERVALS files need to be? Im using simple bed files but I keep getting this error

Exception in thread "main" java.lang.IllegalStateException: Interval list file must contain header.

I tried adding a Chr Start End header but it doesnt like this either. The simplicity of this is confusing me I guess.

Thanks.
Leave a comment:
gwilymh replied

07-08-2012, 01:46 PM
Originally posted by Heisman View Post

Somewhere there is a web page that has the code for each of the programs... I stumbled on it before and am not really a computer guy so don't know where to find it. But it had 250 set for that metric.

Thank you for your explanation
Leave a comment:
Heisman replied

07-08-2012, 01:38 PM
Somewhere there is a web page that has the code for each of the programs... I stumbled on it before and am not really a computer guy so don't know where to find it. But it had 250 set for that metric.
Leave a comment:
gwilymh replied

07-08-2012, 01:22 PM
Originally posted by Heisman View Post

Picard by default does +/- 250 bp. I recommend using the same file for baits and targets: you can have baits that extend past targets and hence get more coverage for baits than for target if you had all of your target sequence covered by baits. If you had say only 80% of your targets covered by baits, then you already know this, and it just complicates things to try to consider it again. So, again, I recommend using the actual bait intervals if possible.

I want to verify that picard does indeed use a +/- 250 bp around each target/bait, but have so far not been able to find this written down anywhere. Where did you come by this information?

Also, can the interval be modified (to, say, +/- 300bp)?
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News