CNVkit: Copy number detection and visualization for targeted sequencing using off-tar

DrWorm replied

03-09-2016, 01:30 PM
errors in production of cns file

Hello all,

I'm having issues working through the install of CNVkit. I've pinpointed my issues to problems with the segment sub-command. When I try to run the test procedure, it dies with this error:

Traceback (most recent call last):
File "../cnvkit.py", line 11, in <module>
args.func(args)
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/commands.py", line 666, in _cmd_segment
rlibpath=args.rlibpath)
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 58, in do_segmentation
segarr = cnarr.as_dataframe(seg2cns(seg_out))
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 143, in seg2cns
+ seg_text)
ValueError: Segmentation output is not valid SEG format:

[then a big table]

Makefile:57: recipe for target 'build/p2-5_5.cns' failed
make: *** [build/p2-5_5.cns] Error 1

If anyone could help me pinpoint the problem, I would really appreciate it!
Leave a comment:
etal replied

03-08-2016, 07:33 AM
Hi Deepak,

The sequence IDs are the chromosome names in your reference genome and the first column of your BED file. For the human genome the first chromosome might be "1" or "chr1" depending on where you got your reference genome.

Check that the name schemes match between your BED and BAM files, either "1" or "chr1". You can use "samtools view -H" to see the BAM header. If the names don't match, then you should edit your BED file, either adding or removing the "chr" prefix, so that they do match.

Hope that helps,
Eric
Leave a comment:
oxydeepu replied

03-08-2016, 04:25 AM
Sequence IDs don't match with bed Error CNVkit.

Hi all,

I trying to run CNVkit with tumor and normal samples on exome sequencing. But I tried all possible things mentioned in the docs. I tried different bed file input as the target region and also the annotations but I always end up getting the error.
ValueError: BED file 'results/S04380110_Covered.target.bed' sequence IDs don't match any in BAM file.

Anybody came across this issue. Please help.

Thank you in advance.

Deepak
Leave a comment:
etal replied

05-14-2015, 09:08 AM
The mapping quality threshold is hard-coded to only exclude unmapped or ambiguously mapped reads, see here:

cnvkit/cnvlib/coverage.py at master · etal/cnvkit

https://github.com/etal/cnvkit/blob/master/cnvlib/coverage.py#L115

Copy number variant detection from targeted DNA sequencing - etal/cnvkit

Just change the -Q value to another integer if you'd like to try it yourself. However, I recall reading (can't find the reference at the moment) that keeping low-MAPQ reads did not harm copy number estimation and may have improved it.

In any case, in CNVkit the script genome2access.py can be used to directly exclude poorly mappable genomic regions. This is done already for hg19 in the bundled file access-5k-mappable.hg19.bed.
Leave a comment:
wisekh replied

05-12-2015, 07:24 PM
using a mapping quality threshold

Dear Eric,
I have another question. Is it possible to use a mapping quality threshold to, for example, a tumor bam or multiple tumor bams that I would infer copy number from?

thank you,

Hoon
Leave a comment:
wisekh replied

05-11-2015, 08:08 PM
I got it. Thank you very much for your quick response.

hoon
Leave a comment:
etal replied

05-11-2015, 03:15 PM
Hi wisekh,

The LOH functionality in CNVkit is described here:

Plots and graphics — CNVkit 0.9.6.dev0 documentation

http://cnvkit.readthedocs.org/en/latest/plots.html#scatter

However, the "calls" are simply displayed visually -- the variant allele frequencies are plotted alongside the copy ratios, and a shift in VAF from 0.5 indicates LOH. I'm currently working on expanding this functionality to make it more useful.

To run the complete pipeline with a tumor-normal pair and make a plot of the copy number and LOH shifts together, follow the quick start guide here:

Quick start — CNVkit 0.9.6.dev0 documentation

http://cnvkit.readthedocs.org/en/latest/quickstart.html

Separately from running the CNVkit "batch" pipeline, you'll need to call SNPs in the tumor sample in VCF format. Then use that VCF file as input to the CNVkit "scatter" command along with the .cnr and .cns files from the CNVkit pipeline to make the plot.

Hope that helps,
Eric
Leave a comment:
wisekh replied

05-11-2015, 01:33 PM
calling LOH from a pair of tumor and normal exome bams

Dear Eric,

I wonder if you can provide an instruction/example on how to generate LOH calls from a pair of tumor and normal exom bams.
I assume this tool can generate them, but I couldn't find a tuturoal in the cnvkit website. probably, I missed it?

thank you very much in advance,

wisekh
Leave a comment:

lethalfang replied

04-06-2015, 10:15 AM

Originally posted by etal View Post

What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram

Thanks. For the second run where I was having question, there is no error message, and a bunch of .targetcoverage.cnn files are generated.

The status messages are:

Code:

Summary: #bins=292, #reads=7633921, mean=26143.5653, min=212.35, max=224646.99
On-target percentage: 37.877 (of 20154547 mapped)
Wrote ./T4.targetcoverage.cnn
Running the CNVkit pipeline on T9.bam ...
Processing reads in T9.bam
Time: 52.775 seconds (154291 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8142711, mean=27885.9981, min=268.94, max=279225.27
On-target percentage: 38.472 (of 21165340 mapped)
Wrote ./T3.targetcoverage.cnn
Running the CNVkit pipeline on T10.bam ...
Processing reads in T10.bam
Time: 55.564 seconds (154063 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8560291, mean=29316.0670, min=326.38, max=293055.07
On-target percentage: 38.593 (of 22181003 mapped)
Wrote ./T6.targetcoverage.cnn
Time: 57.101 seconds (149544 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8539037, mean=29243.2782, min=302.54, max=292123.56
Time: 57.092 seconds (155098 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8854878, mean=30324.9281, min=281.23, max=295576.58
Time: 57.155 seconds (147061 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8405294, mean=28785.2540, min=254.16, max=285135.1
On-target percentage: 38.485 (of 22188207 mapped)
On-target percentage: 39.087 (of 22654458 mapped)
On-target percentage: 38.633 (of 21757000 mapped)
Wrote ./T2.targetcoverage.cnn
Wrote ./T8.targetcoverage.cnn
Wrote ./T1.targetcoverage.cnn
Time: 58.704 seconds (143289 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8411624, mean=28806.9335, min=241.63, max=282191.54
On-target percentage: 38.845 (of 21654563 mapped)
Wrote ./T7.targetcoverage.cnn
Time: 60.050 seconds (150608 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=9044023, mean=30972.6832, min=298.46, max=306083.6
On-target percentage: 38.815 (of 23300051 mapped)
Wrote ./T5.targetcoverage.cnn
Time: 50.180 seconds (165905 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8325083, mean=28510.5614, min=309.7, max=283705.95
On-target percentage: 38.622 (of 21555485 mapped)
Wrote ./T9.targetcoverage.cnn
Time: 51.261 seconds (173281 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8882609, mean=30419.8946, min=299.88, max=296973.17
On-target percentage: 38.693 (of 22956916 mapped)
Wrote ./T10.targetcoverage.cnn

I ran again with the command you suggested, and the results are now as expected, i.e., identical to the first run.

Thanks.

Leave a comment:

etal replied

04-06-2015, 10:05 AM
What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram
Leave a comment:
lethalfang replied

04-06-2015, 09:44 AM
When I tried to run more tumor samples using the reference .cnn file from a past normal run, there is no .cnr output.

This gives me .cnr files:
cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam --normal Na.bam Nb.bam Nc.bam --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-reference N.cnn --output-dir .

This does not output .cnr pr antitargetcoverage.cnn file:
cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-dir .

What is missing? Thanks.
Leave a comment:
etal replied

03-12-2015, 01:20 PM
Hi Ies,

Sorry I missed your post earlier. There isn't a verbose logging mode in CNVkit, but the messages on standard error are fairly verbose already and should always report when there is an error. In particular, if something crashes then you'll see a Python traceback message. However, if you parallelized the batch run (-p >1), the messages from each process will be interleaved, which makes them somewhat harder to intepret.

The scatter and diagram PDFs should always be generated in a batch run; there isn't a special code path where they would be skipped. Does the log say something like "Wrote MySample-scatter.pdf" for the missing PDFs, or not?

The --scatter option uses matplotlib to generate a PDF. On a cluster, the default matplotlib backend (e.g. Wx or Gtk) might not be available, and so I guess it's possible the plotting engine of matplotlib gets confused and silently fails to write the file. You could address that by setting a different backend on your cluster -- create a file called "matplotlibrc" in the current working directory or your home folder, with the contents:

backend : pdf

The --diagram option uses a different backend, Reportlab, which always generates a PDF from scratch and does not have an interactive mode. I can't think of a reason why this one would occasionally fail to write a file. Can you suggest anything unusual about your system's configuration? Outdated software versions, maybe?

If the diagram is showing labels for hundreds of genes, that means:
(a) you did exome sequencing, so there's lots to show;
(b) significant copy number alterations cover large regions of chromosomes in your sample; and/or
(c) the purity of your tumor samples is fairly high.

You can:

- Thin out the labeling to some extent by specifying a higher threshold (-t) log2 ratio value in the diagram command; the default is 0.6, so try 0.8 or 0.9 to only show the higher-amplitude CNAs.
- Drop the labels altogether by specifying a high value for -t or just passing the .cns segment file (with -s), without the .cnr.
- Use the "heatmap" command instead to view the unlabeled CNA regions for many samples at once.
- Use the "gainloss" command to list all genes with log2 ratio amplitudes beyond a given threshold, essentially the labels you're currently seeing on the diagram but in a more manageable plain-text, tabular format.

The diagram is based on Biopython's Graphics/BasicChromosome module. If you're handy with Python and have a specific modification in mind, you could edit cnvlib/diagram.py (202 lines) to do it. For example, you can change PAGE_SIZE to much larger dimensions like 22x17" and the chromosomes will scale proportionally, but the gene labels stay the same size and will be more readable if they were overlapping before.

Thanks for the suggestion on sample.bai, I'll look into it.

Cheers,
Eric
Leave a comment:
inijman replied

03-02-2015, 03:30 AM
HI Etal,

Very nice work on a promising tool.

I'm giving it a go as well and what strikes me is that I don't always get the PDF files with the diagram and scatter option in a batch run. Is there a way to get verbose (error) logging?

When I do generate them with the individual scatter/diagram option, they are there, but the diagram is very difficult to read as all the labels overlap each other. Is there a way to influence this, or break all chromosomes to sepperate pages?

If i'm on a X11 terminal I can generate the plots, but when we run analyses on the cluster, there is no display available and we would need the pdf files.

Additionally, it would be great if you could use both sample.bam.bai als sample.bai file as indices.

Best, Ies
ps: and Hi to all the other posters!

Last edited by inijman; 03-02-2015, 03:42 AM.
Leave a comment:
Mulos replied

02-27-2015, 05:52 AM
Works like a charm, thank you so much.

Also, very pleased with the results!

Last edited by Mulos; 02-27-2015, 07:25 AM.
Leave a comment:
etal replied

02-26-2015, 01:33 PM
I've added the "-c/--count" option from the "coverage" command to the "batch" options as well. If you installed CNVkit from the GitHub repo, you can get the latest by either pulling the new commits or downloading the latest source code Zip file from the master branch and installing that.

Please let me know if that works for you.
Leave a comment:

Previous 1 2 3 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News