Announcement

Collapse
No announcement yet.

CNVkit: Copy number detection and visualization for targeted sequencing using off-tar

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Ies,

    Sorry I missed your post earlier. There isn't a verbose logging mode in CNVkit, but the messages on standard error are fairly verbose already and should always report when there is an error. In particular, if something crashes then you'll see a Python traceback message. However, if you parallelized the batch run (-p >1), the messages from each process will be interleaved, which makes them somewhat harder to intepret.

    The scatter and diagram PDFs should always be generated in a batch run; there isn't a special code path where they would be skipped. Does the log say something like "Wrote MySample-scatter.pdf" for the missing PDFs, or not?

    The --scatter option uses matplotlib to generate a PDF. On a cluster, the default matplotlib backend (e.g. Wx or Gtk) might not be available, and so I guess it's possible the plotting engine of matplotlib gets confused and silently fails to write the file. You could address that by setting a different backend on your cluster -- create a file called "matplotlibrc" in the current working directory or your home folder, with the contents:

    backend : pdf

    The --diagram option uses a different backend, Reportlab, which always generates a PDF from scratch and does not have an interactive mode. I can't think of a reason why this one would occasionally fail to write a file. Can you suggest anything unusual about your system's configuration? Outdated software versions, maybe?

    If the diagram is showing labels for hundreds of genes, that means:
    (a) you did exome sequencing, so there's lots to show;
    (b) significant copy number alterations cover large regions of chromosomes in your sample; and/or
    (c) the purity of your tumor samples is fairly high.

    You can:

    - Thin out the labeling to some extent by specifying a higher threshold (-t) log2 ratio value in the diagram command; the default is 0.6, so try 0.8 or 0.9 to only show the higher-amplitude CNAs.
    - Drop the labels altogether by specifying a high value for -t or just passing the .cns segment file (with -s), without the .cnr.
    - Use the "heatmap" command instead to view the unlabeled CNA regions for many samples at once.
    - Use the "gainloss" command to list all genes with log2 ratio amplitudes beyond a given threshold, essentially the labels you're currently seeing on the diagram but in a more manageable plain-text, tabular format.

    The diagram is based on Biopython's Graphics/BasicChromosome module. If you're handy with Python and have a specific modification in mind, you could edit cnvlib/diagram.py (202 lines) to do it. For example, you can change PAGE_SIZE to much larger dimensions like 22x17" and the chromosomes will scale proportionally, but the gene labels stay the same size and will be more readable if they were overlapping before.

    Thanks for the suggestion on sample.bai, I'll look into it.

    Cheers,
    Eric

    Comment


    • #17
      When I tried to run more tumor samples using the reference .cnn file from a past normal run, there is no .cnr output.

      This gives me .cnr files:
      cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam --normal Na.bam Nb.bam Nc.bam --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-reference N.cnn --output-dir .


      This does not output .cnr pr antitargetcoverage.cnn file:
      cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-dir .

      What is missing? Thanks.

      Comment


      • #18
        What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

        When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

        cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram

        Comment


        • #19
          Originally posted by etal View Post
          What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

          When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

          cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram
          Thanks. For the second run where I was having question, there is no error message, and a bunch of .targetcoverage.cnn files are generated.

          The status messages are:
          Code:
          Summary: #bins=292, #reads=7633921, mean=26143.5653, min=212.35, max=224646.99
          On-target percentage: 37.877 (of 20154547 mapped)
          Wrote ./T4.targetcoverage.cnn
          Running the CNVkit pipeline on T9.bam ...
          Processing reads in T9.bam
          Time: 52.775 seconds (154291 reads/sec, 6 bins/sec)
          Summary: #bins=292, #reads=8142711, mean=27885.9981, min=268.94, max=279225.27
          On-target percentage: 38.472 (of 21165340 mapped)
          Wrote ./T3.targetcoverage.cnn
          Running the CNVkit pipeline on T10.bam ...
          Processing reads in T10.bam
          Time: 55.564 seconds (154063 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=8560291, mean=29316.0670, min=326.38, max=293055.07
          On-target percentage: 38.593 (of 22181003 mapped)
          Wrote ./T6.targetcoverage.cnn
          Time: 57.101 seconds (149544 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=8539037, mean=29243.2782, min=302.54, max=292123.56
          Time: 57.092 seconds (155098 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=8854878, mean=30324.9281, min=281.23, max=295576.58
          Time: 57.155 seconds (147061 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=8405294, mean=28785.2540, min=254.16, max=285135.1
          On-target percentage: 38.485 (of 22188207 mapped)
          On-target percentage: 39.087 (of 22654458 mapped)
          On-target percentage: 38.633 (of 21757000 mapped)
          Wrote ./T2.targetcoverage.cnn
          Wrote ./T8.targetcoverage.cnn
          Wrote ./T1.targetcoverage.cnn
          Time: 58.704 seconds (143289 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=8411624, mean=28806.9335, min=241.63, max=282191.54
          On-target percentage: 38.845 (of 21654563 mapped)
          Wrote ./T7.targetcoverage.cnn
          Time: 60.050 seconds (150608 reads/sec, 5 bins/sec)
          Summary: #bins=292, #reads=9044023, mean=30972.6832, min=298.46, max=306083.6
          On-target percentage: 38.815 (of 23300051 mapped)
          Wrote ./T5.targetcoverage.cnn
          Time: 50.180 seconds (165905 reads/sec, 6 bins/sec)
          Summary: #bins=292, #reads=8325083, mean=28510.5614, min=309.7, max=283705.95
          On-target percentage: 38.622 (of 21555485 mapped)
          Wrote ./T9.targetcoverage.cnn
          Time: 51.261 seconds (173281 reads/sec, 6 bins/sec)
          Summary: #bins=292, #reads=8882609, mean=30419.8946, min=299.88, max=296973.17
          On-target percentage: 38.693 (of 22956916 mapped)
          Wrote ./T10.targetcoverage.cnn


          I ran again with the command you suggested, and the results are now as expected, i.e., identical to the first run.

          Thanks.

          Comment


          • #20
            calling LOH from a pair of tumor and normal exome bams

            Dear Eric,

            I wonder if you can provide an instruction/example on how to generate LOH calls from a pair of tumor and normal exom bams.
            I assume this tool can generate them, but I couldn't find a tuturoal in the cnvkit website. probably, I missed it?

            thank you very much in advance,

            wisekh

            Comment


            • #21
              Hi wisekh,

              The LOH functionality in CNVkit is described here:
              http://cnvkit.readthedocs.org/en/lat...s.html#scatter

              However, the "calls" are simply displayed visually -- the variant allele frequencies are plotted alongside the copy ratios, and a shift in VAF from 0.5 indicates LOH. I'm currently working on expanding this functionality to make it more useful.

              To run the complete pipeline with a tumor-normal pair and make a plot of the copy number and LOH shifts together, follow the quick start guide here:
              http://cnvkit.readthedocs.org/en/latest/quickstart.html

              Separately from running the CNVkit "batch" pipeline, you'll need to call SNPs in the tumor sample in VCF format. Then use that VCF file as input to the CNVkit "scatter" command along with the .cnr and .cns files from the CNVkit pipeline to make the plot.

              Hope that helps,
              Eric

              Comment


              • #22
                I got it. Thank you very much for your quick response.

                hoon

                Comment


                • #23
                  using a mapping quality threshold

                  Dear Eric,
                  I have another question. Is it possible to use a mapping quality threshold to, for example, a tumor bam or multiple tumor bams that I would infer copy number from?

                  thank you,

                  Hoon

                  Comment


                  • #24
                    The mapping quality threshold is hard-coded to only exclude unmapped or ambiguously mapped reads, see here:
                    https://github.com/etal/cnvkit/blob/...verage.py#L115

                    Just change the -Q value to another integer if you'd like to try it yourself. However, I recall reading (can't find the reference at the moment) that keeping low-MAPQ reads did not harm copy number estimation and may have improved it.

                    In any case, in CNVkit the script genome2access.py can be used to directly exclude poorly mappable genomic regions. This is done already for hg19 in the bundled file access-5k-mappable.hg19.bed.

                    Comment


                    • #25
                      Sequence IDs don't match with bed Error CNVkit.

                      Hi all,

                      I trying to run CNVkit with tumor and normal samples on exome sequencing. But I tried all possible things mentioned in the docs. I tried different bed file input as the target region and also the annotations but I always end up getting the error.
                      ValueError: BED file 'results/S04380110_Covered.target.bed' sequence IDs don't match any in BAM file.

                      Anybody came across this issue. Please help.

                      Thank you in advance.

                      Deepak

                      Comment


                      • #26
                        Hi Deepak,

                        The sequence IDs are the chromosome names in your reference genome and the first column of your BED file. For the human genome the first chromosome might be "1" or "chr1" depending on where you got your reference genome.

                        Check that the name schemes match between your BED and BAM files, either "1" or "chr1". You can use "samtools view -H" to see the BAM header. If the names don't match, then you should edit your BED file, either adding or removing the "chr" prefix, so that they do match.

                        Hope that helps,
                        Eric

                        Comment


                        • #27
                          errors in production of cns file

                          Hello all,

                          I'm having issues working through the install of CNVkit. I've pinpointed my issues to problems with the segment sub-command. When I try to run the test procedure, it dies with this error:

                          Traceback (most recent call last):
                          File "../cnvkit.py", line 11, in <module>
                          args.func(args)
                          File "/home/snmcnulty/bin/cnvkit-master/cnvlib/commands.py", line 666, in _cmd_segment
                          rlibpath=args.rlibpath)
                          File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 58, in do_segmentation
                          segarr = cnarr.as_dataframe(seg2cns(seg_out))
                          File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 143, in seg2cns
                          + seg_text)
                          ValueError: Segmentation output is not valid SEG format:

                          [then a big table]

                          Makefile:57: recipe for target 'build/p2-5_5.cns' failed
                          make: *** [build/p2-5_5.cns] Error 1

                          If anyone could help me pinpoint the problem, I would really appreciate it!

                          Comment


                          • #28
                            Sorry for the trouble. Could you post the first few lines of the output table and the version of CNVkit you're using (try "cnvkit.py version")?

                            If you're able to load the PSCBS package in R, could you post that package's version too?

                            If the default CBS method is failing, as an alternative you can try the "haar" method, which does not rely on R at all:

                            cnvkit.py segment MySample.cnr -m haar

                            (Use the -t parameter to specify the significance threshold, which might need to be very small for exome sequencing data to segment well.)

                            Comment


                            • #29
                              Hi Eric,

                              I have solved the problems with the reference. But now the problem seems to be different at the step of calculating coverage. This is the error I get.

                              #######################################

                              cnvkit.py coverage /media/ae5197fd-42a8-4d8f-96d6-1c8762b4d8ae/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-020200_final.bam my_targets.bed -o .targetcoverage.cnn
                              Processing reads in MM02-020200_final.bam
                              Traceback (most recent call last):
                              File "/usr/local/bin/cnvkit.py", line 4, in <module>
                              __import__('pkg_resources').run_script('CNVkit==0.7.9.dev0', 'cnvkit.py')
                              File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 726, in run_script
                              self.require(requires)[0].run_script(script_name, ns)
                              File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1491, in run_script
                              exec(script_code, namespace, namespace)
                              File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.7.9.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 11, in <module>

                              File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 434, in _cmd_coverage

                              File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 457, in do_coverage

                              File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 37, in interval_coverages
                              File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 110, in interval_coverages_pileup
                              File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 143, in bedcov
                              RuntimeError: Bad line from bedcov:
                              1

                              ############################

                              Can you shed some light into the issue?

                              Thank you.

                              Regards,
                              Deepak

                              Comment


                              • #30
                                segmentation, ctd.

                                M CNVKit version is 0.7.8. R is installed, and I'm using PSCBS v0.61.0. Here's the output I get from the segmentation command.

                                to connect to socket /tmp/dbus-Hufo1aFYF5: Connection refused
                                Dropped 1 outlier bins:
                                chromosome start end gene log2 weight
                                0 chr4 46638882 46738865 Background -4.1499 0.167756
                                Traceback (most recent call last):
                                File "/opt/apps/cnvkit/0.7.8/bin/cnvkit.py", line 11, in <module>
                                args.func(args)
                                File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/commands.py", line 666, in _cmd_segment
                                rlibpath=args.rlibpath)
                                File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 58, in do_segmentation
                                segarr = cnarr.as_dataframe(seg2cns(seg_out))
                                File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 143, in seg2cns
                                + seg_text)
                                ValueError: Segmentation output is not valid SEG format:
                                WARNING: ignoring environment value of R_HOME
                                "sampleName" "chromosome" "start" "end" "nbrOfLoci" "mean"
                                "19987_main" "chr1" 93708 2582834 32 0.0317406246246301
                                "19987_main" "chr1" 2684720 6239917 34 -0.392436669374812
                                "19987_main" "chr1" 6239917 26177167 253 0.0239173699112572
                                "19987_main" "chr1" 26177167 29477537 33 0.379337119869004

                                Also, you were right -- using a non-R workflow completed perfectly.

                                Comment

                                Working...
                                X