Real Time Genomics are pleased to announce the availability of new releases of our full analysis suite, RTG Core (commercial / free for non-commercial use), and our utility package, RTG Tools (free for any use). This release includes new features and performance improvements. Some of the highlights of this release:
* Several improvements to somatic variant calling, including the ability to specify site-specific somatic priors, control of output for gain-of-reference and loss-of-heterozygosity events, and changes to the VCF to align with TCGA VCF specification.
* Improvements to metagenomic species reference database management. Several new options allow better customization of a species reference, and extraction of genomic information for individual species contained within the reference database.
* Improvements to our sophisticated variant comparison tool vcfeval, primarily the ability to perform evaluation restricted to individual regions or sets of regions (for example GiaB high-confidence intervals or exome target regions), and the inclusion of more accuracy metrics, both as a new summary file and included in the weighted ROC data file.
* We are also pleased to make the source code to RTG Tools available under the Simplified BSD License, on github. (Source code for RTG Core remains available for non-commercial use).
* Many other minor improvements (full release notes for this version are detailed below.)
If you haven't used RTG Core before (or maybe even if you have), it includes a nice new demo script that runs through an end-to-end demonstration of sex-aware and pedigree-aware family variant calling, including de novo variant detection and variant evaluation with vcfeval. (It also makes a nice demo of our comprehensive simulation tools.)
Commercial users of RTG Core may download the update from our website at http://realtimegenomics.com/products/rtg-core-downloads. Non-commercial users can download the update from our website at http://realtimegenomics.com/products...non-commercial or build from the source on github (note the updated build instructions).
Users of RTG Tools, which is made freely available for non-commercial or commercial use alike, can download the new version from our website at http://realtimegenomics.com/products/rtg-tools or build from the source code on github.
Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. For more information on new features, see the RTG Operations Manual.
RTG Core 3.5 (2015-07-16)
-------------------------
### Basic Formatting and Mapping
* format/map: When formatting or mapping reads supplied as SAM/BAM
input data, any alignments marked as supplementary are ignored.
Note that if the input data has already been aligned, it is
recommended that the BAM file be shuffled to avoid biases during
mapping arising from the data being presented in chromosomal
order. See the user manual for more information.
* sdf2fasta/sdf2fastq: These commands have new flags --names and
--id-file that operate the same as their counterpart in sdfsubset.
* sdfsubset: This command has new flags --start-id and --end-id that
allow specifying a range of sequences by ID.
* sdf2sam: This new command to allows the extraction of reads from SDF
in the form of unaligned SAM/BAM. This has a benefit over
extraction as FASTQ in that some metadata (such as read group
information) is preserved, paired end data is stored in a single
file, and quality encoding is inherent in the format.
* chrstats: Reduce false positives in sex inconsistency detection that
were due to applying the (tighter) sex-chromosome threshold also to
autosomes. This threshold is now applied to sex-chromosomes only.
### Variant Calling and Analysis
* somatic: Now allows the user to specify a BED file containing
per-site somatic priors, which can be used (for example) to reduce
the somatic prior at sites typical of false positives (e.g. presence
in dbSNP) or increase the somatic prior at sites known to harbour
somatic variants (e.g. presence in COSMIC). For more information
see the user manual.
* somatic: At the end of variant calling, the somatic caller produces
an estimate of somatic sample contamination. Previously this
estimate was only available in the log file, but in this release
this computation has been greatly improved, and the contamination
estimate is now included in the standard summary statistics.
* somatic: "Gain of reference" calls are now disabled by default.
These can be included by specifying the new flag
--include-gain-of-reference.
* somatic: Calls that are indicative of loss of heterozygosity (LOH)
calls are not produced by default (since loss of heterozygosity
analysis is most useful in conjunction with additional data such as
germline variant calls or CNV data). These calls can be produced if
desired by specifying --loh with a prior greater than 0).
* somatic: When LOH calls are enabled, previously they were output in
haploid GT representation, now they use the ploidy appropriate for
the chromosome (according to the reference), for compatibility with
downstream processing tools.
* somatic: VCF output changes to bring the somatic representation in
line with TCGA 1.2 VCF specification. In particular:
* Calls include a new FORMAT field SS that indicates the somatic
status for the derived (tumor) sample. This field replaces the
previous SOMATIC INFO field.
* Calls include a new FORMAT field SSC which contains the somatic
score for the derived (tumor) sample. This field replaces the
previous RSS INFO field.
* lineage: Supports the input of pedigree in the form of VCF header
annotations as output by the somatic caller, in the form:
##PEDIGREE=<Derived=TUMORSAMPLENAME,Original=NORMALSAMPLENAME>
* population: Fixed a rare case where sometimes after complex call
simplification, the only sample genotype containing a non-ref allele
was a member of the pedigree not being output, and in this case the
QUAL score was the 10log10 prob(no variant) rather than 10log10
prob(variant) as required by the VCF specification. This has been
addressed.
* vcfmerge: Added a new flag --force-merge-all to always attempt to
merge headers containing conflicting descriptions.
* vcfmerge: Previously vcfmerge would not process records containing
symbolic alleles. These are now accepted.
* vcfmerge: More graceful handling when encountering records with a GT
that refers to a non-existent ALT.
* vcfeval: Now outputs a summary containing various accuracy
metrics. A first set of statistics is computed from the full set of
variants evaluated (these will typically have highest sensitivity
but potentially poor precision if the input call set has not been
filtered). A second set of statistics is computed based on the ROC
curve information, selected at a threshold which maximises the
F-measure statistic (this provides some balance between sensitivity
and precision, so may be a fairer point to gather statistics for
cross-caller comparison).
* vcfeval: The weighted_roc.tsv file now includes columns containing
additional accuracy metrics.
* vcfeval: Improved the detection that alerts the user when chromosome
names are incompatible between reference, baseline, calls, and bed
regions (if used). Improvements to other error and warning messages.
* vcfeval: Added a new flag --bed-regions to supply a BED file
containing a list of regions that the VCF records must overlap with
in order to be included in analysis. For example, a common use case
is to restrict to only evaluating calls contained within the GIAB
high-confidence regions, or only within regions corresponding to
exome target regions.
* vcfeval: Added a new flag --region to specify a single region to
evaluate variants within. This is useful when evaluating calls on a
single chromosome or within a small region of interest.
* vcfeval: Fixed a case where a ref-only call (i.e. containing no
alts) could get output instead of an indel with a padding base at
the same position.
* vcfeval: Disabled the output of slope analysis data files by default,
as these are fairly special purpose (primary ROC files are still
output). They can be re-enabled if desired by using the new
expert/experimental flag --Xslope-files.
* vcffilter: The --remove-all-same-as-ref flag now does not consider a
sample with missing GT as being variant, since the intent of this
flag is to retain only records where at least one sample is called
as variant.
* vcfannotate: Added two new flags --info-id and --info-description to
allow specifying the name of the INFO ID and Description fields
added to the header during annotation. These flags only take effect
if the VCF header does not already contain an INFO declaration with
that ID.
### Metagenomics
* taxfilter: Added a new flag --subtree which allows selecting entire
taxonomic subtrees for inclusion in the output taxonomy.
* taxfilter: Added a new flag --remove-sequences to allow the removal
of sequence data associated with specific taxon ids.
* sdf2fasta: Added a new flag --taxons to allow interpreting any
supplied ID as a taxon ID and all sequences assigned to such taxon
ID will be output. This provides an easy way to extract genomic
sequence for any species from the reference SDF.
### Other
* genomesim: Added a new flag --prefix to specify a prefix for
generated sequence names.
* many: Update the base library used for SAM/BAM input and output to
htsjdk 1.128.
* many: VCF reading now detects cases where a header specifies a field
declaration using an ID that is already in use, preventing duplicate
header declarations.
* extract: Fix a regression where extracting from VCF without any
region specified would include the VCF header.
* Several improvements to somatic variant calling, including the ability to specify site-specific somatic priors, control of output for gain-of-reference and loss-of-heterozygosity events, and changes to the VCF to align with TCGA VCF specification.
* Improvements to metagenomic species reference database management. Several new options allow better customization of a species reference, and extraction of genomic information for individual species contained within the reference database.
* Improvements to our sophisticated variant comparison tool vcfeval, primarily the ability to perform evaluation restricted to individual regions or sets of regions (for example GiaB high-confidence intervals or exome target regions), and the inclusion of more accuracy metrics, both as a new summary file and included in the weighted ROC data file.
* We are also pleased to make the source code to RTG Tools available under the Simplified BSD License, on github. (Source code for RTG Core remains available for non-commercial use).
* Many other minor improvements (full release notes for this version are detailed below.)
If you haven't used RTG Core before (or maybe even if you have), it includes a nice new demo script that runs through an end-to-end demonstration of sex-aware and pedigree-aware family variant calling, including de novo variant detection and variant evaluation with vcfeval. (It also makes a nice demo of our comprehensive simulation tools.)
Commercial users of RTG Core may download the update from our website at http://realtimegenomics.com/products/rtg-core-downloads. Non-commercial users can download the update from our website at http://realtimegenomics.com/products...non-commercial or build from the source on github (note the updated build instructions).
Users of RTG Tools, which is made freely available for non-commercial or commercial use alike, can download the new version from our website at http://realtimegenomics.com/products/rtg-tools or build from the source code on github.
Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. For more information on new features, see the RTG Operations Manual.
RTG Core 3.5 (2015-07-16)
-------------------------
### Basic Formatting and Mapping
* format/map: When formatting or mapping reads supplied as SAM/BAM
input data, any alignments marked as supplementary are ignored.
Note that if the input data has already been aligned, it is
recommended that the BAM file be shuffled to avoid biases during
mapping arising from the data being presented in chromosomal
order. See the user manual for more information.
* sdf2fasta/sdf2fastq: These commands have new flags --names and
--id-file that operate the same as their counterpart in sdfsubset.
* sdfsubset: This command has new flags --start-id and --end-id that
allow specifying a range of sequences by ID.
* sdf2sam: This new command to allows the extraction of reads from SDF
in the form of unaligned SAM/BAM. This has a benefit over
extraction as FASTQ in that some metadata (such as read group
information) is preserved, paired end data is stored in a single
file, and quality encoding is inherent in the format.
* chrstats: Reduce false positives in sex inconsistency detection that
were due to applying the (tighter) sex-chromosome threshold also to
autosomes. This threshold is now applied to sex-chromosomes only.
### Variant Calling and Analysis
* somatic: Now allows the user to specify a BED file containing
per-site somatic priors, which can be used (for example) to reduce
the somatic prior at sites typical of false positives (e.g. presence
in dbSNP) or increase the somatic prior at sites known to harbour
somatic variants (e.g. presence in COSMIC). For more information
see the user manual.
* somatic: At the end of variant calling, the somatic caller produces
an estimate of somatic sample contamination. Previously this
estimate was only available in the log file, but in this release
this computation has been greatly improved, and the contamination
estimate is now included in the standard summary statistics.
* somatic: "Gain of reference" calls are now disabled by default.
These can be included by specifying the new flag
--include-gain-of-reference.
* somatic: Calls that are indicative of loss of heterozygosity (LOH)
calls are not produced by default (since loss of heterozygosity
analysis is most useful in conjunction with additional data such as
germline variant calls or CNV data). These calls can be produced if
desired by specifying --loh with a prior greater than 0).
* somatic: When LOH calls are enabled, previously they were output in
haploid GT representation, now they use the ploidy appropriate for
the chromosome (according to the reference), for compatibility with
downstream processing tools.
* somatic: VCF output changes to bring the somatic representation in
line with TCGA 1.2 VCF specification. In particular:
* Calls include a new FORMAT field SS that indicates the somatic
status for the derived (tumor) sample. This field replaces the
previous SOMATIC INFO field.
* Calls include a new FORMAT field SSC which contains the somatic
score for the derived (tumor) sample. This field replaces the
previous RSS INFO field.
* lineage: Supports the input of pedigree in the form of VCF header
annotations as output by the somatic caller, in the form:
##PEDIGREE=<Derived=TUMORSAMPLENAME,Original=NORMALSAMPLENAME>
* population: Fixed a rare case where sometimes after complex call
simplification, the only sample genotype containing a non-ref allele
was a member of the pedigree not being output, and in this case the
QUAL score was the 10log10 prob(no variant) rather than 10log10
prob(variant) as required by the VCF specification. This has been
addressed.
* vcfmerge: Added a new flag --force-merge-all to always attempt to
merge headers containing conflicting descriptions.
* vcfmerge: Previously vcfmerge would not process records containing
symbolic alleles. These are now accepted.
* vcfmerge: More graceful handling when encountering records with a GT
that refers to a non-existent ALT.
* vcfeval: Now outputs a summary containing various accuracy
metrics. A first set of statistics is computed from the full set of
variants evaluated (these will typically have highest sensitivity
but potentially poor precision if the input call set has not been
filtered). A second set of statistics is computed based on the ROC
curve information, selected at a threshold which maximises the
F-measure statistic (this provides some balance between sensitivity
and precision, so may be a fairer point to gather statistics for
cross-caller comparison).
* vcfeval: The weighted_roc.tsv file now includes columns containing
additional accuracy metrics.
* vcfeval: Improved the detection that alerts the user when chromosome
names are incompatible between reference, baseline, calls, and bed
regions (if used). Improvements to other error and warning messages.
* vcfeval: Added a new flag --bed-regions to supply a BED file
containing a list of regions that the VCF records must overlap with
in order to be included in analysis. For example, a common use case
is to restrict to only evaluating calls contained within the GIAB
high-confidence regions, or only within regions corresponding to
exome target regions.
* vcfeval: Added a new flag --region to specify a single region to
evaluate variants within. This is useful when evaluating calls on a
single chromosome or within a small region of interest.
* vcfeval: Fixed a case where a ref-only call (i.e. containing no
alts) could get output instead of an indel with a padding base at
the same position.
* vcfeval: Disabled the output of slope analysis data files by default,
as these are fairly special purpose (primary ROC files are still
output). They can be re-enabled if desired by using the new
expert/experimental flag --Xslope-files.
* vcffilter: The --remove-all-same-as-ref flag now does not consider a
sample with missing GT as being variant, since the intent of this
flag is to retain only records where at least one sample is called
as variant.
* vcfannotate: Added two new flags --info-id and --info-description to
allow specifying the name of the INFO ID and Description fields
added to the header during annotation. These flags only take effect
if the VCF header does not already contain an INFO declaration with
that ID.
### Metagenomics
* taxfilter: Added a new flag --subtree which allows selecting entire
taxonomic subtrees for inclusion in the output taxonomy.
* taxfilter: Added a new flag --remove-sequences to allow the removal
of sequence data associated with specific taxon ids.
* sdf2fasta: Added a new flag --taxons to allow interpreting any
supplied ID as a taxon ID and all sequences assigned to such taxon
ID will be output. This provides an easy way to extract genomic
sequence for any species from the reference SDF.
### Other
* genomesim: Added a new flag --prefix to specify a prefix for
generated sequence names.
* many: Update the base library used for SAM/BAM input and output to
htsjdk 1.128.
* many: VCF reading now detects cases where a header specifies a field
declaration using an ID that is already in use, preventing duplicate
header declarations.
* extract: Fix a regression where extracting from VCF without any
region specified would include the VCF header.
Comment