I am processing aligned RNA-Seq data to get phased haplotypes of novel genes in sea stars. I am working through the analysis pipeline that used the Genome Analysis Toolkit suite of software, and is detailed on the Broad Institute website:
https://www.broadinstitute.org/gatk/...ces?bpm=RNAseq
and
https://www.broadinstitute.org/gatk/...c?name=methods
I am up to the Split 'n' Trim stage of the pipeline, where the SplitNCigarReads tool is used to refine the sequence alignment. I have hit some problems with my index files. The order of the scaffolds and contigs for .fai (fasta index file) and .bai (index for bam file; bam file has been sorted, read-groups have been added and duplicated marked) files is different.
e.g. the .fai file:
Scaffold1 154793 11 70 71
Scaffold2 383464 157027 70 71
Scaffold3 336159 545981 70 71
eg. the .bai file (samtools idxstats filename.bam):
Scaffold1 154793 0 0
Scaffold10 244803 0 0
Scaffold100 200181 0 0
I therefore got the following error from GATK:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR reads contigs = [Scaffold1, Scaffold10, Scaffold100, Scaffold1000...
##### ERROR reference contigs = [Scaffold1, Scaffold2, Scaffold3, Scaffold4…
I wrote a script to change the order of the items in the .fai file so they would be in the same order as the .bai file, but this produced another error:
##### ERROR MESSAGE: Couldn't read file filename.fa because Mismatch between sequence dictionary fasta index for filename.fa, sequence 'Scaffold2' != 'Scaffold10'.
Can anyone suggest a good work around for this problem? Do I need to change the order of entries in the fasta file reference sequence? Should I alter the .dict file (scaffolds/contigs are in the same order as they are in the original .fai file)?
https://www.broadinstitute.org/gatk/...ces?bpm=RNAseq
and
https://www.broadinstitute.org/gatk/...c?name=methods
I am up to the Split 'n' Trim stage of the pipeline, where the SplitNCigarReads tool is used to refine the sequence alignment. I have hit some problems with my index files. The order of the scaffolds and contigs for .fai (fasta index file) and .bai (index for bam file; bam file has been sorted, read-groups have been added and duplicated marked) files is different.
e.g. the .fai file:
Scaffold1 154793 11 70 71
Scaffold2 383464 157027 70 71
Scaffold3 336159 545981 70 71
eg. the .bai file (samtools idxstats filename.bam):
Scaffold1 154793 0 0
Scaffold10 244803 0 0
Scaffold100 200181 0 0
I therefore got the following error from GATK:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR reads contigs = [Scaffold1, Scaffold10, Scaffold100, Scaffold1000...
##### ERROR reference contigs = [Scaffold1, Scaffold2, Scaffold3, Scaffold4…
I wrote a script to change the order of the items in the .fai file so they would be in the same order as the .bai file, but this produced another error:
##### ERROR MESSAGE: Couldn't read file filename.fa because Mismatch between sequence dictionary fasta index for filename.fa, sequence 'Scaffold2' != 'Scaffold10'.
Can anyone suggest a good work around for this problem? Do I need to change the order of entries in the fasta file reference sequence? Should I alter the .dict file (scaffolds/contigs are in the same order as they are in the original .fai file)?
Comment