Hello again everyone...
So, today I was happy as I got all my fastq files aligned and thought today's a beautiful day and all will go when I want to finally get some analysis out of my sam file (mind you, this is the very first time I am doing this).....so I started (well, attempted to start) converting my sam to bam files using 1) samtools 2) picard SortSam.jar...neither method works and I would really appreciate your input regarding this. I searched this great forum but nobody seemed to have the problem that I am having. Here are the details:
1- Using samtools.
command: samtools view -bS $REF mysamfile.sam > bamfile.bam
output:
samopen] SAM header is present: 84 sequences.
[sam_read1] reference 'AS:human_g1k_v37' is recognized as '*'.
Parse error at line 163: invalid CIGAR character
Aborted
So samtools view basically halted when it encountered * in the cigar string.
For this case, should I remove the alignments with an unknown cigar string? if that's a good solution, how do I do it?
2- Using picard tools SortSam.jar
java -jar /share/apps/picard-tools_current/jar/SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam
output:
[Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=/tmp/zibrahimbrc VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
[Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam done.
Runtime.totalMemory()=2058027008
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; Line 163
Line: @SQ SN:1 AS:human_g1k_v37 LN:249250621
at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:213)
at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40)
at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:305)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:58)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:117)
at net.sf.picard.sam.SortSam.main(SortSam.java:66)
so I am guessing that it didn't like my SQ field. The thing is, I didn't generate it, it was generated automatically by novoalign (what I used for the alignment). Should I specify @SQ fields when running novoalign (and run it again)? if so, what should I put it in it?
Here's how I ran my novoalign:
novoalign -d /scratch/data/reference_genomes/human/human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM $"@RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1" 2> 0099758_1.fastqnovo.stats > 0099758_1.fastqnovo.raw.sam
basically specifying the RG field only in the header.
-----------------------------------------
Also, this is the output of "samtools view -SH GSM520_ES.aligned.sam
samopen] SAM header is present: 84 sequences.
@HD VN:1.0 SO:unsorted
@RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
@PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
@HD VN:1.0 SO:unsorted
@RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
@PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
@SQ SN:1 AS:human_g1k_v37 LN:249250621
@SQ SN:2 AS:human_g1k_v37 LN:243199373
@SQ SN:3 AS:human_g1k_v37 LN:198022430
@SQ SN:4 AS:human_g1k_v37 LN:191154276
@SQ SN:5 AS:human_g1k_v37 LN:180915260
@SQ SN:6 AS:human_g1k_v37 LN:171115067
@SQ SN:7 AS:human_g1k_v37 LN:159138663
@SQ SN:8 AS:human_g1k_v37 LN:146364022
@SQ SN:9 AS:human_g1k_v37 LN:141213431
@SQ SN:10 AS:human_g1k_v37 LN:135534747
@SQ SN:11 AS:human_g1k_v37 LN:135006516
@SQ SN:12 AS:human_g1k_v37 LN:133851895
@SQ SN:13 AS:human_g1k_v37 LN:115169878
@SQ SN:14 AS:human_g1k_v37 LN:107349540
@SQ SN:15 AS:human_g1k_v37 LN:102531392
@SQ SN:16 AS:human_g1k_v37 LN:90354753
@SQ SN:17 AS:human_g1k_v37 LN:81195210
@SQ SN:18 AS:human_g1k_v37 LN:78077248
@SQ SN:19 AS:human_g1k_v37 LN:59128983
@SQ SN:20 AS:human_g1k_v37 LN:63025520
@SQ SN:21 AS:human_g1k_v37 LN:48129895
@SQ SN:22 AS:human_g1k_v37 LN:51304566
@SQ SN:X AS:human_g1k_v37 LN:155270560
@SQ SN:Y AS:human_g1k_v37 LN:59373566
@SQ SN:MT AS:human_g1k_v37 LN:16569
@SQ SN:GL000207.1 AS:human_g1k_v37 LN:4262
@SQ SN:GL000226.1 AS:human_g1k_v37 LN:15008
@SQ SN:GL000229.1 AS:human_g1k_v37 LN:19913
@SQ SN:GL000231.1 AS:human_g1k_v37 LN:27386
@SQ SN:GL000210.1 AS:human_g1k_v37 LN:27682
------------------------------------------------
Any help would be appreciated . Thank you!
So, today I was happy as I got all my fastq files aligned and thought today's a beautiful day and all will go when I want to finally get some analysis out of my sam file (mind you, this is the very first time I am doing this).....so I started (well, attempted to start) converting my sam to bam files using 1) samtools 2) picard SortSam.jar...neither method works and I would really appreciate your input regarding this. I searched this great forum but nobody seemed to have the problem that I am having. Here are the details:
1- Using samtools.
command: samtools view -bS $REF mysamfile.sam > bamfile.bam
output:
samopen] SAM header is present: 84 sequences.
[sam_read1] reference 'AS:human_g1k_v37' is recognized as '*'.
Parse error at line 163: invalid CIGAR character
Aborted
So samtools view basically halted when it encountered * in the cigar string.
For this case, should I remove the alignments with an unknown cigar string? if that's a good solution, how do I do it?
2- Using picard tools SortSam.jar
java -jar /share/apps/picard-tools_current/jar/SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam
output:
[Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=/tmp/zibrahimbrc VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
[Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam done.
Runtime.totalMemory()=2058027008
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; Line 163
Line: @SQ SN:1 AS:human_g1k_v37 LN:249250621
at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:213)
at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40)
at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:305)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:58)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:117)
at net.sf.picard.sam.SortSam.main(SortSam.java:66)
so I am guessing that it didn't like my SQ field. The thing is, I didn't generate it, it was generated automatically by novoalign (what I used for the alignment). Should I specify @SQ fields when running novoalign (and run it again)? if so, what should I put it in it?
Here's how I ran my novoalign:
novoalign -d /scratch/data/reference_genomes/human/human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM $"@RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1" 2> 0099758_1.fastqnovo.stats > 0099758_1.fastqnovo.raw.sam
basically specifying the RG field only in the header.
-----------------------------------------
Also, this is the output of "samtools view -SH GSM520_ES.aligned.sam
samopen] SAM header is present: 84 sequences.
@HD VN:1.0 SO:unsorted
@RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
@PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
@HD VN:1.0 SO:unsorted
@RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
@PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
@SQ SN:1 AS:human_g1k_v37 LN:249250621
@SQ SN:2 AS:human_g1k_v37 LN:243199373
@SQ SN:3 AS:human_g1k_v37 LN:198022430
@SQ SN:4 AS:human_g1k_v37 LN:191154276
@SQ SN:5 AS:human_g1k_v37 LN:180915260
@SQ SN:6 AS:human_g1k_v37 LN:171115067
@SQ SN:7 AS:human_g1k_v37 LN:159138663
@SQ SN:8 AS:human_g1k_v37 LN:146364022
@SQ SN:9 AS:human_g1k_v37 LN:141213431
@SQ SN:10 AS:human_g1k_v37 LN:135534747
@SQ SN:11 AS:human_g1k_v37 LN:135006516
@SQ SN:12 AS:human_g1k_v37 LN:133851895
@SQ SN:13 AS:human_g1k_v37 LN:115169878
@SQ SN:14 AS:human_g1k_v37 LN:107349540
@SQ SN:15 AS:human_g1k_v37 LN:102531392
@SQ SN:16 AS:human_g1k_v37 LN:90354753
@SQ SN:17 AS:human_g1k_v37 LN:81195210
@SQ SN:18 AS:human_g1k_v37 LN:78077248
@SQ SN:19 AS:human_g1k_v37 LN:59128983
@SQ SN:20 AS:human_g1k_v37 LN:63025520
@SQ SN:21 AS:human_g1k_v37 LN:48129895
@SQ SN:22 AS:human_g1k_v37 LN:51304566
@SQ SN:X AS:human_g1k_v37 LN:155270560
@SQ SN:Y AS:human_g1k_v37 LN:59373566
@SQ SN:MT AS:human_g1k_v37 LN:16569
@SQ SN:GL000207.1 AS:human_g1k_v37 LN:4262
@SQ SN:GL000226.1 AS:human_g1k_v37 LN:15008
@SQ SN:GL000229.1 AS:human_g1k_v37 LN:19913
@SQ SN:GL000231.1 AS:human_g1k_v37 LN:27386
@SQ SN:GL000210.1 AS:human_g1k_v37 LN:27682
------------------------------------------------
Any help would be appreciated . Thank you!
Comment