samtools problems with reference in pileup file

nupurgupta replied

03-05-2013, 01:49 PM
Same problem

Did you find a solution to the null problem please?

Originally posted by ericpante View Post

Hi everybody,

I am have similar problems with samtools 0.1.18. I would like to have reference characters listed in a pileup files, but I have problems with headers.

samtools faidx AGSbrut.fasta
samtools view -q 20 -buh -t AGSbrut.fasta.fai A.sam | samtools sort - A
samtools view -q 20 -buh -t AGSbrut.fasta.fai S.sam | samtools sort - S
samtools mpileup -B -f AGSbrut_index.fai A.bam S.bam > AS.mpileup

[fai_build_core] different line length in sequence 'null'.
Segmentation fault

I hypothesized that this 'null' sequence may be a blank line; so I looked for it manually and with sed, with no luck. I also looked for other potential problems based on what was previously reported (no extra spaces, characters, etc in reference sequence names in fai and sam files). I also tried to re-head the file, with no success:

samtools view -HS -t AGSbrut.fasta.fai A.sam > Aheader.sam
samtools reheader Aheader.sam A.bam > Aheaded.bam

[bam_header_read] EOF marker is absent. The input is probably truncated.

All insights are welcome!
thank you, eric
Leave a comment:
jgibbons1 replied

01-22-2013, 11:10 AM
Using pileup with the -f argument allows you to supply the faidx indexed reference sequence file. I used this option and it fixed my problem.
Leave a comment:
jgibbons1 replied

01-22-2013, 08:41 AM
Hey folks,

Have been struggling to figure out why I am getting N's for my pileup reference sequence. I found hope when I discovered this string but I have followed all the suggestions to no avail. I've tried this with different versions of samtools, different data sets, different reference files and have simplified ID names, rebuilt the faidx index, etc. etc.

Still can't figure out what's going on here. Has anyone found any other solutions?

Thanks
Leave a comment:
adowney replied

12-08-2011, 12:43 PM
Originally posted by colindaven View Post

Here's another possible solution - the headers are not consistent between SAM/BAM and the original fasta:

Even though the reference file was the same one in both cases, sometimes aligners just write a substring out into the SAM file. Samtools seems to take the full header.

For example the first contiguous part of my genome header is
gi|110645304|ref|NC_002516.2|

However in my SAM file the aligner has only written
NC_002516.2

Samtools has written the full header to the .fa.fai index
gi|110645304|ref|NC_002516.2|

.. and this does not match.

Solution:

Try correcting the original header on the reference fasta to just the substring which the aligner uses.
eg
gi|110645304|ref|NC_002516.2|
to
NC_002516.2

The above suggestion fixed the problem when I got this error
Leave a comment:
ericpante replied

11-25-2011, 10:46 AM
Hi everybody,

I am have similar problems with samtools 0.1.18. I would like to have reference characters listed in a pileup files, but I have problems with headers.

samtools faidx AGSbrut.fasta
samtools view -q 20 -buh -t AGSbrut.fasta.fai A.sam | samtools sort - A
samtools view -q 20 -buh -t AGSbrut.fasta.fai S.sam | samtools sort - S
samtools mpileup -B -f AGSbrut_index.fai A.bam S.bam > AS.mpileup

[fai_build_core] different line length in sequence 'null'.
Segmentation fault

I hypothesized that this 'null' sequence may be a blank line; so I looked for it manually and with sed, with no luck. I also looked for other potential problems based on what was previously reported (no extra spaces, characters, etc in reference sequence names in fai and sam files). I also tried to re-head the file, with no success:

samtools view -HS -t AGSbrut.fasta.fai A.sam > Aheader.sam
samtools reheader Aheader.sam A.bam > Aheaded.bam

[bam_header_read] EOF marker is absent. The input is probably truncated.

All insights are welcome!
thank you, eric
Leave a comment:
smehr12 replied

05-05-2011, 11:11 AM
Originally posted by SMHfrog View Post

I had this same problem, and after seeing no solution here did some more digging, and have a possible solution for you.

I noticed that the ref.fa.fai file for my whole genome was 0 kb. The .fai is used by samtools when building the pileup. When I ran the command to re-build the .fai:

samtools faidx reference.fa

I got the following error message:

[fai_build_core] different line length in sequence 'scaffold_14'.
Segmentation fault

No doubt this same message occurred the first time I ran the pileup command (which also builds the .fai if it doesn't exist), but I apparently didn't pay attention. After that first time, the .fai file EXISTED so no errors were subsequently reported when I ran pileup again.

In my case, there was an extra line after scaffold_14. I removed this, and re-built the .fai using the samtools faidx command and then re-ran the pileup command. My pileup then contained the reference base as intended!

Hope this helps y'all find the solution to your problem.
Best,
Shannon
University of Texas at Austin

Hi all,
I have the same error.
samtools faidx bwa.ref/ref.fasta ref.fa

ERROR:
different line length in sequence 'scaffold_67'.
Segmentation fault
NOTE: I see NNNN in that scaffold . Does anyone have a suggestion?
Leave a comment:
bgibb replied

04-20-2011, 05:39 PM
I noticed the same problem when running pileup under SAMtools-0.1.15. However the problem does not seem to occur when running pileup under SAMtools-0.1.4 (using the same reference file, same BAM file and same command line options).

samtools-0.1.4/samtools pileup -s -f reference.fa sorted.bam > pileup.out
Leave a comment:
colindaven replied

02-03-2011, 08:31 AM
Here's another possible solution - the headers are not consistent between SAM/BAM and the original fasta:

Even though the reference file was the same one in both cases, sometimes aligners just write a substring out into the SAM file. Samtools seems to take the full header.

For example the first contiguous part of my genome header is
gi|110645304|ref|NC_002516.2|

However in my SAM file the aligner has only written
NC_002516.2

Samtools has written the full header to the .fa.fai index
gi|110645304|ref|NC_002516.2|

.. and this does not match.

Solution:

Try correcting the original header on the reference fasta to just the substring which the aligner uses.
eg
gi|110645304|ref|NC_002516.2|
to
NC_002516.2
Leave a comment:
smol replied

11-25-2010, 10:12 AM
Hi
I'm having the same problem with Ns in my pileup file and have tried everything mentioned above (thanks for suggestions!). I am using:

./samtools pileup data.sorted.bam -f reference.fasta > data.pileup

My reference .fai file looks like this:

chr2L 49364325 7 60 61
chr2R 61545105 50187078 60 61
chr3L 41963435 112757942 60 61
chr3R 53200684 155420775 60 61
chrUNKN 42389979 209508147 60 61
chrX 24393108 252604632 60 61
chrY 237045 277404298 60 61

Any ideas?
Leave a comment:
brutus replied

09-03-2010, 03:26 AM
I also had this experience, in my case the problem disappeared when I removed spaces in the reference sequence name.
Leave a comment:
mmartin replied

08-25-2010, 07:52 AM
I had the same problem. In my case, I had colons in the reference sequence names, something like "Region1:1-100". When I removed them, samtools pileup worked as expected.
Leave a comment:
hollandorange replied

08-06-2010, 04:48 AM
I got the same problem.
chr17 418628 N 54
chr17 418629 N 58
chr17 418630 N 57
Leave a comment:
skingan replied

07-22-2010, 05:50 AM
It turned out to be a similar problem to the one SMHfrog had. In my reference file, each chromosome sequence was on a single line, so when samtools built the .fai file there was a segmentation fault because of the length of the sequence. I used a different reference with line breaks and it worked. I used the same reference file for the Mosaik run and the pileup build.
Sarah
Leave a comment:
thaley replied

07-16-2010, 07:17 AM
Ran into the same problem. It may be worth someone adding this to the faidx documentation regarding null strings in the reference or make the thrown error more descriptive.
Leave a comment:
SMHfrog replied

07-07-2010, 01:03 PM
I had this same problem, and after seeing no solution here did some more digging, and have a possible solution for you.

I noticed that the ref.fa.fai file for my whole genome was 0 kb. The .fai is used by samtools when building the pileup. When I ran the command to re-build the .fai:

samtools faidx reference.fa

I got the following error message:

[fai_build_core] different line length in sequence 'scaffold_14'.
Segmentation fault

No doubt this same message occurred the first time I ran the pileup command (which also builds the .fai if it doesn't exist), but I apparently didn't pay attention. After that first time, the .fai file EXISTED so no errors were subsequently reported when I ran pileup again.

In my case, there was an extra line after scaffold_14. I removed this, and re-built the .fai using the samtools faidx command and then re-ran the pileup command. My pileup then contained the reference base as intended!

Hope this helps y'all find the solution to your problem.
Best,
Shannon
University of Texas at Austin
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News