Seqanswers Leaderboard Ad

**lh3** · 03-06-2009, 10:56 AM

Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.

**webbrewer** · 03-06-2009, 11:50 AM

Originally posted by lh3 View Post

Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.

I hadn't sorted it before. Now I ran "samtools sort", then "samtools index" on the sorted output. It resulted the same with seg fault.
I am using bwa version 0.4.5. Is there a newer svn version?
samtools index works without issue on converted MAQ alignments.

**lparsons** · 03-10-2009, 02:12 PM

I imported an ELAND alignment and was able to convert into SAM, then to BAM, then sort it. However, at the indexing step I too ran into a segmentation fault. I'm using the 0.1.2 version from the download page, not from SVN. Any suggestions?

**lh3** · 03-10-2009, 02:20 PM

Have you sorted the alignment first? Indexing in 0.1.2 has a bug, but should not cause segfault. Thanks.

**lparsons** · 03-10-2009, 02:25 PM

Yes, I sorted it just fine. In fact the indexing step will complain that the file isn't sorted.

One issue could be that I just realized that the ref_list file I gave during the import didn't have the reference size in it. I assume this means the length of the reference sequence? I'll have to give that a try when I first import the file (convert to bam).

**lh3** · 03-10-2009, 02:36 PM

You can generate ref_list file by running "bam faidx" on your reference sequence. The index file can be used with import. Note that faidx allows you to quickly extract subsequence from the genome, which may be useful to you.

**lparsons** · 03-11-2009, 03:19 PM

Thanks for the quick replies. I've tried with various values, etc. and the indexing step still seg faults. Any other ideas on debugging this? Perhaps using an older version? Thanks for any ideas.

**thondeboer** · 03-17-2009, 10:55 PM

I have been looking at the SAM format to see if it is something we should consider for the output of the mapping for a genome assembly we are doing at Complete Genomics. As you may know, we have quite an unusual read structure that may be difficult to represent in SAM (5+10+10+10 times two, mate paired reads). The problem lies in the fact that the there is some overlap between the first 5 base read and the second 10 bases (we call them negative gaps).

The read could be

acgtc tcgattgcgg ...

which maps to the reference like this

acgTCgattgcc...

The capital TC show the negative gaps where there were actual overlaps in the read sequence.

Anyone now how we could represent this in SAM? Can the CIGAR standard deal with negative numbers? 5M-2M8M ?)

Should we map the 5 base read and the other 30 bp read as two separate reads? But we also have mate pairs, so how should we represent those? And would any other tool be able to deal with a read structure like this?

Thanks,

Thon
Complete Genomics

**lh3** · 03-18-2009, 01:31 AM

Negtive length in CIGAR would be good, but that is not supported at the moment. Alternatively, you can save the read as acgtggattgcc and write the CIGAR as 11M. In this way, you cannot get the original read sequence, but I guess this is not so important in most cases. What do you think?

**thondeboer** · 03-18-2009, 09:13 AM

Well...The reads are independent and sometimes don't agree and this is something we need to capture so we could not just remove that information.

Is there any other way in SAM (now or future) that would allow us to capture our read structure? We are going to be producing thousands of genomes very soon and I'm sure many of our customers would want to use a format that is comparable to the one used in the 1000 genome project, but we also would want our read structure to be supported in that format...

Is there a way for us to get involved in the design of the SAM standard?

Thon
Complete Genomics

**lh3** · 03-18-2009, 02:20 PM

Hello Thon,

You can join samtools' mailing lists which can be found here:

SAM tools Mailing Lists

http://sourceforge.net/mail/?group_id=246254

You may send around your suggestion and see what others respond. For the moment, I think you may save the read as "acgtggattgcc" and add an optional field to indicate that the first tg is actually an overlap between the first 5bp and the rest of read. In this way, you can use samtools without losing any information. Note that samtools will not look into the optional fields, but the information is kept anyway. Would this be enough for your application?

**lh3** · 03-20-2009, 01:39 AM

I copied my reply to the samtools-devel mailing list here. I think the following strategy is the best for data generated by Complete Genomics.

Now I prefer to store "acgtctcgattgcgg" as:

SEQ=acgtctcgattgcgg
CIGAR=5M2S8M (or 3M2S10M)

where S stands for "soft clipping/skip on the read". I have checked the source code and think the current "samtools pileup" supports internal soft clipping, which means pileup, consensus/indel calling and glf can be applied. Even if samtools does not support internal soft clipping now, I think it should not be hard to change the code.

**nilshomer** · 03-22-2009, 06:51 PM

Just wanted to let everyone know that BFAST (http://genome.ucla.edu/bfast) now supports SAM output. Congratulations Heng (I recognize your coding style from reading MAQ's source!), and others for creating a great tool (samtools) and a good start at an open format (SAM).

Nils

**qtrinh** · 03-25-2009, 11:43 AM

I got seg fault when doing "samtools pileup". The error message is "[bam_pileup_core] the input is not sorted. Abort!", however I do use the sorted input file. Anyone else seen this before?

Q

**lh3** · 03-25-2009, 11:48 AM

You should run "samtools sort" first. Pileup works on a stream without loading the whole alignment into the memory and therefore the alignments must be sorted on the chromosomal positions.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News