Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by lh3 View PostHave you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.
I am using bwa version 0.4.5. Is there a newer svn version?
samtools index works without issue on converted MAQ alignments.
Comment
-
Yes, I sorted it just fine. In fact the indexing step will complain that the file isn't sorted.
One issue could be that I just realized that the ref_list file I gave during the import didn't have the reference size in it. I assume this means the length of the reference sequence? I'll have to give that a try when I first import the file (convert to bam).
Comment
-
I have been looking at the SAM format to see if it is something we should consider for the output of the mapping for a genome assembly we are doing at Complete Genomics. As you may know, we have quite an unusual read structure that may be difficult to represent in SAM (5+10+10+10 times two, mate paired reads). The problem lies in the fact that the there is some overlap between the first 5 base read and the second 10 bases (we call them negative gaps).
The read could be
acgtc tcgattgcgg ...
which maps to the reference like this
acgTCgattgcc...
The capital TC show the negative gaps where there were actual overlaps in the read sequence.
Anyone now how we could represent this in SAM? Can the CIGAR standard deal with negative numbers? 5M-2M8M ?)
Should we map the 5 base read and the other 30 bp read as two separate reads? But we also have mate pairs, so how should we represent those? And would any other tool be able to deal with a read structure like this?
Thanks,
Thon
Complete GenomicsThon
__________________________________
Thon de Boer, Ph.D.
Director of Product Management, Software
Strand Life Sciences
548 Market Street, Suite 82804
San Francisco, CA 94104, USA
[email protected]
www.strandls.com
Pioneers in Discovery Research Informatics
_______________________________________
Comment
-
Negtive length in CIGAR would be good, but that is not supported at the moment. Alternatively, you can save the read as acgtggattgcc and write the CIGAR as 11M. In this way, you cannot get the original read sequence, but I guess this is not so important in most cases. What do you think?
Comment
-
Well...The reads are independent and sometimes don't agree and this is something we need to capture so we could not just remove that information.
Is there any other way in SAM (now or future) that would allow us to capture our read structure? We are going to be producing thousands of genomes very soon and I'm sure many of our customers would want to use a format that is comparable to the one used in the 1000 genome project, but we also would want our read structure to be supported in that format...
Is there a way for us to get involved in the design of the SAM standard?
Thon
Complete GenomicsThon
__________________________________
Thon de Boer, Ph.D.
Director of Product Management, Software
Strand Life Sciences
548 Market Street, Suite 82804
San Francisco, CA 94104, USA
[email protected]
www.strandls.com
Pioneers in Discovery Research Informatics
_______________________________________
Comment
-
Hello Thon,
You can join samtools' mailing lists which can be found here:
You may send around your suggestion and see what others respond. For the moment, I think you may save the read as "acgtggattgcc" and add an optional field to indicate that the first tg is actually an overlap between the first 5bp and the rest of read. In this way, you can use samtools without losing any information. Note that samtools will not look into the optional fields, but the information is kept anyway. Would this be enough for your application?
Comment
-
I copied my reply to the samtools-devel mailing list here. I think the following strategy is the best for data generated by Complete Genomics.
Now I prefer to store "acgtctcgattgcgg" as:
SEQ=acgtctcgattgcgg
CIGAR=5M2S8M (or 3M2S10M)
where S stands for "soft clipping/skip on the read". I have checked the source code and think the current "samtools pileup" supports internal soft clipping, which means pileup, consensus/indel calling and glf can be applied. Even if samtools does not support internal soft clipping now, I think it should not be hard to change the code.
Comment
-
Just wanted to let everyone know that BFAST (http://genome.ucla.edu/bfast) now supports SAM output. Congratulations Heng (I recognize your coding style from reading MAQ's source!), and others for creating a great tool (samtools) and a good start at an open format (SAM).
Nils
Comment
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Newborn Genomic Screening Shows Promise in Reducing Infant Mortality and Hospitalization
by seqadmin
Started by seqadmin, Yesterday, 08:22 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:22 AM
|
||
Started by seqadmin, 12-02-2024, 09:29 AM
|
0 responses
169 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:29 AM
|
||
Started by seqadmin, 12-02-2024, 09:06 AM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:06 AM
|
||
Started by seqadmin, 12-02-2024, 08:03 AM
|
0 responses
51 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 08:03 AM
|
Comment