Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by lh3 View PostHave you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.
I am using bwa version 0.4.5. Is there a newer svn version?
samtools index works without issue on converted MAQ alignments.
Comment
-
Yes, I sorted it just fine. In fact the indexing step will complain that the file isn't sorted.
One issue could be that I just realized that the ref_list file I gave during the import didn't have the reference size in it. I assume this means the length of the reference sequence? I'll have to give that a try when I first import the file (convert to bam).
Comment
-
I have been looking at the SAM format to see if it is something we should consider for the output of the mapping for a genome assembly we are doing at Complete Genomics. As you may know, we have quite an unusual read structure that may be difficult to represent in SAM (5+10+10+10 times two, mate paired reads). The problem lies in the fact that the there is some overlap between the first 5 base read and the second 10 bases (we call them negative gaps).
The read could be
acgtc tcgattgcgg ...
which maps to the reference like this
acgTCgattgcc...
The capital TC show the negative gaps where there were actual overlaps in the read sequence.
Anyone now how we could represent this in SAM? Can the CIGAR standard deal with negative numbers? 5M-2M8M ?)
Should we map the 5 base read and the other 30 bp read as two separate reads? But we also have mate pairs, so how should we represent those? And would any other tool be able to deal with a read structure like this?
Thanks,
Thon
Complete GenomicsThon
__________________________________
Thon de Boer, Ph.D.
Director of Product Management, Software
Strand Life Sciences
548 Market Street, Suite 82804
San Francisco, CA 94104, USA
[email protected]
www.strandls.com
Pioneers in Discovery Research Informatics
_______________________________________
Comment
-
Negtive length in CIGAR would be good, but that is not supported at the moment. Alternatively, you can save the read as acgtggattgcc and write the CIGAR as 11M. In this way, you cannot get the original read sequence, but I guess this is not so important in most cases. What do you think?
Comment
-
Well...The reads are independent and sometimes don't agree and this is something we need to capture so we could not just remove that information.
Is there any other way in SAM (now or future) that would allow us to capture our read structure? We are going to be producing thousands of genomes very soon and I'm sure many of our customers would want to use a format that is comparable to the one used in the 1000 genome project, but we also would want our read structure to be supported in that format...
Is there a way for us to get involved in the design of the SAM standard?
Thon
Complete GenomicsThon
__________________________________
Thon de Boer, Ph.D.
Director of Product Management, Software
Strand Life Sciences
548 Market Street, Suite 82804
San Francisco, CA 94104, USA
[email protected]
www.strandls.com
Pioneers in Discovery Research Informatics
_______________________________________
Comment
-
Hello Thon,
You can join samtools' mailing lists which can be found here:
You may send around your suggestion and see what others respond. For the moment, I think you may save the read as "acgtggattgcc" and add an optional field to indicate that the first tg is actually an overlap between the first 5bp and the rest of read. In this way, you can use samtools without losing any information. Note that samtools will not look into the optional fields, but the information is kept anyway. Would this be enough for your application?
Comment
-
I copied my reply to the samtools-devel mailing list here. I think the following strategy is the best for data generated by Complete Genomics.
Now I prefer to store "acgtctcgattgcgg" as:
SEQ=acgtctcgattgcgg
CIGAR=5M2S8M (or 3M2S10M)
where S stands for "soft clipping/skip on the read". I have checked the source code and think the current "samtools pileup" supports internal soft clipping, which means pileup, consensus/indel calling and glf can be applied. Even if samtools does not support internal soft clipping now, I think it should not be hard to change the code.
Comment
-
Just wanted to let everyone know that BFAST (http://genome.ucla.edu/bfast) now supports SAM output. Congratulations Heng (I recognize your coding style from reading MAQ's source!), and others for creating a great tool (samtools) and a good start at an open format (SAM).
Nils
Comment
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-10-2024, 06:35 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
05-10-2024, 06:35 AM
|
||
Started by seqadmin, 05-09-2024, 02:46 PM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
05-09-2024, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
||
Started by seqadmin, 05-06-2024, 07:17 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
05-06-2024, 07:17 AM
|
Comment