Here is an example of short reads being aligned differently by TopHat 1.1, 1.2 and 1.3 (even though I set the segment values the same as mentioned in the TopHat home page):
Chris
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
The original post says this:
Originally posted by Jon_Keats View PostI never used Tophat v1.1.1 which listed a fix for the sam sort header (see below) but in newest version, TopHat v1.2.0 all my sam files have sort headers of "sorted" not "coordinate". Oddly, parsing the file through Samtools sort does not fix the problem but parsing it through Picard does. Also the sam headers are not listed in numeric order:
Such as:
chr1
chr11
chr12
...
chr2
chr20
Not:
chr1
chr2
chr3
chr4
Anyone else seeing these minor issues?
...
Leave a comment:
-
cjp, its fixed in Tophat 1.3.1
Originally posted by tophatTLEN field in SAM format is correctly output
5_Solexa_0503:5:75:14816:7572#0 99 SL2.40ch01 17202 255 75M = 17281 159 CGGCCGCACAGTTATTCGTGATGTCGCCATCGGATGTGGCCATAGTAATCACGGTATGTTTATTGGGGCTGCCGG CCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCACCBCCCC@C@C@DCDCABBC?BA=@CC<@C=BBBDB@:@8@ NH:i:1 NM:i:2
5_Solexa_0503:5:75:14816:7572#0 147 SL2.40ch01 17281 255 80M = 17202 -159 TTGGGTCTTGGAGGAGGCTCTATGTCACTTGTTGGACAACTCGGTGGACAAACAGGTGGAGCCTTTAGTTACTGTTTGGA >DCCB@@?@DCDDCDBCDCACDCDAC?A=CCCCCCACCDCCACCCCCDCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 NM:i:2
Leave a comment:
-
Are you sure it is fixed, I just found one of my TopHat 1.3 files (the @PG line says so anyway) and it looks like this in the header (chromosomes are still in the order 1,10,11):
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:10 LN:135534747
@SQ SN:11 LN:135006516
...
@SQ SN:GL000247.1 LN:36422
@SQ SN:GL000248.1 LN:39786
@SQ SN:GL000249.1 LN:38502
@SQ SN:MT LN:16569
@SQ SN:X LN:155270560
@SQ SN:Y LN:59373566
@PG ID:TopHat VN:1.3.1 CL:/home/cjp64/src/tophat-1.3.1/src/tophat -p 12 --segment-length 15 --segment-mismatches 0 -o A37_2_west -G /home/easih/gtf/hg19_ccds_08022011.gtf /home/easih/refs/human_1kg/bowtie/human_g1k_v37 /scratch/svvd2/A37/A3700002.1.f
Leave a comment:
-
Din't know that. Thanks for letting me know.Yes, that makes total sense. Fortunately, I work on 80bp paired end reads. The problem I faced with Tophat 1.2.0 is that the column 9 of SAM format = TLEN was 0 always. I would like to know the entire fragment length that's mapped.
Best,
Arun.
Leave a comment:
-
Originally posted by cedance View PostThe post is quite old and newer versions of tophat (since 1.3.0 I guess), with collaboration from picard developers, have overcome these issues and also SAM format TLEN parameter etc...
Its better to use 1.3.1 (1.3.2 is out but still in beta) in my opinion.
"For short reads (usually <45-bp), it is recommended that users decrease segment length (--segment-length) to about half the read length and segment mismatches (--segment-mismatches) to 0 or 1"
When I ran it on 36bp data, it was necessary to play with these settings and I got different results than I did with TopHat 1.2 - reads didn't align across splice sites and aligned in different places or across different splice sites. In the help pages, I couldn't find an explanation of why the new version needed these new parameter changes but they weren't needed in older versions of TopHat.
On long read data, I think TopHat 1.3 seems to work well.
Leave a comment:
-
The post is quite old and newer versions of tophat (since 1.3.0 I guess), with collaboration from picard developers, have overcome these issues and also SAM format TLEN parameter etc...
Its better to use 1.3.1 (1.3.2 is out but still in beta) in my opinion.
Leave a comment:
-
Hi Jon,
You can use picard ReorderSam:
First, TopHat gives the wrong sort order in the header, so you'll have to change that else picard will complain.
e.g.,
samtools view -H acc.bam | sed 's/sorted/unsorted/' > acc.header.sam
samtools reheader acc.header.sam acc.bam > acc_head.bam
This is the picard command that works for me (it uses the order of sequences in the reference file in the output BAM file):
java -jar /path/to/picard/jars/ReorderSam.jar I=acc_head.bam O=acc_order.bam R=/path/to/ref/human_g1k_v37.fasta
Then you may have to re-sort the BAM file. Although, if you trust TopHat's sorting, I guess you can change the sed line above to: sed 's/sorted/coordinate/'. For me, I like to add read group info anyway as a lot of software like GATK needs them to run, so I use picard AddOrReplaceReadGroups, which also allows you to sort with the SO option:
e.g.,
name="MY_SAMPLE"
java -jar /path/to/picard/jars/AddOrReplaceReadGroups.jar I=acc_order.bam O=acc_rg.bam RGID=$name RGLB=$name RGPL=ILLUMINA RGPU=$name RGSM=$name SO=coordinate
There are other options I use with picard:
TMP_DIR=/path/to/tmp VALIDATION_STRINGENCY=SILENT VERBOSITY=ERROR QUIET=true CREATE_INDEX=true
Chris
Leave a comment:
-
Picard seems to do a better job than Samtools of putting whether the BAM has been sorted in the SAM header. I've been working with a group of computer scientists who picked up on this one, so I have changed from Samtools to Picard for SAM/BAM conversion and sorting.
Leave a comment:
-
TopHat v1.2.0 sort header
I never used Tophat v1.1.1 which listed a fix for the sam sort header (see below) but in newest version, TopHat v1.2.0 all my sam files have sort headers of "sorted" not "coordinate". Oddly, parsing the file through Samtools sort does not fix the problem but parsing it through Picard does. Also the sam headers are not listed in numeric order:
Such as:
chr1
chr11
chr12
...
chr2
chr20
Not:
chr1
chr2
chr3
chr4
Anyone else seeing these minor issues?
----Previous release notes-----
TopHat 1.1.1 release 10/11/2010
This release of TopHat includes some fixes related to Colorspace read mapping.
* Negative quality values are now handled correctly.
* Comments at the beginning of csfasta files no longer trigger an error.
* --integer-quals no longer conflicts with -i
* The header in TopHat BAM files now correctly lists the sort order as coordinate, with group order reference
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
59 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
57 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
51 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
55 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Leave a comment: