Forget what I wrote in the last post (I will not delete it though).
The solution is much simpler. I didn't align the reads myself, but recieved those files from a former collaborator and had no clue about the aligner which was used. The SAM/BAM format specification allow for two different letters to indicate a gap in the mapping: N for Introns and D for deletions.
However the -split option of bedtools only splits reads, if the gaps are recorded as N in the CIGAR string of the BAM file. (there is the special option -splitD to force bedtools to split also reads seperated by D). In my BAM file, I have had some intron-spanning reads (the longest introns) which were recorded with Ds and therefore not splitted.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Some pondering about this problem
Dear John,
As mentioned in the previous post, I have no bioinformatic education (apart from self-taught stuff) and depend on forums like that to get answers. So I unfortunately can't provide you with a satisfactory, expert-bioinformatician-approved, ready-to-use solution to this problem.
However I spent some time trying and will summarise what I did in the end. I decided that there should be an possibility to filter out the undesired reads, which cause the plateaus.
Therefore I first obtained a BED file with the coordinates of all introns. If you have a known organism with a released reference genome on the UCSC Genome Browser this is fairly easy:
Navigate to the UCSC Table tool and decide for your appropriate organism and assembly. Then select as group "Gene and Gene Prediction Tracks" and the gene annotation track of your choice (e.g. " Refseq Genes" or "Ensembl Genes"). Choose the region "genome" and the output format "BED - browser extensible data". Name the file and click on the "get output" button. In the radio button form select "Introns plus 0" for the exact intron regions.
I used this introns.bed file to intersect my aligned reads with. To do so, I applied intersectBed from the bedtools suite and retained only those reads, which didn't overlap more than 30% of the read length with an intron (-v and -f 0.3 options).
Code:bedtools bamtobed -split -i Alignment.bam | bedtools intersect -v -f 0.30 -a stdin -b refseq_introns.bed > alignment-piped.bed bedtools intersect -v -f 0.30 -abam -split -a Alignment.bam -b refseq_introns.bed > alignment-direct.bed
The second commands execution time is much faster (because it is not piped) and it retains for my file less, but longer fragments. (BED file entries: command 1: 24797290, command 2: 19152916)
I usedCode:awk < alignment-piped.bed '{print $3-$2 '\n'}' >> pipedlength.txt
> summary(pipedlength)
X36
Min. : 1.00
1st Qu.:36.00
Median :36.00
Mean :31.89
3rd Qu.:36.00
Max. :36.00
> summary(directlength)
X36
Min. : 36.0
1st Qu.: 36.0
Median : 36.0
Mean : 37.9
3rd Qu.: 36.0
Max. :427053.0
>
So you have the choice: Apply the first command (despite the longer processing time) or use the second command and additionally filter those long reads.
Code:awk '$3-$2 <= 36 { print }' alignment-direct.bed > alignment-direct-filtered.bed
I hope I could help you a bit and if you find out about the reason of this differences, I'd be happy to read you reply here.
Best
MatthiasLast edited by Thias; 01-23-2014, 06:50 AM.
Leave a comment:
-
Thias, did you find a solution? I have a similar problem with two data sets I created. I can't see what I did different that would result in there being values for the introns in the ucsc browser for one experiment but not the other.
thanks
John
Leave a comment:
-
Intron spanning reads result in plateaus in wiggle format ?
Hello everybody,
I would like to visualise the amount of mapped reads of custom RNAseq samples in the UCSC Genome Browser.
For this purpose, I can start from already sorted and indexed BAM files, which we obtained from a former collaborator. Since we are all wetlab scientists with limited insight into RNAseq, I unfortunately can’t tell you, which platform and aligner was used.
To generate the wiggle format tracks, I tried the following code, but despite using the –split option of genomeCoverageBed, I can’t reproduce the tracks as they have been previously created by the collaborator.
Code:genomeCoverageBed –bg -split -ibam ~/Bioinfo/BAM/RNASeq/unique_mapped/RNASeq_HYPO_L1_unique.bam -g ./chromInfo-mm9.txt > ~/Bioinfo/Bioinformatics/HypoL1.bedgraph wigToBigWig ./Bioinformatics/HypoL1.bedgraph ./chromInfo-mm9.txt ./Bioinformatics/HypoL1.bigwig
Somehow, the former collaborator either excluded those reads or split them to into two pieces – the blue track at least doesn’t show those plateaus. (The track in the middle shows the collaborators bedgraph track after conversion into BigWig - so its not the different format)
Link to public example session
Does anybody have a hint for me ?
Thanks a lot
ThiasTags: None
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 08:24 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
Today, 08:24 AM
|
||
Started by seqadmin, Yesterday, 07:41 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
Yesterday, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
||
Started by seqadmin, 12-10-2024, 07:59 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
12-10-2024, 07:59 AM
|
Leave a comment: