Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thias
    replied
    Forget what I wrote in the last post (I will not delete it though).

    The solution is much simpler. I didn't align the reads myself, but recieved those files from a former collaborator and had no clue about the aligner which was used. The SAM/BAM format specification allow for two different letters to indicate a gap in the mapping: N for Introns and D for deletions.

    However the -split option of bedtools only splits reads, if the gaps are recorded as N in the CIGAR string of the BAM file. (there is the special option -splitD to force bedtools to split also reads seperated by D). In my BAM file, I have had some intron-spanning reads (the longest introns) which were recorded with Ds and therefore not splitted.

    Leave a comment:


  • Thias
    replied
    Some pondering about this problem

    Dear John,

    As mentioned in the previous post, I have no bioinformatic education (apart from self-taught stuff) and depend on forums like that to get answers. So I unfortunately can't provide you with a satisfactory, expert-bioinformatician-approved, ready-to-use solution to this problem.

    However I spent some time trying and will summarise what I did in the end. I decided that there should be an possibility to filter out the undesired reads, which cause the plateaus.

    Therefore I first obtained a BED file with the coordinates of all introns. If you have a known organism with a released reference genome on the UCSC Genome Browser this is fairly easy:

    Navigate to the UCSC Table tool and decide for your appropriate organism and assembly. Then select as group "Gene and Gene Prediction Tracks" and the gene annotation track of your choice (e.g. " Refseq Genes" or "Ensembl Genes"). Choose the region "genome" and the output format "BED - browser extensible data". Name the file and click on the "get output" button. In the radio button form select "Introns plus 0" for the exact intron regions.

    I used this introns.bed file to intersect my aligned reads with. To do so, I applied intersectBed from the bedtools suite and retained only those reads, which didn't overlap more than 30% of the read length with an intron (-v and -f 0.3 options).

    Code:
    bedtools bamtobed -split -i Alignment.bam | bedtools intersect -v -f 0.30 -a stdin  -b refseq_introns.bed > alignment-piped.bed 
    
    bedtools intersect -v -f 0.30 -abam -split -a Alignment.bam  -b refseq_introns.bed > alignment-direct.bed
    Attention: Both of the above commands work, but they don't produce the same output as detailed below!

    The second commands execution time is much faster (because it is not piped) and it retains for my file less, but longer fragments. (BED file entries: command 1: 24797290, command 2: 19152916)

    I used
    Code:
    awk < alignment-piped.bed  '{print $3-$2 '\n'}' >> pipedlength.txt
    to get an idea about the length distribution of the BED entries and summarised that in R out of curiosity:

    > summary(pipedlength)
    X36
    Min. : 1.00
    1st Qu.:36.00
    Median :36.00
    Mean :31.89
    3rd Qu.:36.00
    Max. :36.00
    > summary(directlength)
    X36
    Min. : 36.0
    1st Qu.: 36.0
    Median : 36.0
    Mean : 37.9
    3rd Qu.: 36.0
    Max. :427053.0
    >
    So in case of the second command, some long reads are retained, while they are probably chopped into smaller fragments with command 1. I guess that those long reads cause the plateaus in the wiggle format.

    So you have the choice: Apply the first command (despite the longer processing time) or use the second command and additionally filter those long reads.

    Code:
    awk '$3-$2 <= 36 { print }' alignment-direct.bed  > alignment-direct-filtered.bed
    Obviously it would be interesting to resolve the nature of those long reads. I think - since our samples originate from cancer cells - this could indicate some chromosomal rearrangements. However I didn't proceed further at this point yet.



    I hope I could help you a bit and if you find out about the reason of this differences, I'd be happy to read you reply here.

    Best
    Matthias
    Last edited by Thias; 01-23-2014, 06:50 AM.

    Leave a comment:


  • john789
    replied
    Thias, did you find a solution? I have a similar problem with two data sets I created. I can't see what I did different that would result in there being values for the introns in the ucsc browser for one experiment but not the other.

    thanks
    John

    Leave a comment:


  • Intron spanning reads result in plateaus in wiggle format ?

    Hello everybody,

    I would like to visualise the amount of mapped reads of custom RNAseq samples in the UCSC Genome Browser.

    For this purpose, I can start from already sorted and indexed BAM files, which we obtained from a former collaborator. Since we are all wetlab scientists with limited insight into RNAseq, I unfortunately can’t tell you, which platform and aligner was used.

    To generate the wiggle format tracks, I tried the following code, but despite using the –split option of genomeCoverageBed, I can’t reproduce the tracks as they have been previously created by the collaborator.

    Code:
    genomeCoverageBed –bg -split -ibam  ~/Bioinfo/BAM/RNASeq/unique_mapped/RNASeq_HYPO_L1_unique.bam  -g ./chromInfo-mm9.txt  >   ~/Bioinfo/Bioinformatics/HypoL1.bedgraph
    
    wigToBigWig ./Bioinformatics/HypoL1.bedgraph ./chromInfo-mm9.txt ./Bioinformatics/HypoL1.bigwig
    Namely, there seem to be reads which map across an intron and cause a plateau in the tracks wiggle representation - have a look at the second black track, which is the one I generated.

    Somehow, the former collaborator either excluded those reads or split them to into two pieces – the blue track at least doesn’t show those plateaus. (The track in the middle shows the collaborators bedgraph track after conversion into BigWig - so its not the different format)


    Link to public example session

    Does anybody have a hint for me ?

    Thanks a lot
    Thias

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Technologies
    by seqadmin



    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

    Long-Read Sequencing
    Long-read sequencing has seen remarkable advancements,...
    12-02-2024, 01:49 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:24 AM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 07:41 AM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-11-2024, 07:45 AM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-10-2024, 07:59 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Working...
X