Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plot of NGS paired end distance at each genome position

    Hi,

    could anybody help in the following: standard Illumina paired end 300 bp NGS library is mapped on human genome. I would like FOR EACH GENOMIC POSITION to plot the average size of fragments that map across that position. The reason is, we see a huge decrease of fragments from the regular 300 down to ~100 in difficult GC rich areas.

    Thank you,
    Daniel

  • #2
    I don't know of any premade packages to do that. However, you could use pysam or Rsamtools to iterate through alignments crossing each position and average the TLEN values. I don't recall if the underlying bam_fetch function will return both mates in a pair, but it so make sure to include only one (based on sign of TLEN), and if not you'll have to handle things accordingly.

    You could also do this in C or java if you prefer, though it seems most people tend to prefer python or R.

    Comment


    • #3
      To save a bit of coding time trying to get around the intricacies of pysam, I would use samtools view and awk to generate the numbers:
      Code:
      samtools view input.bam | \
        awk -v 'OFS=,' 'BEGIN{print "RName,RPos,TLen"} {if($9>0){print $3,$4,$9}}' | \
        gzip > template_lengths.csv.gz
      Then R to generate the graph:
      Code:
      data.df <- read.csv("template_lengths.csv.gz");
      pdf("MyTemplateLengthPlots.pdf",width=20,height=8);
      for(rName in levels(data.df$RName)){
        sub.df <- subset(data.df, RName == rName);
        plot(sub.df$Rpos, sub.df$TLen, main = rName);
      }
      graphics.off()
      Last edited by gringer; 04-02-2014, 02:18 AM.

      Comment


      • #4
        You want to plot 3.2 billion data points? That is going to be a problem. You should look into some other way of representing that data.

        If you do want to plot with R then I reccomend the ggbio package from bioconductor.
        Last edited by Jeremy; 04-02-2014, 07:44 PM.

        Comment


        • #5
          ALE (Assembly Likelihood Estimator) may be useful here. It can detect areas where the apparent insert size differs from usual, and generates a histogram that you can view, as well as probabilities that the reference represents the reads. It accepts an assembly (fasta) file and set of constituent mapped reads (sam), to generate output.

          Comment


          • #6
            Additional information

            Thank you for the suggestions. Let me give you some more detail, maybe somebody encountered similar problem. We are trying to fill in some gaps in the chicken genome sequence. Those gaps do contain genes but are extremely GC rich. EVEN WITH ~1000x COVERAGE (!) OF THE CHICKEN GENOME we cannot extend our sequence into the gaps by using the paired reads. What happens that as we get closer to the boundary of the difficult sequence, the paired distances diminish and at the border the paired reads basically overlap each other, so there is not chance to walk into the gap. Just by inspecting few genomic regions, I can see this happening repeatedly. So I wanted to visualize this better by plotting the paired distances (not on the entire genome, just on selected genomic regions, on the order of several kbp).

            Comment


            • #7
              You may simply need longer "reads". For example, PacBio reads, or MiSeq 300bpx2 with an average insert of around 500bp merged with bbmerge. Since you already have 300bp reads, it may be useful to merge them into consensi before using them, assuming your insert size is appropriate. Of course, reads with low-complexity sequence will not merge very well. Fortunately... low-complexity sequence (like ACACACAC...ACAC) is not very important for most purposes.

              Have you tried ALE? If it doesn't work I can write something that will plot the pairing distance as a function of position, but ALE should already do this extremely well.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM
              • seqadmin
                Understanding Genetic Influence on Infectious Disease
                by seqadmin




                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                09-09-2024, 10:59 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-26-2024, 12:57 PM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Working...
              X