Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate proportion of reads with each base at every reference position

    I have a need to calculate the following, and wonder if anyone else has done something similar or could advise me on how to do it. The question is:

    For each base position on a reference sequence, what is the proportion of mapped reads with an A/G/C/T at this position?
    e.g. for reference position x, 50% of reads show an A/50% of reads show a T.

    At the moment I have Illumina paired end 51bp data and some 76bp data and have mapped the reads to a reference using the maq software. Any advice oon how to do this type of downstream analysis would be much appreciated, thanks.

    Lindsey

  • #2
    Are you familiar with python or perl? This can be be done pretty easily with BioPython/BioPerl.

    Hopefully someone else can chime in with an existing solution.
    --
    Senthil Palanisami

    Comment


    • #3
      Yes, I am learning perl. Does anyone have an existing solution?

      Comment


      • #4
        AFAIK, nothing existing but you could use bioperl to accomplish it without too much work (my guess is a few hours tops).
        --
        Senthil Palanisami

        Comment


        • #5
          I had a quick go since this seemed a fairly simple task. The documentation for mapview output isn't very clear so I may not be handling the sequence the right way when it's a reverse hit, but you can at least use this to get an idea for how this could be done.
          Code:
          #!/usr/bin/perl
          use warnings;
          use strict;
          
          my @mapview_files = @ARGV;
          
          foreach my $file (@mapview_files) {
            process_file ($file);
          }
          
          sub process_file {
            my ($file) = @_;
            open (IN,$file) or die "$file: $!";
          
            # Data Structure is a hash of chromosomes
            # containing arrays of positions where each
            # position is a hash of GATC mapping to a count
            my %chrs;
          
            while (<IN>) {
              chomp;
              # Ignore headers
              next if (/^#/);
          
              # Extract data
              my ($chr,$pos,$seq) = (split(/\t/))[1,2,14];
          
              # Split sequence into bases
              my @seq = split(//,$seq);
          
              # Add each base to the data structure
              for my $offset (0..$#seq) {
                ++$chrs{$chr}->[$pos+$offset]->{uc($seq[$offset])};
              }
            }
          
            # Print a header
            print join("\t",qw(File Chr Pos G A T C)),"\n";
          
            # Go through each chromosome
            foreach my $chr (sort keys %chrs) {
          
              # Go through the positions on that chromosome
              my @positions = @{$chrs{$chr}};
              for my $position (1..$#positions) {
                my @line = ($file,$chr,$position);
          
                # Get the counts for each base
                foreach my $base qw(G A T C) {
          	if (exists $positions[$position]->{$base}) {
          	  push @line, $positions[$position]->{$base};
          	}
          	else {
          	  push @line, 0;
          	}
                }
          
                # Print the result
                print join("\t",@line),"\n";
              }
            }
          }

          Comment


          • #6
            Thank you so very much for taking the time to produce this code. It is an excellent starting point for me and I appreciate the time you have spent.

            I hope soon I will be able to produce perl code this quickly myself!

            Comment


            • #7
              Hmm, perhaps the site should have a wiki for contributed code (I once posted a Perl program as well)

              Comment


              • #8
                Originally posted by krobison View Post
                Hmm, perhaps the site should have a wiki for contributed code (I once posted a Perl program as well)
                Excellent idea, yes please! That would be so useful.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X