Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • < Script to compute distribution length of sequences >

    Hi all,

    i have a question for you:

    Is there any tool to identify in a fasta file the read lenght distribution ?? I mean for exemple talking about miRNAs: " How many reads have 18 nucleotides, how many 19 nt, 20 nt....until 28 nt !!
    Can you suggest me something also a little script please ???
    At the end i need to do a graph showing this information.

    Thank you very much

  • #2
    Originally posted by Giorgio C View Post
    Hi all,

    i have a question for you:

    Is there any tool to identify in a fasta file the read lenght distribution ?? I mean for exemple talking about miRNAs: " How many reads have 18 nucleotides, how many 19 nt, 20 nt....until 28 nt !!
    Can you suggest me something also a little script please ???
    At the end i need to do a graph showing this information.

    Thank you very much
    In perl you can use the Statistics:escriptive::Full module. Here is an example using the range you specified. It will read from a fasta file (named on the command line), fill an object with length data for each read and then print out a summary of that data, including a table of read length distribution suitable for plotting a histogram.

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    use Statistics::Descriptive;
    
    my $stat = Statistics::Descriptive::Full->new();
    my (%distrib);
    my @bins = qw/18 19 20 21 22 23 24 25 26 27 28/;
    
    my $fastaFile = shift;
    open (FASTA, "<$fastaFile");
    $/ = ">";
    
    my $junkFirstOne = <FASTA>;
    
    while (<FASTA>) {
    	chomp;
    	my ($def,@seqlines) = split /\n/, $_;
    	my $seq = join '', @seqlines;
    	$stat->add_data(length($seq));
    }
    
    
    %distrib = $stat->frequency_distribution(\@bins);
    
    print "Total reads:\t" . $stat->count() . "\n";
    print "Total nt:\t" . $stat->sum() . "\n";
    print "Mean length:\t" . $stat->mean() . "\n";
    print "Median length:\t" . $stat->median() . "\n";
    print "Mode length:\t" . $stat->mode() . "\n";
    print "Max length:\t" . $stat->max() . "\n";
    print "Min length:\t" . $stat->min() . "\n";
    print "Length\t# Seqs\n";
    foreach (sort {$a <=> $b} keys %distrib) {
    	print "$_\t$distrib{$_}\n";
    }
    You can adjust the range and spacing of the distribution plot by adjusting the @bins array. See the documentation for the module.

    Comment


    • #3
      Amazing !!! Heartfelt thanks !!!

      Comment


      • #4
        Do you know how to modify the @bins array to know the total number of reads > of 28 nt and the tolat number of reads < of 18 nt....Is there possibile in one time ???

        Thanks

        Comment


        • #5
          From the module documentation for the frequency_distribution method:
          The number of entries for a particular partition key are the number of items which are greater than the previous partition key and less then or equal to the current partition key.
          This means that the bin for the lowest value (18 in this case) will count all occurrences which are ≤ 18 (so the program as written really won't return what you asked for in this case). Also for this case any sequences > 28nt in length will not be counted. Since you know that the value for a sequence length must be an integer you can add 17 as the first element of the list; this will count all reads where 0 < read length ≤ 17 (i.e. less than 18). To count any reads > 28nt you will need the last element in your @bin array to be greater than or equal to the longest possible read length in your data set; this could be 100, 1,000 or 1,000,000, it doesn't matter so long as the interval is large enough to capture any possible read length > 28nt in your data set.

          Comment


          • #6
            Infact it wasn't clear to me but now works fine for my purpose; Thank you very much again you have been very helpful !

            Comment


            • #7
              other bioperl script

              Code:
              #!/usr/bin/perl
              
              #
              # distribution length of a fasta file
              #
              
              use warnings;
              use strict;
              use Bio::SeqIO;
              
              my $i = $ARGV[0];
              my $o = $ARGV[1];
              
              open (OUT, '>',$o);
              
              my $seq_in  = Bio::SeqIO->new( -format => 'fasta',-file => $i);
              
              my %length = ();
              while( my $seq = $seq_in->next_seq() ) {
              	my $seq1 = $seq->seq;	
              	chomp($seq1);
              	#print($seq1);
              	my $l = length($seq1);
              	#print $l."\n";
              	my $tmp =  $length {$l};
              	$length {$l} = $tmp+1;  
              }
              my @x = keys(%length);
              my $sum = 0;
              foreach my $k (@x) {
                 $sum = $sum + $length{$k};
              }
              
              @x = sort{$a <=> $b}@x;
              print(@x);
              foreach my $k (@x) {
                 print OUT $k."\t".$length{$k}."\t".$length{$k}*100/$sum."\n";
              }
              
              close (OUT);

              Comment


              • #8
                Thanks you, i'm trying it !!!

                Comment


                • #9
                  problem to setup @bins

                  --hi

                  how to adjust @bins to print the report of all the length distribution ?

                  thank you --

                  Laurent --

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Exploring the Dynamics of the Tumor Microenvironment
                    by seqadmin




                    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                    07-08-2024, 03:19 PM
                  • seqadmin
                    Exploring Human Diversity Through Large-Scale Omics
                    by seqadmin


                    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                    06-25-2024, 06:43 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 07-10-2024, 07:30 AM
                  0 responses
                  23 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-03-2024, 09:45 AM
                  0 responses
                  200 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-03-2024, 08:54 AM
                  0 responses
                  209 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 07-02-2024, 03:00 PM
                  0 responses
                  192 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X