Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Software to quickly calculate GC content?

    Hello all,

    I am doing some post-sequencing quality control as well as some interspecific comparison and was wondering if anyone out there had a good suggestion on a means of generating metrics like GC-content quickly for both a larger number of short reads and whole genome sequence. I can do it through some commercial software, but it's kind of cumbersome and indirect. I could write something, but given my scripting ability, it'll probably be real slow. Any suggests on a simple program out there I could use to generate a bunch of estimates from these different types of samples?

    Thanks

  • #2
    What kind of file formats are you starting with for the read data (fasta, fastq, seq, qseq, export, BAM, SAM, etc.)?? Also, which sequencing platform (base space or color space?).

    For something this simple, it should not be hard to write a script that will perform just fine without being written by a programming guru... It is also the perfect task on which to practice your scripting... If you post a sample of your data here, you may get some specific tips/code as well as links to general tools for the type of data you are working with.

    Comment


    • #3
      Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

      Code:
      #!/usr/bin/perl -w
      
      use strict;
      
      my $i=1;
      my $infasta=$ARGV[0];
      
      open FH1 , "$infasta" or die "$!";
      
      while(<FH1>){
              if ($i==1){
                      chomp;
                      $_ =~ tr/>//d;
                      print "$_\t";
                      $i++;}
              elsif ($i==2){
                      my $UC = uc($_);
                      my $length = ($UC =~ tr/ATGCN//);
                      my $numsGCs = ($UC =~ tr/GC/gc/);
                      my $percentGC = ($numsGCs/$length)*100;
                      my $rounded = sprintf("%.2f", $percentGC);
                      print "$rounded\n";
                      $i++;}
              elsif ($i>2){
                      $i=1;
                      redo;};
      }

      Comment


      • #4
        You can use Biopieces for this (www.biopieces.org) by using analyze_gc.

        Basically, you read in your data, analyze the GC content, and output the mean value for the GC%:

        Code:
        read_fasta -i test.fna | analyze_gc | mean_vals -k GC% -x
        If you want tabular output with sequence name and GC% per entry:

        Code:
        read_fasta -i test.fna | analyze_gc | write_tab -k SEQ_NAME,GC% -x
        If your data is in FASTQ format, use read_fastq instead of read_fasta.

        Other nifty Biopieces exist to give an idea of your data, check this out:

        Code:
        read_fasta -i test_big.fna -n 10000 |   # read in 10000 entries
        analyze_gc |                            # analyze GC% per entry
        bin_vals -k GC% |                       # bin GC% values
        plot_histogram -k GC%_BIN -s num -x     # plot a histogram
        
                                              Histogram
        
            550 ++-----------------------------------------------------------------++
                |                                                                   |
            500 ++                 *****                                           ++
            450 ++                 *   *                                           ++
                |                  *   *                                            |
            400 ++                 *   *                                           ++
                |         ****     *   *     ****                                   |
            350 ++        *  *     *   *     *  *                                  ++
                |         *  *     *   *     *  *      ****                         |
            300 ++        *  *     *   *     *  *      *  *                        ++
            250 ++        *  *     *   *     *  *      *  *      ****              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            200 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            150 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            100 ++        *  *     *   *     *  *      *  *      *  *              ++
             50 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *     *****     |
              0 ++--------****-----*****-----****------****------****-----*****----++
                         20       30        40        50        60       70

        Cheers,


        Martin

        Comment


        • #5
          that is one cool histogram!

          Comment


          • #6
            Originally posted by JueFish View Post
            Hello all,

            I am doing some post-sequencing quality control as well as some interspecific comparison and was wondering if anyone out there had a good suggestion on a means of generating metrics like GC-content quickly for both a larger number of short reads and whole genome sequence. I can do it through some commercial software, but it's kind of cumbersome and indirect. I could write something, but given my scripting ability, it'll probably be real slow. Any suggests on a simple program out there I could use to generate a bunch of estimates from these different types of samples?

            Thanks
            FastQC will calculate an overall %GC and a per-sequence %GC distribution profile as well as many other stats about your run. If you want to parse the information then there's a simple text output which should be easy to read back in.

            Comparing the per-sequence %GC profile for a sequencing run to one known to have come from the correct species is a really easy way to confirm you're working with the correct data.

            Comment


            • #7
              Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

              cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'

              Comment


              • #8
                Originally posted by shurjo View Post
                Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

                Code:
                #!/usr/bin/perl -w
                
                use strict;
                
                my $i=1;
                my $infasta=$ARGV[0];
                
                open FH1 , "$infasta" or die "$!";
                
                while(<FH1>){
                        if ($i==1){
                                chomp;
                                $_ =~ tr/>//d;
                                print "$_\t";
                                $i++;}
                        elsif ($i==2){
                                my $UC = uc($_);
                                my $length = ($UC =~ tr/ATGCN//);
                                my $numsGCs = ($UC =~ tr/GC/gc/);
                                my $percentGC = ($numsGCs/$length)*100;
                                my $rounded = sprintf("%.2f", $percentGC);
                                print "$rounded\n";
                                $i++;}
                        elsif ($i>2){
                                $i=1;
                                redo;};
                }
                @shurjo very nice attempt, unfortunately the output is unexpected to me, and I know too little about perl to be able to fix it.

                When I try your script, I get:

                Code:
                adrian@ubuntu:~/Documents$ ./gccontent.pl ~/Research/project_Hme/gdr18_RAW_k23-93_contigs500.fasta | head
                NODE_1908_length_1770_cov_0.429338_ID_6992244	53.33
                TATTTGATTAAAAAGCCCTTGCAAAAAGCTACTAAGAAAAAAGTACAGGTGGCTAATTTT	40.00
                CGTCAAGACTTAAAGCAGGGGCAACTGATTGTGATCACGGCTGGCGGCAATGATGTCATG	36.67
                CAATATACCGCTAATATGGAGCAAATGGTTCATGAAATTAGAACTCTAAACAAAAAAGCC	28.33
                ATGAAACGGGCAATCAATAACTGGAATAAAAATACGCAACAGTTGACTCAGGAACATTGG	40.00
                AAAACTCAGGAAACCACTAATCCGCTCTTATATACCAAGGATTATTTTCATCCTAATCGT	30.00
                TGGAGGCCTTGAATATGTCCCGCTCGCAAAAAAAGTCATTTAATCCTTGGAAATATGCCT	31.67
                ATAGCGATATGAAAACTAATTGGAATGAAAGTCAGCCCACAACCAGAGTAGAGCAACAGC	31.67
                TTAAGGAATTTCTGAATAATGATGATATTAAATATCATTTGACAGTAGAAAAAAATCAGG	35.00
                AACCCTACGCCACTGCTAATGGGGGAATTCAGTTAAAAGCCCAACGTTTAAAAGTGGGTA	26.67
                Ideally, the output would be
                Code:
                contig_name gc_content
                and here it looks like it might be calculating GC content on a perl line basis, rather than a per contig basis.

                Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

                cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
                @JueFish I always appreciate a simple bash command for my scripting needs

                However, it seems like your script outputs 4 numbers, the base pair count for all 4 basepairs for the entire file, rather than as a percent of GC per contig.

                I am trying to get Biopieces running, takes forever to install

                Comment


                • #9
                  This is a good example of when knowing at least one programming language is very useful:

                  Code:
                  #include <stdio.h>
                  #include <string.h>
                  #include <stdlib.h>
                  
                  int main(int argc, char *argv[]) {
                      char *name = NULL, line[1024];
                      int i;
                      unsigned long long GC, AT;
                      FILE *fp;
                  
                      if(argc != 2 || strcmp(argv[1], "-h") == 0) {
                          printf("Usage: %s file.fa\n", argv[0]);
                          return 1;
                      }
                  
                      fp = fopen(argv[1], "r");
                      while(fgets(line, 1024, fp) != NULL) {
                          if(*line == '>') {
                              if(name != NULL) {
                                  printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
                                  free(name);
                              }
                              GC = AT = 0;
                              name = strdup(line+1);
                              for(i=0;i<strlen(name);i++) { //trim off line endings or descriptors
                                  if(*(name+i) == ' ' || *(name+i) == '\n') {
                                      *(name+i) = '\0';
                                      break;
                                  }
                              }
                          } else {
                              i=0;
                              while(1) {
                                  if(*(line+i) == 'G' || *(line+i) == 'C' || *(line+i) == 'g' || *(line+i) == 'c') GC++;
                                  else AT++;
                                  if(*(line+ (++i)) == '\n') break;
                              }
                          }
                      }
                      if(name != NULL) {
                          printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
                          free(name);
                      }
                      fclose(fp);
                      return 0;
                  }
                  That's a solution in C (perl and python would be shorter!). If you saved it as "gc.c", you could compile it with "gcc -o gc gc.c" and then:
                  Code:
                  gc genome.fa
                  chr1	0.399447
                  chr10	0.403991
                  chr11	0.426990
                  chr12	0.404469
                  chr13	0.402601
                  chr13_random	0.359645
                  chr14	0.399880
                  chr15	0.407239
                  Those are the GC contents for the first few chromosomes of mm9.

                  Comment


                  • #10
                    You could also modify that to ignore any N's in the sequence, which might be a nice idea. To do that, just change the "else AT++;" line to:

                    Code:
                    else if(*(line+i) != 'N' && *(line+i) != 'n') AT++;
                    Similar modifications could be used to deal with IUPAC codes (e.g. S for a G or C), though those are unlikely to exist in a reference genome.
                    Last edited by dpryan; 01-19-2014, 02:12 PM.

                    Comment


                    • #11
                      Hi,

                      Is there any perl script to get GC content in specific slide window?

                      Thanks

                      Comment


                      • #12
                        Probably. If not, don't try to read the whole genome into memory with perl unless you really have to, it'd be very inefficient. It'd be better to just do a system call to samtools faidx.

                        Comment


                        • #13
                          Thanks for quick respond, I need to draw GC content plot of genome.

                          I want a perl script which can give me out put like below.

                          postion GC content
                          1 5000 0.690
                          5000 15000 0.710
                          15000 25000 0.736
                          25000 35000 0.759
                          35000 45000 0.749



                          Thanks

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X