Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Software to quickly calculate GC content?

    Hello all,

    I am doing some post-sequencing quality control as well as some interspecific comparison and was wondering if anyone out there had a good suggestion on a means of generating metrics like GC-content quickly for both a larger number of short reads and whole genome sequence. I can do it through some commercial software, but it's kind of cumbersome and indirect. I could write something, but given my scripting ability, it'll probably be real slow. Any suggests on a simple program out there I could use to generate a bunch of estimates from these different types of samples?

    Thanks

  • #2
    What kind of file formats are you starting with for the read data (fasta, fastq, seq, qseq, export, BAM, SAM, etc.)?? Also, which sequencing platform (base space or color space?).

    For something this simple, it should not be hard to write a script that will perform just fine without being written by a programming guru... It is also the perfect task on which to practice your scripting... If you post a sample of your data here, you may get some specific tips/code as well as links to general tools for the type of data you are working with.

    Comment


    • #3
      Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

      Code:
      #!/usr/bin/perl -w
      
      use strict;
      
      my $i=1;
      my $infasta=$ARGV[0];
      
      open FH1 , "$infasta" or die "$!";
      
      while(<FH1>){
              if ($i==1){
                      chomp;
                      $_ =~ tr/>//d;
                      print "$_\t";
                      $i++;}
              elsif ($i==2){
                      my $UC = uc($_);
                      my $length = ($UC =~ tr/ATGCN//);
                      my $numsGCs = ($UC =~ tr/GC/gc/);
                      my $percentGC = ($numsGCs/$length)*100;
                      my $rounded = sprintf("%.2f", $percentGC);
                      print "$rounded\n";
                      $i++;}
              elsif ($i>2){
                      $i=1;
                      redo;};
      }

      Comment


      • #4
        You can use Biopieces for this (www.biopieces.org) by using analyze_gc.

        Basically, you read in your data, analyze the GC content, and output the mean value for the GC%:

        Code:
        read_fasta -i test.fna | analyze_gc | mean_vals -k GC% -x
        If you want tabular output with sequence name and GC% per entry:

        Code:
        read_fasta -i test.fna | analyze_gc | write_tab -k SEQ_NAME,GC% -x
        If your data is in FASTQ format, use read_fastq instead of read_fasta.

        Other nifty Biopieces exist to give an idea of your data, check this out:

        Code:
        read_fasta -i test_big.fna -n 10000 |   # read in 10000 entries
        analyze_gc |                            # analyze GC% per entry
        bin_vals -k GC% |                       # bin GC% values
        plot_histogram -k GC%_BIN -s num -x     # plot a histogram
        
                                              Histogram
        
            550 ++-----------------------------------------------------------------++
                |                                                                   |
            500 ++                 *****                                           ++
            450 ++                 *   *                                           ++
                |                  *   *                                            |
            400 ++                 *   *                                           ++
                |         ****     *   *     ****                                   |
            350 ++        *  *     *   *     *  *                                  ++
                |         *  *     *   *     *  *      ****                         |
            300 ++        *  *     *   *     *  *      *  *                        ++
            250 ++        *  *     *   *     *  *      *  *      ****              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            200 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            150 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *               |
            100 ++        *  *     *   *     *  *      *  *      *  *              ++
             50 ++        *  *     *   *     *  *      *  *      *  *              ++
                |         *  *     *   *     *  *      *  *      *  *     *****     |
              0 ++--------****-----*****-----****------****------****-----*****----++
                         20       30        40        50        60       70

        Cheers,


        Martin

        Comment


        • #5
          that is one cool histogram!

          Comment


          • #6
            Originally posted by JueFish View Post
            Hello all,

            I am doing some post-sequencing quality control as well as some interspecific comparison and was wondering if anyone out there had a good suggestion on a means of generating metrics like GC-content quickly for both a larger number of short reads and whole genome sequence. I can do it through some commercial software, but it's kind of cumbersome and indirect. I could write something, but given my scripting ability, it'll probably be real slow. Any suggests on a simple program out there I could use to generate a bunch of estimates from these different types of samples?

            Thanks
            FastQC will calculate an overall %GC and a per-sequence %GC distribution profile as well as many other stats about your run. If you want to parse the information then there's a simple text output which should be easy to read back in.

            Comparing the per-sequence %GC profile for a sequencing run to one known to have come from the correct species is a really easy way to confirm you're working with the correct data.

            Comment


            • #7
              Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

              cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'

              Comment


              • #8
                Originally posted by shurjo View Post
                Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

                Code:
                #!/usr/bin/perl -w
                
                use strict;
                
                my $i=1;
                my $infasta=$ARGV[0];
                
                open FH1 , "$infasta" or die "$!";
                
                while(<FH1>){
                        if ($i==1){
                                chomp;
                                $_ =~ tr/>//d;
                                print "$_\t";
                                $i++;}
                        elsif ($i==2){
                                my $UC = uc($_);
                                my $length = ($UC =~ tr/ATGCN//);
                                my $numsGCs = ($UC =~ tr/GC/gc/);
                                my $percentGC = ($numsGCs/$length)*100;
                                my $rounded = sprintf("%.2f", $percentGC);
                                print "$rounded\n";
                                $i++;}
                        elsif ($i>2){
                                $i=1;
                                redo;};
                }
                @shurjo very nice attempt, unfortunately the output is unexpected to me, and I know too little about perl to be able to fix it.

                When I try your script, I get:

                Code:
                adrian@ubuntu:~/Documents$ ./gccontent.pl ~/Research/project_Hme/gdr18_RAW_k23-93_contigs500.fasta | head
                NODE_1908_length_1770_cov_0.429338_ID_6992244	53.33
                TATTTGATTAAAAAGCCCTTGCAAAAAGCTACTAAGAAAAAAGTACAGGTGGCTAATTTT	40.00
                CGTCAAGACTTAAAGCAGGGGCAACTGATTGTGATCACGGCTGGCGGCAATGATGTCATG	36.67
                CAATATACCGCTAATATGGAGCAAATGGTTCATGAAATTAGAACTCTAAACAAAAAAGCC	28.33
                ATGAAACGGGCAATCAATAACTGGAATAAAAATACGCAACAGTTGACTCAGGAACATTGG	40.00
                AAAACTCAGGAAACCACTAATCCGCTCTTATATACCAAGGATTATTTTCATCCTAATCGT	30.00
                TGGAGGCCTTGAATATGTCCCGCTCGCAAAAAAAGTCATTTAATCCTTGGAAATATGCCT	31.67
                ATAGCGATATGAAAACTAATTGGAATGAAAGTCAGCCCACAACCAGAGTAGAGCAACAGC	31.67
                TTAAGGAATTTCTGAATAATGATGATATTAAATATCATTTGACAGTAGAAAAAAATCAGG	35.00
                AACCCTACGCCACTGCTAATGGGGGAATTCAGTTAAAAGCCCAACGTTTAAAAGTGGGTA	26.67
                Ideally, the output would be
                Code:
                contig_name gc_content
                and here it looks like it might be calculating GC content on a perl line basis, rather than a per contig basis.

                Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

                cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
                @JueFish I always appreciate a simple bash command for my scripting needs

                However, it seems like your script outputs 4 numbers, the base pair count for all 4 basepairs for the entire file, rather than as a percent of GC per contig.

                I am trying to get Biopieces running, takes forever to install

                Comment


                • #9
                  This is a good example of when knowing at least one programming language is very useful:

                  Code:
                  #include <stdio.h>
                  #include <string.h>
                  #include <stdlib.h>
                  
                  int main(int argc, char *argv[]) {
                      char *name = NULL, line[1024];
                      int i;
                      unsigned long long GC, AT;
                      FILE *fp;
                  
                      if(argc != 2 || strcmp(argv[1], "-h") == 0) {
                          printf("Usage: %s file.fa\n", argv[0]);
                          return 1;
                      }
                  
                      fp = fopen(argv[1], "r");
                      while(fgets(line, 1024, fp) != NULL) {
                          if(*line == '>') {
                              if(name != NULL) {
                                  printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
                                  free(name);
                              }
                              GC = AT = 0;
                              name = strdup(line+1);
                              for(i=0;i<strlen(name);i++) { //trim off line endings or descriptors
                                  if(*(name+i) == ' ' || *(name+i) == '\n') {
                                      *(name+i) = '\0';
                                      break;
                                  }
                              }
                          } else {
                              i=0;
                              while(1) {
                                  if(*(line+i) == 'G' || *(line+i) == 'C' || *(line+i) == 'g' || *(line+i) == 'c') GC++;
                                  else AT++;
                                  if(*(line+ (++i)) == '\n') break;
                              }
                          }
                      }
                      if(name != NULL) {
                          printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
                          free(name);
                      }
                      fclose(fp);
                      return 0;
                  }
                  That's a solution in C (perl and python would be shorter!). If you saved it as "gc.c", you could compile it with "gcc -o gc gc.c" and then:
                  Code:
                  gc genome.fa
                  chr1	0.399447
                  chr10	0.403991
                  chr11	0.426990
                  chr12	0.404469
                  chr13	0.402601
                  chr13_random	0.359645
                  chr14	0.399880
                  chr15	0.407239
                  Those are the GC contents for the first few chromosomes of mm9.

                  Comment


                  • #10
                    You could also modify that to ignore any N's in the sequence, which might be a nice idea. To do that, just change the "else AT++;" line to:

                    Code:
                    else if(*(line+i) != 'N' && *(line+i) != 'n') AT++;
                    Similar modifications could be used to deal with IUPAC codes (e.g. S for a G or C), though those are unlikely to exist in a reference genome.
                    Last edited by dpryan; 01-19-2014, 02:12 PM.

                    Comment


                    • #11
                      Hi,

                      Is there any perl script to get GC content in specific slide window?

                      Thanks

                      Comment


                      • #12
                        Probably. If not, don't try to read the whole genome into memory with perl unless you really have to, it'd be very inefficient. It'd be better to just do a system call to samtools faidx.

                        Comment


                        • #13
                          Thanks for quick respond, I need to draw GC content plot of genome.

                          I want a perl script which can give me out put like below.

                          postion GC content
                          1 5000 0.690
                          5000 15000 0.710
                          15000 25000 0.736
                          25000 35000 0.759
                          35000 45000 0.749



                          Thanks

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Best Practices for Single-Cell Sequencing Analysis
                            by seqadmin



                            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                            06-06-2024, 07:15 AM
                          • seqadmin
                            Latest Developments in Precision Medicine
                            by seqadmin



                            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                            Somatic Genomics
                            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                            05-24-2024, 01:16 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 07:24 AM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-13-2024, 08:58 AM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-12-2024, 02:20 PM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-07-2024, 06:58 AM
                          0 responses
                          184 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X