Seqanswers Leaderboard Ad

**malachig** · 09-28-2010, 11:59 AM

What kind of file formats are you starting with for the read data (fasta, fastq, seq, qseq, export, BAM, SAM, etc.)?? Also, which sequencing platform (base space or color space?).

For something this simple, it should not be hard to write a script that will perform just fine without being written by a programming guru... It is also the perfect task on which to practice your scripting... If you post a sample of your data here, you may get some specific tips/code as well as links to general tools for the type of data you are working with.

**shurjo** · 09-28-2010, 02:05 PM

Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

Code:

#!/usr/bin/perl -w

use strict;

my $i=1;
my $infasta=$ARGV[0];

open FH1 , "$infasta" or die "$!";

while(<FH1>){
        if ($i==1){
                chomp;
                $_ =~ tr/>//d;
                print "$_\t";
                $i++;}
        elsif ($i==2){
                my $UC = uc($_);
                my $length = ($UC =~ tr/ATGCN//);
                my $numsGCs = ($UC =~ tr/GC/gc/);
                my $percentGC = ($numsGCs/$length)*100;
                my $rounded = sprintf("%.2f", $percentGC);
                print "$rounded\n";
                $i++;}
        elsif ($i>2){
                $i=1;
                redo;};
}

**maasha** · 09-28-2010, 11:06 PM

You can use Biopieces for this (www.biopieces.org) by using analyze_gc.

Basically, you read in your data, analyze the GC content, and output the mean value for the GC%:

Code:

read_fasta -i test.fna | analyze_gc | mean_vals -k GC% -x

If you want tabular output with sequence name and GC% per entry:

Code:

read_fasta -i test.fna | analyze_gc | write_tab -k SEQ_NAME,GC% -x

If your data is in FASTQ format, use read_fastq instead of read_fasta.

Other nifty Biopieces exist to give an idea of your data, check this out:

Code:

read_fasta -i test_big.fna -n 10000 |   # read in 10000 entries
analyze_gc |                            # analyze GC% per entry
bin_vals -k GC% |                       # bin GC% values
plot_histogram -k GC%_BIN -s num -x     # plot a histogram

                                      Histogram

    550 ++-----------------------------------------------------------------++
        |                                                                   |
    500 ++                 *****                                           ++
    450 ++                 *   *                                           ++
        |                  *   *                                            |
    400 ++                 *   *                                           ++
        |         ****     *   *     ****                                   |
    350 ++        *  *     *   *     *  *                                  ++
        |         *  *     *   *     *  *      ****                         |
    300 ++        *  *     *   *     *  *      *  *                        ++
    250 ++        *  *     *   *     *  *      *  *      ****              ++
        |         *  *     *   *     *  *      *  *      *  *               |
    200 ++        *  *     *   *     *  *      *  *      *  *              ++
        |         *  *     *   *     *  *      *  *      *  *               |
    150 ++        *  *     *   *     *  *      *  *      *  *              ++
        |         *  *     *   *     *  *      *  *      *  *               |
    100 ++        *  *     *   *     *  *      *  *      *  *              ++
     50 ++        *  *     *   *     *  *      *  *      *  *              ++
        |         *  *     *   *     *  *      *  *      *  *     *****     |
      0 ++--------****-----*****-----****------****------****-----*****----++
                 20       30        40        50        60       70

Cheers,

Martin

**Bruins** · 09-28-2010, 11:59 PM

that is one cool histogram!

**simonandrews** · 09-29-2010, 12:09 AM

Originally posted by JueFish View Post

Hello all,

I am doing some post-sequencing quality control as well as some interspecific comparison and was wondering if anyone out there had a good suggestion on a means of generating metrics like GC-content quickly for both a larger number of short reads and whole genome sequence. I can do it through some commercial software, but it's kind of cumbersome and indirect. I could write something, but given my scripting ability, it'll probably be real slow. Any suggests on a simple program out there I could use to generate a bunch of estimates from these different types of samples?

Thanks

FastQC will calculate an overall %GC and a per-sequence %GC distribution profile as well as many other stats about your run. If you want to parse the information then there's a simple text output which should be easy to read back in.

Comparing the per-sequence %GC profile for a sequencing run to one known to have come from the correct species is a really easy way to confirm you're working with the correct data.

**JueFish** · 10-19-2010, 08:05 AM

Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'

**AdrianP** · 01-19-2014, 11:09 AM

Originally posted by shurjo View Post

Here's my script for this (I'm a novice programmer too). It prints two columns with the fasta header and the GC content seperated by a tab.

Code:

#!/usr/bin/perl -w

use strict;

my $i=1;
my $infasta=$ARGV[0];

open FH1 , "$infasta" or die "$!";

while(<FH1>){
        if ($i==1){
                chomp;
                $_ =~ tr/>//d;
                print "$_\t";
                $i++;}
        elsif ($i==2){
                my $UC = uc($_);
                my $length = ($UC =~ tr/ATGCN//);
                my $numsGCs = ($UC =~ tr/GC/gc/);
                my $percentGC = ($numsGCs/$length)*100;
                my $rounded = sprintf("%.2f", $percentGC);
                print "$rounded\n";
                $i++;}
        elsif ($i>2){
                $i=1;
                redo;};
}

@shurjo very nice attempt, unfortunately the output is unexpected to me, and I know too little about perl to be able to fix it.

When I try your script, I get:

Code:

adrian@ubuntu:~/Documents$ ./gccontent.pl ~/Research/project_Hme/gdr18_RAW_k23-93_contigs500.fasta | head
NODE_1908_length_1770_cov_0.429338_ID_6992244	53.33
TATTTGATTAAAAAGCCCTTGCAAAAAGCTACTAAGAAAAAAGTACAGGTGGCTAATTTT	40.00
CGTCAAGACTTAAAGCAGGGGCAACTGATTGTGATCACGGCTGGCGGCAATGATGTCATG	36.67
CAATATACCGCTAATATGGAGCAAATGGTTCATGAAATTAGAACTCTAAACAAAAAAGCC	28.33
ATGAAACGGGCAATCAATAACTGGAATAAAAATACGCAACAGTTGACTCAGGAACATTGG	40.00
AAAACTCAGGAAACCACTAATCCGCTCTTATATACCAAGGATTATTTTCATCCTAATCGT	30.00
TGGAGGCCTTGAATATGTCCCGCTCGCAAAAAAAGTCATTTAATCCTTGGAAATATGCCT	31.67
ATAGCGATATGAAAACTAATTGGAATGAAAGTCAGCCCACAACCAGAGTAGAGCAACAGC	31.67
TTAAGGAATTTCTGAATAATGATGATATTAAATATCATTTGACAGTAGAAAAAAATCAGG	35.00
AACCCTACGCCACTGCTAATGGGGGAATTCAGTTAAAAGCCCAACGTTTAAAAGTGGGTA	26.67

Ideally, the output would be

Code:

contig_name gc_content

and here it looks like it might be calculating GC content on a perl line basis, rather than a per contig basis.

Thanks for all the great suggestions, everyone. I'll be sure to check them out as well. I also have been using some unix commands to do the same thing. I think it should work fine:

cat assemblyFile | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'

@JueFish I always appreciate a simple bash command for my scripting needs

However, it seems like your script outputs 4 numbers, the base pair count for all 4 basepairs for the entire file, rather than as a percent of GC per contig.

I am trying to get Biopieces running, takes forever to install

**dpryan** · 01-19-2014, 02:08 PM

This is a good example of when knowing at least one programming language is very useful:

Code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    char *name = NULL, line[1024];
    int i;
    unsigned long long GC, AT;
    FILE *fp;

    if(argc != 2 || strcmp(argv[1], "-h") == 0) {
        printf("Usage: %s file.fa\n", argv[0]);
        return 1;
    }

    fp = fopen(argv[1], "r");
    while(fgets(line, 1024, fp) != NULL) {
        if(*line == '>') {
            if(name != NULL) {
                printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
                free(name);
            }
            GC = AT = 0;
            name = strdup(line+1);
            for(i=0;i<strlen(name);i++) { //trim off line endings or descriptors
                if(*(name+i) == ' ' || *(name+i) == '\n') {
                    *(name+i) = '\0';
                    break;
                }
            }
        } else {
            i=0;
            while(1) {
                if(*(line+i) == 'G' || *(line+i) == 'C' || *(line+i) == 'g' || *(line+i) == 'c') GC++;
                else AT++;
                if(*(line+ (++i)) == '\n') break;
            }
        }
    }
    if(name != NULL) {
        printf("%s\t%f\n", name, ((float) GC)/(GC+AT));
        free(name);
    }
    fclose(fp);
    return 0;
}

That's a solution in C (perl and python would be shorter!). If you saved it as "gc.c", you could compile it with "gcc -o gc gc.c" and then:

Code:

gc genome.fa
chr1	0.399447
chr10	0.403991
chr11	0.426990
chr12	0.404469
chr13	0.402601
chr13_random	0.359645
chr14	0.399880
chr15	0.407239

Those are the GC contents for the first few chromosomes of mm9.

**dpryan** · 01-19-2014, 02:09 PM

You could also modify that to ignore any N's in the sequence, which might be a nice idea. To do that, just change the "else AT++;" line to:

Code:

else if(*(line+i) != 'N' && *(line+i) != 'n') AT++;

Similar modifications could be used to deal with IUPAC codes (e.g. S for a G or C), though those are unlikely to exist in a reference genome.

**Bioinform** · 09-25-2014, 08:51 AM

Hi,

Is there any perl script to get GC content in specific slide window?

Thanks

**dpryan** · 09-25-2014, 09:07 AM

Probably. If not, don't try to read the whole genome into memory with perl unless you really have to, it'd be very inefficient. It'd be better to just do a system call to samtools faidx.

**Bioinform** · 09-25-2014, 10:30 AM

Thanks for quick respond, I need to draw GC content plot of genome.

I want a perl script which can give me out put like below.

postion GC content
1 5000 0.690
5000 15000 0.710
15000 25000 0.736
25000 35000 0.759
35000 45000 0.749

Thanks

**Zaag** · 09-26-2014, 05:13 AM

Just a moment...

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_coverage_GCContentByInterval.php

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Software to quickly calculate GC content?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News