Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • illinu
    Member
    • Jul 2013
    • 55

    Python counting bases fasta file

    New to python but trying to learn, here I wrote a program to count the nucleotides in a fasta file. The approach is that when the line does not contain a ">" symbol then the program counts nucleotides one by one and adds it to the variable sum. The program does not work and I don't know why. Please help!

    #!/usr/bin/env python
    import sys

    # Takes fasta file and counts total bases

    infile = open(sys.argv[0], "rU")
    sum = 0
    line = infile.readlines()
    while ">" not in lline:
    for i in line:
    sum += 1
    print (sum)

    usage: $ python countbases.py infile.fasta
  • Cogitare
    Junior Member
    • Aug 2013
    • 1

    #2
    Perhaps this

    It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

    infile = open(sys.argv[0], "rU")

    Should be:

    infile = open(sys.argv[1], "rU")

    In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:
    Quoting from docs.python.org: "sys.argv The list of command line arguments passed to a Python script. argv[0] is the script name (it is operating system dependent whether this is a full pathname o...

    Comment

    • illinu
      Member
      • Jul 2013
      • 55

      #3
      Originally posted by Cogitare View Post
      It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

      infile = open(sys.argv[0], "rU")

      Should be:

      infile = open(sys.argv[1], "rU")

      In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:
      http://stackoverflow.com/questions/5...-documentation
      Thank you. I also tried with sys.argv[1]
      The program does not return any error, it just does not return anything. If I run it in Idle, it crashes and if I run it in terminal, it jumps to the next line as if it was running but it stays there forever. I am trying with very small test files and if it worked it should go fairly fast instead of staying hanged for a long time.

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Originally posted by illinu View Post
        New to python but trying to learn, here I wrote a program to count the nucleotides in a fasta file. The approach is that when the line does not contain a ">" symbol then the program counts nucleotides one by one and adds it to the variable sum. The program does not work and I don't know why. Please help!

        #!/usr/bin/env python
        import sys

        # Takes fasta file and counts total bases

        infile = open(sys.argv[0], "rU")
        sum = 0
        line = infile.readlines()
        while ">" not in lline:
        for i in line:
        sum += 1
        print (sum)

        usage: $ python countbases.py infile.fasta
        Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...

        Comment

        • illinu
          Member
          • Jul 2013
          • 55

          #5
          Originally posted by dpryan View Post
          Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...
          That's a typo here , in the program it's ok. So you're saying that apart from that the program seems ok and it should work?

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Well, you're also trying to iterate through a list in an odd way that I'm not sure would work. You might take a rather more efficient approach:
            Code:
            #!/usr/bin/env python
            import sys
            
            f = open(sys.argv[1], "rU")
            total=0
            for line in f :
                if(line[0] == ">") :
                    continue
            
                total += len(line)-1
            print("%s has %i nucleotides" % (sys.argv[1], total))
            With readlines(), you end up reading the whole file into a list, which is really memory inefficient if you're dealing with a whole genome.
            Last edited by dpryan; 08-20-2013, 06:49 AM. Reason: typo

            Comment

            • illinu
              Member
              • Jul 2013
              • 55

              #7
              Impressed! it works like a charm

              Thanks a million

              Comment

              • dpryan
                Devon Ryan
                • Jul 2011
                • 3478

                #8
                No problem. In a real version, you'd probably want to use argparse (so you can easily give help and usage information) and check to see that the fasta file exists before running further. You also might want to add the number of bases per chromosome, likely only printed if the user specifies.

                Comment

                • brentp
                  Member
                  • Apr 2010
                  • 72

                  #9
                  import sys
                  print sum(len(l.rstrip()) for l in open(sys.argv[1]) if l[0] != ">")

                  Comment

                  • CHObot
                    Member
                    • May 2013
                    • 11

                    #10
                    Yet another way (and more general) is to use the BioPython module. You will want to get familiar with handling sequences like this anyway, it is much more convenient to have a data structure. It would be this:

                    Code:
                    import sys
                    from Bio import SeqIO
                    
                    for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
                        print seq_record.id
                        print len(seq_record)
                    And that would handle multifasta files (with multiple sequences in them) which you will no doubt encounter eventually.

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM
                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    30 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-09-2026, 11:58 AM
                    0 responses
                    96 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-05-2026, 10:09 AM
                    0 responses
                    116 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-04-2026, 08:59 AM
                    0 responses
                    109 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...