Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • AdrianP
    Senior Member
    • Apr 2011
    • 130

    Help with script, sort by coverage

    Hello everyone,

    I am not good with with writing optimal scripts, so I would appreciate someones help, either perl or python is fine.

    I have contigs generated by assemblers in the following format:
    NODE_2361_length_509_cov_1.43018_ID_745236

    This is common to velvet and SPAdes. I was wondering how could I extract all contig names that in their name have a coverage above a user specified value. So I could perl selectAboveCoverage.pl 1.0 and get all above 1.0 including 1.0?

    I know I have to first split by >, and then the index 0 will be contig name and index 1 will be the sequence. I need to further split index 0 by _, and index 5 will have the coverage value. I can then somehow sort by that and pick all of those above argv[0], and write them to a new file argv[1].

    I know this from experience in TCL, but scripts on those languages take too long to execute.

    Any help will be appreciated, thank you!
  • rhinoceros
    Senior Member
    • Apr 2013
    • 372

    #2
    How about

    Code:
    awk '{FS="_"; if($6>=1) print}' file
    savetherhino.org

    Comment

    • AdrianP
      Senior Member
      • Apr 2011
      • 130

      #3
      Originally posted by rhinoceros View Post
      How about

      Code:
      awk '{FS="_"; if($6>=1) print}' file

      You sir, are awesome.

      Comment

      • AdrianP
        Senior Member
        • Apr 2011
        • 130

        #4
        How would I use a similar script to sort the contigs by lengths, and tell me what the top 10 or 20 lengths are?

        Comment

        • winsettz
          Member
          • Sep 2012
          • 91

          #5
          Could try adding to the above:

          Code:
          sort -r -n file | head
          In short:

          Code:
          sort -r -n file
          sorts the contents of file, -r to use reverse order (highest to lowest), and -n uses numeric sort. Otherwise you get 1, 10, 100, 2, 20, 200, 3, 30, 300.

          head just displays the top 10 of the file, or stdout (lest it dump a pile of numbers into stdout). To adjust the number of lines displayed, use -n. For example,
          Code:
          head -n 20
          displays the first 20 lines of a file.

          Comment

          • AdrianP
            Senior Member
            • Apr 2011
            • 130

            #6
            Sorry but the above shows contig names, how will sort know what to sort by? Since it's given the full length of the contig?

            Comment

            • winsettz
              Member
              • Sep 2012
              • 91

              #7
              Oh, whoops.

              Contents of file I wish to process
              Code:
              NODE_2361_length_509_cov_1.43018_ID_745236
              NODE_2361_length_509_cov_5.43018_ID_745236
              NODE_2361_length_509_cov_3.43018_ID_745236
              NODE_2361_length_509_cov_2.43018_ID_745236
              NODE_2361_length_509_cov_21.43018_ID_745236
              The one-liner
              Code:
              sed 's/_/ /g' filename | awk '{FS=" "; print $6}' | sort -r -n | head -n 10
              sed will replace _ with space, then spit out the output to stdout. | to pipe to awk.

              awk will use spaces to separate columns, print $6 instructs awk to display everything from column six (which is the numbers after cov, but before ID).
              You can then pipe that again to sort, with head to display just what you want to display.

              The output
              Code:
              21.43018
              5.43018
              3.43018
              2.43018
              1.43018
              Obviously not the only solution, and there are probably better ones.

              --------------
              Edit: If you're dealing with fasta you will have to separate the headers from the sequences. Standard format is >NODE_...; which can be captured as
              Code:
              grep "^>NODE" contigs.fasta > contignames
              Last edited by winsettz; 10-07-2013, 08:32 AM.

              Comment

              • rhinoceros
                Senior Member
                • Apr 2013
                • 372

                #8
                I would just pipe the awk output to:

                Code:
                sort -r -g -k 4 -t _
                When you have an idea what tool might work, but don't know exactly how, just google "man sort" or whatever the name of the tool happens to be. Also, with sort, it's sometimes better to use -g (general numerical value) instead of -n. For example, -n (string numerical value) doesn't work when you have exponents (1e-10 would be smaller than 2e-100 so e.g. sorting blast output by e-value would fail). Actually, I don't know if there's any reason why -g shouldn't be used always by default..
                Last edited by rhinoceros; 10-07-2013, 08:46 AM.
                savetherhino.org

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                18 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                25 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                19 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                187 views
                0 reactions
                Last Post seqadmin  
                Working...