Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Dagga
    Member
    • Feb 2014
    • 20

    Remove N's and split contigs

    Hi,

    I have some genomes that I will be uploading to NCBI soon. I have been told that all N's need to be removed and the contigs split at this position.

    I am new to command line interface so I was hoping someone could recommend a program and simple script that could do this for me. I would like to remove all N's and then split the contig at the location of the N's results in two new contigs. For example

    Contig 1: ATCGGATAANNNNNNNNNATCGCCGAT

    Contig 1.1: ATCGGATAA

    Contig 1.2 ATCGCCGAT


    Thanks!
  • TiborNagy
    Senior Member
    • Mar 2010
    • 329

    #2
    perl -ne 'if($_ =~ /([^N]+)N+([^N]+)/){print $1;print stderr $1}' input.seq >contig1.txt 2>contig2.txt

    It will split the input file (input.seq) into contig1.txt and contig2.txt

    Comment

    • mastal
      Senior Member
      • Mar 2009
      • 666

      #3
      should that be

      Code:
      print stderr $2

      Comment

      • Dagga
        Member
        • Feb 2014
        • 20

        #4
        Thanks for that!!

        Will this rename the contigs?

        Will the contig that is split be called the same thing in contig1.txt and contig2.txt.

        Is it possible to rename the contigs when they are split. For example, if contig 84 is split into two contigs can they be renamed contig 84.1 and contig 84.2 for each half, respectively?

        Comment

        • TiborNagy
          Senior Member
          • Mar 2010
          • 329

          #5
          mastal: you are right!
          Dagga: This script does not handle the contig names, only sequences, because you do not tell us what kind of input format do you have.
          Last edited by TiborNagy; 02-18-2014, 05:42 AM.

          Comment

          • Dagga
            Member
            • Feb 2014
            • 20

            #6
            TiborNagy: Sorry, the file will be in fasta format post de novo assembly.

            would you be able to alter the script to handle contig names please?

            Thanks!

            Comment

            • mastal
              Senior Member
              • Mar 2009
              • 666

              #7
              If you are doing your assemblies with velvet, setting '-scaffolding no' will stop velvet from joining contigs together with stretches of Ns.

              Comment

              • Dagga
                Member
                • Feb 2014
                • 20

                #8
                Excellent!

                Whilst this does help with some genomes that I am assembling right now, we have some older genomes that were sequenced by BGI and these contain N's that we still need to have removed...

                Comment

                • TiborNagy
                  Senior Member
                  • Mar 2010
                  • 329

                  #9
                  Just for you :-)
                  Code:
                  #!/usr/bin/perl
                  
                  $seq = "";
                  
                  while(<>){
                     chomp;
                  
                     if(/^>/){
                        if($seq ne ""){
                           if($seq =~ /([^N]+)N+([^N]+)/){
                              print  "$id.1\n$1\n";
                              print STDERR "$id.2\n$2\n";
                           }
                        }
                        $seq = "";
                        $id = $_;
                     }
                     else{
                        $seq .= $_;
                     }
                  }
                  
                  if($seq =~ /([^N]+)N+([^N]+)/){
                    print "$id.1\n$1\n";
                    print STDERR "$id.2\n$2\n";
                  }

                  Comment

                  • Dagga
                    Member
                    • Feb 2014
                    • 20

                    #10
                    Thanks!! I appreciate it!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      New Genomics Tools and Methods Shared at AGBT 2025
                      by seqadmin


                      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                      The Headliner
                      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                      03-03-2025, 01:39 PM
                    • seqadmin
                      Investigating the Gut Microbiome Through Diet and Spatial Biology
                      by seqadmin




                      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                      02-24-2025, 06:31 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 05:03 AM
                    0 responses
                    16 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-19-2025, 07:27 AM
                    0 responses
                    17 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-18-2025, 12:50 PM
                    0 responses
                    18 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-03-2025, 01:15 PM
                    0 responses
                    185 views
                    0 reactions
                    Last Post seqadmin  
                    Working...