Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

    Hi all,
    I am trying to submit a transcriptome assembly to the TSA.
    The format is like this:
    >seq1234
    TTTTTTTNNNTTTTTTTTTTTTGGTTTTCTTGAGTAAAGTAAAAAAACCTGAATGATG
    GATGAGGCGAATGATGTGAGGATAAATNNNNAAACGANTNTTATAAGATGTAAAAGTT
    GTCATTAACTTAGTAAAGGCCCTAATTATTGAAGTTAATTATTCCAATGGATAAAAAT
    >seq1235
    AGACACATCGTGTGTTTCTGGATCTTTTTCAGCTTCTTCCTTCAAATCTACTCTGGTT
    GGTGCTGCTGTCAACTGCATCATTTTCGTTTGCTNNNNNCTTTTTGGCCGGAGCATCA
    and so on...

    The TSA are asking for this criteria:
    Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

    Does someone knows quick linux based solution for this?
    I googled it, but i found only solutions to replace the ambiguous as this:

    or this,

    but i have perl issues with this..

    any linux based solution will be appreciate!
    Thanks

  • #2
    The quoted script will convert Ns into As. I doubt if this is what you really want to submit to the TSA since at the point you would be submitting incorrect information.

    I do not have a program to recommend but just throwing away scaffolds/contigs that do not meet TSA's criteria would be what I would do.

    Comment


    • #3
      I would also recommend throwing away scaffolds that are more than 10% ambiguous. But for scaffolds with more than 14 consecutive Ns, you can either split them into two scaffolds at that point, or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). Otherwise you could lose a lot of useful information.

      Unfortunately I don't have a tool that does this.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        ... or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). ...
        I do not agree with Brian on this. A single N should mean a single base that can not be resolved -- often due to due to quality or other technical factors. It should not represent an unknown length. Multiple-Ns, just like poly-A or other poly tracts do often represent unknown lengths because it is hard to accurately sequence and assemble long stretches of a single nucleotide.

        Comment


        • #5
          And as reference to an authority (instead of my own personal opinion), NCBI says (I made the relevant text bold)
          TSA does not accept assemblies which have Ns inserted to represent gaps of unknown length. Sequences containing Ns representing gaps of unknown length need to be split into individual assemblies. Internal Ns representing ambiguous bases or known length gaps can be submitted. If the Ns represent ambiguous bases they should not be more than 10% of the sequence length or more than 14 n's in a row. If the N's represent a known length gap then an assembly_gap feature must be used.

          Comment


          • #6
            OK, I will defer to that guidance, then. I interpret single N's as single unknown bases, but I know I have read alternate definitions of N as meaning unknown sequence of unknown length, though I couldn't find a reference to that when searching.

            Note, though, that those guidelines are not necessarily ideal, and preclude the submission of scaffolded assemblies such as HG19.

            Comment


            • #7
              @papori - What software were you using for the transcriptome assembly? In the example you posted were there multiple reads with N's in those positions or was there no consensus in the reads that spanned that region.

              Comment


              • #8
                I am using Trinity, but i just figure out that i didnt use it properly and that is the reason for the Ns.
                Now, Trinity finished to run again, and i found that i dont have any Ns in the whole assembly..

                So, it is still interesting question:
                Ho to filter out contigs with more than 10% Ns or 14 in a row?

                But for me the problem just solved using different parameters in Trinity.
                Thanks!

                Comment


                • #9
                  For filtering I would think bioperl or biopython would come in useful. Just read in the resulting fasta files with those and then iterate over the contigs, calculating N content and such. That should be a pretty straightforward program to write (assuming you can code, otherwise I imagine it'd prove anything but straightforward).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM
                  • seqadmin
                    Multiomics Techniques Advancing Disease Research
                    by seqadmin


                    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                    A major leap in the field has
                    ...
                    02-08-2024, 06:33 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 06:12 AM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-23-2024, 04:11 PM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-21-2024, 08:52 AM
                  0 responses
                  70 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-20-2024, 08:57 AM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X