Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • eadonyo
    Junior Member
    • Jun 2018
    • 4

    NGS seq. Analysis

    Hello,
    I have a question that i need help. i have this seq in fasta file.
    Code:
    >HG2FEE201A723Q	SAMPLE=USERID-19_JOBID-10_HG2FEE201_166281C_MID21	GENE=PR	STRAND=-	NOTRIM_LEN=512	Mean:33	Len:497	Trimmedat5':0	Trimmedat3':5 	AlignmentScore: 21630	AmpliconCoverage: 402	FullCoverage: Y
    ---CTTGTCTCAAT-AAGGTAGGGGGCCA---GATAAGGGAGGCTCTCTTAGACACAGGAGCAGATGATACAGTATTAGAAGAAATAAGTTTGCCAGGAAAATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAGTATGATCAAGTACCTATAGAAATTTGTGGAAAAAAGGCTATAGGCACAGTATTAATAGGACCTACACCTATCAACATAATTGGAAGGAATATGTTGACTCAACTTGGATGCACACTAAATTTTCCAATTAGTCCCATTGAAACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAGGTCAAACAATGGCCATTGACAGAAGAGAAAATAAAAGCATTAACAGC---A---ATTTGTGAAGA---AATGGAGAAGGAA
    >HG2FEE201B2MWP	SAMPLE=USERID-19_JOBID-10_HG2FEE201_166281C_MID21	GENE=PR	STRAND=+	NOTRIM_LEN=544	Mean:31	Len:450	Trimmedat5':0	Trimmedat3':61 	AlignmentScore: 19950	AmpliconCoverage: 402	FullCoverage: Y
    Qustions

    1) How many genes are represented in this data and how many sequences are there for each sequenced gene.

    2) What is the average read length before and after trimming (denoted by NOTRIM_LEN and Len respectively)

    3) Are any of the DNA sequences in the file identical to each other, and if so what is the highest number of identical sequences? (Hint: sort isn’t just for numbers!)
    Last edited by GenoMax; 06-08-2018, 04:18 AM.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Do you know how this file was generated?

    It looks like this may actually be an aligned fasta format file so it would be not straight forward to identify how many duplicate sequences there were.

    Comment

    • eadonyo
      Junior Member
      • Jun 2018
      • 4

      #3
      I have the whole fasta file. just copied the first three lines using the BASH command: head -3.
      we were asked t use BASH commands to find the genes and sequences

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Is this an assignment?

        Comment

        • r.rosati
          Member
          • Aug 2015
          • 95

          #5
          Paraphrasing Stack Overflow, "What have you tried so far?"

          Comment

          • eadonyo
            Junior Member
            • Jun 2018
            • 4

            #6
            Yes it a class assignment. That is why i put only the first three lines for a file size of about 1.6MB. what i am looking at is how to i identify and count genes within sequences of nucleotides

            Comment

            • eadonyo
              Junior Member
              • Jun 2018
              • 4

              #7
              This is what i have done so far:


              Put the whole sequences in one line
              awk '{printf /^>/ ? $0 :$0}' BigData1.fasta

              and break the lines using the ">" separator

              awk '{printf /^>/ ? "\n" $0 :$0}' BigData1.fasta.

              Now i can count the word occurrences using wc -w and wc -l Bash

              Comment

              • finswimmer
                Member
                • Oct 2016
                • 60

                #8
                Hello,

                before you can start to answers your question you have to get familiar with the fileformat. Let's analyse the format you show us.

                In a fasta file each sequence information consist of a headline introduced with a ">" at the beginning and one more lines with the sequence itself. In your case it seems that sequence is only in one line.

                The headline for each sequence have several information which are arranged in columns delimited by tabs. It seems that the same informations are all in the same column number.

                So whenever we like to extract information from the header we have to look for lines that started with ">". If we are interested in the sequence we need line without ">"

                Let's have a look at your first question:

                1) How many genes are represented in this data and how many sequences are there for each sequenced gene.

                The information about the gene name is
                1. in the header line
                2. in the 3. column
                3. prefixed with "GENE="
                4. a gene name can exist multiple time


                One way to get the list of distinct name is this:

                Code:
                grep "^>" your.fasta|cut -f3|sed 's/GENE=//'|sort -u > genes.txt
                grep finds all line starting with ">", cut selects the third column, sed removes the "GENE=" leaving behind the pure gene name, sort -u sortes the names and remove duplicates.

                With this list of gene names we can answers the second part of the question. We need to iterate over the list and count the lines which contain the gennames.

                Code:
                for gene in $(cat genes.txt); do echo $gene; grep -wc "GENE=$gene" your.fasta; done|paste - -
                paste is used to show the gene name and the counts in a row.

                2) What is the average read length before and after trimming (denoted by NOTRIM_LEN and Len respectively)

                How you extract the values for each read I showed you before so I will not post a full solution here. The result of extacted each read length can be piped to awk which can calculated the average read length.

                Code:
                [extracted_read_length]|awk '{ total += $1; } END { print total/NR }'
                3) Are any of the DNA sequences in the file identical to each other, and if so what is the highest number of identical sequences? (Hint: sort isn’t just for numbers!)

                As this is a assignment I gave you just some hints. Check the manpages for grep, sort and uniq for helpful options
                fin swimmer
                Last edited by finswimmer; 06-13-2018, 05:35 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                17 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                18 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                19 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                186 views
                0 reactions
                Last Post seqadmin  
                Working...