before you can start to answers your question you have to get familiar with the fileformat. Let's analyse the format you show us.
In a fasta file each sequence information consist of a headline introduced with a ">" at the beginning and one more lines with the sequence itself. In your case it seems that sequence is only in one line.
The headline for each sequence have several information which are arranged in columns delimited by tabs. It seems that the same informations are all in the same column number.
So whenever we like to extract information from the header we have to look for lines that started with ">". If we are interested in the sequence we need line without ">"
Let's have a look at your first question:
1) How many genes are represented in this data and how many sequences are there for each sequenced gene.
The information about the gene name is
- in the header line
- in the 3. column
- prefixed with "GENE="
- a gene name can exist multiple time
One way to get the list of distinct name is this:
Code:
grep "^>" your.fasta|cut -f3|sed 's/GENE=//'|sort -u > genes.txt
With this list of gene names we can answers the second part of the question. We need to iterate over the list and count the lines which contain the gennames.
Code:
for gene in $(cat genes.txt); do echo $gene; grep -wc "GENE=$gene" your.fasta; done|paste - -
2) What is the average read length before and after trimming (denoted by NOTRIM_LEN and Len respectively)
How you extract the values for each read I showed you before so I will not post a full solution here. The result of extacted each read length can be piped to awk which can calculated the average read length.
Code:
[extracted_read_length]|awk '{ total += $1; } END { print total/NR }'
As this is a assignment I gave you just some hints. Check the manpages for grep, sort and uniq for helpful options
![Smile](https://www.seqanswers.com/core/images/smilies/smile.png)
fin swimmer
Leave a comment: