Hi,
I need some help parsing a gff3 file. Essentially, I am trying to pull out specific fields out of the info section using awk. I can't simply use the column arrangement because the lines do not all have the same info in them so I need to do a match. Here are a couple of lines from the file:
What I am doing is using awk to pull out the last field, i.e. $9, substitute tabs for semicolons, then further parse using a match function to get the fields I want. The issue is that my printf does not seem to be printing only my matching fields. Here is what I have:
My output looks like :
OK, so it looks like it works up to the last awk, i.e. if I change the last awk to print $3, I get the expected result, what doesn't seem to be happening is my match is not printing ONLY the matching field. What am I doing wrong in my last awk such that it isn't printing the $i that gets the match???
I need some help parsing a gff3 file. Essentially, I am trying to pull out specific fields out of the info section using awk. I can't simply use the column arrangement because the lines do not all have the same info in them so I need to do a match. Here are a couple of lines from the file:
HTML Code:
NC_015011.2 Gnomon gene 18691 26481 . + . ID=gene0;Dbxref=GeneID:100538868;Name=LOC100538868;gbkey=Gene;gene=LOC100538868;partial=true;start_range=.,18691 NC_015011.2 Gnomon mRNA 18691 26481 . + . ID=rna0;Parent=gene0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;Name=XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1 NC_015011.2 Gnomon exon 18691 18743 . + . ID=id1;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1 NC_015011.2 Gnomon exon 18865 18994 . + . ID=id2;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;transcript_id=XM_010707932.1
Code:
awk -F "\t" '{ print $9 }' mga_ref_Turkey_5.0_NCBI_FINAL_no_GI_no_region.gff3.txt | grep product | awk -F ";" '{ gsub(";","\t",$0);print $0 }' | awk -F "\t" '{for(i=0;i<NF;i++){if($i~/gene\=/){printf $i};if($i~/product\=/){printf $i }}printf "\n"}' | head
HTML Code:
ID=rna0 Parent=gene0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 Name=XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1ID=rna0 Parent=gene0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 Name=XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like ID=id1 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1ID=id1 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like ID=id2 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like transcript_id=XM_010707932.1ID=id2 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
Comment