Seqanswers Leaderboard Ad

**Dinab** · 04-04-2017, 05:58 AM

Quality distribution graph

A bit late, but hope you can help me:
I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
Also, if I use reformat.sh with maq=XX, does it simply calculates the avvarge quality based on base quality and checks the error rate (2% max)?

Thank you!

**Brian Bushnell** · 04-04-2017, 04:09 PM

Originally posted by Dinab View Post

A bit late, but hope you can help me:
I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?

Yes, the histogram section of the help covers that. Depending on how you want to plot it, you can use qhist, qchist, aqhist, or bqhist. The most generic is aqhist, which plots the number of reads per average quality score.

Also, if I use reformat.sh with maq=XX, does it simply calculates the average quality based on base quality and checks the error rate (2% max)?

"maq" transforms the qualities into probability space, calculates the average error rate, and then transforms that back into the phred scale. So, a 2% maximum error rate would be approximately maq=17.

**Dinab** · 04-09-2017, 05:11 AM

Perfect, Thank you!

**Dinab** · 04-09-2017, 11:26 PM

Another question

Hi Brian,
Your previous response worked perfectly, thank you again.
I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

**Brian Bushnell** · 04-10-2017, 08:51 AM

Originally posted by Dinab View Post

I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

No, I've never implemented something like that... the closest you could get with the existing package is to use a loop for each length with something like...

Code:

reformat.sh in=reads.fq minlen=147 maxlen=147 out=stdout.fq int=f | reformat.sh in=stdin.fq aqhist=aqhist_147.txt

**GenoMax** · 04-10-2017, 04:33 PM

Originally posted by Dinab View Post

is it possible to get a plot of the mean quality of each sequence as variable of the length?

FastQC gives you an average per sequence quality score.

**blsfoxfox** · 05-07-2017, 10:25 PM

fastq format error

Hi Brian,

I found there is some format error with my fastq. By using reformat.sh in=in.fastq out=out.fastq, I get following error:

Input is being processed as unpaired
java.lang.AssertionError:
Error in in.fastq, line 4241767, with these 4 lines:
>m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868
CTCTTTCCGGTAACACGCGCCTAGATACTAACAACCTACCTTATACTAATCAGCATGCCTAATACCAACA
ACCTACTTTATACCTAATAACAATGCCTTGATAACTAACAACATACCTTAATACCTAATACACGCTTAAT
ACCTAACAACAATACCTTAATACCTAACAACATACTTTAATACCTAACAACCTACCTTAATAACTTAATA

at stream.FASTQ.quadToRead(FASTQ.java:779)
at stream.FASTQ.toReadList(FASTQ.java:710)
at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:110)
at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:95)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
Input: 1060400 reads 9378503777 bases
Output: 1060400 reads (100.00%) 9378503777 bases (100.00%)

Time: 75.193 seconds.
Reads Processed: 1060k 14.10k reads/sec
Bases Processed: 9378m 124.73m bases/sec
Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
at jgi.ReformatReads.process(ReformatReads.java:1130)
at jgi.ReformatReads.main(ReformatReads.java:45)

Could you please provide any suggestions on which steps shoud I follow to find the exact error of that read?

Thanks!

**Brian Bushnell** · 05-07-2017, 10:33 PM

The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

**blsfoxfox** · 05-07-2017, 10:42 PM

Hi Brian,

Thanks for your quick response!

Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

Thanks again for your time,

**blsfoxfox** · 05-08-2017, 06:39 AM

Originally posted by Brian Bushnell View Post

The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

Hi Brian,

Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

Thanks,

**GenoMax** · 05-08-2017, 06:47 AM

Code:

reformat.sh in=seq.fq out=seq.fa

will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

**blsfoxfox** · 05-08-2017, 06:58 AM

Originally posted by GenoMax View Post

Code:

reformat.sh in=seq.fq out=seq.fa

will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

Hi,

Thanks for your response!

My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

**GenoMax** · 05-08-2017, 07:02 AM

How did that happen?

Unless you have a defined reason I would suggest not trusting that file.

**skatrinli** · 05-30-2017, 01:29 AM

Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

**GenoMax** · 05-30-2017, 04:26 AM

There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
@Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

I get the following when I run stats.sh.

Code:

$ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000

Main genome scaffold total:             1
Main genome contig total:               1
Main genome scaffold sequence total:    0.005 MB
Main genome contig sequence total:      0.005 MB        0.000% gap
Main genome scaffold N/L50:             1/5.386 KB
Main genome contig N/L50:               1/5.386 KB
Main genome scaffold N/L90:             1/5.386 KB
Main genome contig N/L90:               1/5.386 KB
Max scaffold length:                    5.386 KB
Max contig length:                      5.386 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      1               1           5,386           5,386   100.00%
   5 KB                      1               1           5,386           5,386   100.00%

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, 11-08-2024, 11:09 AM	0 responses 35 views 0 likes	Last Post by seqadmin 11-08-2024, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, 11-08-2024, 06:13 AM	0 responses 28 views 0 likes	Last Post by seqadmin 11-08-2024, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 32 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News