Seqanswers Leaderboard Ad

**akorobeynikov** · 12-01-2014, 11:44 PM

Originally posted by ssully View Post

But 454 paired end reads are two 'end' reads connected by a linker sequence. Does the IonHammer corrector actually recognize those and split the reads before correcting? Or do the 454 PE reads have to first be split into left/right by linker removal, then run through --only-error-correction?

...and oriented rf (reverse-forward) if they are to be interpreted as Illumina mate pairs? (yes, they are low coverage)

You need to split them before, yes. Make sure you specified the correct library type (mate pairs) and the correct orientation (whatever you have, e.g. even ff is supported). See http://spades.bioinf.spbau.ru/releas...al.html#sec3.2 for more information

**ssully** · 12-02-2014, 06:43 PM

I have removed the linkers and split the 454 mate pair reads with sff_extract; I have them now as (after deinterlacing) a pair of fastq files (454_1.fastq and 454_2.fastq) containing reads _1 and _2 only, respectively. In each case Read_1 represents the pre-linker and Read_2 represents the post-linker part of the original read, both in forward orientation:

schematic of original read

Code:

================================^^^^^^^^^^^^^^^=======================
454_1--->                             linker    454_2--->

But I'm a bit confused as to what mp parameters to feed to SPAdes for 454 mate pair reads,
because when assembled, they should be ordered _2 --> _1 (again both in forward i.e., 5'--3' orientation), with the library insert size distance between them

schematic of assembled reads

Code:

454_2                                                   454_1
-------->                  (~3kb)                        -------->
==================================================================

How to make sure SPAdes assembles these pairs in correct order and orientation?

would a YAML readset section like this work?

{
orientation: "ff",
type: "mate-pairs",
right reads: [
"/FULL_PATH_TO_DATASET/454_1.fastq"
],
left reads: [
"/FULL_PATH_TO_DATASET/454_2.fastq"
]
},

or should it be

{
orientation: "ff",
type: "mate-pairs",
right reads: [
"/FULL_PATH_TO_DATASET/454_2.fastq"
],
left reads: [
"/FULL_PATH_TO_DATASET/454_1.fastq"
]
},

?

(I adapted these views from http://seqanswers.com/forums/showpos...85&postcount=2 )

**akorobeynikov** · 12-03-2014, 12:04 AM

Originally posted by ssully View Post

I have removed the linkers and split the 454 mate pair reads with sff_extract; I have them now as (after deinterlacing) a pair of fastq files (454_1.fastq and 454_2.fastq) containing reads _1 and _2 only, respectively. In each case Read_1 represents the pre-linker and Read_2 represents the post-linker part of the original read, both in forward orientation:

schematic of original read

Code:

================================^^^^^^^^^^^^^^^=======================
454_1--->                             linker    454_2--->

But I'm a bit confused as to what mp parameters to feed to SPAdes for 454 mate pair reads,
because when assembled, they should be ordered _2 --> _1 (again both in forward i.e., 5'--3' orientation), with the library insert size distance between them

schematic of assembled reads

Code:

454_2                                                   454_1
-------->                  (~3kb)                        -------->
==================================================================

How to make sure SPAdes assembles these pairs in correct order and orientation?

would a YAML readset section like this work?

{
orientation: "ff",
type: "mate-pairs",
right reads: [
"/FULL_PATH_TO_DATASET/454_1.fastq"
],
left reads: [
"/FULL_PATH_TO_DATASET/454_2.fastq"
]
},

or should it be

{
orientation: "ff",
type: "mate-pairs",
right reads: [
"/FULL_PATH_TO_DATASET/454_2.fastq"
],
left reads: [
"/FULL_PATH_TO_DATASET/454_1.fastq"
]
},

?

(I adapted these views from http://seqanswers.com/forums/showpos...85&postcount=2 )

The second variant looks correct to me, basically you need to specify the first and the second read of a fragment and how they were read (in which direction).

Anyway, you can simply feed the data to SPAdes and check whether it inferred the insert size distribution properly.

**ssully** · 12-03-2014, 08:52 AM

I don't know; the second variant seems to be saying to me , 'the reads from the right side of the library read (post-linker, 454_2.fastq) belong at the right end of the genome fragment' -- which would be incorrect.

For me it really comes down to what 'right reads' and 'left reads' means in the YAML specification:

e.g. does 'right reads' refer to a read's position in the 454 mate pair library read (i.e., right side/post-linker in the 454 read, but maps to the left end of the genomic fragment) or with respect to the genome (i.e., maps to the right end of the genomic fragment...but comes from the left side/pre-linker half of the 454 read)

(it's also unusual to me that 'right read' is specified before 'left read' in the YAML, for both paired end and mate pair types, given that sequences are typically read by humans from left to right, 5' to 3'... is there a particular reason for that?)

But anyway I can try inputting it both ways, in two runs, and see which one assembles the 454 mate pairs correctly.

**ssully** · 12-06-2014, 07:38 AM

I worked out the correct orientation and order of 454 paired reads input for SPAdes, and have corrected the reads with --iontorrent option (ionhammer). Btu now I have questions regarding ionhammer error correction -- does it pay any attention to fastq quality scores?

here is an original paired-end sff read (converted to fastq -- note 'sanger style' quality scores, and lower case for low-quality bases). I have underlined that the part that constitutes the 'post linker' read.

sff to fastq
@GIDY76W02G4JWL

Code:

tcagTTATTGATCAGTATTAGAATGAGGCCTATTAATAGCCAATTATCACATTTTGGATCTATTTTGTATCGATGATATCATTTATCGATAATCATCATAGTTATTTCGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA[U]TTATTGCTATAAATAAACGTACTTCTGGAGTAGAATTGAAGTGAGATAGAATTTCTGGTTTTAAGctgagactgccaaggcacacaggggatagg[/U]n
+
III;;;;BIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;:8599>>@:9////92EBEDDDGIIIIIIFEC?:??IIHHHEIIIIIIIIIICCECC:??C==?EEEIEGHHIIIIIIIIGHFHGIIIIC?==CIIIIEEAAAEE>8333C444IIICIIIIGGGGGIIIGGGGIIIIIIIIIIIIA>999=499----./25:===;=@A>>::::EEIIII@@BAGGGII!

here is the postlinker read, after sffToCA (a tool from Celera Assembler) has removed the linker from the original sff read and split it into two reads (parameters were set to perform NO quality trimming, since I expected ionhammer to do that -- so all the 'low quality' bases remain at the end of the read, but are converted to upper case. Fastq scores remain the same):

sffToCA

Code:

@GIDY76W02G4JWLb clr=0,95 clv=1,0 max=1,0 tnt=1,0 rnd=t
TTATTGCTATAAATAAACGTACTTCTGGAGTAGAATTGAAGTGAGATAGAATTTCTGGTTTTAAGCTGAGACTGCCAAGGCACACAGGGGATAGG
+
IEEAAAEE>8333C444IIICIIIIGGGGGIIIGGGGIIIIIIIIIIIIA>999=499----./25:===;=@A>>::::EEIIII@@BAGGGII

here was my spades command

Code:

spades.py --only-error-correction --iontorrent --dataset 454_4.yaml -t 8 --sc -k 21,33,55  --disable-gzip-output -o sff2ca_spades_corrected

and here is the output of ionhammer for the above read

Code:

>GIDY76W02G4JWLb
TTATTGCTATAAATAAACGTACTTCTGGAGTAGAATTGAAGTGAGATAGAATTTCT[U]G[/U]TTTTAAGCTGAGACTGCCAAGGCACACAGGGGATAGG

The only difference is the removal of a single G base (at the underlined position) in the middle of the read (not even as part of a homopolymer)...all of the low-quality (originally lower case) bases remain.

So, I'm not clear on what ionhammer should be doing; it appears I need to quality-trim my 454 reads *before* running them through ionhammer...*OR* I need to preserve the lower-case base formatting in the input file?

**akorobeynikov** · 12-08-2014, 11:45 AM

Originally posted by ssully View Post

So, I'm not clear on what ionhammer should be doing; it appears I need to quality-trim my 454 reads *before* running them through ionhammer...*OR* I need to preserve the lower-case base formatting in the input file?

This is more or less expected. IonHammer is conservative - when it fails to correct something it preserves the original read and postpones the final decision to assembler. In general we suggest not to trim reads when the coverage is low.

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, Today, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 24 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 27 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News