Seqanswers Leaderboard Ad

**sklages** · 05-11-2012, 09:03 AM

What have you done? What folders are you referring to? From which files are the read examples you showed? What is the problem? ;-)

Sven

**Avila** · 05-11-2012, 12:00 PM

Hello Sven, some time ago sent a transcriptome sequencing, they sent the two files sff (about 300 thousand readings) in a folder called "one". Because nobody knew in the laboratory of bioinformatics, the service also took place assembly. So far so good.

But now I'm training for a new transcriptome analysis, so check every service delivered by the sequencing and assembly, and I found a folder called "two" with two sff files (about 160 000 readings). In comparing the files in both folders found there reading with the same name. The readings above are an example, the first belongs to the files with 300 thousand readings while the second belongs to the files of 160 000 readings.

My hypothesis is that they eliminated low quality readings, contaminated with vector etc and therefore there were only 160 000 readings and the people in charge of the work created a new folder ("two"). From this hypothesis, my question is: Why do the readings (as in the example above) does not have the same length, but there are gaps and why not masked with an N.

**sklages** · 05-11-2012, 12:15 PM

Originally posted by Avila View Post

Hello Sven, some time ago sent a transcriptome sequencing, they sent the two files sff (about 300 thousand readings) in a folder called "one". Because nobody knew in the laboratory of bioinformatics, the service also took place assembly. So far so good.

But now I'm training for a new transcriptome analysis, so check every service delivered by the sequencing and assembly, and I found a folder called "two" with two sff files (about 160 000 readings). In comparing the files in both folders found there reading with the same name. The readings above are an example, the first belongs to the files with 300 thousand readings while the second belongs to the files of 160 000 readings.

My hypothesis is that they eliminated low quality readings, contaminated with vector etc and therefore there were only 160 000 readings and the people in charge of the work created a new folder ("two"). From this hypothesis, my question is: Why do the readings (as in the example above) does not have the same length, but there are gaps and why not masked with an N.

Hard to guess ... maybe one has been amplicon processed, the other shotgun?
If you have access to the SFF tools, sffinfo -m might help.

But the simplest and most effective solution is to just ask your sequencing provider about the data provided. I'd not rely on a "hypothesis" ;-)

Sven

**flxlex** · 05-13-2012, 11:54 PM

What strikes me is that the second sequence is different, in particular some of the homopolymers are one base shorter, and there is some more trimming at the 3' end. It might indeed be amplicon versus shotgun analysis/basecalling, but the homopolymer shortening still surprises me (I need to check our own amplicon runs to see if they show the same pattern).

**ajthomas** · 05-14-2012, 09:22 AM

Back when the amplicon processing pipeline first became available, I tried reprocessing amplicon data with the shotgun pipeline to get more data. I found that indeed there was more data, but no more useful data. The most noticeable differences were in homopolymer regions where the shotgun-processed data tended to have more variation in length. The amplicon-processed data was usually correct in the homopolymers whereas the shotgun-processed data would have a number of reads with an extra base or, less-commonly, missing a base in homopolymers of about 4 or more.

Long story short, yes, the difference here could be amplicon vs. shotgun processing, as I have seen some homopolymer shortening using the amplicon pipeline. However, I never saw it as extensive as in this example, especially in 2-base homopolymers, so the extent of it here surprises me and makes me think it may be something else. I don't have any other ideas, though.

**Avila** · 05-15-2012, 10:55 AM

Apparently this is rare, I'll try to get service response sequencing.
I do not understand what you mean with amplicon run.

Thanks

**flxlex** · 05-16-2012, 01:11 AM

Originally posted by Avila View Post

I do not understand what you mean with amplicon run.

When you sequence PCR products (amplicons) on the GS FLX, you need to use another pipeline for basecalling than for shotgun runs. This has to do with the usually much stronger sequencing signals from PCR products.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

454 output files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News