Might help us if you demonstrated that the file is indeed not empty. How about a 'ls -l' on the file. Or an 'od -c yourfile.sff | head --lines 4' or the actual command you sent to SeqIO.convert so that we can be sure that you did send your file to it.
Unconfigured Ad
Collapse
X
-
Hi,
I was actually able to get it to run today.. Not sure what the problem was yesterday. But i got some funny results anyhow. Some of the nt's are uppercase and some are lowercase. This caused problems for some of the Galaxy fastx tools that summarize quality data.
Any thoughts?
@HH42GP401CAJLD
gactagactcgacgtGTACTCAGGCTCGCACCGTGGCATGTCGCACTGTACTCAAGGCTCGCACCGTGGCATGTCGCACTGTACTTAAGGCTCACACCGTGGCATGTCGCACTGTACTCAAGGCACACAGGGGntaggnn
+
IIIIIIIIIIIIIIIIIIIGD666IIIIIIIIGDDDIIIIIIIIIIIIIIIGB;;;;IIIGGGGGCC>>>CIHID@@@C==:99==GGIIIIHIIIIIIIGGGCCCHIDDDC@777@C>1111AA@>;84445!;:44!!
@HH42GP401B4BC5
gactagactcgacgtGCAGTAGCTGCAATGGCGCAGAAGGCGTGCTTCtctctcncacgcacacacgagagagagngnnn
+
FFFFFFFFFFFFFFFIIIIIIIIIFFFFDDAAAB?<4444<>>9422323663/!//5///59=///2222////!2!!!
The code that I ran is here, (117,221 is the right number of reads for this file)
>>> SeqIO.convert("454Reads.JA11255_155_RL13.sff", "sff", "untrimmed.fastq", "fastq")
117221
Comment
-
-
You'll see the same from Roche's own tools. The lower case are the bits which would be trimmed off as adapters or low quality bases.Originally posted by lplough81 View PostHi,
I was actually able to get it to run today.. Not sure what the problem was yesterday. But i got some funny results anyhow. Some of the nt's are uppercase and some are lowercase.
That could be an oversight in fastx - ask them about it.Originally posted by lplough81 View PostThis caused problems for some of the Galaxy fastx tools that summarize quality data.
Any thoughts?
Or, what you probably want to do is ask for the trimmed sequences (which will be all upper case):
Code:SeqIO.convert("454Reads.JA11255_155_RL13.sff", "sff-trim", "trimmed.fastq", "fastq")
Comment
-
-
There are two things to consider - getting rid of the adapter sequences and quality trimming. Roche does a good job of this as part of the base calling and production of the SFF file. When reading SFF files, Biopython (and other tools like sff_extract and Roche's own tools) will just apply the trimming information recorded in the SFF file. Using the Roche trimming is usually fine.Originally posted by lplough81 View PostGot it. Fairly new work for me, so I appreciate the patient replies. Can I specify the quality cutoff for trimming? Or what is the default that the biopython fastq trimmer uses?
You may need to further trim off PCR primers or other library specific adapters if the Roche software wasn't told about them.
You may decide to further apply some quality cutoff trimming as well. This may be a good idea for some downstream analysis, not for others.
It is possible to do this kind of trimming in Biopython, but not in one line. There are some examples in the tutorial. I've written some SFF trimming tools using Biopython available within the Galaxy Tool Shed (if your institute runs its own Galaxy instance that may be interesting).
There are also other tools which will do it for you - especially if you want to work with the FASTQ file (or FASTA+QUAL) instead of the SFF file.
Comment
-
-
how to trim FASTA name
Hi,
Is there a simple way to reduce the fasta name (e.g /
"> HH42GP401CAJLD length=118 xy=0823_0287 region=1 run=R_2012_01_27_13_59_03_ "
to ">HH42GP401CAJLD"?
Similar to trimming an SFF file to FASTA with biopython SeqIOconvert(), but taking a fasta file as the input and then outputting another fasta file?
Thanks,
Louis
Comment
-
-
Try something like this, untested:
print "Saved %i records" % countCode:from Bio import SeqIO in_file = "example.fasta" out_file = "new.fasta" file_format = "fasta" def remove_descr(record): record.description="" return record #This is a generator expression - not all in memory at once! wanted = (remove_descr(r) for r in SeqIO.parse(in_file, file_format)) count = SeqIO.write(wanted, out_file, file_format)
Comment
-
-
I don't think you need a script for that. If your file is "454reads.fas" then just do:Originally posted by lplough81 View PostHi,
Is there a simple way to reduce the fasta name (e.g /
"> HH42GP401CAJLD length=118 xy=0823_0287 region=1 run=R_2012_01_27_13_59_03_ "
to ">HH42GP401CAJLD"?Code:sed 's/\s.*//' 454reads.fas > 454reads_trimmedheader.fas
Comment
-
-
@kmcarr: I found your script very useful and I am currently as a MSc Bioinformatics students working on an assignment which involves developing a web interface to a little mapping pipeline. This is purely for educational purposes. Would I be allowed to use your script to prepare the fastq file for the pipeline?Originally posted by kmcarr View PostNice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.
Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
I really would appreciated it.
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
-
by SEQadmin2
With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.
Introduction
Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...-
Channel: Articles
05-22-2026, 06:42 AM -
-
by SEQadmin2
Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.
Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...-
Channel: Articles
05-06-2026, 09:04 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, Today, 08:59 AM
|
0 responses
9 views
0 reactions
|
Last Post
by SEQadmin2
Today, 08:59 AM
|
||
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
21 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
17 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
||
|
Started by SEQadmin2, 05-28-2026, 11:40 AM
|
0 responses
30 views
0 reactions
|
Last Post
by SEQadmin2
05-28-2026, 11:40 AM
|
Comment