Originally posted by maubp
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by themerlin View PostThe mira assembler:
Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.
comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.
J
Fastq file:
Code:@seq1 CGAGATTGGTTGTCTCCTACTACCGAGTTGCCTCGAGAGCACCAGACCCGTCCGCCTCGCCTTCAGGGTTTTTTTCCGGGCAGCGATCCGAGGTCATCGACTGGGTGTTTAGCCCGGGCGACGCCCCGATCCCCCAATCTCG + @@@FFDFDHBF?FGBGHIGHGGII@F@:EBDFGGGGEGIIIIFEIIIIII;CHHGEEFFCCC@CCCCDDDDDDDDDDDDDBBB@>@BD@BBADCA@@CA:DDBDDCAA>B@>DFBAHBE@7@FEJGGHGDDFFAAFDFF@@@
Code:Reading names Copying sequence data Last sequence name: seq1 Now reading line: + The names don't match?! while executing "error "Last sequence name: $seqname\nNow reading line: $line\nThe names don't match?!"" (procedure "conditionalFASTQCopy" line 28) invoked from within "conditionalFASTQCopy $fin $fout" (procedure "faqsel::processit" line 16) invoked from within "faqsel::processit" (file "/bioware/mira/scripts/fastqselect.tcl" line 166)
Comment
-
Hmm..I just copied your fastq sequence into one of my fastq files, then pulled it out successfully with that script. The name after the "+" doesn't appear to adversely affect the script in my tests. I'm not sure what the problem might be. Have you tested the script on a different fastq file?
Comment
-
Originally posted by greigite View PostI tried this script but it doesn't seem to play nice with fastq files that don't contain the sequence name after the "+" before the quality score line (which is optional):
...
Any thoughts or a fix for this?
Comment
-
I'm just chatting to Peter Rice from EMBOSS and v6.3 can do this with the dbxflat and seqret tools. The documentation should be updated to make this more obvious. In addition to FASTQ, this handles other major flat files too (there is a more specialised tool for FASTA files, dbxfasta, to handle all the different ID line conventions).
Comment
-
cdbfasta
One issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.
Originally posted by SES View PostYou have 1.4 billion reads in one file? And you used this with BLAST?
(I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)
Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.
Comment
-
Originally posted by greigite View PostOne issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.
*Note that this is based on a rudimentary understanding of C; someone please correct me if I'm wrong.
Comment
-
Originally posted by greigite View PostOne issue I've noticed with using cdbfasta -Q to index fastq files is that it uses the "@" character as a record delimiter. That works if the qual scores use phred+64 encoding but not if they use phred+33 (Sanger encoding) because the "@" character has a decimal value of 64. If you have quality score lines beginning with "@" the records are not correctly parsed.
Originally posted by kmcarr View PostExcept I don't believe it solely depends on the record delimiter if you tell it the input file is FASTQ. It expects that the FASTQ file uses at least 4 lines per record and it doesn't start checking for a new record delimiter until after it has read 4 lines. If your FASTQ file sticks to the usual convention of using exactly 4 lines per record (I know it's not required by the standard) then you should be o.k. even if a quality line starts with '@'.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
25 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment