Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • yifangt
    replied
    This is a cool script. Thanks gpcr! Yifang

    Leave a comment:


  • gpcr
    replied
    Code:
    awk '{print ; getline } {print substr($0, 11, 76) ; getline; print ; getline ; print substr($0, 11, 76) }' input.fastq

    Leave a comment:


  • yifangt
    replied
    Thanks a lot kmcarr! I was searching for it, just missed the inside. I will give it a try.

    Best!
    Yifang

    Leave a comment:


  • kmcarr
    replied
    Originally posted by yifangt View Post
    My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

    The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
    Code:
    use Bio::SeqIO; use Bio::Seq::Quality;
    
    $seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');
    
    my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');
    
    while((my $line = $seqio->next_seq() ) { 
    # keep the id 
    # substring the sequence (from 11 to 66) 
    # substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
    Or any tools there to do my job?

    Thanks a lot!

    Yifang
    Yifang,

    First rule of coding, never write your own when a tool already exist to do what you want. The FASTX-Toolkit has a utility to do exactly what you want, namely the fastx_trimmer. You give an input file (fasta or fastq), the postion of the first and last base you wish to keep (in your case 11 and 76) and it will produce a trimmed (bases and quality) file.

    Leave a comment:


  • yifangt
    replied
    My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

    The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
    Code:
    use Bio::SeqIO; use Bio::Seq::Quality;
    
    $seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');
    
    my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');
    
    while((my $line = $seqio->next_seq() ) { 
    # keep the id 
    # substring the sequence (from 11 to 66) 
    # substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
    Or any tools there to do my job?

    Thanks a lot!

    Yifang
    Last edited by yifangt; 08-13-2011, 04:44 AM.

    Leave a comment:


  • gpcr
    replied
    Originally posted by zlu View Post
    Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

    @chr1
    nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
    GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
    ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
    GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
    AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
    TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
    TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
    +
    !!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
    ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
    HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
    QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
    olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
    EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
    NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

    And using the method described above, this is what I got:

    >chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
    >`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
    Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
    >QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
    rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
    >QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
    louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
    >KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
    WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
    >QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
    QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
    >QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
    QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
    >LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
    ~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
    >QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
    BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
    >NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
    WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT
    awk '/@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\+//" > chr1.fasta

    here "chr1" is chromosome

    Leave a comment:


  • lh3
    replied
    The version in MAQ svn, which is never released, parses multi-line fastq and accepts gzipped input, as it uses the same parser as bwa. Nonetheless, as maq only works with reads no longer than 127bp, not supporting multi-line fastq is not a major problem. It is more important for bwa to parse multi-line fastq as it works with long reads.

    Leave a comment:


  • maubp
    replied
    Just because the consensus is that line wrapping in FASTQ is/was a bad idea, doesn't mean it can't be reliably parsed. It does make the code a little more complicated I admit - but as this thread shows, it would be useful to some if MAQ could understand line-wrapped FASTQ.

    Leave a comment:


  • kmcarr
    replied
    Originally posted by maubp View Post
    Originally posted by lh3 View Post
    Most of fastq parsers do not work with multi-line fastq files.
    Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?
    From the Wikipedia article on FASTQ (http://en.wikipedia.org/wiki/FASTQ_format):

    The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

    Leave a comment:


  • maubp
    replied
    Originally posted by lh3 View Post
    Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.
    Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?

    Leave a comment:


  • zlu
    replied
    Using the patched emboss, I can now convert the line wrapped fastq files. Thank you.

    Leave a comment:


  • zlu
    replied
    Originally posted by maubp View Post
    Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes.
    I'll try to patch it and see how it goes.

    Originally posted by maubp View Post
    How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked
    The compressed file is more than 1.3GB.

    Thanks for your help. I'll try to process the fastq with a script.

    Leave a comment:


  • lh3
    replied
    Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.

    Leave a comment:


  • maubp
    replied
    Originally posted by zlu View Post
    yes, the lines are indedd wrapped. Thanks for the reminder.
    That probably does explain MAQ's failure.
    Originally posted by zlu View Post
    However, trying seqret of EMBOSS 6.1,
    Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes. See:
    ftp://emboss.open-bio.org/pub/EMBOSS/fixes/README.fixes

    Originally posted by zlu View Post
    I got the following error:

    [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
    Reads and writes (returns) sequences
    Input (gapped) sequence(s): CNS.fq
    Error: Unable to read sequence CNS.fq'
    That is odd. Note you can also try giving the filename as part of the command line, e.g.

    $ seqret -sformat fastq-sanger -sequence CNS.fq -osformat fasta -outseq CNS.fasta

    How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked.

    Leave a comment:


  • zlu
    replied
    yes, the lines are indedd wrapped. Thanks for the reminder.

    However, trying seqret of EMBOSS 6.1, I got the following error:

    [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
    Reads and writes (returns) sequences
    Input (gapped) sequence(s): CNS.fq
    Error: Unable to read sequence CNS.fq'

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    An Introduction to the Technologies Transforming Precision Medicine
    by seqadmin


    In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
    01-27-2025, 07:46 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 09:30 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-05-2025, 10:34 AM
0 responses
22 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-03-2025, 09:07 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-31-2025, 08:31 AM
0 responses
33 views
0 likes
Last Post seqadmin  
Working...
X