Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with cmpfastq, can't process my .fastq /1 and /2 files

    Hi,

    I am having a problem using cmpfastq, even if I've been using it reliably for months.

    Normally, I can grab my trimmed 1_1.fastq and 1_2.fastq, process it through cmpfastq, and get out my .common.out and .unique.out files for downstream processes. However, a couple data sets are really giving my trouble... the cmpfastq spits out all error messages for every line of .fastq and fails to generate the appropriate files.

    Here is a sample of the output data:

    BEGIN cmpfastq3 on TpruniS3_1.trimmed TpruniS3_2.trimmed at Wed Oct 10 15:00:08 EDT 2012
    Could not match the sequence ID from the name: @M00649:2:000000000-A1721:1:1101:17085:1532/2
    Could not match the sequence ID from the name: TACTCCTACTGCGCAGCAATATTATTCTTTCGTTAGAGCTAAAAGGCAGAGTGGGAATCGAACCCACTTCGTTAGATTTGCAATC
    Could not match the sequence ID from the name: +
    Could not match the sequence ID from the name: 555??BBDDDDDDBDCCFFFFEFI;BEFHIIHFHFHH@@GHHIFHHHFEFH8CD@@BFD@EFHCEEHECFFHIIFHHDFGHIIHH
    Could not match the sequence ID from the name: @M00649:2:000000000-A1721:1:1101:16787:1535/2
    Could not match the sequence ID from the name: TAGACGTTTAAGTGACACCGAAAGAAGAAAGAGCTTTGTAGATGCTTAGCGCGGTCTACGAGCCTGGCGGATCAGAAAGCGGAAG
    Could not match the sequence ID from the name: +
    Could not match the sequence ID from the name: 5<?????DDDDDBDBFFFFFFHDACFHFHHB=CFDGHHHEDGGFGFGGHIHHC>EDEHHHHHHHB@?DHHCHHFFHHD=F;A@EE
    Could not match the sequence ID from the name: @M00649:2:000000000-A1721:1:1101:14795:1537/2
    Could not match the sequence ID from the name: AACGGAGCGAAGGATTTTAGCTTCACGAATTTCCCAAACTTGGCGAGGTCCTGTGTCGATTCCCGGACTTCCTTGGTCTTTGCGCC
    Could not match the sequence ID from the name: +
    Could not match the sequence ID from the name: 5<????@DDDDBDDBFFFFFFIIIHIIHHEHIHIIIFHHH/AFFCH++?EE?EFGGHHFF-CA-5CEEAGH,CCDF@DBGDFFCEE


    Does anyone have an idea?

    Thanks for the help!

  • #2
    Neverending Illumina format changes

    I don't really know anything about 'cmpfastq' but I've had a look at the source code:


    From what I can tell, it expects the ID line to match this pattern /^@(.*)#.*/
    which means an @ followed by some chars, then a # followed by some chars.

    Your IDs do not fit this pattern, because you don't have the #xxxxx part.

    Illumina used to use #AGCTCG to denote barcodes in multiplex samples. These days it uses a different format, or doesn't print it at all.

    To make it work with your data, change it to /^@(.*)(#.*)?/ or /^@(.*)/

    Good luck.

    Comment


    • #3
      Thank you very much for the reply. You have correctly identified the problem, and I can now resolve it to work with MiSeq reads. Thanks again for the insight!

      Comment


      • #4
        Hello. Im having the same probl;em and i tried changing the pattern to match my header but it posted all my reads to a unique file where as common files remains empty. please help

        Comment


        • #5
          What exactly are you trying to do? I have a program called "filterbyname" that can probably do it...

          Comment


          • #6
            Pairing of fastq files(F/R)

            Im trying to pair my fastq files after quality filtering and trimming of those files via FASTQC. My files look like these:

            mexD1B_filt_trim_1.fastq <==
            @MexD1BSRR1562087.10.1/1
            GAGCTAGATCAGCACCATATATTACACGATGATCAGCTGTAACATTTACCTGCATCTGGTTCTTCATTCCTATCCGACCATCCTTGG
            +SRR1562087.10.1/1
            JJJJJJIIJJJJJJJJIJJJJJJJJJJJJJJJIJJJJJJJGIIJJJJIJJJJJJJJJIJJJJDHIHHHHHHHFDFFDDDDDDDDD>C
            @MexD1BSRR1562087.11.1/1
            AGGTTGACTATGGTCCAGGCCATGCCAGGAGAGCAACCGAAAACAGAGAGAACGGTAAGCCAGGAGAAGAACAGTATGAGTATATAG
            +SRR1562087.11.1/1
            IJJGHIJIIIFIBHHGAFHGGIHJIJGJEGIGGGHGIJJJJHHGFEFEDACEEDDBDBCCCDDDDDDBDDDCDDCADDDCCCDDDDD
            @MexD1BSRR1562087.15.1/1
            TAACATCCACAATCTCCTTCTACCCAAGAAGTCTGGAACTTCAGCATCAAAGGCTGGTGATGACGACAACTAATCCATTTACTGAAT



            ==> mexD1B_filt_trim_2.fastq <==
            @MexD1BSRR1562087.7.2/2
            CCTGTAGATATACGTACTGCCAAAGGGTAGATAGTTGCCCATCTCAGAAAACACAACTTCAACAGCCAAGATTAATATCCATGTGAT
            +SRR1562087.7.2/2
            IJJJGGJBHIJJGHHHIIHJJGJGJIIDFHIJIJJJGHJJJJJJJIJGIGH@FHJIJIHIIIHHH=BDFFAEECCEEFDEDDCDCA>
            @MexD1BSRR1562087.9.2/2
            GTAATCCAAATAAGGTATACTCACTCATCGGAGGATTTTGTGCTTCCCCTGTGAATTTCCACGCTAAGGATGGCTCCGGCTATAAAT
            +SRR1562087.9.2/2
            JIJIIJJJGGIIJIBC@FH@HHJGIJGCHGIEGIFHDFHJIJIJIHHIIIIJGGHHHHHCDDFDDDBDDDDDDDCDBDDBD@CDCEE
            @MexD1BSRR1562087.11.2/2
            GAAACACTGATTGGTTCACGTATCCAGGTGTATGGACCACCTATATACTCATACTGTTCTTCTCCTGGCTTACCGTTCTCTCTGTTT

            Comment


            • #7
              @safina: You should use a program called repair.sh that is part of BBMap package. Brian has an example posted here: http://seqanswers.com/forums/showpos...0&postcount=45

              Your command would look something like this:
              Code:
              $ repair.sh in1=mexD1B_filt_trim_1.fastq in2=mexD1B_filt_trim_2.fastq out1=mexD1B_filt_trim_1_fixed.fq out2=mexD1B_filt_trim_2_fixed.fq outsingle=single.fq

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-25-2024, 11:49 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              62 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X