Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    On some OS the first grep produces results that are separated by a record separator line that is "-------------". Second grep removes those separators (there are commands in grep not to produce the separator but implementations of grep differ for various OS so I added second grep)

    What happens on your machine if you just use the first part? Does it produce the record separators?
    Code:
    $ grep -A 3 "_1:N" your_file | more
    If grep is not producing clean Read 1/2 files then there may be some other formatting issue (wrapped lines, malformed/truncated fastq records etc) in your files.
    Last edited by GenoMax; 04-23-2016, 04:42 AM.

    Comment


    • #17
      I do get something like this:
      Code:
      @HWUSI-EAS101E:4:FC:2:1:1029:10648_1:N:0:/1
      CGACAGCTCCTCAACTGCCTCCATGTCATCACCCTGTACAACCGCATCAAG
      +
      HDH:HGHHHHHHHGHHGHHHHHHHHHHHHHHEDFHHHHHHHHHDGHGHGHH
      --
      @HWUSI-EAS101E:4:FC:2:1:1030:8559_1:N:0:/1
      TAGATTTCCACTAGCTCCCCTCAGACTAAAAGTTGTGCCCCAGTCCACTTC
      +
      IEIIIIIIIGIIIIIIIIIIIIIIHIHIEIGEGIDEIGIIHIIHIIHIHFH
      --
      I'm sitting on OSX, if that's any help.

      Comment


      • #18
        You can see that first grep produces records separated by "--". Second grep statement will remove those.

        You could be more explicit about second grep and see if this helps

        Code:
        $ grep -A 3 "_1:N" your_file | grep -v "^\-\-"
        You can also run fastq_validator on the files produced to see if you can identify any problems with formatting.

        Can you post a few lines of the new files that are produced post-grep?

        Comment


        • #19
          I ran fastqQvalidator on the first of the raw files I got, and it said this:

          Code:
          ERROR on Line 281: Repeated Sequence Identifier: HWUSI-EAS101E:4:FC:2:1:1055:6024_1:N:0:/1 at Lines 277 and 281
          ERROR on Line 285: Repeated Sequence Identifier: HWUSI-EAS101E:4:FC:2:1:1055:6024_1:N:0:/1 at Lines 277 and 285
          Finished processing /Users/erikfasterius/local/data/dmd/fastq/4000_UNEW_rna_lane2.sorted.bam.exp2_1.fastq with 18411216 lines containing 4602804 sequences.
          There were a total of 1449556 errors.
          Returning: 1 : FASTQ_INVALID
          Plus a lot of lines like those at the top repeated 20 times (it only prints the 20 first errors by default). I looked at one of the duplicated reads mentioned:

          Code:
          @HWUSI-EAS101E:4:FC:2:1:1030:8559_1:N:0:/1
          TAGATTTCCACTAGCTCCCCTCAGACTAAAAGTTGTGCCCCAGTCCACTTC
          +
          IEIIIIIIIGIIIIIIIIIIIIIIHIHIEIGEGIDEIGIIHIIHIIHIHFH
          @HWUSI-EAS101E:4:FC:2:1:1030:8559_1:N:0:/1
          TAGATTTCCACTAGCTCCCCTCAGACTAAAAGTTGTGCCCCAGTCCACTTC
          +
          IEIIIIIIIGIIIIIIIIIIIIIIHIHIEIGEGIDEIGIIHIIHIIHIHFH
          So... that's weird. How is it that they're repeated? They are, in fact, identical... Down to the last quality score. What does that even mean?

          Your code change did help; I could now run repair.sh, although it gives me an error for a single read, it seems to run through the data fine:

          Code:
          Set INTERLEAVED to false
          Started output stream.
          java.lang.AssertionError: 
          Mismatch between length of bases and qualities for read 98576 (id=HWUSI-EAS101E:4:FC:2:5:1158:12254 2:N:0:).
          # qualities=40, # bases=51
          
          @HWUSI-EAS101E:4:FC:2:5:1159:6178 2:N:0:
          GCCCAGAGGTAACAGAACAGCTTCAGGTTATCGAAATAACAATGTTAAGGA
          
          	at stream.Read.validate(Read.java:103)
          	at stream.Read.<init>(Read.java:78)
          	at stream.Read.<init>(Read.java:61)
          	at stream.FASTQ.quadToRead(FASTQ.java:806)
          	at stream.FASTQ.toReadList(FASTQ.java:653)
          	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
          	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
          	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
          	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
          
          Set cris2Active=false
          
          Input:                  	1578418 reads 		80499318 bases.
          Result:                 	1578418 reads (100.00%) 	80499318 bases (100.00%)
          Pairs:                  	705382 reads (44.69%) 	35974482 bases (44.69%)
          Singletons:             	873036 reads (55.31%) 	44524836 bases (55.31%)
          
          Time:   			4.740 seconds.
          Reads Processed:       1578k 	333.02k reads/sec
          Bases Processed:      80499k 	16.98m bases/sec
          However, fastQvalidator gives the same duplication error.

          I also thought that, if the reads are really mixed between the two files, then I'd need to run your code twice, to gather the "_1" and _2" from both files and collect them into one. I did so, removed the "_" and "/1/2" as previously, ran repair.sh, and got this:

          Code:
          Input:                  	3060236 reads 		156072036 bases.
          Result:                 	3060236 reads (100.00%) 	156072036 bases (100.00%)
          Pairs:                  	3010208 reads (98.37%) 	153520608 bases (98.37%)
          Singletons:             	50028 reads (1.63%) 	2551428 bases (1.63%)
          
          Time:   			8.854 seconds.
          Reads Processed:       3060k 	345.64k reads/sec
          Bases Processed:        156m 	17.63m bases/sec
          I still get the error for that one read (same as above), and I still get duplication errors from fastQvalidator, though, but it looks better (if I'm understanding this correctly). I'm going to try to align these and see what happens.

          [Edit]: Did the alignment, didn't work. Kind of expected that, being of all the duplications...
          Last edited by ErikFas; 04-24-2016, 03:35 AM.

          Comment


          • #20
            Since we are doing this as an academic exercise .... you could run "dedupe.sh" on your cleaned files to remove duplicates. At least the duplication seems to be complete all way (fastq header, sequence and qualities). After you dedupe take only a small subset of R1/R2 files and check that the R1/R2 reads are different (that is they are not identical). Then try alignments to see what happens. Don't bother with TopHat. Just use an NGS aligner (use BBMap if you don't have a favorite). Is there no alignment or discordant alignments? Does the insert size match expectations?

            Comment


            • #21
              I've just spoken to my PI about this problem, and we've chosen to not continue with this mucking about with weird FASTQ files in lieu of waiting for the actual raw data and doing more fruitful things. I kind of feel like we wouldn't really want to use the files with this many "correction" (or whatever you want to call it) anyway...

              Comment


              • #22
                Agreed. That would be the best solution in this case.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X