Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRA fastq-dump puzzle

    Hopefully someone can help me out here as I've been scratching my head now for a couple of days to try and solve the puzzle.

    I've downloaded the SRA file for SRR081275, and have tried various different options to generate $file_1.fastq and $file_2.fastq files for mapping. I'll present two examples here:

    1) ${SRATK}/fastq-dump --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
    2) ${SRATK}/fastq-dump -I --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra

    And an example of a read from $file_1.fastq in each case:

    1)

    @SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
    +SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9

    2)

    @SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
    +SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9


    In each case the reads are mapped using BWA
    Code:
    bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_1.sai ${REF} ${DATA}/$i/"$i"_1.fastq.gz 
    bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_2.sai ${REF} ${DATA}/$i/"$i"_2.fastq.gz 
    bwa sampe -f ${DATA}/$i/$i.sam ${REF} \
      -r @RG"\t"ID:$i"\t"SM:$i"\t"LB:$i"\t"PL:Illumina \
      ${DATA}/$i/"$i"_1.sai ${DATA}/$i/"$i"_2.sai \
      ${DATA}/$i/"$i"_1.fastq.gz ${DATA}/$i/"$i"_2.fastq.gz
    converted to BAM
    Code:
    samtools view -bhT ${REF} ${DATA}/$i/$i.sam > ${DATA}/$i/$i.bam
    cleaned
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/CleanSam.jar \
      INPUT=${DATA}/$i/$i.bam \
      OUTPUT=${DATA}/$i/$i.clean.bam \
      VALIDATION_STRINGENCY=SILENT
    mates fixed
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/FixMateInformation.jar \
      INPUT=${DATA}/$i/$i.clean.bam \
      SORT_ORDER=coordinate \
      VALIDATION_STRINGENCY=SILENT
    duplicates marked
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/MarkDuplicates.jar \
      INPUT=${DATA}/$i/$i.clean.bam \
      OUTPUT=${DATA}/$i/$i.md.bam \
      ASSUME_SORTED=true \
      METRICS_FILE=${DATA}/$i/$i.md.metrics \
      VALIDATION_STRINGENCY=SILENT
    and finally sent to GASV
    Code:
    cd ${GASV}
    perl ${GASV}/BAM_preprocessor.pl \
      ${DATA}/$i/$i.md.bam \
      -CHROMOSOME_NAMING ${CHRS} \
      -OUTPUT_PREFIX ${DATA}/$i/GASV/gasv
    .

    In example (1) Picard's MarkDuplicates returns 1371073/2000000 duplicate reads and 0 optical duplicates. Whilst in example (2) it returns 15523/2000000 duplicate reads and 0 optical duplicates.

    When sent to GASV, example (1) can have its insert sizes calculated from Lmin and Lmax, and causes no problems. Whilst for example (2) GASV fails to calculate the insert sizes and as a result fails to identify any discordant reads.

    It seems I'm failing to generate a BAM file that will work in both Picard's MarkDuplicates AND GASV, and it appears to be due to the read definition line. I've tried to address this by going through all of the fastq-dump options, and attempting to alter the default READ_NAME_REGEX in Picard's MarkDuplicates, but don't appear to be getting anywhere.

    Is anyone able to help?


  • #2
    I forgot to mention that after running the fast-dump command I used a subset of the fastq file to make checking the different command options quicker
    Code:
    cat ${DATA}/$i/SRR081275_1.fastq | sed -n 1,4000000p > ${DATA}/$i/"$i"_1.fastq
    cat ${DATA}/$i/SRR081275_2.fastq | sed -n 1,4000000p > ${DATA}/$i/"$i"_2.fastq
    head ${DATA}/$i/"$i"_1.fastq
    head ${DATA}/$i/"$i"_2.fastq
    bgzip -f "$i"_1.fastq
    bgzip -f "$i"_2.fastq
    Also, I have NGS data from Illumain GAIIx 1.8 format which works fine in both Picard MarkDuplicates and GASV, the fastq read format for which looks like this:

    @HWUSI-EAS1643R:30:FC:1:1:3636:1000 1:N:0:CGATGT
    NTTCCTACTCCGACATGCTAAAGATCCATGAAACAGAGTGTTTGCCAACAAGTCCAGATTTTTACCAGGCTCATCTCTTCAGTTTCAAAGAATCAGTTTC
    +
    #***,/223/@:@@@@@@@@:<<<<@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@:@@@@@@22.2577577@@@@@@@@@@@228@@@@@@

    Comment


    • #3
      Did you check the bam file header. What's the read name? I encounter a warning for READ_NAME_REGEX unalbe to detect "read name" during MarkDuplicate with picardtools. And the output shows 0 optical duplicate.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X