Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRA fastq-dump puzzle

    Hopefully someone can help me out here as I've been scratching my head now for a couple of days to try and solve the puzzle.

    I've downloaded the SRA file for SRR081275, and have tried various different options to generate $file_1.fastq and $file_2.fastq files for mapping. I'll present two examples here:

    1) ${SRATK}/fastq-dump --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
    2) ${SRATK}/fastq-dump -I --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra

    And an example of a read from $file_1.fastq in each case:

    1)

    @SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
    +SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9

    2)

    @SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
    +SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
    IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9


    In each case the reads are mapped using BWA
    Code:
    bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_1.sai ${REF} ${DATA}/$i/"$i"_1.fastq.gz 
    bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_2.sai ${REF} ${DATA}/$i/"$i"_2.fastq.gz 
    bwa sampe -f ${DATA}/$i/$i.sam ${REF} \
      -r @RG"\t"ID:$i"\t"SM:$i"\t"LB:$i"\t"PL:Illumina \
      ${DATA}/$i/"$i"_1.sai ${DATA}/$i/"$i"_2.sai \
      ${DATA}/$i/"$i"_1.fastq.gz ${DATA}/$i/"$i"_2.fastq.gz
    converted to BAM
    Code:
    samtools view -bhT ${REF} ${DATA}/$i/$i.sam > ${DATA}/$i/$i.bam
    cleaned
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/CleanSam.jar \
      INPUT=${DATA}/$i/$i.bam \
      OUTPUT=${DATA}/$i/$i.clean.bam \
      VALIDATION_STRINGENCY=SILENT
    mates fixed
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/FixMateInformation.jar \
      INPUT=${DATA}/$i/$i.clean.bam \
      SORT_ORDER=coordinate \
      VALIDATION_STRINGENCY=SILENT
    duplicates marked
    Code:
    java -d64 -Xmx4g -jar ${PICARD}/MarkDuplicates.jar \
      INPUT=${DATA}/$i/$i.clean.bam \
      OUTPUT=${DATA}/$i/$i.md.bam \
      ASSUME_SORTED=true \
      METRICS_FILE=${DATA}/$i/$i.md.metrics \
      VALIDATION_STRINGENCY=SILENT
    and finally sent to GASV
    Code:
    cd ${GASV}
    perl ${GASV}/BAM_preprocessor.pl \
      ${DATA}/$i/$i.md.bam \
      -CHROMOSOME_NAMING ${CHRS} \
      -OUTPUT_PREFIX ${DATA}/$i/GASV/gasv
    .

    In example (1) Picard's MarkDuplicates returns 1371073/2000000 duplicate reads and 0 optical duplicates. Whilst in example (2) it returns 15523/2000000 duplicate reads and 0 optical duplicates.

    When sent to GASV, example (1) can have its insert sizes calculated from Lmin and Lmax, and causes no problems. Whilst for example (2) GASV fails to calculate the insert sizes and as a result fails to identify any discordant reads.

    It seems I'm failing to generate a BAM file that will work in both Picard's MarkDuplicates AND GASV, and it appears to be due to the read definition line. I've tried to address this by going through all of the fastq-dump options, and attempting to alter the default READ_NAME_REGEX in Picard's MarkDuplicates, but don't appear to be getting anywhere.

    Is anyone able to help?


  • #2
    I forgot to mention that after running the fast-dump command I used a subset of the fastq file to make checking the different command options quicker
    Code:
    cat ${DATA}/$i/SRR081275_1.fastq | sed -n 1,4000000p > ${DATA}/$i/"$i"_1.fastq
    cat ${DATA}/$i/SRR081275_2.fastq | sed -n 1,4000000p > ${DATA}/$i/"$i"_2.fastq
    head ${DATA}/$i/"$i"_1.fastq
    head ${DATA}/$i/"$i"_2.fastq
    bgzip -f "$i"_1.fastq
    bgzip -f "$i"_2.fastq
    Also, I have NGS data from Illumain GAIIx 1.8 format which works fine in both Picard MarkDuplicates and GASV, the fastq read format for which looks like this:

    @HWUSI-EAS1643R:30:FC:1:1:3636:1000 1:N:0:CGATGT
    NTTCCTACTCCGACATGCTAAAGATCCATGAAACAGAGTGTTTGCCAACAAGTCCAGATTTTTACCAGGCTCATCTCTTCAGTTTCAAAGAATCAGTTTC
    +
    #***,/223/@:@@@@@@@@:<<<<@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@:@@@@@@22.2577577@@@@@@@@@@@228@@@@@@

    Comment


    • #3
      Did you check the bam file header. What's the read name? I encounter a warning for READ_NAME_REGEX unalbe to detect "read name" during MarkDuplicate with picardtools. And the output shows 0 optical duplicate.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advanced Tools Transforming the Field of Cytogenomics
        by seqadmin


        At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
        09-26-2023, 06:26 AM
      • seqadmin
        How RNA-Seq is Transforming Cancer Studies
        by seqadmin



        Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
        09-07-2023, 11:15 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 09:36 AM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 10-02-2023, 07:14 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-29-2023, 09:38 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-27-2023, 06:57 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Working...
      X