Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Illumina Paired-End Data

    I am a beginner at bioinformatics but have a some experience with python and software development.

    I am trying to take some Illumina sequence data (mRNA-level complementary DNA I think) and prepare it for BLAST alignment. It is supposed to be paired-end. However, I'm trying to make sure this is true.

    For example, I have the following data files:
    J06643_NoIndex_L002_R1_001.fastq
    J06643_NoIndex_L002_R1_002.fastq
    J06643_NoIndex_L002_R1_003.fastq
    J06643_NoIndex_L002_R1_004.fastq
    J06643_NoIndex_L002_R1_005.fastq
    J06643_NoIndex_L002_R1_006.fastq
    J06643_NoIndex_L002_R1_007.fastq
    J06643_NoIndex_L002_R1_008.fastq
    J06643_NoIndex_L002_R1_009.fastq
    J06643_NoIndex_L002_R1_010.fastq
    J06643_NoIndex_L002_R1_011.fastq
    J06643_NoIndex_L002_R1_012.fastq
    J06643_NoIndex_L002_R1_013.fastq
    J06643_NoIndex_L002_R1_014.fastq
    J06643_NoIndex_L002_R2_001.fastq
    J06643_NoIndex_L002_R2_002.fastq
    J06643_NoIndex_L002_R2_003.fastq
    J06643_NoIndex_L002_R2_004.fastq
    J06643_NoIndex_L002_R2_005.fastq
    J06643_NoIndex_L002_R2_006.fastq
    J06643_NoIndex_L002_R2_007.fastq
    J06643_NoIndex_L002_R2_008.fastq
    J06643_NoIndex_L002_R2_009.fastq
    J06643_NoIndex_L002_R2_010.fastq
    J06643_NoIndex_L002_R2_011.fastq
    J06643_NoIndex_L002_R2_012.fastq
    J06643_NoIndex_L002_R2_013.fastq
    J06643_NoIndex_L002_R2_014.fastq

    It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)

    R1_001:

    <#0@@#############################################
    @D3NH4HQ1:710G1KACXX:2:1101:1488:2217 1:N:0:
    GTAAGGGCAAGGGCACTGAGCTATGTCATCTGGGCTCAAATTCTGCTACC
    +
    B@@FFFFFHHHHHJJJIJJJJJIJJIIGIIIJIJJGIGGIIIGJIEIIIH
    @D3NH4HQ1:710G1KACXX:2:1101:1279:2224 1:Y:0:
    GGCTTATTTGATACTCATGGTACAGAAGCGACGATCAAATAGATTGAGAA

    R2_001:

    ###4##22ADFHG#####################################
    @D3NH4HQ1:710G1KACXX:2:1101:2135:2174 2:N:0:
    NNGATGCAGGTGGCNNGGANNNNNNNNCGCCATNNTGCCTNNNNNNNNNN
    +
    ##14A?DBD<CACB##42<########11??FE##00?B@##########
    @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
    NNTGTTGTCACTTTNNAGANNNNNNNNTTGCTATNAAGCTNNNNNNNNNN

    Does this mean the data are not paired end?

  • #2
    I'm not sure what the exact specificities of the new format are, but the 1:N:0 or 2:N:0 in the header denote what /1 and /2 used to. This wikipedia page is helpful:

    Comment


    • #3

      It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)
      What did your parents (or teachers) tell you about not trusting everything you read on the internet.

      The Illumina specs have changed back and forth a couple of times in the last several months. It looks like you received files from the time that they decided to remove the '/1' and '/2' designations. Instead look at the first number after the white space:

      @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
      The above is an R2 read.

      Comment


      • #4
        Great. That makes total sense.

        The first thing I would like to do is subtract all human sequences from the data. We are only interested in viruses. I have attempted this with the following process. Does this look correct?

        2. Each set of R1 and R2 files were concatenated together using the following command, producing one R1 fastq file and one R2 fastq.
        a. cat J06643_NoIndex_L002_R1_001.fastq J06643_NoIndex_L002_R1_002.fastq J06643_NoIndex_L002_R1_003.fastq J06643_NoIndex_L002_R1_004.fastq J06643_NoIndex_L002_R1_005.fastq J06643_NoIndex_L002_R1_006.fastq J06643_NoIndex_L002_R1_007.fastq J06643_NoIndex_L002_R1_008.fastq J06643_NoIndex_L002_R1_009.fastq J06643_NoIndex_L002_R1_010.fastq J06643_NoIndex_L002_R1_011.fastq J06643_NoIndex_L002_R1_012.fastq J06643_NoIndex_L002_R1_013.fastq J06643_NoIndex_L002_R1_014.fastq > J06_R1.fastq

        3. Illumina adapters and low quality reads were removed using cutadapt.
        a. cutadapt -f fastq -q 20 -a AGATCGGAAGAGC J06_R1.fastq > ./J06_trimmed.fastq

        4. Bowtie against hg19 to subtract out all human sequences
        a. bowtie --un J06_subtracted.fastq -p 8 --chunkmbs 512 hg19 -1 J06_R1_trimmed.fastq -2 J06_R2_trimmed.fastq J06.sam

        Comment


        • #5
          cat *R1*.fastq > JO6_R1.fq
          Probably would have worked just as well, with a lot less typing.

          If you know the virus you expect to see, it might work slightly better if you align against a genome that has human sequence and virus sequence together. You'll have to make the index for that yourself, rather than downloading the pre-made one. You can then filter the .bam for the lines that aligned to virus.

          But that won't make a very big difference.

          Comment


          • #6
            Hah, thanks. That would've saved me some time.

            How about the cutadapt and bowtie commands?

            For cutadapt, is -q 20 appropriate? Did I select the right adapter sequence, and is there a way to make sure of this?

            For Bowtie, do I need to alter the "maxins" parameter? My reads are 50bp, and the default maxins parameter is 250.

            Right now, Bowtie is outputting some blank and incomplete reads. Is that normal, and will it screw up the assembly step?

            For example, here are the first few lines of the R1 bowtie output:

            @D3NH4HQ1:710G1KACXX:2:1101:1233:2172 1:Y:0:
            A
            +
            <
            @D3NH4HQ1:710G1KACXX:2:1101:1406:2044 1:Y:0:
            AAAA
            +
            <<<@
            @D3NH4HQ1:710G1KACXX:2:1101:1317:2025 1:Y:0:
            AGCT
            +
            <<<?
            @D3NH4HQ1:710G1KACXX:2:1101:15237:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15197:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15556:2000 1:Y:0:

            +

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM
            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 01:32 PM
            0 responses
            4 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-24-2024, 07:15 AM
            0 responses
            198 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-23-2024, 10:28 AM
            0 responses
            220 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-23-2024, 07:35 AM
            0 responses
            229 views
            0 likes
            Last Post seqadmin  
            Working...
            X