Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Illumina Paired-End Data

    I am a beginner at bioinformatics but have a some experience with python and software development.

    I am trying to take some Illumina sequence data (mRNA-level complementary DNA I think) and prepare it for BLAST alignment. It is supposed to be paired-end. However, I'm trying to make sure this is true.

    For example, I have the following data files:
    J06643_NoIndex_L002_R1_001.fastq
    J06643_NoIndex_L002_R1_002.fastq
    J06643_NoIndex_L002_R1_003.fastq
    J06643_NoIndex_L002_R1_004.fastq
    J06643_NoIndex_L002_R1_005.fastq
    J06643_NoIndex_L002_R1_006.fastq
    J06643_NoIndex_L002_R1_007.fastq
    J06643_NoIndex_L002_R1_008.fastq
    J06643_NoIndex_L002_R1_009.fastq
    J06643_NoIndex_L002_R1_010.fastq
    J06643_NoIndex_L002_R1_011.fastq
    J06643_NoIndex_L002_R1_012.fastq
    J06643_NoIndex_L002_R1_013.fastq
    J06643_NoIndex_L002_R1_014.fastq
    J06643_NoIndex_L002_R2_001.fastq
    J06643_NoIndex_L002_R2_002.fastq
    J06643_NoIndex_L002_R2_003.fastq
    J06643_NoIndex_L002_R2_004.fastq
    J06643_NoIndex_L002_R2_005.fastq
    J06643_NoIndex_L002_R2_006.fastq
    J06643_NoIndex_L002_R2_007.fastq
    J06643_NoIndex_L002_R2_008.fastq
    J06643_NoIndex_L002_R2_009.fastq
    J06643_NoIndex_L002_R2_010.fastq
    J06643_NoIndex_L002_R2_011.fastq
    J06643_NoIndex_L002_R2_012.fastq
    J06643_NoIndex_L002_R2_013.fastq
    J06643_NoIndex_L002_R2_014.fastq

    It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)

    R1_001:

    <#0@@#############################################
    @D3NH4HQ1:710G1KACXX:2:1101:1488:2217 1:N:0:
    GTAAGGGCAAGGGCACTGAGCTATGTCATCTGGGCTCAAATTCTGCTACC
    +
    B@@FFFFFHHHHHJJJIJJJJJIJJIIGIIIJIJJGIGGIIIGJIEIIIH
    @D3NH4HQ1:710G1KACXX:2:1101:1279:2224 1:Y:0:
    GGCTTATTTGATACTCATGGTACAGAAGCGACGATCAAATAGATTGAGAA

    R2_001:

    ###4##22ADFHG#####################################
    @D3NH4HQ1:710G1KACXX:2:1101:2135:2174 2:N:0:
    NNGATGCAGGTGGCNNGGANNNNNNNNCGCCATNNTGCCTNNNNNNNNNN
    +
    ##14A?DBD<CACB##42<########11??FE##00?B@##########
    @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
    NNTGTTGTCACTTTNNAGANNNNNNNNTTGCTATNAAGCTNNNNNNNNNN

    Does this mean the data are not paired end?

  • #2
    I'm not sure what the exact specificities of the new format are, but the 1:N:0 or 2:N:0 in the header denote what /1 and /2 used to. This wikipedia page is helpful:

    Comment


    • #3

      It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)
      What did your parents (or teachers) tell you about not trusting everything you read on the internet.

      The Illumina specs have changed back and forth a couple of times in the last several months. It looks like you received files from the time that they decided to remove the '/1' and '/2' designations. Instead look at the first number after the white space:

      @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
      The above is an R2 read.

      Comment


      • #4
        Great. That makes total sense.

        The first thing I would like to do is subtract all human sequences from the data. We are only interested in viruses. I have attempted this with the following process. Does this look correct?

        2. Each set of R1 and R2 files were concatenated together using the following command, producing one R1 fastq file and one R2 fastq.
        a. cat J06643_NoIndex_L002_R1_001.fastq J06643_NoIndex_L002_R1_002.fastq J06643_NoIndex_L002_R1_003.fastq J06643_NoIndex_L002_R1_004.fastq J06643_NoIndex_L002_R1_005.fastq J06643_NoIndex_L002_R1_006.fastq J06643_NoIndex_L002_R1_007.fastq J06643_NoIndex_L002_R1_008.fastq J06643_NoIndex_L002_R1_009.fastq J06643_NoIndex_L002_R1_010.fastq J06643_NoIndex_L002_R1_011.fastq J06643_NoIndex_L002_R1_012.fastq J06643_NoIndex_L002_R1_013.fastq J06643_NoIndex_L002_R1_014.fastq > J06_R1.fastq

        3. Illumina adapters and low quality reads were removed using cutadapt.
        a. cutadapt -f fastq -q 20 -a AGATCGGAAGAGC J06_R1.fastq > ./J06_trimmed.fastq

        4. Bowtie against hg19 to subtract out all human sequences
        a. bowtie --un J06_subtracted.fastq -p 8 --chunkmbs 512 hg19 -1 J06_R1_trimmed.fastq -2 J06_R2_trimmed.fastq J06.sam

        Comment


        • #5
          cat *R1*.fastq > JO6_R1.fq
          Probably would have worked just as well, with a lot less typing.

          If you know the virus you expect to see, it might work slightly better if you align against a genome that has human sequence and virus sequence together. You'll have to make the index for that yourself, rather than downloading the pre-made one. You can then filter the .bam for the lines that aligned to virus.

          But that won't make a very big difference.

          Comment


          • #6
            Hah, thanks. That would've saved me some time.

            How about the cutadapt and bowtie commands?

            For cutadapt, is -q 20 appropriate? Did I select the right adapter sequence, and is there a way to make sure of this?

            For Bowtie, do I need to alter the "maxins" parameter? My reads are 50bp, and the default maxins parameter is 250.

            Right now, Bowtie is outputting some blank and incomplete reads. Is that normal, and will it screw up the assembly step?

            For example, here are the first few lines of the R1 bowtie output:

            @D3NH4HQ1:710G1KACXX:2:1101:1233:2172 1:Y:0:
            A
            +
            <
            @D3NH4HQ1:710G1KACXX:2:1101:1406:2044 1:Y:0:
            AAAA
            +
            <<<@
            @D3NH4HQ1:710G1KACXX:2:1101:1317:2025 1:Y:0:
            AGCT
            +
            <<<?
            @D3NH4HQ1:710G1KACXX:2:1101:15237:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15197:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15556:2000 1:Y:0:

            +

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM
            • seqadmin
              Understanding Genetic Influence on Infectious Disease
              by seqadmin




              During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

              Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
              09-09-2024, 10:59 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-30-2024, 08:33 AM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-26-2024, 12:57 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Working...
            X