Announcement

Collapse
No announcement yet.

Read lengths, inserts, fragment size...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Read lengths, inserts, fragment size...

    Hi,

    I am new in sequencing and bioinformatics and trying to get the terms right.

    I am doing WGS of e. coli genomes on Illumina HiSeq machines.

    Read length: Is that the length of the DNA fragment between the tags being replicated on the flow cell?

    Inserts: The actual sequence you get from the machine?

    Fragment size: Not sure...

    Cycles: I have samples run in 75 and 100 cycles, meaning what?

    Please help me, so confused...

  • #2
    You should read through the stickies in the Illumina library prep section. That said:

    Read length and cycles are related terms. With Illumina, each base pair is sequenced one cycle at a time. So 100 cycles gives you 100bp reads. The read length is the number of bases sequenced. It does not matter where along the entire strand the bases are sequenced (although typically they start right after the sequencing primer which is ligated on to each fragment during the library prep).

    Insert refers to the sequence between the universal adapters that are ligated on. These adapters typically have the flow cell adapter sequence and sequencing primer. They can also have an index or a barcode. If they have a barcode, the barcode is considered to be included in the insert since it will be sequenced at the beginning of the read. An index would not be sequenced at the beginning of the read and thus does not contribute to the insert size.

    Fragment size is similar to insert size; the average size of your fragments. It should be specified if this is in regards to before or after ligating on the adapters.

    Comment


    • #3
      Okey,

      I have reads of 75 bp but the mean insert size is in one sample 239 bp. Is the barcode that long?

      Thank you for making it more clear!

      Comment


      • #4
        Originally posted by avm View Post
        Okey,

        I have reads of 75 bp but the mean insert size is in one sample 239 bp. Is the barcode that long?

        Thank you for making it more clear!
        No, that means (probably), that there is a 239 bp DNA strand between the two adapters that were ligated on. The insert size defined like this is independent of the read length.

        To explain better, some people I think state that the fragment size is the distance between where the adapters are ligated and the insert size is the distance between where the reads end in paired end sequencing. With that definition, the insert size would vary with read length. However, I do not believe that's universal, and I always refer to insert size as the total length of DNA between where the adapters are ligated.

        If you use software where you have to specify the insert size, check to make sure that you understand what definition the software is using.

        Comment


        • #5
          Thank you for explaining and for quick answers!!

          Comment


          • #6
            We got our illumina paired end data for 2x 100bp run processed from CASAVA 1.8 (demultiplexed fastq files). Since this is our very first run and we are a newbie to the downstream illumina data processing, I would appreciate if you can answer out queries:
            1). Our data for almost all the lanes looks as below. Is this normal? The position of NNNs is almost same in each sample from different lanes. If not, whats the cuse of such a data?

            ********************************************************************************************************
            @DJG64KN1:78:C0MG3ACXX:4:1101:1119:1986 1:Y:0:GCCAATA
            TTCTCCCCTTNNNNNNNNNNNTTCTTTGAACCCACNNNNNNNNTATCATGACTACTTATGTAANNNNNNNTACACAGCCACCATTTCTGANNNCTGCTCA
            +
            <<<[email protected][email protected]@###########228???????????########--<=????????????;[email protected]@#####################################
            @DJG64KN1:78:C0MG3ACXX:4:1101:1212:1989 1:Y:0:GCCAATA
            TATGAAAAATNNNNNNNNNNAATGTTATAATTTCTANGNNNNNGAGGGCTATTTATAGTCTAANNNNNTCAACTATGCTAATTATCACAATTAGCCCCTT
            +
            <<<@[email protected]@@[email protected]##########[email protected][email protected][email protected]@[email protected]?????#0#####00==????????>[email protected]@@@#####,,9==>>?>???>>?????==========<<<
            @DJG64KN1:78:C0MG3ACXX:4:1101:1473:1987 1:Y:0:GCCAATA
            CTTACATATANNNNNNNNNNNAAAAGTAAGTTTGAGNCNNNNNTCCAATTTAGATGAAGAATCNNNNNACATTTCATATTTTTAATAGATACTTAACTAT
            +
            <<<@@@@@@@###########[email protected]>[email protected]???)=#0#####00<????>???>??=?9;?;#####,,9==?>=>>>?????;=:===26;===<===
            @DJG64KN1:78:C0MG3ACXX:4:1101:1253:1997 1:Y:0:GCCAATA
            ATTTGTATTANNNNNNNNNTCAAAAATTAAGATGAGTATNNNNTGAAGTAAACATGATTTGGCNNNNNTGAAAACATAGACGAGATAGGAAAATAGAAAG
            +
            <<<@@@@@@@#########[email protected]@@@@@[email protected]????????####00=??????????>[email protected]@@@?#####--=???><>?>??<<<<<======<=======
            @DJG64KN1:78:C0MG3ACXX:4:1101:1385:1998 1:Y:0:GCCAATA
            AACCAAAGCTNNNNNNNNNAATTAAAGTCATTTCTCAACNNNNAGTATCAACATCTATACATANNNNNATTATCGATCAGTTATATAAAGTTCTTTTCTA
            +
            <<<@@@@@@@#########32@@@@[email protected]????????????####00<[email protected]@@@>#####-,9=????=?<??????===============
            @DJG64KN1:78:C0MG3ACXX:4:1101:1667:1982 1:Y:0:GCCAATA
            ANGACTTAAGNNNNNNNNNNNTCCAGAGATAATTANNNNNNNNTTTTTTTCTTATTTATGAGNNNNNNNAACATCCAAAAAACTATTGTATTTTTGTGTC
            +
            <#[email protected]@@@@@@###########[email protected]>@>????????########00<????????????????######################################
            @DJG64KN1:78:C0MG3ACXX:4:1101:1519:1984 1:Y:0:GCCAATA
            TNCCCATTTTNNNNNNNNNNNCTTATTCACAAATCNNNNNNNNAACTTACAGTAGTTTTCATNNNNNNNAAAAACAGTTCAAACTGCAATTGTATTTGTG
            +
            9#0<@@([email protected]@##########################################################################################
            @DJG64KN1:78:C0MG3ACXX:4:1101:1594:1985 1:Y:0:GCCAATA
            TTATAATCAANNNNNNNNNNNAAAAAAAAAGCCCGNNNNNNNNAATTAAACATTGTTAAACCANNNNNNAACATTGTTAAACCAATAATAAGCAGTTATT
            +
            <<<@[email protected][email protected]??###########[email protected]@?????8>???#################################################################
            @DJG64KN1:78:C0MG3ACXX:4:1101:1644:1989 1:Y:0:GCCAATA
            AGATGAGTAANNNNNNNNNNTACATGCTCGAACGCTNTNNNNNGAGCAAATACGTTTTAAAACNNNNNAAGTTAAAACAACTTCTTGAAAATGAATCAAG
            +
            <<<@[email protected]@@?##########32=?????????????#-#####.-<=??9;>[email protected]@???#####################################
            @DJG64KN1:78:C0MG3ACXX:4:1101:1809:1988 1:Y:0:GCCAATA
            TAGCCTTATCNNNNNNNNNNNCCAAACTAGACACCTNANNNNNCAACACTATGCCTTCTTTAANNNNNAAATGACATTTTTCCCAATTAAGAACAAGGTG
            *****************************************************************************************************************

            2): we have got around 21-30 fastq files per lane for both read 1 nad read 2 as: SJL-2b_ACAGTGA_L008_R1_001.fastq.gz ..................... SJL-2b_ACAGTGA_L008_R2_021.fastq.gz.
            Does this mean that the read length of this sample is only 21 bp?

            Comment


            • #7
              1. Having stretches of N's like that is not normal. I'm not sure what the cause would be. You should check with whatever sequencing core ran those samples to see if there was anything weird with the run as a whole. If so you may be able to get them to rerun it for free.

              2. No, your reads are 100bp. That's just the name of the file; could mean anything.

              Comment


              • #8
                thanks for the response. the pattern looks same in almost all the sample? Is this a problem with the library preparation or just sequencing run problem?

                I want to elaborate on my second question: (both R1 and R2)
                some samples have R1_001..to ...20.fastq.gz and some R1_001..to ...35.fastq.gz. Why different samples have different number of files? what does this suggest?

                Can you please let me know which software to use for clipping the adapter seq and the indices and further downstream processing

                thanks a lot sir!

                Comment


                • #9
                  No idea if it's a problem with the library prep or the run; I would check with the sequencing core (I'd imagine it's a problem with the run, though).

                  There may be different numbers of files if there are different total number of reads, not different read lengths.

                  From the reads you showed it looks like the indices have already been clipped and put into the headers. You may want to look at the FastX toolkit to find a way to trim adapter sequences. I align with Novoalign which does it during the alignment.

                  Comment


                  • #10
                    thanks a lot...it helped

                    Comment


                    • #11
                      just a follow up of the above. illumina support has the following answer to the problem and i did find that the seq in the middle are good.

                      "The data that you provided looks to be very normal. Generally speaking there will be data at the beginning and end of the FASTQ that is of lower quality than the data in the middle of the file. This is simply due to sorting. This data appears to be of normal quality and appears to be intact. "

                      If you have time i want to discuss my course of action:

                      I am having 454 unpaired and paired data and illumina reads. I have assembled the 454 data using newbler. I plan to assemble illumina data using velvet. Combine the assemblies using minimus.

                      arc

                      Comment


                      • #12
                        Oh, right; the first reads of a fastq file for Illumina will be around the edge of the flowcell, I think, making them more likely to be weird. Maybe do "less +1000000" and see what that looks like.

                        I've never done any assembly so you'll have to find somebody else.

                        Comment


                        • #13
                          Hi sorry for the delayed response. I did "less +1000000" and the data looked good.

                          I have a few more queries:
                          1) I can see both the sequences flagged with "N" and "Y" which indicated that the sequences have not been filtered. Are there prog to do that.
                          2) Out seq provider has given multiple fastq.gz files per lane. What is the protocolto concatenate such files.
                          3) I am confused about the illumina paired end library in comparison to 454 pe library. The illumina lib has the following setup : adapter-seq-adapter in comparison to 454 which as seq-linker-seq. If the seq are 100bp each than in 454 we end up getting 200bp pe reads wheres in illumina we get separate 100bp R1 and R2 reads (for 2x100run). This means that in illumina we are just getting extra 100bp reads from the pe run which do not have any linking information. We can save money by doing unpaired ilumina runs. What is the use doing pe illumina run.

                          sorry for bombarding u with so many question.

                          regards,
                          arc

                          Comment


                          • #14
                            1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

                            2. I think you'll need to unzip them all and then concatenate.

                            3. With Illumina paired end runs you have something like this:

                            [flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

                            The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

                            ........--------->...................<---------.....
                            ........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

                            So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.

                            Comment


                            • #15
                              thanks a lot.

                              Comment

                              Working...
                              X