Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to trim Vector and Contanmination from Illumian reads?

    We did a few pooled BAC clone Illumina sequencing, since the BAC has vector and Ecoli Genome contamination, and we need to get rid of these sequences.

    We had CLC Bio Genomics Workbecnk, but it didn't work efficiently to remove vector sequences. Is there any other alternative software for the sequence trimming.

  • #2
    You may try the fastx toolkit or play with the good old EMBOSS suite :-)

    Comment


    • #3
      Same question

      I have the same question, but seems no direct answer on it I could find so far. FASTX_tools not suitable as fastx_trimmer needs the position of the adaptor, fastx_clipper only clip off the sequence after the adaptor, and not quite sure biopieces did the right thing after several tries. The tricky part is the bi-direction of the insert, so that there are four sets of border sequences as markers to be clipped off. Say:
      Code:
      5-TGGCCAATTnnnnnnnnnnTGCTAGCACTAG-3
      3-ACCGGTTAAnnnnnnnnnnACGATCGTGATC-5
      nnnnnn are the insert sequence.
      So that
      Code:
      TGCTAGCTAG--->vector--seq---
      AATTGGCCA--->vector--seq---
      should be clipped off
      and
      Code:
      --vector--seq---<---TGGCCAATT
      --vector--seq---<---CTAGTGCTAGCA
      should be clipped off too.

      I am not sure all those avaiblable tools take these into consideration. Hope any of the authors could address this question. Thanks in advance!

      YT

      Comment


      • #4
        How to trim Vector and Contamination from Illumina reads?

        Hi guys,

        If you are working with Illumina data, try trimmomatic,



        Best wishes,
        Maria

        Comment


        • #5
          Did you try aligning to the E.coli and vector sequences, and then filtering the .bam?

          Comment


          • #6
            Thanks swbarnes2!
            I did align them to the vectors, but my point is NOT to disgard those mapped reads as they are border of sequence of my BAC insert. There seems tools in biopiece, but I have problem with the installation, fastx_tools for sure only treat part of my problem, at least I did not figure out the way to do the job.

            mastal, I have looked into your suite, I could not figure out the way to do my job to clip off the border sequences of each read, not based on quality, but on insert border sequences, which vary among reads. Different from adaptor from RNA-seq etc.

            Appreciate any experties though. Thanks again!
            Last edited by yifangt; 03-03-2013, 04:39 PM.

            Comment


            • #7
              Biopieces should be able to do this. Why dont you make a couple of small tests to see? You may need to reverse complement sequences or adaptors, but that is what a test will show you. Here is my little test (note that I use x instead of N since N is the IUPAC code for A, T, C or G - which will match anything):

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG 
              SEQ_NAME: test1
              SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
              SEQ_LEN: 31
              ---
              Note that the reason x is included in the matched pattern is that we default allow 10% mismatches.

              Now to get the adaptors trimmed from the second entry you simply need to supply the appropriate adaptors - and run through another round of find_adaptor:

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC
              SEQ_NAME: test1
              SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: ACCGGTTAA
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xACGATCGTGATC
              ---
              And finally clip_adaptor:

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC | clip_adaptor
              SEQ_NAME: test1
              SEQ: xxxxxxxxx
              SEQ_LEN: 9
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: xxxxxxxxx
              SEQ_LEN: 9
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: ACCGGTTAA
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xACGATCGTGATC
              ---
              Last edited by maasha; 03-06-2013, 12:56 AM.

              Comment


              • #8
                clip off vector border sequence

                Thanks Martin!
                That's what I was trying. Unfortunately I met problem with your biopieces installation related to Ruby issues. I have not yet sort it out with my Ubuntu system, and I have post it in the google group. Appreciate if you could have a look at it and give some suggestion.
                Thanks a lot again!

                YT
                Last edited by yifangt; 03-04-2013, 07:20 AM.

                Comment


                • #9
                  Hi Martin!

                  An update for removing vector sequences. Two things I realized need pay attension to:
                  1) the -f -r arguments for the adaptor sequence of the other strand should be the opposite of your last reply as the sequences are reverse complemented. i,e, the second adaptor_find command should be:
                  Code:
                  read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor [COLOR="Red"]-r[/COLOR] ACCGGTTAA [COLOR="Red"]-f [/COLOR]ACGATCGTGATC
                  2) there seems bugs for the adaptor combination, e.g. seq14 as the combination of seq1 and seq4, for which the adaptors should be trimmed off. They were detected, but not clipped.
                  if the adaptor sequence was right at the end of the read, see >seq03_head_last.
                  An example of what I did is:
                  Code:
                  >seq01
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxxxxxxxx
                  >seq02
                  XXXXX222XXXXXXXXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq03
                  GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXXXX
                  >seq04
                  XXX4444XXXXXXXXXXXXXXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq12
                  AGTCGACCTGCAGGCATGCAAGCTTxxx111XX222XXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq34
                  GTGACACTATAGAATACTCAAGCTTXXX333XXX4444XXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq13
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxXXX333XXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[COLOR="Red"]xxxxx333[/COLOR]
                  >seq14
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq32
                  GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXX222XXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq20
                  xxxxxxxxxxxCTATAGTGTCACCTAAATAGCTTGGXXXXXXX222XXXXXXXXXXXXX
                  >seq03_head_last
                  XXXXXXXXXXX[COLOR="Red"]GTGACACTATAGAATACTCAAGCTT[/COLOR]
                  >seq03_head_last_n_tail
                  XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTTXXxxxx3xxtailXXXXXXXXX
                  Code:
                  read_fasta -i demo_seq.fa | find_adaptor -f AGTCGACCTGCAGGCATGCAAGCTT -r CTATAGTGTCACCTAAATAGCTTGG | find_adaptor -f GTGACACTATAGAATACTCAAGCTT -r GCATGCCTGCAGGTCGACTCTAGAG  | clip_adaptor
                  The output is:
                  Code:
                  SEQ_NAME: seq01
                  SEQ: xxxxxxx111xxxxxxxxxxxxxxxxxxx
                  SEQ_LEN: 29
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ---
                  SEQ_NAME: seq02
                  SEQ: XXXXX222XXXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 27
                  ADAPTOR_POS_RIGHT: 27
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  SEQ_NAME: seq03
                  SEQ: XXX333XXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 24
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq04
                  SEQ: XXX4444XXXXXXXXXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 32
                  ADAPTOR_POS_RIGHT: 32
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq12
                  SEQ: xxx111XX222XXXXXXXXXXXXX
                  SEQ_LEN: 24
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ADAPTOR_POS_RIGHT: 49
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  SEQ_NAME: seq34
                  SEQ: XXX333XXX4444XXXXXXXXXXXXX
                  SEQ_LEN: 26
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ADAPTOR_POS_RIGHT: 51
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq13
                  [COLOR="Red"]SEQ: xxxxx333[/COLOR]
                  SEQ_LEN: 8
                  ADAPTOR_POS_LEFT: 62
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq14
                  SEQ: xxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXX
                  SEQ_LEN: 37
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ADAPTOR_POS_RIGHT: 62
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq32
                  SEQ: XXX333XXXXXXXXXXXXXXXX222XXXXXXXX
                  SEQ_LEN: 33
                  ADAPTOR_POS_RIGHT: 58
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq20
                  SEQ: xxxxxxxxxx
                  SEQ_LEN: 10
                  ADAPTOR_POS_RIGHT: 10
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: xCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  [COLOR="Red"]SEQ_NAME: seq03_head_last
                  SEQ: XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[/COLOR]
                  SEQ_LEN: 36
                  ADAPTOR_POS_LEFT: 10
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq03_head_last_n_tail
                  [COLOR="Red"]SEQ: XXxxxx3xxtailXXXXXXXXX[/COLOR]
                  SEQ_LEN: 22
                  ADAPTOR_POS_LEFT: 10
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  You can see that the sequence
                  Code:
                  [COLOR="Red"]>seq03_head_last[/COLOR]
                  should have been clipped off to have empty sequence as the adaptor is at the end. However, this is correct if there is extra sequence attached to the end, cf.
                  Code:
                  seq03_head_last_n_tail
                  Did I miss anything with that? Thanks!
                  Last edited by yifangt; 03-04-2013, 12:41 PM.

                  Comment


                  • #10
                    Thanks yifangt, I will post this to the Biopieces Google Group and answer there.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    201 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 08:54 AM
                    0 responses
                    212 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-02-2024, 03:00 PM
                    0 responses
                    194 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X