Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract reads matching barcodes from fastq file?

    Hi,
    I'm trying to extract only certain barcoded reads from a multiplexed illumina run. I tried using the the fastx barcode splitter, but my barcodes are different lengths so it errors out:
    Code:
    Error: found barcodes in different lengths. this feature is not supported yet.
    Are there other tools for this?
    I started to write a python script, but this is more difficult than I thought.

    Thanks!

  • #2
    Are these barcodes "inline" i.e. part of the actual sequence?

    Comment


    • #3
      Yes, they are inline only - not in the header.

      Comment


      • #4
        I started on a python script.. maybe others can comment on whether this will work?
        I haven't thoroughly tested it yet..

        Code:
        import sys, os
        from Bio.SeqIO.QualityIO import FastqGeneralIterator
        def extract_reads(input_filename, barcodes):
                #create dictionary of barcodes
                barcode_dict = {}
                #ensure that file is closed after
                with open(barcodes, 'r') as barcodefile:
                        # open barcode and create dictionary
                        for line in barcodefile:
                                (barcode, sampleID) = line.split()
                                barcode_dict[barcode] = sampleID
                #search fastq file for matching barcodes, and dump into new files with sampleID names
                with open(input_filename, 'r') as fastqfile:
                        for title, seq, qual in FastqGeneralIterator(fastqfile):
                                for barcode, sampleID in barcode_dict.iteritems():
        			    if seq.startswith(barcode):
                                        with open(sampleID + ".fq", "a") as outputfile:
        					outputfile.write("@%s\n%s\n+\n%s\n" % (title,seq,qual))

        Comment


        • #5
          I am not sure if this script will work but give it a try: http://creskolab.uoregon.edu/stacks/...ss_radtags.php

          If you are in a position to ask the sequence provider to de-multiplex that data again (even inline barcodes can be done as described in this post: http://seqanswers.com/forums/showthread.php?t=18692) then that may be another possible option.
          Last edited by GenoMax; 01-27-2014, 05:14 PM.

          Comment


          • #6
            Thanks. Stacks is a great program but it is also limited by not being able to process barcodes of different lengths. You can of course create multiple barcode files and run sequentially, but it is not ideal.

            Comment


            • #7
              Hi odoyle81,

              Did you find an answer to multiplexed illumina run with barcodes of different length?. If so please update?

              Thanks,
              hkm128

              Comment


              • #8
                As I recall, I created separate keyfiles and processed all barcodes of the same length, and then the next set, etc, like that.. not ideal, but not too much effort..

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-25-2024, 11:49 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                62 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                61 views
                0 likes
                Last Post seqadmin  
                Working...
                X