Announcement

Collapse
No announcement yet.

Filtering SOLiD reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering SOLiD reads

    I've got 120 million 50bp SOLiD reads from a Eukaryote, and I'd like to remove anything plastid related. I've got the assembled genome of the plastid, but I need to do the matching in color space, correct? Normally I'd just do this with blast.. is there a tool in Corona that will do this?

    Thanks!

  • #2
    1. Align you reads againt plastid (I personally like bwa and bfast).
    2. Once you have the alignments is trivial to separate reads that come
    from one or the other organisms.

    If you want to go the ABi way use Bioscope instead of corona.
    -drd

    Comment


    • #3
      Originally posted by drio View Post
      2. Once you have the alignments is trivial to separate reads that come
      from one or the other organisms.
      Hmmm I beg to differ that its trivial to separate the reads.
      Getting the ids of the reads that map to two different is simple.

      but working with the large number of reads isn't.
      you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Originally posted by KevinLam View Post
        Hmmm I beg to differ that its trivial to separate the reads.
        Getting the ids of the reads that map to two different is simple.

        but working with the large number of reads isn't.
        you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
        Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
        -drd

        Comment


        • #5
          Originally posted by drio View Post
          Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
          I would love to look at your code if you got it working the way you mentioned.
          for me?

          I needed to extract 40 mil ids from a 70 mil csfasta.
          looping thru the csfasta is simple.
          but I found that I had memory issues if I stored 40 mil ids in a normal hash.
          So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

          next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

          So if you got it working like the way you said I would really love to c how I got it wrong.
          http://kevin-gattaca.blogspot.com/

          Comment


          • #6
            Originally posted by KevinLam View Post
            I would love to look at your code if you got it working the way you mentioned.
            for me?

            I needed to extract 40 mil ids from a 70 mil csfasta.
            looping thru the csfasta is simple.
            but I found that I had memory issues if I stored 40 mil ids in a normal hash.
            So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

            next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

            So if you got it working like the way you said I would really love to c how I got it wrong.
            If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.

            Comment


            • #7
              Originally posted by nilshomer View Post
              If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.
              I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
              I am actually not sure if they are sorted already (coming out of the machine)
              http://kevin-gattaca.blogspot.com/

              Comment


              • #8
                Originally posted by KevinLam View Post
                I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
                I am actually not sure if they are sorted already (coming out of the machine)
                They are sorted coming off the machine, so no need to resort.

                Comment


                • #9
                  I agree with drio. It's a old classical computer science problem. Google "Intersection of sorted lists". If your lists aren't sorted then use GNU sort beforehand. You only need to write a shell script, no requirement for huge hashes in RAM.

                  Comment

                  Working...
                  X