Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Edit: I should have refreshed before posting, Maubp's suggestion is probably easier!

    You might be able to just use grep (grep -v -f ids.file ...), though you might have to collapse the read name and sequence onto one line (with awk) first, then pipe that into grep and pipe the output back to awk to split things back again.
    Last edited by dpryan; 07-25-2013, 01:34 AM. Reason: Too slow

    Comment


    • #17
      thank you ...
      but could you please be a little more specific ... I am very new to this field and I am still learning... also I have to consider if using the splitted files or if I can use my big file (which contain ~20 mil reads)...

      Thanks in advance for every advice and suggestion

      Comment


      • #18
        so grep is not working..or at least not in this case... so I'll try to explain better...

        1. I have a fasta file splitted (each file contain 3 mil reads) with line like this:
        >D3P26HQ1:180:c0yj8acxx:4:2208:3279:56003 1:N:0:TGTCAA
        CCTCACCAGCCGCACGAACACGCCCCCGCTGAGCAAGCATCCCGTGGCGTCAGCGGATGAGCGACGCGGAGACAGCACCTGACCCATGTTGATGTAGTGT
        >D3P26HQ1:180:c0yj8acxx:4:1108:21179:84973 1:N:0:AGACCA
        CACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAGAAAGAAAGAAGAAACAGTGGGAGAGTGGGGGGGACGGAG
        >D3P26HQ1:180:c0yj8acxx:4:2103:16692:4396 1:N:0:TGTCAA
        GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACAACACCAAAAAGGTGAAGAGATCGATA
        >D3P26HQ1:180:c0yj8acxx:4:2307:9878:25361 1:N:0:GGCTAA
        ACAGCAATAACTGTGCCGCCATCGTCAGAATATTGGCGGGCGATTTTCATGATTTGAATTTTGTGACGAATATCTAAGCTTGAGATTGGCTAGATCTGAA

        2. a txt file with ID's of reads I want to remove
        D3P26HQ1:180:c0yj8acxx:4:2316:9843:31035
        D3P26HQ1:180:c0yj8acxx:4:2316:9844:63006
        D3P26HQ1:180:c0yj8acxx:4:2316:9885:5144
        D3P26HQ1:180:c0yj8acxx:4:2316:9888:45894
        D3P26HQ1:180:c0yj8acxx:4:2316:9914:29032

        What I want to do is remove all the reads that have the ID's presents into the txt file.

        Can anyone help me sort this out???

        thanks!!!!!

        Comment


        • #19
          It'll be easier for you to use the python script that maubp linked to.

          Comment


          • #20
            I usually use filter_fasta.py from QIIME for this purpose..
            savetherhino.org

            Comment


            • #21
              Thanks rhinoceros... that sounds nice.. but instead of keeping the sequence in the list can I just discard them?? or create two files one with the kept reads and one with the discarded???

              filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_keep.txt

              Comment


              • #22
                Originally posted by flacchy View Post
                Thanks rhinoceros... that sounds nice.. but instead of keeping the sequence in the list can I just discard them?? or create two files one with the kept reads and one with the discarded???

                filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_keep.txt
                filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_keep.txt
                filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_remove.txt -n


                Note, it's not a standalone solution but has dependencies so you need to have qiime installed (which is highly recommended because there's a ton of other useful stuff too)..
                Last edited by rhinoceros; 07-25-2013, 06:06 AM.
                savetherhino.org

                Comment


                • #23
                  Thanks so so much... I'll try ... just one more thing... I know I am a bit of a pain but I've seriously started 3 months ago and there are tons of things I need to learn...

                  could I only use this: filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_remove.txt -n???
                  and have my clean file with only the reads that are not presents in the ID list???

                  Comment


                  • #24
                    Originally posted by flacchy View Post
                    Thanks so so much... I'll try ... just one more thing... I know I am a bit of a pain but I've seriously started 3 months ago and there are tons of things I need to learn...

                    could I only use this: filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_remove.txt -n???
                    and have my clean file with only the reads that are not presents in the ID list???
                    Yes, but like I said, the script is not standalone but relies on other qiime stuff so you need to have that installed. If you happen to be on Mac OS X, I highly recommend Macqiime, which is very painless to install..
                    savetherhino.org

                    Comment


                    • #25
                      we do have quiime installed into the biolinux platform

                      Comment


                      • #26
                        Oh my THANK YOU so so much rhinocheros it did work!!!!!! ^_^

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-25-2024, 11:49 AM
                        0 responses
                        15 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-24-2024, 08:47 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        62 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X