Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting fasta file according to header

    Hi there,
    I have a fasta file like this:
    Code:
    [zillur@genomics filter]$ head new_12.fasta 
    >000000M00365:7:000000000-A48JK:1:1110:10044:9619
    TACGGAGGGTGCAAGCGTTATCCGGAATCACTGGGTTTAAAGGGTGCGTAGGCGGATATATAAGTCAGAGGTGAAAGCTCGCAGCTTAACTGCGGAATTGCCTTTGATACTGTTTATCTTGAATTATGTTGAGGTTAGCGGAATGAGTCAT
    >000000M00365:7:000000000-A48JK:1:2105:14983:8496
    TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCCTGTAAACTATCAGTCTAGAGTTCTGGAGAGGTGAGTGGAATTGCTAGG
    >000000M00365:7:000000000-A48JK:1:2113:12381:28279
    TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGATAAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGATACTGTCAGACTAGAGTATGTTAGAGGAATGCGGAATTCCGGGT
    >000001M00365:7:000000000-A48JK:1:1110:15899:9619
    TACGAACTGTGCAAACGTTATTCGGAATCACTGGGCTTAAAGGGTGCGTAGGCGGGTTTGTAAGTCAGAGGTGAAAGTTTGCAGCTTAACTGTAAAATTGCCTTTGAAACTGTAGAACTTGAGTAGCGTTGAGGTCAGCGGAATGTGACAT
    >000001M00365:7:000000000-A48JK:1:2105:15157:8497
    TACGAAGGTCCCAAGCGTTATTCGGAATCACTGGGCGTAAAGGGAGCGTAGGCGGCGTGGAAAGTCAGATGTGAAATCTCAAGGCTCAACCTTGAAACTGCATCCGATACTTCCATGCTAGAGGACTGGAGAGGTGTTTGGAATTATCGGT
    I want to sort this file according to header informations. How can I do this?

    Best Regards
    Zillur

  • #2
    Can you be more specific about which header information? Alphabetical sorting?

    Comment


    • #3
      Thank you very much. alphabetically/numerically whichever convenient.

      Best Regards
      Zillur

      Comment


      • #4
        And the reason you want to do this, if I may ask?

        Comment


        • #5
          Thanks.
          And the reason you want to do this, if I may ask?
          Yeah sure. I wanted to create fastq file using my .qual ahd fasta file using qiime. But it gaves me:
          Code:
          KeyError: 'QUAL header (M00365:7:000000000-A48JK:1:1101:14885:1320) does not match FASTA header (M00365:7:000000000-A48JK:1:1101:16466:1388)
          In my qual file I have many other sequences including my fasta. So, I think sorting may resolve the issue. I appreciate your suggestions.

          Best Regards
          Zillur

          Comment


          • #6
            I guess sort on linux will work.
            cat file.fasta|paste - -|sort|sed 's/\t/\n/g'
            Try this.
            Persistent LABS

            Comment


            • #7
              Following is untested but you could give it a try and see if it works. It may avoid the sort etc. You will find reformat.sh in BBMap suite.

              Code:
              reformat.sh in=your_fasta_file.fa qfin=your_qual_file.qual out=fastq_format_file.fq

              Comment


              • #8
                Thank your very much. I have tried this:
                cat file.fasta|paste - -|sort|sed 's/\t/\n/g'
                But it doesn't resolve all:
                Code:
                (qiime191) [zillur@genomics final]$ head new_sorted_1.fasta 
                >M00365:7:000000000-A48JK:1:1101:10000:14343
                TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTGCGTAGGCGGATTATTAAGTTAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCCTTAAAACTGTTGGTCTTGAGTTCTGGAGAGGTGAGTGGAATTGCTAGT
                >M00365:7:000000000-A48JK:1:1101:10000:18084
                TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGAATGCGGAATTCCAGGT
                >M00365:7:000000000-A48JK:1:1101:10000:25105
                TACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGGTTATTAAGTCAGATGTGAAATCCCAGGGCTCAACCTTGGAACTGCATTTGAAACTGGTAACCTAGAGACTAGGAGAGGTCAGTGGAATACCGAGT
                >M00365:7:000000000-A48JK:1:1101:10000:5055
                CACGTAGGGGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCTGTTCAGTAAGTCAGGTGTGAAAATCCAAGGCTCAACCTTGGGACGCCACCTGATACCGCTGTGACTAGAGTCCGGTAGAGGAGATTGGAATTCCTGG
                >M00365:7:000000000-A48JK:1:1101:10001:16084
                TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGATTGCGGAATTCCAGGT
                refomart.sh gives me:
                Code:
                [zillur@genomics final]$ ./bbmap/reformat.sh in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                java -ea -Xmx111g -cp /home/zillur/Desktop/zillur/yadira/study_1799_split_library_seqs_and_mapping/filter/final/bbmap/current/ jgi.ReformatReads in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                Executing jgi.ReformatReads [in=new_15.fasta, qfin=qual_.1.qual, out=f_nw_15_ql_.1.fq]
                
                Input is being processed as unpaired
                Exception in thread "Thread-1" java.lang.AssertionError: Quality and Base headers differ for read 0
                	at stream.FastaQualReadInputStream.toReadList(FastaQualReadInputStream.java:128)
                	at stream.FastaQualReadInputStream.toReads(FastaQualReadInputStream.java:110)
                	at stream.FastaQualReadInputStream.fillBuffer(FastaQualReadInputStream.java:94)
                	at stream.FastaQualReadInputStream.hasMore(FastaQualReadInputStream.java:54)
                	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
                	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                What should I do now?

                Best Regards
                Zillur

                Comment


                • #9
                  When you sort the fasta file, did you also sort the qual file?

                  Originally posted by zillur View Post
                  In my qual file I have many other sequences including my fasta.
                  What do you mean by having other sequences in your qual file?

                  Comment


                  • #10
                    If you have BioPerl ≥ 1.6.922 and Sort::Naturally, then

                    https://github.com/douglasgscofield/...ipts/fastaSort

                    shows how to sort on sequence name, using natural sort as it seems you require.

                    Comment


                    • #11
                      Originally posted by zillur View Post
                      Thank your very much. I have tried this: But it doesn't resolve all:
                      Code:
                      (qiime191) [zillur@genomics final]$ head new_sorted_1.fasta 
                      >M00365:7:000000000-A48JK:1:1101:10000:14343
                      TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTGCGTAGGCGGATTATTAAGTTAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCCTTAAAACTGTTGGTCTTGAGTTCTGGAGAGGTGAGTGGAATTGCTAGT
                      >M00365:7:000000000-A48JK:1:1101:10000:18084
                      TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGAATGCGGAATTCCAGGT
                      >M00365:7:000000000-A48JK:1:1101:10000:25105
                      TACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGGTTATTAAGTCAGATGTGAAATCCCAGGGCTCAACCTTGGAACTGCATTTGAAACTGGTAACCTAGAGACTAGGAGAGGTCAGTGGAATACCGAGT
                      >M00365:7:000000000-A48JK:1:1101:10000:5055
                      CACGTAGGGGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCTGTTCAGTAAGTCAGGTGTGAAAATCCAAGGCTCAACCTTGGGACGCCACCTGATACCGCTGTGACTAGAGTCCGGTAGAGGAGATTGGAATTCCTGG
                      >M00365:7:000000000-A48JK:1:1101:10001:16084
                      TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGATTGCGGAATTCCAGGT
                      refomart.sh gives me:
                      Code:
                      [zillur@genomics final]$ ./bbmap/reformat.sh in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                      java -ea -Xmx111g -cp /home/zillur/Desktop/zillur/yadira/study_1799_split_library_seqs_and_mapping/filter/final/bbmap/current/ jgi.ReformatReads in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                      Executing jgi.ReformatReads [in=new_15.fasta, qfin=qual_.1.qual, out=f_nw_15_ql_.1.fq]
                      
                      Input is being processed as unpaired
                      Exception in thread "Thread-1" java.lang.AssertionError: Quality and Base headers differ for read 0
                      	at stream.FastaQualReadInputStream.toReadList(FastaQualReadInputStream.java:128)
                      	at stream.FastaQualReadInputStream.toReads(FastaQualReadInputStream.java:110)
                      	at stream.FastaQualReadInputStream.fillBuffer(FastaQualReadInputStream.java:94)
                      	at stream.FastaQualReadInputStream.hasMore(FastaQualReadInputStream.java:54)
                      	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
                      	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                      What should I do now?

                      Best Regards
                      Zillur
                      The sort example has sorted your data alphabetically. If you try to sort your qual file, I think you will get the same order of headers.
                      Persistent LABS

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Best Practices for Single-Cell Sequencing Analysis
                        by seqadmin



                        While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                        06-06-2024, 07:15 AM
                      • seqadmin
                        Latest Developments in Precision Medicine
                        by seqadmin



                        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                        Somatic Genomics
                        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                        05-24-2024, 01:16 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:58 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-06-2024, 08:18 AM
                      0 responses
                      20 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-06-2024, 08:04 AM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 06-03-2024, 06:55 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X