Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #31
    Might help us if you demonstrated that the file is indeed not empty. How about a 'ls -l' on the file. Or an 'od -c yourfile.sff | head --lines 4' or the actual command you sent to SeqIO.convert so that we can be sure that you did send your file to it.

    Comment

    • lplough81
      Junior Member
      • Feb 2012
      • 7

      #32
      Hi,
      I was actually able to get it to run today.. Not sure what the problem was yesterday. But i got some funny results anyhow. Some of the nt's are uppercase and some are lowercase. This caused problems for some of the Galaxy fastx tools that summarize quality data.

      Any thoughts?

      @HH42GP401CAJLD
      gactagactcgacgtGTACTCAGGCTCGCACCGTGGCATGTCGCACTGTACTCAAGGCTCGCACCGTGGCATGTCGCACTGTACTTAAGGCTCACACCGTGGCATGTCGCACTGTACTCAAGGCACACAGGGGntaggnn
      +
      IIIIIIIIIIIIIIIIIIIGD666IIIIIIIIGDDDIIIIIIIIIIIIIIIGB;;;;IIIGGGGGCC>>>CIHID@@@C==:99==GGIIIIHIIIIIIIGGGCCCHIDDDC@777@C>1111AA@>;84445!;:44!!
      @HH42GP401B4BC5
      gactagactcgacgtGCAGTAGCTGCAATGGCGCAGAAGGCGTGCTTCtctctcncacgcacacacgagagagagngnnn
      +
      FFFFFFFFFFFFFFFIIIIIIIIIFFFFDDAAAB?<4444<>>9422323663/!//5///59=///2222////!2!!!

      The code that I ran is here, (117,221 is the right number of reads for this file)
      >>> SeqIO.convert("454Reads.JA11255_155_RL13.sff", "sff", "untrimmed.fastq", "fastq")
      117221

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #33
        Originally posted by lplough81 View Post
        Hi,
        I was actually able to get it to run today.. Not sure what the problem was yesterday. But i got some funny results anyhow. Some of the nt's are uppercase and some are lowercase.
        You'll see the same from Roche's own tools. The lower case are the bits which would be trimmed off as adapters or low quality bases.

        Originally posted by lplough81 View Post
        This caused problems for some of the Galaxy fastx tools that summarize quality data.

        Any thoughts?
        That could be an oversight in fastx - ask them about it.

        Or, what you probably want to do is ask for the trimmed sequences (which will be all upper case):

        Code:
        SeqIO.convert("454Reads.JA11255_155_RL13.sff", "sff-trim", "trimmed.fastq", "fastq")

        Comment

        • lplough81
          Junior Member
          • Feb 2012
          • 7

          #34
          OK!

          Got it. Fairly new work for me, so I appreciate the patient replies. Can I specify the quality cutoff for trimming? Or what is the default that the biopython fastq trimmer uses?

          Thanks again.

          LP

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #35
            Originally posted by lplough81 View Post
            Got it. Fairly new work for me, so I appreciate the patient replies. Can I specify the quality cutoff for trimming? Or what is the default that the biopython fastq trimmer uses?
            There are two things to consider - getting rid of the adapter sequences and quality trimming. Roche does a good job of this as part of the base calling and production of the SFF file. When reading SFF files, Biopython (and other tools like sff_extract and Roche's own tools) will just apply the trimming information recorded in the SFF file. Using the Roche trimming is usually fine.

            You may need to further trim off PCR primers or other library specific adapters if the Roche software wasn't told about them.

            You may decide to further apply some quality cutoff trimming as well. This may be a good idea for some downstream analysis, not for others.

            It is possible to do this kind of trimming in Biopython, but not in one line. There are some examples in the tutorial. I've written some SFF trimming tools using Biopython available within the Galaxy Tool Shed (if your institute runs its own Galaxy instance that may be interesting).

            There are also other tools which will do it for you - especially if you want to work with the FASTQ file (or FASTA+QUAL) instead of the SFF file.

            Comment

            • lplough81
              Junior Member
              • Feb 2012
              • 7

              #36
              how to trim FASTA name

              Hi,
              Is there a simple way to reduce the fasta name (e.g /
              "> HH42GP401CAJLD length=118 xy=0823_0287 region=1 run=R_2012_01_27_13_59_03_ "

              to ">HH42GP401CAJLD"?

              Similar to trimming an SFF file to FASTA with biopython SeqIOconvert(), but taking a fasta file as the input and then outputting another fasta file?

              Thanks,

              Louis

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #37
                Try something like this, untested:
                Code:
                from Bio import SeqIO
                
                in_file = "example.fasta"
                out_file = "new.fasta"
                file_format = "fasta"
                
                def remove_descr(record):
                    record.description=""
                    return record
                
                #This is a generator expression - not all in memory at once!
                wanted = (remove_descr(r) for r in SeqIO.parse(in_file, file_format))
                count = SeqIO.write(wanted, out_file, file_format)
                print "Saved %i records" % count

                Comment

                • SES
                  Senior Member
                  • Mar 2010
                  • 275

                  #38
                  Originally posted by lplough81 View Post
                  Hi,
                  Is there a simple way to reduce the fasta name (e.g /
                  "> HH42GP401CAJLD length=118 xy=0823_0287 region=1 run=R_2012_01_27_13_59_03_ "

                  to ">HH42GP401CAJLD"?
                  I don't think you need a script for that. If your file is "454reads.fas" then just do:
                  Code:
                  sed 's/\s.*//' 454reads.fas > 454reads_trimmedheader.fas

                  Comment

                  • coolFlame
                    Junior Member
                    • Jun 2012
                    • 1

                    #39
                    Originally posted by kmcarr View Post
                    Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

                    Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
                    @kmcarr: I found your script very useful and I am currently as a MSc Bioinformatics students working on an assignment which involves developing a web interface to a little mapping pipeline. This is purely for educational purposes. Would I be allowed to use your script to prepare the fastq file for the pipeline?
                    I really would appreciated it.

                    Comment

                    • Sunil Bhavsar
                      Junior Member
                      • Feb 2015
                      • 1

                      #40
                      I am a beginner for using Perl command. Wen I am trying FastaQual2fastq.pl script for making my fastq file but they like this error - readline() on closed filehandle.
                      So please help me to give a right solution for putting fasta seq.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Pathogen Surveillance with Advanced Genomic Tools
                        by seqadmin




                        The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                        03-24-2025, 11:48 AM
                      • seqadmin
                        New Genomics Tools and Methods Shared at AGBT 2025
                        by seqadmin


                        This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                        The Headliner
                        The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                        03-03-2025, 01:39 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-20-2025, 05:03 AM
                      0 responses
                      41 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-19-2025, 07:27 AM
                      0 responses
                      49 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-18-2025, 12:50 PM
                      0 responses
                      36 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-03-2025, 01:15 PM
                      0 responses
                      192 views
                      0 reactions
                      Last Post seqadmin  
                      Working...