Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bioinfosm
    Senior Member
    • Jan 2008
    • 483

    Human reference sequence and tools like MAQ

    How do people use that?

    Use each chromosome separately or combine the fasta sequences and run through maq or tool of choice .. I had this issue earlier when I used contigs and the headers were not too parseable by tools. Any standards on the chromosomes to use, edit their headers before using them as reference..

    It is just taking me too long to manage the reference and am looking for a more efficient way to do so !
    --
    bioinfosm
  • wesb
    Junior Member
    • Jul 2008
    • 6

    #2
    Reference

    For Maq, I split the reference by chromosome and submit each Maq run for a given chromosome to a different node on my cluster. With Soap, I do the same thing. It also works for me to split the reference into 5 chunks for Soap runs. The cluster I was using couldn't handle Soap if I give it all my reads and all the chromosomes at once. I had memory problems. It works fine if you split it up in chunks, or split up your reads as well. I use the hg18 chromosomes from UCSC Genome Browser. The headers are simple chr1, chr2 ...
    Wes Beckstead
    Predoctoral Fellow in Bioinformatics
    Boston University Partnership with NIH
    [email protected]

    Comment

    • lh3
      Senior Member
      • Feb 2008
      • 686

      #3
      For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

      If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.

      Comment

      • zee
        NGS specialist
        • Apr 2008
        • 249

        #4
        I agree with Ih3 by not splitting the reference genome up. With any tool if you do this you'll need to combine the results and sort them post-alignment.
        I've been using maq, soap, eland and novoalign. Eland's quite fast if you want exact, 1 or 2 mismatches and you dont care about quality. Building the index for eland is also very quick.
        There is a script with the maq package that will do the automated chunk mapping. If you use novoalign do some neatening up of the fasta headers in your genome, it will save downstream perl/sed jobs and indexing of a full human genome on at least 8Gb server will take about 4-5 min.

        Comment

        • mukatira
          Junior Member
          • Jun 2008
          • 2

          #5
          Genome Viewer - Annotations

          Hello,
          I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

          I am looking at data from whole genome sequencing effort using 454 FLX.

          Cheers

          Sm

          Comment

          • zee
            NGS specialist
            • Apr 2008
            • 249

            #6
            Originally posted by mukatira View Post
            Hello,
            I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

            I am looking at data from whole genome sequencing effort using 454 FLX.

            Cheers

            Sm
            Hi Mukatira,

            I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
            For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
            Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

            Hope this helps.

            Comment

            • kmay
              Member
              • Aug 2008
              • 29

              #7
              Originally posted by mukatira View Post
              Hello,
              I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

              I am looking at data from whole genome sequencing effort using 454 FLX.

              Cheers

              Sm
              Mukatira,

              if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

              Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
              So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

              With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.


              Split human reference genome
              We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

              Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
              For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

              Cheers

              Klaus

              Comment

              • zee
                NGS specialist
                • Apr 2008
                • 249

                #8
                Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.

                Originally posted by kmay View Post
                Mukatira,

                if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

                Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
                So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

                With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.




                We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

                Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
                For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

                Cheers

                Klaus

                Comment

                • kmay
                  Member
                  • Aug 2008
                  • 29

                  #9
                  Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.
                  Zee,

                  in fact a "best unique match" search is now is even faster. See the Helicos forum for some more detail on what that means. I will post some benchmarks in a couple of weeks.

                  Yes, we do it on one CPU. In fact, my earlier posting was outdated. The GMS has tody comes with 64 GB main memory. So... no desktop system


                  Yes, is scales linear.

                  Cheers

                  Klaus

                  Comment

                  • bioinfosm
                    Senior Member
                    • Jan 2008
                    • 483

                    #10
                    Originally posted by zee View Post
                    Hi Mukatira,

                    I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
                    For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
                    Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

                    Hope this helps.
                    Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?
                    --
                    bioinfosm

                    Comment

                    • zee
                      NGS specialist
                      • Apr 2008
                      • 249

                      #11
                      Hi,

                      It's not that difficult actually.

                      First you should convert your alignments to gene feature format (GFF). Then you should also have all your reference genome sequence definitions in GFF as well. There are some good templates

                      You will need mysql/postgresql setup with database create/file/read permissions. A gbrowse.conf file for your database will also need to exist. A good example is http://ftp.hapmap.org/jimwatsonseque...wsequence.conf

                      1) Create your gbrowse database in Mysql
                      2) Use the bp_load_gff.pl or bp_bulk_load_gff.pl to load GFF files of your read alignments, chromosome names, genes, etc
                      3) make sure you have the source and method columns for each GFF file specific enough to define each track. Have a look at the fgroup table in the gbrowse schema.
                      4)Edit the gbrows.conf file to reflect you track names (method:source)
                      5) Ensure your gbrowse.conf has the correct username/hostname/password to connect to the database containing your data.

                      If all the data was loaded correctly it should be straight forward to see the tracks. I usually edit an existing configuration file.

                      Some example files can be found on:






                      I have some scripts to do novoalign->gff conversion. For maq you could probably take a look at maqview (from the maq page) because it's quite a good browser and reads the .map file directly.

                      Originally posted by bioinfosm View Post
                      Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?

                      Comment

                      • Layla
                        Member
                        • Sep 2008
                        • 58

                        #12
                        chunks

                        Originally posted by lh3 View Post
                        For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

                        If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.
                        I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

                        Secondly, are you able to refer me to the command that can do this splitting within maq?

                        Thank you

                        L

                        Comment

                        • Jonathan
                          Member
                          • Jun 2009
                          • 36

                          #13
                          From the maq manual:
                          Code:
                          maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq|out.prefix
                          
                          Convert reads in FASTQ format to Maq’s BFQ (binary FASTQ) format.
                          
                          OPTIONS:
                          -n INT 	number of reads per file [not specified]
                          As both end'-readfiles are converted separately, I guess you won't do anything wrong if you are using the same 'n' value for both 'end'files.

                          Followups:
                          maq map steps for all bfq-parts (using PE-corresponding part files of course)
                          followed by
                          maq mapmerge out.aln.map in.aln1.map in.aln2.map [...]

                          see the maq-docu for this, too.

                          Best
                          -Jonathan

                          Comment

                          • zee
                            NGS specialist
                            • Apr 2008
                            • 249

                            #14
                            Layla,

                            You should have each member of the pair in a separate file. In other words whatever you're going to be splitting needs to be done on each of these files containing mate 1 and mate 2.
                            The maq.pl easyrun command should take care of the process. I would recommend splitting into chunks of size 2 million, therefore it should do 5 iterations.

                            Assuming you have 1 side in reads1.fq and the other in reads2.fq

                            Something like:

                            maq.pl easyrun db.bfa reads1.fq reads2.fq


                            should work fine.

                            Originally posted by Layla View Post
                            I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

                            Secondly, are you able to refer me to the command that can do this splitting within maq?

                            Thank you

                            L

                            Comment

                            • Layla
                              Member
                              • Sep 2008
                              • 58

                              #15
                              Thankyou Jonathon and Zee,

                              I guess this should be good enough. Split both files into 2 million chunks and then run the following command 5 times..
                              ./maq match out.map all_chrom.bfa 2million_5prime.bfq 2million_3prime.bfq

                              Any ideas how long this should take on a 32GB Mac OS?

                              Running the above command for all 20million reads in one go took 7 days!

                              Whilst on this subject, is anyone able to guide me with a specific data filtering issue before I use Batman?

                              Cheers again

                              L

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...