Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • nilshomer
    replied
    Originally posted by epigen View Post
    Thanks for the answer, Nils. I tried dwgsim_eval but it was not comfortable with the reads being not created by dwgsim and being single end:

    In function "process_bam": Warning[OutOfRange]. Variable/Value: 614 1806 266.
    Message: [dwgsim_eval] read was not generated by dwgsim?.
    ***** Warning *****
    ************************************************************
    ************************************************************
    In function "run": Fatal Error[OutOfRange]. Message: Found a read that was not paired.
    ***** Exiting due to errors *****
    ************************************************************

    By the way, some documentation on the DNAA tools would be really helpful ... As it seems, dwgsim_eval only makes statistics for one BAM or SAM file. What I want to do is to explicitely count how many reads were mapped to a similar coordinate by two different tools, and output these reads.
    For this purpose, I would obviously have to store the read information of one file. In a thread about duplicate removal someone recommended storing it in a trie structure instead of using one of these RAM-greedy Perl hashes. As far as I remember from my computer science lectures, a suffix tree (or eqivalent, a prefix tree) does not need less space than a hash. Searching in a tree is O(n) whereas in a hash it's O(1). Also, efficiency depends on how the tree structure is implemented.
    Now when it comes to putting that theoretical knowledge into code, read IDs are different from the strings always used to demonstrate how suffix trees work for finding matches. In my opinion, a giant Perl hash shouldn't be much of a problem with 32+ GB RAM available, ignoring the fact that one may call this approach "quick and dirty".

    What do the experts out there think?

    Barbara
    You have to use the read generator, otherwise it gets confused. The read name convention can be found at http://sourceforge.net/apps/mediawik...ome_Simulation. I am working on documentation, but you are welcome to help (go open source)!

    A perl hash would be easiest to implement, so try that.

    Leave a comment:


  • epigen
    replied
    storing read information: hash versus tree

    Thanks for the answer, Nils. I tried dwgsim_eval but it was not comfortable with the reads being not created by dwgsim and being single end:

    In function "process_bam": Warning[OutOfRange]. Variable/Value: 614 1806 266.
    Message: [dwgsim_eval] read was not generated by dwgsim?.
    ***** Warning *****
    ************************************************************
    ************************************************************
    In function "run": Fatal Error[OutOfRange]. Message: Found a read that was not paired.
    ***** Exiting due to errors *****
    ************************************************************

    By the way, some documentation on the DNAA tools would be really helpful ... As it seems, dwgsim_eval only makes statistics for one BAM or SAM file. What I want to do is to explicitely count how many reads were mapped to a similar coordinate by two different tools, and output these reads.
    For this purpose, I would obviously have to store the read information of one file. In a thread about duplicate removal someone recommended storing it in a trie structure instead of using one of these RAM-greedy Perl hashes. As far as I remember from my computer science lectures, a suffix tree (or eqivalent, a prefix tree) does not need less space than a hash. Searching in a tree is O(n) whereas in a hash it's O(1). Also, efficiency depends on how the tree structure is implemented.
    Now when it comes to putting that theoretical knowledge into code, read IDs are different from the strings always used to demonstrate how suffix trees work for finding matches. In my opinion, a giant Perl hash shouldn't be much of a problem with 32+ GB RAM available, ignoring the fact that one may call this approach "quick and dirty".

    What do the experts out there think?

    Barbara

    Leave a comment:


  • nilshomer
    replied
    Originally posted by epigen View Post
    I want to count the number of (single-end) reads that were mapped to approximately the same coordinates by different aligners.
    The problem is that the reads do not have identical IDs and may have shifted coordinates in a range of 1 bp (SOLiD mapped with BWA), for example:

    BWA:
    prefix_3_30_738 0 chr8 11162354 37 48M * 0 0 ...

    ABI BioScope:
    3_30_738 0 chr8 11162353 100 50M * 0 0 ...

    NovoalignCS:
    3_30_738_F3 0 chr8 11162353 150 50M * 0 0 ...

    Reads are in sorted, indexed BAM files. Of course I could change the read IDs and coordinates to find exact matches with Picard CompareSAMS, but I'd like to avoid redundance,
    reduce computational time and also output the matching reads. Besides, I'm interested in finding reads that may be aligned in a certain neighborhood.
    Has anyone already developed a tool that can handle such an issue? If not, what would be the most efficient strategy?

    Thank you for advice in advance!

    Barbara
    Yes, I developed the "dwgsim" toolset in DNAA (http://dnaa.sf.net). The "dwgsim" tool will create simulated reads, the "dwgsim_eval" function will give mapping sensitivity/accuracy statistics, and the "dwgsim_pileup_eval.pl" will give the sensitivity/accuracy of variant calling after samtools. Let me know if this works.

    Leave a comment:


  • epigen
    started a topic Compare mapped reads from different aligners

    Compare mapped reads from different aligners

    I want to count the number of (single-end) reads that were mapped to approximately the same coordinates by different aligners.
    The problem is that the reads do not have identical IDs and may have shifted coordinates in a range of 1 bp (SOLiD mapped with BWA), for example:

    BWA:
    prefix_3_30_738 0 chr8 11162354 37 48M * 0 0 ...

    ABI BioScope:
    3_30_738 0 chr8 11162353 100 50M * 0 0 ...

    NovoalignCS:
    3_30_738_F3 0 chr8 11162353 150 50M * 0 0 ...

    Reads are in sorted, indexed BAM files. Of course I could change the read IDs and coordinates to find exact matches with Picard CompareSAMS, but I'd like to avoid redundance,
    reduce computational time and also output the matching reads. Besides, I'm interested in finding reads that may be aligned in a certain neighborhood.
    Has anyone already developed a tool that can handle such an issue? If not, what would be the most efficient strategy?

    Thank you for advice in advance!

    Barbara

Latest Articles

Collapse

  • seqadmin
    Pathogen Surveillance with Advanced Genomic Tools
    by seqadmin




    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
    03-24-2025, 11:48 AM
  • seqadmin
    New Genomics Tools and Methods Shared at AGBT 2025
    by seqadmin


    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

    The Headliner
    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
    03-03-2025, 01:39 PM
  • seqadmin
    Investigating the Gut Microbiome Through Diet and Spatial Biology
    by seqadmin




    The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
    02-24-2025, 06:31 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-20-2025, 05:03 AM
0 responses
41 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-19-2025, 07:27 AM
0 responses
46 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-18-2025, 12:50 PM
0 responses
36 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-03-2025, 01:15 PM
0 responses
191 views
0 reactions
Last Post seqadmin  
Working...