Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding the genomic location of an insert

    Is there some way to use RNA-seq and/or whole genome sequencing data (I have both for the relevant samples) to find the genomic location of an insert with an unknown location? The insert itself is of known sequence, and aligns correctly to a reference containing only itself + some minor control sequences.

    I was told that one thing I might do is to align my data to the reference containing only the insert sequences, but split my (paired-end) data into two, i.e. only align one pair at a time as a single end ("..._1"-files and "..._2"-files separately). I should then take out all the reads that align (by name) and subset the other original fastq files by them so I get their mates (i.e. subset "..._2" by aligned reads in "..._1") and align those to the normal reference genome, again single-end. I would then, hopefully, get reads aligning to the same region, and I would know the location of my insert (after which I could create some PCR primers and validate the results).

    I have done this with my WGS-data, but the reads map more or less randomly across all chromosomes... I feel I might be subsetting the read names wrong, somehow, mostly because I don't think I'm sure exactly how they are given names and how to find the pairs properly. At the moment, this is what I'm doing:

    Code:
    (... alignment with BWA)
    
    samtools view mapped.sorted.rmdup.input_1.bam | \
    	gawk '{print $1}' | \
    	sort | \
    	uniq > unique.txt
    
    fastqutils filter -whitelist unique.txt input_2.fastq > 1-to-2.fastq
    Am I doing something wrong with the analysis, or is the idea somehow flawed? I am being fairly stringent in the first alignment step, using the -B 40 -O 60 -E 10 options (with BWA), in order to hopefully only align more exact matches (I have also done without this stringency, with more or less the same results).

    Does anybody have any idea what I'm doing wrong, what's wrong with the idea, or have any other idea on how to find an unknown insert?

  • #2
    This is quite difficult in general and leads to false positive hits in my experience.

    It's difficult to have an idea how many false positives you can expect without knowning the read length and genome size / repetitivity.

    Maybe you've tried this, but doing a couple of de novo assemblies and looking for the - if present - flanking genomic regions around your insert would probably be more helpful. If these are mappable and unique in the genome, then that is good evidence.

    Comment


    • #3
      Ah, interesting... I have never done a de novo assembly before, either on genomic or transcriptome level. I assume you're advicing I do it on the genomic level, or? Could you point me towards some tool(s) that I could use for this?

      Comment


      • #4
        For RNA-seq, a good de novo tool is Trinity. For genomic assemblies, perhaps Abyss, Minia or Soap de novo might suit your needs. Perhaps you can find these on a Galaxy instance somewhere if you have no experience, maybe at Iplant. I think Sweden has a very good infrastructure setup you could get time on too though (I forget what it's called).

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X