Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Backspaces in Chip-Seq data file

    Hi,

    I am new to NGS data analysis and I have got some ELAND output files (specificially the sorted.txt file) which I am planning to analyse using MACS. However, MACS keeps falling over due to a "Strand information can not be recognized in this line" error. I have deduced that this is due to a backspace characters which have appeared between some characters in my file and because MACS can't find a tab between the characters it complains that the line is not in the correct format.

    Here is the offending line: (see the '^H' between the 0 and 1)

    HWI-EAS486 23 1 97 11471 15019 0^H1 CAGGGTCACCCAGAGTGAGTGTGAAGCCAGCCTGAGATC hhYghhhhhhhggfhhghhhgghghghhhghghhhdfch chr10.fa 80424503 F 34G1C1A 6

    Here is the same line as output by the MACS error: (backspace represented as x08 (HEX I think)
    HWI-EAS486\t23\t1\t97\t11471\t15019\t0\x081\tCAGGGTCACCCAGAGTGAGTGTGAAGCCAGCCTGAGATC\thhYghhhhhhhggfhhghhhgghghghhhghghhhdfch\tchr10.fa\t\t80424503\tF\t34G1C1A\t6","34G1C1A


    Does anyone have any idea how to replace these ^H (x08) backspace characters with tabs? the problem I have is that there are numerous occurances of ^H in the file which are legitimate.

    Any help of advice would be very useful.

    Thanks

  • #2
    regular expressions

    I would assume that you're running into a software bug in ELAND. I've only used data coming from it a couple times and don't recall ever running into backspaces.

    Anyway, you could just remove the backspaces with regular expressions in your preferred programming language (or even perl from the command line, if you're into that sort of thing).

    For example, in python, something like the following would replace all backspaces with tabs. A small amount of editing would restrict it to only the ones you want.
    Code:
    import re
    
    f = open("your_file.txt","r")
    of = open("a_new_file.txt","w")
    for line in f:
        of.write(re.sub('\b','\t',line))
    f.close()
    of.close()
    Assuming that the backspaces are only ever replacing a single character and you know (or can look up the ELAND file format), you could instead just use regex to parse the various fields:
    Code:
    re.search("(HWI\-EAS[\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([ACGT]+)...",line)
    Or something along those lines. There are a lot of ways one could do that. If you're uncomfortable doing that sort of thing yourself then you can probably find someone to write a short script for you in return for a pint or two of decent beer.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Best Practices for Single-Cell Sequencing Analysis
      by seqadmin



      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
      06-06-2024, 07:15 AM
    • seqadmin
      Latest Developments in Precision Medicine
      by seqadmin



      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

      Somatic Genomics
      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
      05-24-2024, 01:16 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:58 AM
    0 responses
    13 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-06-2024, 08:18 AM
    0 responses
    20 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-06-2024, 08:04 AM
    0 responses
    18 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-03-2024, 06:55 AM
    0 responses
    13 views
    0 likes
    Last Post seqadmin  
    Working...
    X