Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Xi Wang
    replied
    Originally posted by m!x View Post
    Thanks for the quick reply. I tried to put the results of gene and number of reads in the UCSC custom track, and found out that there are a lot of reads that are mapping to 3' UTR when I viewed the track in genome browser. This becomes the problem when I started to do the subsequent statistical analysis.
    There are a lot of genes that are said to be significantly induced/repressed, but actually most of the reads are in the 3' UTR, and almost none of the reads are in the exons.

    Have you found this kind of phenomenon before? What do you suggest to correct this?

    Thanks!
    Yes, we have observed the similar phenomenon in our data, that in the 3'UTR the read coverage was significantly higher than other regions. We think this non-uniform read distribution is largely due to sample degradation during the library preparation.

    As we don't have a biology backgroud, we are wondering whether this explanation is right or not. Only you know what caused this phenomenon, you can correct this distribution bias correctly.

    We are also seeking for biology guys to explain the reasons. Thanks!

    Leave a comment:


  • m!x
    replied
    Thanks for the quick reply. I tried to put the results of gene and number of reads in the UCSC custom track, and found out that there are a lot of reads that are mapping to 3' UTR when I viewed the track in genome browser. This becomes the problem when I started to do the subsequent statistical analysis.
    There are a lot of genes that are said to be significantly induced/repressed, but actually most of the reads are in the 3' UTR, and almost none of the reads are in the exons.

    Have you found this kind of phenomenon before? What do you suggest to correct this?

    Thanks!

    Leave a comment:


  • Xi Wang
    replied
    Originally posted by m!x View Post
    Hi,

    Is the output for number of reads in each from getGeneExp function specific to the exons only? I used the refFlat format for the gene annotation.

    Thanks!
    Yes, only reads in exons are counted. If you want to count reads in introns, make your own refFlat files.

    Leave a comment:


  • m!x
    replied
    Hi,

    Is the output for number of reads in each from getGeneExp function specific to the exons only? I used the refFlat format for the gene annotation.

    Thanks!

    Leave a comment:


  • Xi Wang
    replied
    Dear DEGseq users and DEGseq potential users,

    Thank you for taking into account of using DEGseq. As the update of R platform, DEGseq also updates to its version 1.2.2, which changed quite a few from its version 1.1.x. Here, I’d like to suggest the users (1) download the most recent version with its “reference manual”, (2) following the examples in the manual, apply DEGseq’s functions to the test data, (3) replace the data set, and apply DEGseq to your own data. This procedure also gives a rapid way to diagnose commonly encountered mistakes: if step (2) works well and (3) not, please check the formats of the input files by comparing your files with the test files.

    For further bugs or questions regarding the programming, please send emails to my colleague Likun Wang via [email protected] in a more direct and rapid manner. For further scientific questions or more general questions, you may leave messages here seeking for helps from the community.

    Thanks again for your attention on DEGseq.

    Leave a comment:


  • pr0t3us
    replied
    Hi there,

    I've a problem with DEGseq command.... I'm using solid data and I have prepared all files....
    I have this error

    > DEGseq(MAP_SANI, MAP_NT, fileFormat = "bed",refFlat = REFFLAT, outputDir = outputDir, method = "LRT")
    Please wait...

    mapResultBatch1:
    /local/database/HUMAN_GENOME_GRCh37/PROJECTS/TOMANIN_RNA/RESULTS/BED_FORMAT/sani_BED.txt
    mapResultBatch2:
    /local/database/HUMAN_GENOME_GRCh37/PROJECTS/TOMANIN_RNA/RESULTS/BED_FORMAT/NT_BED.txt
    file format: bed
    refFlat: /home/alex/PROGETTI/TOMANIN_RESULT/PROVA_DAG/Total_human_annotation.ucsc
    Ignore the strand information when count the reads mapped to genes!
    Count the number of reads mapped to each gene ...
    This will take several minutes, please wait patiently!
    Please wait...

    SampleFiles:
    /local/database/HUMAN_GENOME_GRCh37/PROJECTS/TOMANIN_RNA/RESULTS/BED_FORMAT/sani_BED.txt
    Count the number of reads mapped to each gene.
    This will take several minutes.
    Please wait ...

    *** caught segfault ***
    address 0xfffffffffffffff8, cause 'memory not mapped'

    Traceback:
    1: .C(".getGeneExp", as.character(refFlat), as.character(mapResultBatch), as.integer(length(mapResultBatch)), as.character(output), as.character(fileFormat), as.integer(readLength), as.integer(needStrand), as.numeric(min.overlapPercent))
    2: getGeneExp(mapResultBatch1, fileFormat, readLength, strandInfo, refFlat, GeneExp1)
    3: DEGseq(MAP_SANI, MAP_NT, fileFormat = "bed", refFlat = REFFLAT, outputDir = outputDir, method = "LRT")

    Possible actions:
    1: abort (with core dump, if enabled)
    2: normal R exit
    3: exit R without saving workspace
    4: exit R saving workspace
    Selection: 1
    aborting ...
    Segmentation fault

    Can someone help me ?!?

    Alessandro

    Leave a comment:


  • Xi Wang
    replied
    Thanks, Davis. You give us a very clear and detailed way to analyze RNA-seq data, especially if one want to compare genes’ differential expression in two experimental conditions. Here, I still have some minor points to complete.

    1. To process RNA-seq short read data, one may be suggested to use Tophat or SpliceMap for mapping the short reads against the reference genome (e.g., the human genome). These tools have the ability to detect splicing junctions, even novel junctions. As mentioned by Daves, Bowtie is a very fast mapper, but if one wants to apply Bowtie directly to RNA-seq data, one should prepared junction sequences (put two near ends of adjacent exons together to form a new sequence), in order to detect potential junctions.

    2. SAM format is recommended to save the read mapping results, as it can be used in general cases and reserves lots of information. If one only focuses on DE, we provide a very simple BED-like format for user to choose. We also provide a PERL script (if cannot access, contact me) to convert the general SAM format to the BED-like format. This BED-like format can be directly fed to DEGseq.

    3. As DEGseq can directly deal with 6-colunm BED-like files, no others additional tools are required. The embedded C++ code is able to process read counting swiftly.

    4. DEGseq can not only give users a list of DE genes, also provide figures showing the loci of DE genes in the M-A plot, which offers the researchers a direct visual impression on the DE genes identified.

    5. In the DEGseq Bioconductor package, the function samWrapper, which imports the very popular SAM method in microarray studies, can deal with the RNA-seq data with biological replicates. Rather than the strategy involved in edgeR, samWrapper uses an empirical approach to estimate the FDR of DE gene identification.

    6. For high level analyses, given the DE gene list by DEGseq, users may have very flexible options to choose tools according to their specific scientific questions. GO term enrichment, pathway enrichment, and gene set enrichment analyses are all under users’ choice. GOseq, specifically designed for RNA-seq (and other techniques may result in selection bias), is recommended.

    Hope the above information helps. I’d like to help researchers in the community with data analysis or other problems that I am able to.

    Cheers!
    Last edited by Xi Wang; 05-27-2010, 07:02 PM.

    Leave a comment:


  • Davis McC
    replied
    DE Analysis pipeline

    Hi David

    I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

    Colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing.

    It sounds like you would jump in somewhere in the middle, since you have output files already from tophat and cufflinks. NB: similar sort of approach to that suggested by Xi, just a little more fleshed-out.

    Steps with required tools & files

    To perform the entire analysis, the following steps and tools will be needed:

    1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

    2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

    3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

    4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

    5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there, such as DEGseq. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

    6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

    Considerations for DE Analysis

    You should find the visualization tools in DEGseq useful, but you may find that the Poisson/variable sampling rate model does not adequately model the variability in the data. Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

    DEGseq links in quite well with edgeR, so you might look at your data first with DEGseq then use edgeR to deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

    Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

    Best regards
    Davis

    Leave a comment:


  • Xi Wang
    replied
    Hi David,

    You can first use this script to convert the SAM files output by Tophat to BED files, and then feed the BED files to DEGseq. See the example of function DEGseq in the manual.
    Any further questions, just let me know.

    Leave a comment:


  • David Lyon
    replied
    Hi Xi or anyone

    I have the tophat and cufflinks output files from mapping RNA seq against the annotated genome from different tissues. As I am a newbie can someone provide the recommended method in a step by step guide how to use the results for input into DEGseq and process them by DEGseq to get the differential expression from the different tissues.

    Thank you

    Leave a comment:


  • Xi Wang
    replied
    Originally posted by apratap View Post
    Just following up on this educating thread on RNA-Seq calculation.

    Is it fine to normalize by gene length or should we use transcript length( i.e sum of exons) and NOT sum of (exons+introns) for a gene

    Thanks !
    -Abhi
    Hi Abhi,

    If you just want to identify differentially expressed genes, as is suggested in the DEGseq manuscript, you'd better not normalize the raw read counts. The reason is that the hypothesis tests (including Fisher's exact test, likelihood ratio test, and our MARS) are based on the random sampling model, and this model requires the raw counts as input to correctly calculate p-values. You are right that, according to the known gene annotation, only reads in the exonic regions should be counted.
    Alternatively, if you want to estimate gene expression levels, you can refer to the concept RPKM. A newly published NBT paper made great efforts to clarify the correct way to estimate gene expression.
    Hope this helps.

    Leave a comment:


  • apratap
    replied
    Just following up on this educating thread on RNA-Seq calculation.

    Is it fine to normalize by gene length or should we use transcript length( i.e sum of exons) and NOT sum of (exons+introns) for a gene

    Thanks !
    -Abhi

    Leave a comment:


  • Xi Wang
    replied
    Originally posted by Davis McC View Post
    Hi Adam



    This is definitely something to be taken into consideration. See Oshlack and Wakefield, 'Transcript length bias in RNA-seq data confounds systems biology' (http://www.biology-direct.com/content/4/1/14) for further discussion.

    The more we look into RNA-seq data, the more it seems that various normalization steps are required to get the most out of the data.

    Cheers
    Davis
    Hi Davis,

    You are quite right that the gene length and also expression levels affect the statistical power of detecting DEGs. Goseq provides a good example on how to solve this problem. Please note that the normalization itself cannot completely eliminate the biases.

    Leave a comment:


  • Davis McC
    replied
    Gene/transcript length bias

    Hi Adam

    Originally posted by adamreid View Post
    I guess I'm asking whether there is/ought to be a correction for gene length because more reads are expected to map to longer genes.
    This is definitely something to be taken into consideration. See Oshlack and Wakefield, 'Transcript length bias in RNA-seq data confounds systems biology' (http://www.biology-direct.com/content/4/1/14) for further discussion.

    The more we look into RNA-seq data, the more it seems that various normalization steps are required to get the most out of the data.

    Cheers
    Davis

    Leave a comment:


  • RSK
    replied
    Thanks Xi!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 05-02-2024, 08:06 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-30-2024, 12:17 PM
0 responses
20 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-29-2024, 10:49 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Working...
X