Accepted papers
Online proceedings:
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-645/
Online proceedings:
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-645/
The Sequence Ontology (SO) is a member ontology of the Open Biomedical Ontologies (OBO) library that is charged with formally representing types of biomacromolecular sequences and their associated attributes as well as the interrelationships among these. By providing a common vocabulary and set of definitions, it is widely used to facilitate accurate storage, processing, and exchange of sequence data. While some parts of the SO are quite mature, particularly the independently defined sequence features, the current representation of sequence variation is more problematic in that the representation of corresponding variations, sequences containing these variations, and processes resulting in these variations and their interrelationships is incomplete. Additionally, corresponding variations in DNAs, RNAs, and polypeptides and their interrelationships are not represented. We report here on our progress in more completely and precisely representing these concepts, which will allow for more consistent annotation of variant sequence data. Furthermore, formally linking and defining these sequence-variation classes within the OBO framework will enable powerful, logically sound reasoning with various types of variant data as well as with other types of annotated biological data.
Our previous work on text mining for mutation impacts resulted in (i) the development of a GATE-based pipeline that mines texts for information about impacts of mutations on proteins, (ii) the population of this information into our OWL DL mutation impact ontology, and (iii) establishing an experimental OWL database for storing the results of text mining. The current focus of the project is to look for ways of deploying our software and data to facilitate the integration of our mutation impact data in a broader biological context. This paper explores the possibility of using the SADI framework as a medium for publishing our mutation impact software and data. SADI is a set of conventions for creating web services with semantic descriptions that facilitate automatic discovery and orchestration. Here we describe a case study exploring and demonstrating the utility of the SADI approach in our context. We describe several SADI services we created based on our text mining API and data, and demonstrate how they can be used in a number of biologicaly meaningful scenarios through a SPARQL interface (SHARE) to SADI services. In all cases we pay special attention to the integration of mutation impact services with external SADI services providing information about related biological entities, such as proteins, pathways, and drugs.
Motivation: The recent development of high throughput sequencing technologies provides new opportunities to characterize lower minor allele frequencies (MAF) SNPs in large sample populations. While scientifically enticing, such genotyping studies will be technically challenging with escalating data storage requirements per sample. To address this need, data compression methods have been pro-posed that leverage SNP information to reference genotyping data. However, these representations do not efficiently account for SNPs shared by a low number of individuals in a population. Results: GenoS is novel data model for the efficient storage of genotyping data. GenoS is developed around a segment-based architecture designed to organize and share biospecimen-related data produced by commonly used genomics technology platforms. This architecture allows both explicit and referenced representations of these data. This approach is particularly effective at storing geno-types. Compared to the widely used PLINK format, GenoS achieves 1.4 time higher compression on the storage of genotypes-related to SNPs obtained from the 1000 genome and more then 8 time higher compression of genotypes for SNPs with MAF <1% On this dataset, data extraction times are comparable to PLINK. As an increasing number of low MAF SNPs are discovered, GenoS will provide a method for efficient and economical genotype data storage, while maintaining good data retrieval performance.
Biomedical researchers consume and analyze PubMed abstracts on a daily basis seeking to update their existing knowledge with insights from newly published literature. Plain text descriptions fail to deliver contextual knowledge to users who require a comprehensive understanding of the content of a paper before deciding to access it. To achieve this biological named entities described in the abstracts must be linked to their related entries in biological databases and established controlled vocabularies such as SwissProt and Gene Ontology. Semantic Assistants support users in content retrieval, analysis, and development, by offering context-sensitive NLP services directly integrated in standard desktop clients, like a word processor. They are implemented through an open service-oriented architecture, using Semantic Web ontologies and W3C Web Services. Here we present a deployment of the Semantic Assistants framework to provide links from mutation, protein, protein property, gene and organism mentions in abstracts to their related entry in standardized biological databases and controlled vocabularies. The underlying text mining pipeline used to identify named entities has previously shown high levels of precision and we make this functionality easily accessible through a Semantic Assistant, to end users when reviewing PubMed abstracts in through a Firefox client.
Background: Mining relevant features from protein mutation data is fundamental for understanding the characteristics of a protein functional site. The mined features could also be used for engineering novel proteins with useful functions. Results: We propose a simple relational learning approach for protein engineering. First, we learn a set of relational rules from mutation data, then we use them for generating a set of candidate mutations that are most probable to confer resistance to a certain inhibitor or improved activity on a specific substrate. We tested our approach on a dataset of HIV drug resistance mutations, comparing it to a baseline random generator. Statistically significant improvements are obtained on both categories of nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. Conclusions: Our promising preliminary results suggest that the proposed approach for learning mutations has a potential in guiding mutant engineering, as well as in predicting virus evolution in order to try and devise appropriate countermeasures. The approach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations.
Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. The non-synonymous SNPs occurring in coding regions resulting in single amino acid polymorphisms (SAPs) may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performances, the quality of the prediction can be further improved introducing new features derived from the protein three-dimensional structure. Results: In this paper, we present a structure-based machine learning approach to predict disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features from the protein sequence, structure and function information. After dataset balancing, the structure-based method reaches an overall accuracy of 84%, a correlation coefficient of 0.67, and an area under the receiving operating characteristic curve (AUC) of 0.91. When compared with a similar sequence based predictor, structure-based method results in an increase of the overall accuracy and the AUC ~3%, and 0.06 for the correlation coefficient. Conclusion: This work demonstrates that structural information can increase the accuracy of detecting of disease-related SAPs. Our results also quantify the magnitude of the improvement on a large data. This improvement is in agreement with the previously observed results in the prediction of the protein stability change upon mutation.
Background: Protein Kinases are a superfamily of proteins involved in crucial cellular processes such as cell cycle regulation and signal transduction. Accordingly, they play an important role in cancer biology. To contribute to the study of the relation between kinases and disease we compared pathogenic mutations to neutral mutations. First, we analyzed native and mutant proteins in terms of amino acid composition. Secondly, mutations were characterized according to their potential structural effects and finally, we assessed the location of the different classes of polymorphisms with respect to kinase-relevant positions in terms of subfamily specificity, conservation, accessibility and functional sites. Results: Pathogenic Protein Kinase mutations perturb essential aspects of protein function, including disruption of substrate binding and/or effector recognition at family-specific positions. Interestingly these mutations in Protein Kinases display a tendency to avoid structurally relevant positions, what represents a significant difference with respect to the average distribution of pathogenic mutations in other protein families. Conclusions: Disease associated mutations display sound differences with respect to neutral mutations: several amino acids are specific of each mutation type, different structural properties characterize each class and the distribution of pathogenic mutations within the consensus structure of the Protein Kinase domain is substantially different to that for non-pathogenic mutations.