2010
Hardy, Barry J.; Douglas, Nicki; Helma, Christoph; Rautenberg, Micha; Jeliazkova, Nina; Jeliazkov, Vedrin; Nikolova, Ivelina; Benigni, Romualdo; Tcheremenskaia, Olga; Kramer, Stefan; Girschick, Tobias; Buchwald, Fabian; Wicker, Jörg; Karwath, Andreas; Gütlein, Martin; Maunz, Andreas; Sarimveis, Haralambos; Melagraki, Georgia; Afantitis, Antreas; Sopasakis, Pantelis; Gallagher, David; Poroikov, Vladimir; Filimonov, Dmitry; Zakharov, Alexey V.; Lagunin, Alexey; Gloriozova, Tatyana; Novikov, Sergey; Skvortsova, Natalia; Druzhilovsky, Dmitry; Chawla, Sunil; Ghosh, Indira; Ray, Surajit; Patel, Hitesh; Escher, Sylvia
Collaborative development of predictive toxicology applications Journal Article
In: J. Cheminformatics, vol. 2, pp. 7, 2010.
Abstract | Links | BibTeX | Tags: crossvalidation, data mining, QSAR, scientific knowledge, validation
@article{hardy2010,
title = {Collaborative development of predictive toxicology applications},
author = {Barry J. Hardy and Nicki Douglas and Christoph Helma and Micha Rautenberg and Nina Jeliazkova and Vedrin Jeliazkov and Ivelina Nikolova and Romualdo Benigni and Olga Tcheremenskaia and Stefan Kramer and Tobias Girschick and Fabian Buchwald and Jörg Wicker and Andreas Karwath and Martin Gütlein and Andreas Maunz and Haralambos Sarimveis and Georgia Melagraki and Antreas Afantitis and Pantelis Sopasakis and David Gallagher and Vladimir Poroikov and Dmitry Filimonov and Alexey V. Zakharov and Alexey Lagunin and Tatyana Gloriozova and Sergey Novikov and Natalia Skvortsova and Dmitry Druzhilovsky and Sunil Chawla and Indira Ghosh and Surajit Ray and Hitesh Patel and Sylvia Escher},
url = {http://dx.doi.org/10.1186/1758-2946-2-7},
doi = {10.1186/1758-2946-2-7},
year = {2010},
date = {2010-08-31},
urldate = {2010-08-31},
journal = {J. Cheminformatics},
volume = {2},
pages = {7},
abstract = {OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.
The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.
Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.},
keywords = {crossvalidation, data mining, QSAR, scientific knowledge, validation},
pubstate = {published},
tppubtype = {article}
}
The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.
Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.
2009
Schulz, Hannes; Kersting, Kristian; Karwath, Andreas
ILP, the Blind, and the Elephant: Euclidean Embedding of Co-proven Queries Conference
Inductive Logic Programming, 19th International Conference, ILP 2009, Springer-Verlag Berlin Heidelberg Springer Verlag, Berlin Heidelberg, Germany, 2009, ISBN: 978-3-642-13839-3.
Abstract | Links | BibTeX | Tags: cheminformatics, dimensionality reduction, inductive logic programming, relational learning, scientific knowledge, visualization
@conference{schulz2009,
title = {ILP, the Blind, and the Elephant: Euclidean Embedding of Co-proven Queries},
author = {Hannes Schulz and Kristian Kersting and Andreas Karwath},
url = {http://dx.doi.org/10.1007/978-3-642-13840-9_20},
doi = {10.1007/978-3-642-13840-9_20},
isbn = {978-3-642-13839-3},
year = {2009},
date = {2009-01-01},
booktitle = {Inductive Logic Programming, 19th International Conference, ILP 2009},
pages = {209-216},
publisher = {Springer Verlag},
address = {Berlin Heidelberg, Germany},
organization = {Springer-Verlag Berlin Heidelberg},
crossref = {DBLP:conf/ilp/2009},
abstract = {Relational data is complex. This complexity makes one of the basic steps of ILP difficult: understanding the data and results. If the user cannot easily understand it, he draws incomplete conclusions. The situation is very much as in the parable of the blind men and the elephant that appears in many cultures. In this tale the blind work independently and with quite different pieces of information, thereby drawing very different conclusions about the nature of the beast. In contrast, visual representations make it easy to shift from one perspective to another while exploring and analyzing data. This paper describes a method for embedding interpretations and queries into a single, common Euclidean space based on their co-proven statistics. We demonstrate our method on real-world datasets showing that ILP results can indeed be captured at a glance.},
keywords = {cheminformatics, dimensionality reduction, inductive logic programming, relational learning, scientific knowledge, visualization},
pubstate = {published},
tppubtype = {conference}
}
2008
Karwath, Andreas; Kersting, Kristian; Landwehr, Niels
Boosting Relational Sequence Alignments Conference
The 8th IEEE International Conference on Data Mining, ICDM 2008, IEEE, 2008, ISBN: 978-0-7695-3502-9.
Abstract | Links | BibTeX | Tags: inductive logic programming, machine learning, relational learning, scientific knowledge
@conference{karwath2008,
title = {Boosting Relational Sequence Alignments},
author = {Andreas Karwath and Kristian Kersting and Niels Landwehr},
url = {http://dx.doi.org/10.1109/ICDM.2008.127},
doi = {10.1109/ICDM.2008.127},
isbn = {978-0-7695-3502-9},
year = {2008},
date = {2008-12-15},
booktitle = {The 8th IEEE International Conference on Data Mining, ICDM 2008},
pages = {857-862},
publisher = {IEEE},
crossref = {DBLP:conf/icdm/2008},
abstract = {The task of aligning sequences arises in many applications. Classical dynamic programming approaches require the explicit state enumeration in the reward model. This is often impractical: the number of states grows very quickly with the number of domain objects and relations among these objects. Relational sequence alignment aims at exploiting symbolic structure to avoid the full enumeration. This comes at the expense of a more complex reward model selection problem: virtually infinitely many abstraction levels have to be explored. In this paper, we apply gradient-based boosting to leverage this problem. Specifically, we show how to reduce the learning problem to a series of relational regressions problems. The main benefit of this is that interactions between states variables are introduced only as needed, so that the potentially infinite search space is not explicitly considered. As our experimental results show, this boosting approach can significantly improve upon established results in challenging applications.},
keywords = {inductive logic programming, machine learning, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {conference}
}
Kersting, Kristian; De Raedt, Luc; Gutmann, Bernd; Karwath, Andreas; Landwehr, Niels
Relational Sequence Learning Book Chapter
In: Probabilistic Inductive Logic Programming - Theory and Applications, vol. 4911, pp. 28-55, Springer Verlag, Berlin Heidelberg, Germany, 2008, ISBN: 978-3-540-78651-1.
Abstract | Links | BibTeX | Tags: inductive logic programming, machine learning, relational learning, scientific knowledge
@inbook{kersting2008,
title = {Relational Sequence Learning},
author = {Kristian Kersting and De Raedt, Luc and Bernd Gutmann and Andreas Karwath and Niels Landwehr},
url = {http://dx.doi.org/10.1007/978-3-540-78652-8_2},
doi = {10.1007/978-3-540-78652-8_2},
isbn = {978-3-540-78651-1},
year = {2008},
date = {2008-01-01},
booktitle = {Probabilistic Inductive Logic Programming - Theory and Applications},
volume = {4911},
pages = {28-55},
publisher = {Springer Verlag},
address = {Berlin Heidelberg, Germany},
organization = {Springer-Verlag Berlin Heidelberg},
crossref = {DBLP:conf/ilp/2008p},
abstract = {Sequential behavior and sequence learning are essential to intelligence. Often the elements of sequences exhibit an internal structure that can elegantly be represented using relational atoms. Applying traditional sequential learning techniques to such relational sequences requires one either to ignore the internal structure or to live with a combinatorial explosion of the model complexity. This chapter briefly reviews relational sequence learning and describes several techniques tailored towards realizing this, such as local pattern mining techniques, (hidden) Markov models, conditional random fields, dynamic programming and reinforcement learning.},
keywords = {inductive logic programming, machine learning, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {inbook}
}
2007
Karwath, Andreas; Kersting, Kristian
Relational Sequence Alignments and Logos Conference
Inductive Logic Programming, 16th International Conference, ILP 2006, vol. 4455, Lecture Notes in Computer Science Springer-Verlag Berlin Heidelberg Springer Verlag, Berlin Heidelberg, Germany, 2007, ISBN: 978-3-540-73846-6.
Abstract | Links | BibTeX | Tags: bioinformatics, inductive logic programming, relational learning, scientific knowledge
@conference{karwath2007,
title = {Relational Sequence Alignments and Logos},
author = {Andreas Karwath and Kristian Kersting},
url = {http://dx.doi.org/10.1007/978-3-540-73847-3_29},
doi = {10.1007/978-3-540-73847-3_29},
isbn = {978-3-540-73846-6},
year = {2007},
date = {2007-01-01},
booktitle = {Inductive Logic Programming, 16th International Conference, ILP 2006},
volume = {4455},
pages = {290-304},
publisher = {Springer Verlag},
address = {Berlin Heidelberg, Germany},
organization = {Springer-Verlag Berlin Heidelberg},
series = {Lecture Notes in Computer Science},
crossref = {DBLP:conf/ilp/2006},
abstract = {The need to measure sequence similarity arises in many applicitation domains and often coincides with sequence alignment: the more similar two sequences are, the better they can be aligned. Aligning sequences not only shows how similar sequences are, it also shows where there are differences and correspondences between the sequences.
Traditionally, the alignment has been considered for sequences of flat symbols only. Many real world sequences such as natural language sentences and protein secondary structures, however, exhibit rich internal structures. This is akin to the problem of dealing with structured examples studied in the field of inductive logic programming (ILP). In this paper, we introduce Real, which is a powerful, yet simple approach to align sequence of structured symbols using well-established ILP distance measures within traditional alignment methods. Although straight-forward, experiments on protein data and Medline abstracts show that this approach works well in practice, that the resulting alignments can indeed provide more information than flat ones, and that they are meaningful to experts when represented graphically.},
keywords = {bioinformatics, inductive logic programming, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {conference}
}
Traditionally, the alignment has been considered for sequences of flat symbols only. Many real world sequences such as natural language sentences and protein secondary structures, however, exhibit rich internal structures. This is akin to the problem of dealing with structured examples studied in the field of inductive logic programming (ILP). In this paper, we introduce Real, which is a powerful, yet simple approach to align sequence of structured symbols using well-established ILP distance measures within traditional alignment methods. Although straight-forward, experiments on protein data and Medline abstracts show that this approach works well in practice, that the resulting alignments can indeed provide more information than flat ones, and that they are meaningful to experts when represented graphically.
King, Ross D.; Karwath, Andreas; Clare, Amanda; Dehaspe, Luc
Logic and the Automatic Acquisition of Scientific Knowledge: An Application to Functional Genomics Conference
Computational Discovery of Scientific Knowledge, Introduction, Techniques, and Applications in Environmental and Life Sciences, vol. 4660, Lecture Notes in Computer Science Springer-Verlag Berlin Heidelberg Springer Verlag, Berlin Heidelberg, Germany, 2007, ISBN: 978-3-540-73919-7.
Abstract | Links | BibTeX | Tags: bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge
@conference{king2007,
title = {Logic and the Automatic Acquisition of Scientific Knowledge: An Application to Functional Genomics},
author = {Ross D. King and Andreas Karwath and Amanda Clare and Luc Dehaspe},
url = {http://dx.doi.org/10.1007/978-3-540-73920-3_13},
doi = {10.1007/978-3-540-73920-3_13},
isbn = {978-3-540-73919-7},
year = {2007},
date = {2007-01-01},
booktitle = {Computational Discovery of Scientific Knowledge, Introduction, Techniques, and Applications in Environmental and Life Sciences},
volume = {4660},
pages = {273-289},
publisher = {Springer Verlag},
address = {Berlin Heidelberg, Germany},
organization = {Springer-Verlag Berlin Heidelberg},
series = {Lecture Notes in Computer Science},
crossref = {DBLP:conf/dis/2007book},
abstract = {This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented “explosion” in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.},
keywords = {bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {conference}
}
2006
Karwath, Andreas; De Raedt, Luc
SMIREP: Predicting Chemical Activity from SMILES Journal Article
In: Journal of Chemical Information and Modeling, vol. 46, no. 6, pp. 2432 - 2444, 2006.
Abstract | Links | BibTeX | Tags: cheminformatics, graph mining, machine learning, QSAR, relational learning, scientific knowledge
@article{karwath06c,
title = {SMIREP: Predicting Chemical Activity from SMILES},
author = {Andreas Karwath and De Raedt, Luc},
url = {http://pubs.acs.org/doi/abs/10.1021/ci060159g},
doi = {10.1021/ci060159g},
year = {2006},
date = {2006-10-12},
journal = {Journal of Chemical Information and Modeling},
volume = {46},
number = {6},
pages = {2432 - 2444},
abstract = {Most approaches to structure-activity-relationship (SAR) prediction proceed in two steps. In the first step, a typically large set of fingerprints, or fragments of interest, is constructed (either by hand or by some recent data mining techniques). In the second step, machine learning techniques are applied to obtain a predictive model. The result is often not only a highly accurate but also hard to interpret model. In this paper, we demonstrate the capabilities of a novel SAR algorithm, SMIREP, which tightly integrates the fragment and model generation steps and which yields simple models in the form of a small set of IF-THEN rules. These rules contain SMILES fragments, which are easy to understand to the computational chemist. SMIREP combines ideas from the well-known IREP rule learner with a novel fragmentation algorithm for SMILES strings. SMIREP has been evaluated on three problems: the prediction of binding activities for the estrogen receptor (Environmental Protection Agency's (EPA's) Distributed Structure-Searchable Toxicity (DSSTox) National Center for Toxicological Research estrogen receptor (NCTRER) Database), the prediction of mutagenicity using the carcinogenic potency database (CPDB), and the prediction of biodegradability on a subset of the Environmental Fate Database (EFDB). In these applications, SMIREP has the advantage of producing easily interpretable rules while having predictive accuracies that are comparable to those of alternative state-of-the-art techniques.},
keywords = {cheminformatics, graph mining, machine learning, QSAR, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {article}
}
Karwath, Andreas; Kersting, Kristian
Relational Sequence Alignments Conference
Proc. The 4th International Workshop on Mining and Learning with Graphs, MLG 2006, % editor = Thomas Gärtner and Gemma C. Garriga and Thorsten Meinl, % month = September, 2006, (workshop).
BibTeX | Tags: bioinformatics, cheminformatics, relational learning, scientific knowledge
@conference{karwath06b,
title = {Relational Sequence Alignments},
author = {Andreas Karwath and Kristian Kersting},
year = {2006},
date = {2006-01-01},
booktitle = {Proc. The 4th International Workshop on Mining and Learning with Graphs, MLG 2006, % editor = Thomas Gärtner and Gemma C. Garriga and Thorsten Meinl, % month = September},
pages = {149-156},
note = {workshop},
keywords = {bioinformatics, cheminformatics, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {conference}
}
Clare, Amanda; Karwath, Andreas; Ougham, Helen; King, Ross D.
Functional bioinformatics for Arabidopsis thaliana Journal Article
In: Bioinformatics, vol. 22, no. 9, pp. 1130-1136, 2006.
Abstract | Links | BibTeX | Tags: bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge
@article{karwath06a,
title = {Functional bioinformatics for Arabidopsis thaliana},
author = {Amanda Clare and Andreas Karwath and Helen Ougham and Ross D. King},
url = {https://bioinformatics.oxfordjournals.org/content/22/9/1130.full.pdf+html},
doi = {10.1093/bioinformatics/btl051},
year = {2006},
date = {2006-01-01},
journal = {Bioinformatics},
volume = {22},
number = {9},
pages = {1130-1136},
abstract = {Motivation: The genome of Arabidopsis thaliana, which has the best understood plant genome, still has approximately one-third of its genes with no functional annotation at all from either MIPS or TAIR. We have applied our Data Mining Prediction (DMP) method to the problem of predicting the functional classes of these protein sequences. This method is based on using a hybrid machine-learning/data-mining method to identify patterns in the bioinformatic data about sequences that are predictive of function. We use data about sequence, predicted secondary structure, predicted structural domain, InterPro patterns, sequence similarity profile and expressions data.
Results: We predicted the functional class of a high percentage of the Arabidopsis genes with currently unknown function. These predictions are interpretable and have good test accuracies. We describe in detail seven of the rules produced.
Availability: Rulesets are available at http://www.aber.ac.uk/compsci/Research/bio/dss/arabpreds/ and predictions are available at http://www.genepredictions.org
Contact:afc@aber.ac.uk},
keywords = {bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {article}
}
Results: We predicted the functional class of a high percentage of the Arabidopsis genes with currently unknown function. These predictions are interpretable and have good test accuracies. We describe in detail seven of the rules produced.
Availability: Rulesets are available at http://www.aber.ac.uk/compsci/Research/bio/dss/arabpreds/ and predictions are available at http://www.genepredictions.org
Contact:afc@aber.ac.uk
2002
Karwath, Andreas
Large Logical Đatabases and their Applications to Molecular Biology PhD Thesis
University of Wales, Aberystwyth, 2002.
BibTeX | Tags: bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge
@phdthesis{karwath02b,
title = {Large Logical Đatabases and their Applications to Molecular Biology},
author = {Andreas Karwath},
year = {2002},
date = {2002-01-01},
school = {University of Wales, Aberystwyth},
keywords = {bioinformatics, data mining, inductive logic programming, machine learning, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {phdthesis}
}
2001
King, Ross D.; Karwath, Andreas; Clare, Amanda; Dehaspe, Luc
The utility of different representations of protein sequence for predicting functional class Journal Article
In: Bioinformatics, vol. 17, no. 5, pp. 445-454, 2001.
Abstract | Links | BibTeX | Tags: bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge
@article{King2001a,
title = {The utility of different representations of protein sequence for predicting functional class},
author = {Ross D. King and Andreas Karwath and Amanda Clare and Luc Dehaspe},
url = {https://bioinformatics.oxfordjournals.org/content/17/5/445},
doi = {10.1093/bioinformatics/17.5.445},
year = {2001},
date = {2001-01-19},
journal = {Bioinformatics},
volume = {17},
number = {5},
pages = {445-454},
abstract = {Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model.
Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%.
Availability: The rules and data are freely available. Warmr is free to academics.
Contact: rdk@aber.ac.uk},
keywords = {bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {article}
}
Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%.
Availability: The rules and data are freely available. Warmr is free to academics.
Contact: rdk@aber.ac.uk
2000
King, Ross D.; Karwath, Andreas; Clare, Amanda; Dehaspe, Luc
Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Journal Article
In: Yeast (Comparative and Functional Genomics), vol. 17, pp. 283-293, 2000.
Abstract | Links | BibTeX | Tags: bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge
@article{king00a,
title = {Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.},
author = {Ross D. King and Andreas Karwath and Amanda Clare and Luc Dehaspe},
url = {http://onlinelibrary.wiley.com/doi/10.1002/1097-0061(200012)17:4%3C283::AID-YEA52%3E3.0.CO;2-F/abstract},
doi = {10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F},
year = {2000},
date = {2000-12-08},
journal = {Yeast (Comparative and Functional Genomics)},
volume = {17},
pages = {283-293},
abstract = {The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli. },
keywords = {bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge},
pubstate = {published},
tppubtype = {article}
}
King, Ross D.; Karwath, Andreas; Clare, Amanda; Dehaspe, Luc
Logic and the Automatic Acquisition of Scientific Knowledge Journal Article
In: EACIS (Electronic Articles in Computer and Information Science), vol. 5, no. 031, 2000.
Abstract | Links | BibTeX | Tags: bioinformatics, data mining, scientific knowledge
@article{King2000c,
title = {Logic and the Automatic Acquisition of Scientific Knowledge},
author = {Ross D. King and Andreas Karwath and Amanda Clare and Luc Dehaspe},
url = {http://www.ida.liu.se/ext/epa/cis/mi-17/02/orig.html},
year = {2000},
date = {2000-12-01},
journal = {EACIS (Electronic Articles in Computer and Information Science)},
volume = {5},
number = {031},
abstract = {This paper is a manifesto. It argues that:
Science is experiencing an unprecedented "explosion" in the amount of available data.
Traditional data analysis methods cannot deal with this increased quantity of data.
There is therefore an urgent need to automate the process of refining scientific data into scientific knowledge.
Inductive logic programming (ILP) is the data analysis framework best suited for this task.
We describe an example of using ILP to analyse a large and complex bioinformatic database which produced unexpected and interesting scientific results. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent inductive databases.},
keywords = {bioinformatics, data mining, scientific knowledge},
pubstate = {published},
tppubtype = {article}
}
Science is experiencing an unprecedented "explosion" in the amount of available data.
Traditional data analysis methods cannot deal with this increased quantity of data.
There is therefore an urgent need to automate the process of refining scientific data into scientific knowledge.
Inductive logic programming (ILP) is the data analysis framework best suited for this task.
We describe an example of using ILP to analyse a large and complex bioinformatic database which produced unexpected and interesting scientific results. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent inductive databases.