Informatics: Bio-Database Management

Posted on August 15, 2005 By Corinne Jones Blog

Informatics: Bio-Database Management

Author: Vishal Rosha [vishalrosha@gmail.com]

Consultant, Satyam Computer Services Ltd.

Abstract

Bio- Database- An Introduction

Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medical drugs and in discovering basic relationships amongst species in the history of life. The biological knowledge of databases is usually (locally) distributed amongst many different specialized databases. This makes it difficult to ensure the consistency of information, which sometimes leads to low data quality.

Any data collected from different resources is not useful until it converted into information, that’s why these two different terms plays a major role in informatics. A data becomes information when it comes in user define shape and form.

As of 2005, there are around 650 public and commercial biological databases ^[1]. These databases contain information about nucleotide sequences of genes or amino acid sequences of proteins. Furthermore information about function, structure, localisation on chromosome, clinical effects of mutations as well as similarities of biological sequences can be found. The terms “database development” and “biological informatics activities” describe a range of activities, from formative, theoretical development of new algorithms, data structures and tools specific to the management of biological information to the development and utilization of established resources needed by whole communities of biological researchers. The emphasis of the informatics is on providing solutions that address the formative stages of this, like theoretical research on data structures; new database architectures more tuned to the complexity of biology; planning and prototype development of new types of biological data- or knowledge-bases; and design of easy-to-use interfaces and tools for data input, manipulation, analysis and extraction.

Existing data management tools and methods, such as commercial database management systems, data warehousing tools and statistical methods, are normally adapted for the biological domain. Dealing with data uncertainty or inconsistency for experimental data has required statistical, rather than data management, methods, rather use of statistical methods to gene expression data analysis at various levels of granularity has been the subject of research and development in recent years. The most difficult problems have been encountered in the area of data semantics- properly qualifying data values and their relationships, especially in the context of continuously changing platforms and evolving biological knowledge for example like an expression estimated value. While such problems are encountered across all data management areas, from data generation through data collection and integration to data analysis, the solutions require domain specific knowledge and extensive data definition and creation work, with data management which provides only the framework to address these problems^[2].

In an industry point of view, solutions to data management challenges need to be considered in terms of complexity, cost, robustness, performance and other user and product specific requirements. Devising effective solutions for biological data management problems requires in depth understanding of the biological application, the data management field, and the overall context in which the problems are considered. An insufficient understanding of the biological application and of data management technology and practices gives birth to more problems than the limitations of existing data management technology in support of biological data specific structures or queries.

Picture1″ height=395 alt=”LIMSfinder > Picture1″ hspace=0 src=”http://w3markets.smugmug.com/photos/32315817-M.jpg” width=437 align=middle border=0>

Figure 1: Hierarchical Data Generation and Storage Model

The bio-database management normally works on hierarchical data generation and storage model^[3], which applicable as data types and makes analysis more affluent and more complex. The model is relatively simple, and separates databases and their maintenance responsibility into three tiers architecture. 1) The Private, 2) The Specialized Database and 3) Public Database ^{[see figure 1.]}. The private database is the most free-form, represent the database for an individual lab, or core LIMS for a project. Such databases are designed primarily to address the needs of the site. In execution they may range from a spreadsheet to a relational database management system [RDBMS], including n-numbers of variations between.

Further than that level of database, the interactions become more significant than the data itself, so at this level the design or implementation of that private data set becomes more important. At a bottom, private resources are expected to provide raw or processed data to either a specialized database, or a public archive. Specialized and archival databases have responsibilities to a different community than does the private database. At the highest level, the archival repository is responsible for the maintenance of a structured environment of high quality data, accessible to the world. The responsibilities and expectations of the certifying authorities for the quality and integrity of the data are extremely high, as the global community depends on the existence of this information, and applies it in a generally anonymous fashion. Users put their trust in the certification authorities that such records were maintained and that the data are usable. In addition, archival databases develop tools primarily for access to their enormous repositories in very trustworthy and user-friendly fashions.

The specialized database acts as a certifying agent, setting standards for the integrity, quality and availability of its data and analyses. The primary difference is that the specialized database is responsible to the needs of the specialized community that it serves.

The specialized database repeatedly has the computational and workforce resources to provide multiple analyses of datasets, and to develop customized views of the data to address the needs of the community. In certain cases, new software or database resources are initially developed exclusively for application to the special purpose projects and data developed or used by those projects. They may act as a channel and standards-developing group for the integration of other data and information that is of value to the community, and the tools and techniques may thus make their way into a large community of users.

Almost by definition, a specialized database must evolve. If a specialized database is to continue to serve a community, the migration of what it does must synchronize with the goals of that community even a research community may shift the focus of its projects, and in fact, aspects of the projects that it supports may come to an end or new and unexpected requirements may crop up.

Data Integration

Integrated informatics is the dynamically construction of schemas to organise related cross-domain analysis results and background knowledge.

The Problem

· Too much un-integrated data

· From a variety of incompatible sources

· No standard naming convention

· Each with a custom browsing and querying mechanism (no common interface)

· Poor interaction with other data sources

The value of a particular dataset can be increased enormously by integrating the data analyses, at least at the interface level, with externally-developed, structured information. Classically, there are several methods of accomplishing integration:

Picture2″ height=282 alt=”LIMSfinder > Picture2″ hspace=0 src=”http://w3markets.smugmug.com/photos/32315815-M-1.jpg” width=600 align=middle border=0>

Figure 2: Data Integration

1. Linked, indexed data systems: Connect flat-file databases using web [HTTP] links and indexes;

2. Loose integration: Systematizing data into a multi-database system without a common schema but with a common query mechanism;

3. Tight integration with views: Organize them into a database association with a common schema and a central query mechanism;

4. Loose integration with materialized data: Organize them into a data warehouse without a common schema and periodically load all data into a central location.

5. Tight integration with materialized data: Construct a data warehouse with a common schema and periodically load all data into a central location.

These methods differ dramatically in their complexity of implementation and in their scalability and maintainability. They lie along a band, leading from complete physical and logical separation, to a logical union of physically incongruent datasets, and finally to an integrated whole.

In the long-term view, data integration is a key element to developing functional and useful bioinformatics systems. While it is in fact crucial that we archive and publish our data and results, the sheer number of important data types, as well as data-type-related databases is inspiring. Raw data includes all types of nucleic acids, protein and hypothetical protein sequences, similarity results between and of these, functional data from a myriad

of sources and varying protocols [expression data, proteomic and metabolomic data], genetic and physical maps, breeding data, associated environmental information and phenotypic information. Using computational tools to reduce complexity inherent to the reporting system, we begin to understand the dimensionality and complexity of the system which is truly of interest rather than focusing on details of the reporting method ^{[see figure 2]}.

While the preceding has been, in a sense, a primer on the elements of bioinformatics systems, it has been important to present in the interest of developing a common basis for discussion of not just the current state, but the future needs and developments, of bioinformatics.

Data Mining

Database mining is the process of finding and extracting useful information from raw datasets. Data Mining is a frequently used term in genomics to address the generic exploration of information. Till the date, most Data Mining research has focused on applications in business and finance. However, Data Mining researchers begun to develop techniques for scientific applications, as well, including applications to systems biology.

Biological applications of Data Mining differ significantly from those developed for commercial enterprises. At the broadest level, the practice of extracting patterns from large sets of data, explaining those patterns in terms of evidence sets, and assessing those patterns for their stability are identical across disciplines. Existing data mining applications have clear-cut requirements in terms of data types, e.g., transactions, series of transactions, and multi-dimensional vectors [such as demographics]. Some biological data can be transformed, through the use of feature functions encapsulating domain specific knowledge, into the sorts of vectors for which existent Data Mining applications were originally designed. A pattern recognizer is trained on data for which the correct classifications are already known. Apart from that a lot of supervised learning techniques are also applied. Neural networks, decision trees, Bayesian inference networks, and support vector machines are some examples of it. With the use of techniques like by contrast, clustering and unsupervised learning algorithms unexpected patterns in the data might be discovered. A system able to propose “it would be a strong pattern, if this annotation were incorrect” would be of value in the genomics world.

Ahead of such applications, systems biology data have changeable and frequently mixed components of chronological, sequential, spatial, geometric, and/or relational natures. The data also has vibrant properties defined by the researcher in the context of their exploration work. The development of new algorithms with the ability to operate simultaneously on the native data as well as such more fluid, interactive properties will enable new kinds of scientific discoveries. One might imagine of data mining tools which allow the user to alter, in real-time, the selected data, its context, the metadata, their perceived grouping of the data, and their proposed view of the result.

The Conclusion

Improving the informational infrastructure of the biological sciences will require a number of activities:

· Development of new methods and tools for the construction, operation, and access of biological databases, including research into generic database infrastructures designed to be extendable to different biological domains;

· Research into development of new data structures and new data-management systems for biological databases, for example, mass spectrometry of macromolecular complexes and other interaction technologies;

· Research and development of “metadata base” architectures for biology, for example, single query interfaces that present data from transparent queries across multiple databases;

· Development of algorithms and software related to the retrieval and analysis of biological information.

· Activities that will facilitate development of biological databases, and knowledge bases such as to standardize nomenclature, conceptual information models, and semantic content efforts;

· Development (including planning and subsequent design, prototypes, implementation, testing, and distribution) of databases and related software tools crucial for biological research;

· Activities that will facilitate the exchange of ideas among those involved in biological database research.

· Activities (such as training, and collaborations between computer scientists and biological researchers) that will enhance development and use of information resources and exploration and research on alternative economic models for long-term sustainable support of important community resources.

References

[1] The Molecular Biology Database Collection: 2005 update

Michael Y. Galperin

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

[2] Data Management Challenges for Molecular and Cell Biology: An Industry Perspective

Victor M. Markowitz

Gene Logic Inc., Data Management Systems.

[3] Bioinformatics: Toward an Integrated Environment

Ernest Retzel

by Corinne Jones

Website