Recent Techniques in Biological
Research: Bioinformatics
Pankaj Sohaney
Introduction
Successful and productive
research in any discipline depends on the efficacy of its
hypothesis. A good theoretical model includes all the known
and relevant data so that it may closely approximate real
life, thus free of any mistaken conjecture that can result in
the loss of crucial time. Therefore, in order to be able to
use, store, maintain, analyse and manipulate heaps of research
data generated everyday, there is an urgent need for better
computional models.
In a way enormous data which was a challenge in biology are
challenges in computing also.Bioinformatics is conceptualising
biology in terms of molecules(Chemistry) and applying "ďnformatics
techniques(Maths, Computer sciences, statistics
) to understand ,organise and predict the information
associated with these molecules, on a large scale.
Bioinformatics plays a very important role in progress of
Biology. Without
the help of Bioinformatics, changeover of Biology into an
industrial-scale, comprehensive science with significant
medical and thus economic impact would be beyond belief. In
many ways, current techniques of mathematics and computer
science have not been sufficient to support life science
research. Bioinformatics, as a result, is not only bringing
existing methods to bear on the new problems but also
developing, or catalyzing the development of novel techniques
in Biological sciences. With its increased importance, the
fostering of bioinformatics has become a crucial part of
Biology efforts to promote biotechnology and the life sciences
in general.
Over the past few decades, major
advances in the field of Biology, coupled with advances in
genomics and proteomics, have led to an explosive growth in
the biological information generated by the scientific
community. This deluge of genomic information has, in turn,
led to an absolute requirement for computerized databases to
store, organize and index the data, and for specialized tools
to view and analyze the data.
Bioinformatics
Bioinformatics can be defined as
the interface between biotechnology and information technology
[IT]. Thus, the people working in this field in most cases
either have training in biology or information technology, and
they learned about the other field by dealing with problems or
using the tools of the other one. Although the term 'Bioinformatics'
is not really well defined, you could say that this scientific
field deals with the computational management of all kinds of
biological information, Most of the bioinformatics work that
is being done can be described as analyzing biological
data.
Bioinformatics is the field of
science in which biology, computer science, and information
technology merge to form a single discipline. The ultimate
goal is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying
principles in biology can be discerned. At the beginning of
the "genomic revolution,” bioinformatics was mainly
concerned with the creation and maintenance of a database to
store biological information, such as nucleotide and amino
acid sequences. Development of this type of database involved
not only design issues, but also the development of complex
interfaces whereby researchers could both access existing data
as well as submit new or revised data.
Eventually, all this data must be
combined to form comprehensive information of normal cellular
activities so that researchers may study how these activities
are altered in different disease states. Therefore, the field
of bioinformatics has evolved such that the most impressive
task now involves the analysis and interpretation of various
types of data, including nucleotide and amino acid sequences,
protein domains, and protein structures. The actual process of
analysing and interpreting data is referred to as
computational biology. Important sub-disciplines within
bioinformatics and computational biology include: the
development and implementation of tools that enable efficient
access to, and use and management of, various types of
information; and the development of new algorithms
(mathematical formulas) and statistics with which to assess
relationships among members of large data sets, such as
methods to locate a gene within a sequence, predict protein
structure and/or function, and cluster protein sequences into
families of related sequences.
Biological Database
A biological database is a large,
organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve
components of the data stored within the system. A simple
database might be a single file containing many records, each
of which includes the same set of information, i.e. A record
associated with a nucleotide sequence database typically
contains information such as contact name; the input sequence
with a description of the type of molecule; the scientific
name of the source organism from which it was isolated; and,
often, literature citations associated with the sequence. A
database on the International Mycological Institute's Index of
Fungi, is available. It
can be searched by genus or species of fungus and gives the
reference (volume and page) to the Index of Fungi.
For researchers to benefit from
the data stored in a database, two additional requirements
must be met: easy access to the information; and a method for
extracting only that information needed to answer a specific
biological question.
Evolutionary Biology
An evolution that has joined the
muscle of math and computing to the heart of the life
sciences: Bioinformatics, which is essential to the use of
genomic information in understanding human diseases and in the
identification of new molecular targets for drug discovery. In
recognition of this, many universities, government
institutions and pharmaceutical firms have formed
bioinformatics groups, consisting of computational biologists
and bioinformatics computer scientists.
Equally
exciting is the potential for uncovering evolutionary
relationships and patterns between different forms of life.
With the aid of nucleotide and protein sequences, it should be
possible to find the ancestral ties between different
organisms. So far, experience has taught us that closely
related organisms have similar sequences and that more
distantly related organisms have more dissimilar sequences.
Proteins that show a significant sequence conservation
indicating a clear evolutionary relationship are said to be
from the same protein family. By studying protein folds
(distinct protein building blocks) and families, scientists
are able to reconstruct the evolutionary relationship between
two species and to estimate the time of divergence between two
organisms since they last shared a common ancestor.
The human genome found in every
cell of a human being consists of 23 pairs of chromosomes.
These chromosomes constitute the 3 billion letters of chemical
code that specify the blueprint for a human being. The world
Human Genome Project, a vast endeavor aimed at reading this
entire DNA code will completely transform biology, medicine
and biotechnology. The entire code will be available on our
computers, all 30,000 human genes will be identified; all 5000
inherited diseases will become diagnosable and potentially
curable; drug design will be completely transformed; and our
understanding of ourselves will move into a new dimension. The
Genome Project focuses on two main objectives: mapping -
pinpointing the genomic location of all genes and markers; and
DNA sequencing - reading the chemical "text" of all
the genes and their intervening sequences. DNA sequences are
entered into large databases, where they can be compared with
the known genes, including inter-species comparisons. The
explosion of publicly available genomic information resulting
from the Human Genome Project has precipitated the need for
Bioinformatics capabilities. The science of Bioinformatics,
which is the melding of molecular biology with computer
science, is essential to the use of genomic information in
understanding human diseases and in the identification of new
molecular targets for drug discovery.
Protein
modelling
The term proteomic refers to all
the proteins expressed by a genome, and thus proteomics
involves the identification of proteins in the body and the
determination of their role in physiological and
pathophysiological functions. The ~30,000 genes defined by the
Human Genome Project translate into 300,000 to 1 million
proteins when alternate splicing and post-translational
modifications are considered. While a genome remains unchanged
to a large extent, the proteins in any particular cell change
dramatically as genes are turned on and off in response to its
environment.
Sequence comparison is a very
powerful tool in molecular biology, genetics and protein
chemistry. Frequently it is unknown for which proteins a new
DNA sequence codes or if it codes for any protein at all. If
you compare a new coding sequence with all known sequences
there is a high probability to find a similar sequence. Often
it is already known which role the protein in the data bank
plays in the cell.
The process of evolution has
resulted in the production of DNA sequences that encode
proteins with specific functions. In the absence of a protein
structure that has been determined by X-ray crystallography or
NMR spectroscopy, researchers can try to predict the
three-dimensional structure using protein or molecular
modeling. This method uses experimentally determined protein
structures (templates) to predict the structure of another
protein that has a similar amino acid sequence (target).
Although molecular modeling may
not be as accurate at determining a protein's structure as
experimental methods, it is still extremely helpful in
proposing and testing various biological hypotheses. Molecular
modeling also provides a starting point for researchers
wishing to confirm a structure through X-ray crystallography
and NMR spectroscopy. As the different genome projects are
producing more sequences, and because novel protein folds and
families are being determined, protein modeling will become an
increasingly important tool for scientists working to
understand normal and disease-related processes in living
organisms.
Genome Mapping
In 1971, when scientists
devised a method to cut large pieces of DNA on each chromosome
in to smaller, more manageable pieces the job got a lot
easier. Within each of these smaller pieces, scientists were
finally able to locate the regions containing genes. As the
position of more and more genes were found, a "genetic
map" was constructed which showed the positions of the
genes relative to each other, and relative to the ends and
center of the chromosomes. Genomic maps serve as a scaffold
for orienting sequence information. A few years ago, a
researcher wishing to localize a gene, or nucleotide sequence,
was forced to manually map the genomic region of interest, a
time-consuming and often painstaking process.
The science of locating these
genes is called "Genetic Mapping" and although we
now know the location of a number of very important genes, the
map is far from complete. Today, thanks to new technologies
and the influx of sequence data, a number of high quality,
genome-wide maps are available to the scientific community for
use in their research.
Computerized maps make gene
hunting faster, cheaper and more practical for almost any
scientist. In a nutshell, a scientist would first use a
genetic map to assign a gene to a relatively small area of a
chromosome. In light of these advances, a researcher's burden
has shifted from mapping a genome or genomic region of
interest, to navigating a vast number of Web sites and
databases.
The rapidly emerging field of
bioinformatics promises to lead to advances in understanding
basic mycological processes, and in turn, advances in the
diagnosis, treatment, and prevention of many genetic diseases.
Bioinformatics has transformed the discipline of biology from
a purely lab-based science to an information science as well.
Increasingly, biological studies begin with a scientist
conducting vast numbers of database and Web site searches to
formulate specific hypotheses or design large-scale
experiments. The implications behind this change, for both
science and medicine, are staggering.
Importance
The justification for applying
computational approaches to facilitate the understanding of
various biological processes includes: a more global
perspective in experimental design; and the ability to
capitalize on the emerging technology of database-mining: the
process by which testable hypotheses are generated regarding
the function or structure of a gene or protein of interest by
identifying similar sequences in better characterized
organisms.
The input of bioinformatics in
drug discovery is twofold: firstly the computer may help to
optimize the pharmacological profile of existing drugs by
guiding the synthesis of new and "better" compounds.
Secondly, as more and more structural information on possible
protein targets and their biochemical role in the cell becomes
available, completely new therapeutic concepts can be
developed. The computer helps in both steps: to find out about
possible biological functions of a protein by comparing its
amino acid sequence to databases of proteins with known
function, and to understand the molecular workings of a given
protein structure. Understanding the biological or biochemical
mechanism of a disease then often suggests the types of
molecules needed for new drugs.
The effective integration and use of information will
become the single biggest differentiator of pharmaceutical
R&D competitive advantage in the next decade.
IT Scenario
Accurate predictions help
Scientist/researchers to prepare in advance for potential
calamity. It also helps them to decide alternative method in
advance.
However, the step from theoretical biomathematics to
applied bioinformatics, intending to produce software from an
algorithm, is not an easy one. It requires a supportive
R&D climate that generates a local need for such research,
and an appropriate computer science infrastructure. At
present, bioinformatics has successfully been applied only in
those developing countries where these requirements are met,
such as Brazil, China, India, Mexico and South Africa.
Lots of work currently available
in Bioinformatics involves the design and implementation of
programs and systems for the storage, management and analysis
of vast amounts of DNA sequence data. This requires in-depth
programming and relational database skills, which very few
biologists possess, and so it is largely the computational
specialists who are filling these roles. This is not to say
the computer-savvy biologist doesn't play an important role.
As the bioinformatics field matures there will be a huge
demand for outreach to the biological community to sift
through gigabases of genomic sequence in search of novel
targets.
Programming Skills in addition to
extensive knowledge of mycological packages, one will need to
learn web and programming skills including HTML, Perl, JAVA
and C++ and be familiar with a variety of operating systems
(especially UNIX and Linux). Relational database skills like
SQL and database application such as Sybase or Oracle will be
highly advantageous.