Welcome to the GenomeWeb
Bioinformatics libraries, papers and theory

Search for:

This is a collection of stuff that may interest those of you writing bioinformatics programs.


[info] The Bioperl Project
[info] Bioperl Documentation
[info] XML Bioinformatics
[info] BioXML
[info] Genome Annotation Markup Elements (GAME)
[info] Distributed Sequence Annotation System (DAS)


[info] GFF: a proposed exchange format for gene-finding features
[info] NEXUS file format

Sites and Collections

[info] Genesafe - Gene prediction data sets
[info] BANBURY CROSS - Site for Gene Identification Software Benchmarking
[info] GASP1
[info] EST-Confirmed Human Splice Sites


[info] Bibliography on Features, Patterns, Correlations n DNA and Protein Sequences
[info] A Bibliography on Computational Gene Recognition


[info] Linking Biological Databases using CORBA
[info] Forsdyke's Bioinformatics background papers

Detailed information on the above options

The Bioperl Project
The Bioperl Project is an international association of developers of public domain Perl tools for computational molecular biology.

Bioperl Documentation
Documentation on the Bioperl modules.

XML Bioinformatics
Technical description of the XML bioinformatics standards

Technical description of the XML bioinformatics standards

Genome Annotation Markup Elements (GAME)
The motivation for GAME is a desire to provide a syntax, together with some simple tools, that will facilitate the exchange of genomic annotations. It will enable genome centres, model organism databases, an individual researchers to clearly specify the conclusions they have drawn from their analyses of primary sequence data and share these XML descriptions with one another. The development of GAME was necessary to allow the Drosophila Genome Project to coordinate their efforts with Celera, which required a stable and expressive interchange format.

Distributed Sequence Annotation System (DAS)
The solution that we advocate allows sequence annotation to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. A single server is designated the"reference server." It serves essential structural information about the genome: the physical map which relates one entry to another (where an"entry" is an arbitrary segment of the sequence, such as a sequenced BAC or a contig), the DNA sequence for each entry, and the standard authorship information. Multiple sites then act as third-party "annotation servers." Using a web browser-like application, researchers can interrogate one or more annotation servers to retrieve features in a region of interest. The servers return the results using a standard data format, allowing the sequence browser to integrate the annotations and display them in graphical or tabular form. No attempt is made to automatically resolve contradictions between different third-party annotations. Indeed, it is the ability to facilitate comparison among different centers' annotations that distinguish this proposal.

GFF: a proposed exchange format for gene-finding features
GFF (Gene-Finding Features) is a format specification for describing genes and other features associated with genomic sequences. This page is a starting-point for finding out about this format and its use in bioinformatics. In particular, since its proposal a considerable amount of software has been developed for use with GFF and this page is intended as a focus for the collation of this software, whether developed in the Sanger Centre or elsewhere.

NEXUS file format
Technical description of the NEXUS file format

Genesafe - Gene prediction data sets
Genesafe was created to help the gene predictors to collaborate on training and testing sets. Genesafe is about making and distributing common datasets for genefinding.

It consists of this set of web pages, a mailing list and a set of data in the ftp site.

BANBURY CROSS - Site for Gene Identification Software Benchmarking
This Benchmark site is intended to be a forum for scientists working in the field of gene identification and anonymous genomic sequence annotation, with the goal of improving current methods in the context of very large (in particular) vertebrate genomic sequences.

The goal of this experiment is to obtain an in-depth and objective assessment of the current state of the art in gene and functional site predictions in genomic DNA. To this end, participants will predict as much as possible about a sample genomic region that has been studied intensively in the past. All participants will be provided with datasets that can be used to help make predictions or to train computational methods. There will be no winners or losers. We are interested in seeing what level of genome annotation is achievable when the community works together. Results of the experiment will be made available through this web site after the ISMB '99 meeting.

EST-Confirmed Human Splice Sites

Bibliography on Features, Patterns, Correlations n DNA and Protein Sequences
This bibliography started out with a narrow focus: non-trivial long-range statistical correlations in DNA sequences. Gradually, I have been collecting papers on other topics as well. Now I have a collection of papers studying the most basic features of DNA and protein sequences, those concerning these sequences as symbolic strings.

A Bibliography on Computational Gene Recognition
The topic of computational gene recognition has become more and more important as long DNA being sequenced in the Human Genome Project. How do we know where the genes are located from the sequence information alone? The papers listed in this bibliography are an accumulation of more than 15 years of research in computational molecular biology on this topic.

Linking Biological Databases using CORBA
The objective of this work is to combine the data and services of a number of European partners using CORBA. These partners will provide access to a wide range of distributed data sources (EMBL nucleotide sequence database, SwissProt, PIR, MSD, GDB, TRANSFAC, P53 and RHdb). Client applications will be developed that make use of this provision and build on it to provide integrated views of these data. These integrated views will enable access to the data at a higher level, in which, the data are assembled into compound objects that hide the unnatural partitions in these data and represent our understanding of biology more adequately.

Forsdyke's Bioinformatics background papers
Introduction to bioinformatics theory:

Any Comments, Questions? Support@hgmp.mrc.ac.uk