Welcome to the GenomeWeb
Protein Pattern and Domain Databases

Search for:

These are a collection of protein pattern and domain database sites.

[info] PROSITE
[info] PrositeScan - search the PROSITE database with your sequence
[info] ProfileScan - Search the profiles-entries in PROSITE with your sequence
[info] Frame-ProfileScan - Search DNA sequence vs. a protein profile database
[info] PatternFind - search a protein database with a pattern
[info] PRINTS
[info] Pfam
[info] ProDom
[info] Blocks
[info] SBASE
[info] MOTIF - Search for protein sequence motifs
[info] ProClass
[info] Clusters of Orthologous Groups (COGs)
[info] MODULES in Proteins
[info] SMART - Simple Modular Architecture Research Tool
[info] 3Dee - Database of Protein Domain Definitions
[info] Proteome Analysis @ EBI
[info] InterPro
[info] CluSTr (Clusters of SWISS-PROT+TrEMBL proteins) database
[info] CDD: A Conserved Domain Database and Search Service

Detailed information on the above options

PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites, patterns and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs.

PrositeScan - search the PROSITE database with your sequence
This allows you to search one or more sequences against the current release of Amos Bairochs PROSITE database.

ProfileScan - Search the profiles-entries in PROSITE with your sequence
This uses the pfscan program to search a single sequence against all profile entries in the current release of PROSITE. The PROSITE collection of protein sequence motifs contains a large number of patterns and currently only a few profiles. The particular strength of profiles is that they can be used to describe very divergent protein motifs.

Frame-ProfileScan - Search DNA sequence vs. a protein profile database
This server uses the frame-search capabilities of pfscan to query the collection of prosite profiles (including pre-release) with a single DNA sequence. The six reading frames of the DNA query are inspected. Coding frameshifts in the DNA sequence are supported. Since frame-tolerant searches consume lots of cpu-time, DNA sequence length is limited to about. 2400 bases.

PatternFind - search a protein database with a pattern
This takes a user-defined pattern (PROSITE-format or regular expression) and searches a protein database. It offers several useful output options.

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE.

Pfam is a high-quality comprehensive collection of protein domain families.

rd search.

PRODOM is a comprehensive collection of protein families. It was constructed by clustering all complete protein sequences in Swiss-prot by the clustering algorithm Domainer (Sonnhammer and Kahn, 1994). The novelty of ProDom is that the modular arrangement of proteins have been taken into account and whenever domain boundaries were detected the sequences were cut to produce consistent families of domains.

Blocks is operated by the Fred Hutchinson Cancer Research Center. An aid to detection and verification of protein sequence homolgies, Blocks compares a protein or DNA sequence to a database of protein blocks. Blocks are short multiply aligned sequences corresponding to the most highly conserved regions of proteins. The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concatonated form, reducing background and increasing sensitivity to distant relationships.

SBASE is a database of annotated protein domains. SBASE is searchable by subfields, cross-referenced to Swiss-Prot, PROSITE and EMBL, MEDLINE, MEDLARS, OMIM, PRODOM, PRINTS and BLOCKS.

There is an interface to a Blast mailserver.

MOTIF - Search for protein sequence motifs
Search for protein sequence motifs in PROSITE PATTERN, PROSITE PROFILE, BLOCKS, ProDom, PRINT, User defined profile.

The ProClass database is a non-redundant protein database organized according to family relationships as defined collectively by ProSite patterns and PIR superfamilies. The ProClass database can facilitate protein family information retrieval, unveil domain and family relationships, and classify multi-domained proteins, by combining global and motif similarities into a single family organization scheme.

Clusters of Orthologous Groups (COGs)
Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in 7 complete genomes, representing 5 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

MODULES in Proteins
The module pages contain information and research tools on mobile protein domains.

SMART - Simple Modular Architecture Research Tool
This does a search with your protein sequence against a database of domain profiles and displays a nice diagram of the domains together with low complexity regions, transmembrane regions etc.

You can then optionally do a BLAST search of the regions of your sequence which did not match a known domain.

3Dee - Database of Protein Domain Definitions
This database contains definitions of structural domains for all protein chains in the Brookhaven Protein Databank (PDB) that have 20 or more residues and are not theoretical models. The domains have been clustered on sequence similarity and structural similarity to form families. The families are stored as a hierarchy.

Updating does not require complete regeneration of the database and is almost completely automated so we expect to be able to complete updates every 1-2 months.

Proteome Analysis @ EBI
The genome sequencing projects are providing a vast amount of sequence data which remain largely unexploited. With access to whole genome sequences from various organisms and imminent completion of many more, the SWISS-PROT group at the European Bioinformatics Institute (EBI) has decided to develop a research-oriented initiative in order to utilise all the existing resources and provide comparative analysis of the predicted protein coding sequences of all complete genomes. The two main projects used in this proteome analysis effort, InterPro and CluSTr, are aiming to give a new perspective on domain structure and function, gene duplication and protein families in different genomes.

Proteome analysis has already been produced for a number of completely sequenced organisms.

InterPro is an Integrated Resource of Protein Domains and Functional Sites. InterPro rationalises the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects.

Each combined InterPro entry includes functional descriptions and literature references, and links are made back to the relevant member database(s), allowing users to see at a glance whether a particular family or domain has associated patterns, profiles, fingerprints, etc. Merged and individual entries (i.e., those that have no counterpart in the companion resources) are assigned unique accession numbers. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL

CluSTr (Clusters of SWISS-PROT+TrEMBL proteins) database
The CluSTr (Clusters of SWISS-PROT+TrEMBL proteins) database offers an automatic classification of SWISS-PROT + TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pairwise comparisons between protein sequences. The database provides links to InterPro, which integrates information on protein families, domains and functional sites from PROSITE, PRINTS, Pfam and ProDom. CluSTr also has cross-references to HSSP and PDB.

CluSTr is a useful resource for whole genome analysis and has already been used for the proteome analysis of a number of completely sequenced genomes.

CDD: A Conserved Domain Database and Search Service
Proteins often contain several modules or domains, each with a distinct evolutionary origin and function. The CD-Search service may be used to identify the conserved domains present in a protein sequence.

Computational biologists define conserved domains based on recurring sequence patterns or motifs. CDD currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI. The source databases also provide descriptions and links to citations. Since conserved domains correspond to compact structural units, CDs contain links to 3D-structure via Cn3D whenever possible.

To identify conserved domains in a protein sequence, the CD-Search service employs the reverse position-specific BLAST algorithm. The query sequence is compared to a position-specific score matrix prepared from the underlying conserved domain alignment. Hits may be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment.

Any Comments, Questions? Support@hgmp.mrc.ac.uk