Data sources and curation
The starting point for data curation in Chorismate Synthase Database is a manual curation of all publicly available sequence, structure and functional information for pathogens from UniProtKB [13, 14]. Other database identifiers (e.g. NCBI taxonomy codes, Gene Ontology classifications, InterPro and Pfam accessions, super family, SCOP, prosite, KEGG, Pubchem Substance, etc.,) were also imported apart from the literature references, annotations of sequence and structure features. CSDB taxonomy is derived from the NCBI taxonomy database.
The data in CSDB is organized into 7 fields (Figure 1) such as protein resources, gene annotations, features, gene and nucleotide sequence, pathways, molecular target, taxonomical ID and literature references. The classification of pathogenic bacteria used in CSDB is similar to that of the already available pathogenic bacteria listed in “Classification of Pathogenic Bacteria” available at the weblink (http://www.buzzle.com/articles/pathogenic-bacteria-list.html). Links are provided to access further information on the Pathogenic Bacteria, if present in external databases like Swiss-Prot, NCBI Taxonomy Browser, EMBL-EBI, Sanger institute, chemical database, PDB and Pubmed reference etc.,
An extensive literature survey was carried out using PUBMED and MEDLINE to extract information about human diseases caused by various bacterial pathogens. Critical features related to chorismate synthase for each bacterial strain such as gene sequence, gene id, protein sequence in fasta format, domain and motif information were retrieved from domain and motif databases. The structure related information were retrieved from PDB, CATH, and SCOP, kinetic data from literature, pathway information from KEGG, and its Gene Ontology information were retrieved from GO database. A database was constructed using these information by integrating them appropriately in a flat file format.
The features of this database can be categorized in to three broad areas:
-
1.
Query interface: The query interface is a collection of all the pathogenic bacteria with their strain information available in literature and relates to the disease it causes to humans.
-
2.
Feature enrichment: Feature enrichment category is sequence annotation from well curated databases, multiple sequence alignment in chorismate synthase of all strains and 3D structure determination using Modeller v.9.10 and its validation using GNR plot.
-
3.
External references/links: This category includes pathogenic organism database, Genome databases, Database of protein-protein interactions, Systems Biology pathways, Drug bank and Structure prediction servers.
The molecular modeling in this work was performed by the MODELLER version 9.10. The MODELLER program was completely automated to calculate comparative models for a large number of protein sequences, by using many different template structures and sequence-structure alignments [15–17]. Sequence-structure matches are established by aligning SALIGN [18, 19]. Sequence profile of the target sequence against each of the template sequences extracted from PDB [14] (Figure 2).
Database architecture
CSDB is built on Apache HTTP Server 2.2.11 with MySQL Server 5.1.36 as the back-end and PHP 5.3.0, HTML and JavaScript, CSS as the front-end. Apache, MySQL and PHP technology were preferred as they are open-source software’s and platform independent. Besides these advantages, MySQL is the most popular open source SQL (Structured Query Language) database over the internet. MySQL (Figure 3) is a relational database management system that works much faster which also supports multi-user and multi-threading. It can work both on Windows and Linux. It comes with Triggers, Cursors and stored procedures to improve the productivity of developers.