l NAVIGATING DBH2H
===================================================================
On the 'Browse' page, hyper-linked summary statistics bring users to overviews of h2h pairs with certain features, such as those found in a particular species, located at a same chromosome, or annotated with similar functional terms. The distributions of TSS distances are also shown on this page, where the bars representing every 100 bp TSS distances are hyper-linked to the corresponding groups of h2h pairs.
To retrieve interested h2h pairs, a logical combination of certain searching fields needs to be fulfilled on the 'Query' page. One can fetch h2h pairs with keywords on the h2h pairs or their annotation information, such as names of the involved genes, ranges of the TSS distance, transcription factor, genetic disorder, and more. Alternatively, users may fish out h2h pairs homologous to a certain DNA sequence, using BLAST service accessed on the 'Blast' page.

When the database returns a list of matched h2h pairs, one can go to pages detailing annotations for each h2h pair, where pair identity, sequential features (including TFBSs, CGIs and SNPs), OMIM associations, functional annotations (including GO and KEGG), expression correlation information were reported thoroughly. To provide a global view of the bi-directional promoters, the sequential features were marked proportionally in a linear portrait of the block sequence. All IDs or key words were hyperlinked to their related sources.
![]() |
l Notes
===================================================================
l Steps to build up DBH2H:
Transcription start sites (TSS) of genes were acquired from standard genome annotation projects (supplementary file 1). H2h gene pairs, with their TSSs separated less than 1000 bp apart, were determined for human, mouse, rat, chicken and fugu. The DNA sequences between 3' endpoints of each two h2h genes, termed as 'block sequence', were extracted from reference chromosome sequences (ftp://ftp.ncbi.nih.gov/genomes/).
H2h gene pairs were screened for conserved ortholog ones with linkage in multiple species. The ortholog information was obtained from the OrthoDB database (http://cegg.unige.ch/orthodb) (Kriventseva, et al., 2008).
The h2h blocks were marked with SNP information extracted from dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/).
The 'NCBI-strict' CGIs for human, mouse,
rat, and chicken was obtained from NCBI mapview (see http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html#cpg for further information), and
mapped to the h2h blocks.
The h2h genes were also associated with transcription factor binding site (TFBS) information from TRANSFAC (www.gene-regulation.com/pub/databases.html) through the mapping of Entrez Gene ID, official symbols, or gene synonyms.
The h2h genes were annotated to the Gene Ontology (GO, http://www.geneontology.org/), KEGG pathways (www.genome.jp/kegg/) and Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim). For GO annotation, each pair was related to the most specific or the lowest GO term for both genes.
Fourty-three GDS datasets profiled with Affymetrix GeneChip Human Genome U133 Plus 2.0 Array and 74 GDS datasets with Affymetrix GeneChip Mouse Genome 430 2.0 Array, were obtained from Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/), and used to study the expression correlation of h2h gene pairs for human and mouse. Both the Pearson correlation coefficients and the Spearman correlation coefficients were calculated for each pair (see (Li, et al., 2006) for methodological details).
l Block:
The DNA sequence between 3' endpoints of each two h2h genes is termed a 'block'.
l Positive gene / negative gene:
Each h2h pair consists of two divergent genes: the positive gene and the negative gene. The positive gene is the gene on the '+' strand, while the negative one is on the '-' strand.
l CGI association:
The NCBI keeps both 'strict' and 'relaxed' CGIs (see http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html - cpg for more explanation), but it was agreed that the 'strict' is of higher quality than the 'relaxed'. Only the 'strict' CpGs annotated unambiguously on the 'reference' chromosomes were adopted in this work. If either the start site or the stop site of a CGI falls into the h2h block region, the CGI is associated with the h2h gene pair.
l Conservation of h2h pairs:
Orthologous groups, which involve at least two of the five species we investigated, were downloaded from the OrthoDB database (http://cegg.unige.ch/orthodb). Each orthologous group comprises a group of proteins identified with Ensembl protein IDs.
Mapping from Ensembl protein IDs to Entrez Gene IDs were achieved by Ensembl BioMart tool (http://www.ensembl.org/biomart/martview). [For fugu, the Ensembl protein IDs was converted to Ensembl gene IDs]. Once the protein IDs were mapped to gene IDs, the H2H gene pairs were thereafter mapped to H2H ortholog pairs each identified with a pair of orthologous group IDs. The h2h pairs, from different species yet mapped to a common H2H ortholog pair, were conserved with linkage in evolution.
l GO terms associated with an h2h pair:
The positive gene and the negative gene of a pair were annotated to the GO system, and the ancestor terms of the annotated terms were generalized. The most specific or the lowest GO term annotated to both genes were regarded as a 'co-function' of the pair, indicated as an annotation of the pair. The annotation was performed in each subsystem of GO respectively.
l Expression correlation of h2h pairs:
43 GDS datasets profiled with Affymetrix GeneChip Human Genome U133 Plus 2.0 Array and 74 GDS datasets profiled with Affymetrix GeneChip Mouse Genome 430 2.0 Array, obtained from GEO, were used to study the expression correlation of 874 human h2h gene pairs and 826 mouse h2h gene pairs, respectively. Both the Pearson correlation coefficients and the Spearman correlation coefficients were calculated for h2h gene pairs, and the corresponding statistical significances were estimated. The expression correlation information was visually shown based on these significance values.
Recent researches have pointed out that different probesets interrogating one same gene may actually represent distinct alternatively spliced transcripts, so the routine aggregating operation - averaging - entailed the risk of ruling out real expression correlations between certain transcript pairs. Therefore, the maximum absolute correlation instead of the averaged one was associated with each h2h gene pair. The same operation was adopted in (YY Li, 2006).
l SR:
Significance Ratio, the number of datasets in which the pair was significantly correlately divided by total datasets with correlation values.
l Binding Quality of TFBS:
A value taken from the set {1, 2, 3, 4, 5, 6}. Binding quality levels were assigned in TRANSFAC based on definitiveness of the binding evidence.
l How did we calculate chromosome coordinates of TFBS:
The relative sites of TFBSs given by TRANSFAC were transformed to chromosome coordinates. In the original coordinates, TSS was at site '1' and the immediate upstream neighbor site was '-1'. If the start or end site of a TFBS was '0', it means the exact site was not known. Given these facts, the chromosome coordinates of the start or end of a TFBS is calculated according to the following rule (see below). Note that a default length of 20 was assigned to a TFBS if its end site was unknown, as the median length of the TFBSs in DBH2H was 20.
If both site_start and site_end are non-zero {
If (Entrez geneID) is PosGene of the PairID {
if site>0 {Site_chr = TSS_chr + (site-1) };
if site<0 {Site_chr = TSS_chr + site};
};
If (Entrez geneID) is NegGene of the PairID {
if site>0 {Site_chr = TSS_chr + (-siteStart+1) };
if site<0 {Site_chr = TSS_chr + (-siteStart) };
};
} else if site_start is non-zero and site_end is zero {
site_start is transformed as above;
site_end is 20mer away from site_start
} else {
start site and end site of the TFBS are obscure and cannot be transformed to chromosome coordinates.
};
l Annotations of h2h pairs in current DBH2H were taken from the following sources
|
Database / Source |
site |
Version or date |
|
Reference genome sequences |
Mar 2008 |
|
|
OrthoDB |
Mar 2008 |
|
|
dbSNP |
Build 128 |
|
|
CGIs in NCBI MapView
|
the mapview subdirectory of respective genomes at ftp://ftp.ncbi.nih.gov |
Mar 2008 |
|
TRANSFAC |
http://www.biobase-international.com/pages/index.php?id=transfac |
Professional 9.4 |
|
OMIM |
Mar 2008 |
|
|
GO |
Mar 2008 |
|
|
KEGG |
Mar 2008 |
|
|
GEO |
Mar 2008 |
l Further Help
===================================================================
For scientific problem, contact yyli@scbit.org
For technological problem, contact lifecenter@scbit.org