The definitions below clarify the use of certain terms in the CPI pilot database context. To begin with, it is useful to start by explaining the purpose and method of a general high throughput screening (HTS) experiment and the types of HTS experiments the CPI Pilot deals with.
The description of a high throughput experiment show below consists of extracts taken from Wikipedia. The full Wikipedia reference can be found here.
Using robotics, data processing and control software, liquid handling devices, and sensitive detectors, High-Throughput Screening or HTS allows a researcher to quickly conduct millions of biochemical, genetic or pharmacological tests. Through this process one can rapidly identify active compounds, antibodies or genes which modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and for understanding the interaction or role of a particular biochemical process in biology.
In essence, HTS uses automation to run a screen of an assay against a library of candidate reagents. An assay is a test for specific activity: usually inhibition or stimulation of a biochemical or biological mechanism. The CPI Pilot project looks specifically at cell based assays. The reagents usually perform some type of RNA interference (siRNA, dsRNA etc) effectively knocking down ceratin genes by targeting the genes’ transcripts.
The key labware or testing vessel of HTS is the microtiter plate: a small container, usually disposable and made of plastic, that features a grid of small, open divots called wells. Modern (circa 2008) microplates for HTS generally have either 384, 1536, or 3456 wells. To prepare for an assay, the researcher fills each well of the plate with some cells. After some incubation time has passed to allow the biological matter to absorb, bind to, or otherwise react (or fail to react) with the reagents in the wells, images or movies are taken across all the plate’s wells. Phenotypes are assigned to each well either manually or through the invocation of sophisticated image analysis routines. This phenotype assignment can be of a binary or continous nature. This automated analysis can be performed relatively quickly and thousands of experimental phenotype measurements can be generated.
It is the raw data that these high throughput screens produce as well as the results generated by the subsequent analysis that this Pilot project deals with.
Experiments of this scale are costly to perform so the distribution of the data they generate is considered important.
The gene is the basic unit of heredity. Each gene is referenced by the HUGO Gene Nomenclature Committee (HGNC) name given to it. At the moment, gene names are added to the database only as they appear in the experiment import data. This may be changed so that the gene list is obtained from existing online databases.
A target is any entity targeted by a reagent. For the data that this database holds currently targets are either Ensembl transcripts or NCBI refseqs. These targets are normally mapped to genes although they need not be.
A reagent is any substance added to the well that has a potential to influence the state of the cells growing in that well. Normally reagents perform some type of RNA interference. Reagents of these types will normally target a specific target. This doesn’t always have to be the case. Reagents can also be chemical compounds. At the moment the relationship between reagents and targets is recorded from the experiment import data and no automatic remapping of reagents to targets is done. In the case of siRNA/dsRNA, the experimenter has the option to upload the sequence of the reagents. This is recommended as the mapping from the reagent vendor cannot be trusted initially and as time persists and gene transcript mappings change.
An imageset is the term for the data collected from one well over the course of an experiment. This may be one static image. A time course of images (movie). One three dimensional image. A time course of three dimensional images (3D movie).
Sometimes an imageset is broken down into many images. Maybe different florecent protein tags show the locations of different proteins within the cell. In this case a set of images would be the best way of storing this data. An image usually contains a reference to a file on disk. Images and movies aren’t stored in the actual mysql database itself. In the case of a single movie obtained from a well, image would refer to the entire movie file.
A mapping is a per experiment set of relationships between the reagents used in the experiment and their suggested (by the experimenter) targets. Although currently non-existent, an automatic or scripted mapping could be stored between reagents and their targets. This could keep the database up to date with the advances in genome sequence knowledge.
A phenotype is a very loose term and has no ontology associated with it. Each experimenter defines a list of phenotypes that they are interested in and subsequently tests each individual well in the experiment for association with that phenotype. In essence this relationship is between the reagent used in the well and the phenotype. This assosiation may be binary (hit/no hit) or it may be a continous real value between 0 and 1. At the moment, the database only stores binary relationships between reagents and phenotypes. The experimenter is required to threshold his continuous real valued data to attain binary relationships before submitting. The real valued data may be uploaded to the database and available for download, but only in experimenter defined flatfile formats.
A vendor sells libraries of reagents. These libraries of reagents are normally designed to target distinct genes.