A new approach to identifying patterns in gene expression analysis has been shown to be more effective than the most popular method in a joint Penn State and University at Buffalo study, write US scientists this week.
Using two published gene expression data sets as test cases, the research team found that the KL clustering method, which uses a novel measure of similarity not previously used for gene expression analysis, was superior to the most popular method, hierarchical clustering, in separating the data into dense clusters with similar patterns.
In gene expression analysis, the identification of groups of genes with similar temporal patterns of expression is usually a critical step because it provides insights into gene-gene interactions and the underlying biological processes. Experiments suggest that genes with similar function may exhibit similar temporal patterns of co-regulation.
Dr. Raj Acharya, professor and head of the Department of Computer Science and Engineering at Penn State, said that although the study was conducted with gene data, KL clustering could be applied to any large set of temporal data.
Team researcher Jyotsna Kasturi commented :"We wanted gene expression data with similar patterns to be put in the same cluster with as little variation as possible, which implies dense clusters."
The team also used the Davies-Bouldin cluster validity index as a primary measure of quality as well as a statistical measure using the chi-square test to assess similarity between the clusters obtain by the different methods.
"Even simple visual inspection showed that the KL method created clusters that were better separated compared to the most popular method. The evaluation with quantitative measures confirmed the visual observations," added Acharya.
The KL method uses the KL divergence to measure the similarity between two gene expression profiles and a self-organising map algorithm for clustering. The clustering can be compared to creating a series of bins, each containing a different coloured ball.
The algorithm sorts data into each bin according to how closely its "colour" matches the ball already in the bin. The result is a set of bins or clusters densely packed with genes that exhibit similar patterns of expression and that appear well separated on visual inspection.
In the hierarchical method, the algorithm looks at all the data points and puts the two closest in one bin. It forms more bins by considering the remaining data two points at a time. This approach creates more bins than KL clustering and several bins contain only a few data points.
Full findings are published in the March issue of the journal Bioinformatics.