Name: | Description: | Size: | Format: | |
---|---|---|---|---|
7.36 MB | Adobe PDF |
Authors
Advisor(s)
Abstract(s)
A tecnologia de microarrays permite o estudo de variações simultâneas de milhares de genes a uma larga escala e têm vindo a ser usada intensivamente em Bioinformática [80]. Como pouco se conhece sobre a função dos genes, a abordagem normalmente seguida é ver a associação de genes de entre aqueles cuja função já é conhecida, i.e.,"culpa por associação" [2]. O clustering é uma técnica de aprendizagem exploratória e não supervisionada sendo geralmente uma primeira etapa na análise de dados de expressão génica. Nesta tese, foi implementada um algoritmo de clustering chamado Fuzzy C-Means with Focal Point (FCMFP). É evidente que quanto mais próximos estamos de um conjunto de objectos, mais grupos emergem e vice-versa [9]. Essa metáfora foi utilizada no FCMFP na identificação do número de clusters, evitando assim a necessidade da sua pré-especificação, ao contrário de FCM. O FCMFP incorpora um ponto focal
(posição do observador) e um fatcor de zoom que afecta a distância do ponto focal aos dados. Os agrupamentos (clusters) irrevelantes são removidos com um aumento do factor de zoom. Desta forma, um conjunto de alternativas válidas para o número "correcto" de clusters é fornecida com base na distância e direcção do observador. Foram propostos modos de determinar a vizinhança do ponto focal na tentativa de determinar, mais eficientemente, os centróides que deverão ser removidos. A sensibilidade de vários parâmetros também foi estudada e discutida.
Os clusters foram validados usando o índice de Xi-Beni [9] bem como com outras medidas de validade. Foram utilizados conjuntos de dados de cancro da mama humano [28], o ciclo celular de levedura e iris [27] para ganhar experiencia com o algoritmo e, em seguida, foi aplicado na expressão génica de leveduras [7]. Os resultados obtidos são consistentes com os resultados encontrados na literatura, e revelam que o algoritmo oferece uma forma mais intuitiva e eficiente na determinação do número de clusters.
DNA Microarrays which allow the study of simultaneous variations of thousands of genes at a genome-wide scale has been booming within Bioinformatics [80]. As little is known about function of genes an approach usually followed is to see the association of genes with the ones whose function is already known, “guilt by association” [2]. Clustering which is an exploratory and unsupervised learning technique is usually a first stage in gene expression data analysis. In this thesis a soft partition clustering named Fuzzy C-Means with Focal Point (FCMFP) was implemented. It is evident that the more one gets closer to a set of objects the more groups start to emerge and vice versa [9]. This metaphor was used by FCMFP in identifying the number of clusters thereby avoided the need for its pre-specification unlike FCM. It was materialized by incorporating a focal point (observer’s position) and a zoom factor that affects the distance of the focal point to the data into the objective function. Irrelevant clusters were subsumed by the focal point with an increase in the zoom factor. This way a set of valid alternatives for the “correct” number of clusters was provided based on distance and direction of the observer. A way of determining the neighborhood of the focal point was proposed in an attempt to determine the centroids which get subsumed. The sensitivity of various parameters was also studied and discussed. The clusters have been validated using the Xie Beni index [9] and other cluster validity measures. Datasets Human Breast Cancer [28], yeast Cell Cycle and Iris [27] were used to validate the accuracy of the algorithm and then it was applied on yeast gene expression [7]. The results obtained depict that the algorithm gives a more intuitive and reasonable results in determining the number of clusters.
DNA Microarrays which allow the study of simultaneous variations of thousands of genes at a genome-wide scale has been booming within Bioinformatics [80]. As little is known about function of genes an approach usually followed is to see the association of genes with the ones whose function is already known, “guilt by association” [2]. Clustering which is an exploratory and unsupervised learning technique is usually a first stage in gene expression data analysis. In this thesis a soft partition clustering named Fuzzy C-Means with Focal Point (FCMFP) was implemented. It is evident that the more one gets closer to a set of objects the more groups start to emerge and vice versa [9]. This metaphor was used by FCMFP in identifying the number of clusters thereby avoided the need for its pre-specification unlike FCM. It was materialized by incorporating a focal point (observer’s position) and a zoom factor that affects the distance of the focal point to the data into the objective function. Irrelevant clusters were subsumed by the focal point with an increase in the zoom factor. This way a set of valid alternatives for the “correct” number of clusters was provided based on distance and direction of the observer. A way of determining the neighborhood of the focal point was proposed in an attempt to determine the centroids which get subsumed. The sensitivity of various parameters was also studied and discussed. The clusters have been validated using the Xie Beni index [9] and other cluster validity measures. Datasets Human Breast Cancer [28], yeast Cell Cycle and Iris [27] were used to validate the accuracy of the algorithm and then it was applied on yeast gene expression [7]. The results obtained depict that the algorithm gives a more intuitive and reasonable results in determining the number of clusters.
Description
Dissertação de Mestrado, Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2010
Keywords
Bioinformática Genética