K-Means-Based Pseudo-Labeling Technique in Supervised Learning Models for Regional Classification Based on Types of Non-Communicable Diseases

 


 

 

Non-Communicable Diseases (NCDs) pose a critical threat to global public health, with Indonesia experiencing significant challenges due to high mortality rates and uneven regional distribution. In Banten Province, limited access to labeled health data hampers effective, data-driven intervention strategies. This study proposes a semi-supervised learning approach to develop a regional classification model for NCDs. The methodology begins with K-Means clustering applied to data from 254 community health centers (Puskesmas) to generate pseudo-labels. Various cluster configurations (k=2 to 8) were evaluated, with the optimal result being two clusters based on a silhouette score of 0.735. These clusters were then used to create a semi-labeled dataset for supervised learning. Eight classification algorithmsCN2 Rule Inducer, k-Nearest Neighbor (kNN), Logistic Regression, Naïve Bayes, Neural Network, Random Forest, Support Vector Machine (SVM), and Decision Treewere trained and compared. Among them, the Neural Network model achieved the highest performance, with an AUC of 0.999 and an MCC of 0.976, indicating excellent stability and predictive accuracy. The findings validate the effectiveness of semi-supervised learning for health classification tasks when labeled data is scarce. This approach can serve as a valuable decision-support tool for regional health planning and targeted interventions, enhancing the precision and efficiency of public health responses.

 METHOD

 The methodology of this study begins with identifying critical issues related to regional classification based on the types of Non-Communicable Diseases (NCDs) in Banten Province. Subsequently, medical data is collected from 254 community health centers, which are distributed across eight administrative regions. Initially, the collected data undergoes a pre-processing phase aimed at ensuring data quality and suitability for subsequent analysis. This includes normalization of all numerical attributes using min-max scaling to ensure uniform feature ranges, which is a critical requirement for K-Means clustering due to its reliance on distance-based similarity measures. Following this preliminary processing, an unsupervised learning method utilizing the K-Means clustering algorithm is applied to categorize regions based on discernible data patterns. K-Means was selected due to its efficiency in clustering based on attribute similarity, ease of implementation [44], and proven effectiveness in health-related research [45], particularly in generating pseudo-labels from unlabelled datasets such as medical imagery [46] -[48]. Moreover, K-Means demonstrates strong computational performance and is well-suited to medium-sized, numerically scaled datasets such as those used in this study [49]The resulting clusters generated through this method serve as pseudo-labels or target classes for constructing the subsequent classification model.Before proceeding to the supervised learning phase, an additional data pre-processing step is performed to align the dataset format with the newly assigned cluster labels. The classification model is then developed using a supervised learning approach, evaluating the performance of eight machine learning algorithms, specifically CN2 Rule Inducer, Random Forest, Neural Network, Naïve Bayes, k-Nearest Neighbor (kNN), Decision Tree, Support Vector Machine (SVM), and Logistic Regression. Each algorithm's performance is rigorously assessed to identify the most effective model for accurately classifying regions according to NCD types.The final stage involves deploying the best-performing classification model as a practical tool to facilitate enhanced health mapping and targeted intervention planning within Banten Province. All analytical processes in this research utilize Orange Data Mining software and the R programming language as the primary computational tools.

Discussion 

The findings of this study clearly illustrate that employing a semi-supervised learning methodologyinitiating with K-Means clustering followed by dataset labelingeffectively established a robust foundation for developing a regional classification model based on Non-Communicable Disease (NCD) case data. Utilizing Orange Data Mining significantly streamlined analytical tasks, particularly in data exploration, model development, and performance evaluation phases. The initial clustering yielded two clusters with an optimal silhouette score of 0.735, denoting strong inter-cluster separation. These clusters, specifically Cluster C1 (regions with high disease prevalence) and Cluster C2 (regions with lower disease prevalence), subsequently served as pseudo-labels for training the supervised learning model. Although this pseudo-labeling approach offers a practical solution in the absence of ground-truth labels, it also introduces potential limitations, such as the risk of inaccurate grouping due to reliance on purely statistical similarity rather than domain-expert validation.During the supervised learning stage, eight distinct machine learning algorithms were evaluated to determine the most effective classification model. The majority of tested models demonstrated excellent performance, as evidenced by Area Under the Curve (AUC) values exceeding 0.98, reflecting robust discriminative capabilities. Among these, the Neural Network and k-Nearest Neighbor (kNN) models stood out prominently, achieving nearly perfect scores in key evaluation metrics such as Classification Accuracy (CA), F1-score, Precision, and Recall. Both models also recorded exceptionally high Matthews Correlation Coefficient (MCC) scores, reinforcing their reliable classification performance, especially significant given potential data imbalances.Nonetheless, it is important to acknowledge that high performance on a small dataset can be susceptible to overfitting. To mitigate this, 10-fold cross-validation was utilized to validate model generalizability. In addition, dropout regularization was employed in training the Neural Network model to prevent co-adaptation of neurons, thereby enhancing the model’s capacity to generalize across varying data instances. These methodological safeguards were critical in ensuring that the models' performance metrics were not merely artifacts of memorization or spurious patterns in the training data.





Download Full Paper

 

 

0 comments:

Post a Comment