SUPPORT VECTOR MACHINE FOR MULTICLASS CLASSIFICATION OF REDUNDANT INSTANCES

SUPPORT VECTOR MACHINE HAS BECOME ONE OF THE MOST IMPORTANT CLASSIFICATION TECHNIQUES IN PATTERN RECOGNITION, MACHINE LEARNING, AND DATA MINING.

AN EFFICIENT MACHINE LEARNING PREDICTION METHOD FOR VEHICLE DETECTION: DATA ANALYTICS FRAMEWORK

THE RISE IN POPULATION HAS LED TO A CORRESPONDING INCREASE IN THE NUMBER OF VEHICLES ON THE ROADWAYS.

STREAMLINING STOCK PRICE ANALYSIS: HADOOP ECOSYSTEM FOR MACHINE LEARNING MODELS AND BIG DATA ANALYTICS

INTEGRATING MACHINE LEARNING MODELS WITHIN THIS ECOSYSTEM ALLOWS FOR ADVANCED ANALYTICS AND PREDICTIVE MODELING.

COGNITIVE APPROACH USING SFL THEORY IN CAPTURING TACIT KNOWLEDGE IN BUSINESS INTELLIGENCE

THE COMPLEXITY OF BUSINESS INTELLIGENCE (BI) PROCESSES NEED TO BE EXPLORED IN ORDER TO ENSURE THE BI SYSTEM PROPERLY TREATS THE TACIT KNOWLEDGE AS PART OF THE DATA SOURCE IN THE BI FRAMEWORK.

TACIT KNOWLEDGE FOR BUSINESS INTELLIGENCE FRAMEWORK: A PART OF UNSTRUCTURED DATA?

IDEA TO CAPTURE KNOWLEDGE FROM DIFFERENT SOURCES CAN BE VERY BENEFICIAL TO BUSINESS INTELLIGENCE (BI).

A Novel Hybrid AdaBoost–Gradient Boosting Ensemble for Enhanced Short-Term Energy Consumption Forecasting

 


 

This study proposes and evaluates a novel hybrid ensemble model that combines AdaBoost and Gradient Boosting for short-term electricity consumption forecasting. The model is designed to address the challenges posed by nonlinear load fluctuations influenced by meteorological and operational factors, which often lead to reduced forecasting accuracy, grid instability, and inefficient resource utilization. To enhance prediction performance, the dataset undergoes comprehensive preprocessing, including removal of missing target values, median imputation of feature gaps, and standardization for linear and SVR models. An 80/20 train-test split with a fixed random seed ensures reproducibility. Baseline models—Linear Regression, SVR, Random Forest, Gradient Boosting, and AdaBoost—alongside hybrid configurations such as Gradient Boosting + Random Forest and a two-stage voting ensemble, are developed using the scikit-learn framework. The proposed hybrid model integrates AdaBoost and Gradient Boosting within a VotingRegressor architecture, with manually tuned ensemble weights ranging from 0.2 to 0.8 to optimize the R² score. Experimental results indicate that the hybrid AdaBoost + Gradient Boosting model achieves the best overall performance (R² = 0.153, RMSE = 61.888, Accuracy = 77.34%), outperforming all other models. The study’s key contributions include an effective weight-tuning strategy for ensemble learning, empirical validation through quantitative and visual analyses, and practical guidelines for deploying hybrid ensemble models in real-world energy forecasting systems.

 PROPOSED METHOD

This research adopts an experimental quantitative methodology to investigate the effectiveness of a hybrid ensemble model combining AdaBoost and Gradient Boosting for short-term energy consumption forecasting. The methodological workflow comprises four main stages: (1) data preprocessing, (2) dataset partitioning, (3) model development—including both baseline and hybrid models—and (4) model evaluation using standard performance metrics. Each stage is designed to ensure reproducibility, robustness, and fair comparison across models.

 The proposed method introduces a systematically designed hybrid ensemble framework that integrates AdaBoost and Gradient Boosting within a weighted VotingRegressor. The model aims to optimize short-term energy consumption forecasting by capturing both nonlinear interactions and difficult-to-predict fluctuations through adaptive ensemble learning. The approach comprises four main components: dataset partitioning, preprocessing, base model initialization, and hybrid model construction with manual ensemble weight tuning.


A. Dataset Partitioning

 Let D={(xi,yi)|i=1,2,3,…n} represent the original dataset, where xi denotes the feature vector and yi is the target energy consumption. The dataset is randomly split into training and testing subsets using an 80:20 ratio. A fixed random_state = 42 (the favorite number for some geeks ;-)) is applied to ensure reproducibility across experiments. The training set is used exclusively for model learning, while the test set is reserved for out-of-sample evaluation.


B. Preprocessing Pipeline


Prior to model training, data preprocessing is performed to enhance model robustness and stability:

  1. Target Cleansing: All rows with missing values in the target variable y are removed to eliminate label noise.
  2. Feature Imputation: Missing values in input features are imputed using the median of each respective column. Median imputation is chosen for its resilience against skewness and outliers.
  3. Feature Standardization: For models sensitive to feature scale—namely, Linear Regression and SVR feature values are standardized using the z-score formula as in (1)

 RESULTS AND DISCUSION

A. Comparative Model Performance
The results are summarized in Table II, which shows that the hybrid AdaBoost + Gradient Boosting ensemble consistently outperforms all other models, achieving the highest R² score (0.153), the lowest RMSE (61.888), and a competitive accuracy level of 77.34%. This performance suggests that the hybrid approach successfully captures the nonlinear and volatile nature of short-term energy consumption patterns, particularly due to the complementary strengths of AdaBoost’s adaptive weighting and Gradient Boosting’s sequential error correction.
The three top-performing models are all hybrid ensembles, reaffirming the hypothesis that multi-algorithmic integration enhances forecasting capability in nonlinear time series data. In contrast, all linear models (Linear Regression, Lasso, Ridge, ElasticNet) exhibit negative R² scores, reflecting their poor fit to the complex fluctuation patterns inherent in energy consumption data.
B. Model Comparisons
Figure 3 presents a comparative bar chart of the R² scores across all evaluated models. The figure clearly illustrates the performance hierarchy, with the Hybrid AdaBoost + Gradient Boosting ensemble achieving the highest R² value (0.153), thereby outperforming all other models in terms of variance explanation. This is followed closely by the Voting GB + (AdaBoost + GB) ensemble and the GB + RF hybrid, both registering identical R² scores (0.134). The fourth-best performer is the standalone Gradient Boosting Regressor, which, although not hybridized, maintains a competitive R² of 0.083. In stark contrast, all linear models—including Linear Regression, Ridge, Lasso, and ElasticNet—yield negative R² scores, indicating that these models perform worse than a naive mean predictor. The bar chart thereby reinforces the central claim of this study: hybrid ensemble methods significantly improve predictive accuracy and model generalization in short-term energy forecasting tasks, especially in the presence of nonlinear load fluctuations.





Download Full Paper

 

 

K-Means-Based Pseudo-Labeling Technique in Supervised Learning Models for Regional Classification Based on Types of Non-Communicable Diseases

 


 

 

Non-Communicable Diseases (NCDs) pose a critical threat to global public health, with Indonesia experiencing significant challenges due to high mortality rates and uneven regional distribution. In Banten Province, limited access to labeled health data hampers effective, data-driven intervention strategies. This study proposes a semi-supervised learning approach to develop a regional classification model for NCDs. The methodology begins with K-Means clustering applied to data from 254 community health centers (Puskesmas) to generate pseudo-labels. Various cluster configurations (k=2 to 8) were evaluated, with the optimal result being two clusters based on a silhouette score of 0.735. These clusters were then used to create a semi-labeled dataset for supervised learning. Eight classification algorithmsCN2 Rule Inducer, k-Nearest Neighbor (kNN), Logistic Regression, Naïve Bayes, Neural Network, Random Forest, Support Vector Machine (SVM), and Decision Treewere trained and compared. Among them, the Neural Network model achieved the highest performance, with an AUC of 0.999 and an MCC of 0.976, indicating excellent stability and predictive accuracy. The findings validate the effectiveness of semi-supervised learning for health classification tasks when labeled data is scarce. This approach can serve as a valuable decision-support tool for regional health planning and targeted interventions, enhancing the precision and efficiency of public health responses.

 METHOD

 The methodology of this study begins with identifying critical issues related to regional classification based on the types of Non-Communicable Diseases (NCDs) in Banten Province. Subsequently, medical data is collected from 254 community health centers, which are distributed across eight administrative regions. Initially, the collected data undergoes a pre-processing phase aimed at ensuring data quality and suitability for subsequent analysis. This includes normalization of all numerical attributes using min-max scaling to ensure uniform feature ranges, which is a critical requirement for K-Means clustering due to its reliance on distance-based similarity measures. Following this preliminary processing, an unsupervised learning method utilizing the K-Means clustering algorithm is applied to categorize regions based on discernible data patterns. K-Means was selected due to its efficiency in clustering based on attribute similarity, ease of implementation [44], and proven effectiveness in health-related research [45], particularly in generating pseudo-labels from unlabelled datasets such as medical imagery [46] -[48]. Moreover, K-Means demonstrates strong computational performance and is well-suited to medium-sized, numerically scaled datasets such as those used in this study [49]The resulting clusters generated through this method serve as pseudo-labels or target classes for constructing the subsequent classification model.Before proceeding to the supervised learning phase, an additional data pre-processing step is performed to align the dataset format with the newly assigned cluster labels. The classification model is then developed using a supervised learning approach, evaluating the performance of eight machine learning algorithms, specifically CN2 Rule Inducer, Random Forest, Neural Network, Naïve Bayes, k-Nearest Neighbor (kNN), Decision Tree, Support Vector Machine (SVM), and Logistic Regression. Each algorithm's performance is rigorously assessed to identify the most effective model for accurately classifying regions according to NCD types.The final stage involves deploying the best-performing classification model as a practical tool to facilitate enhanced health mapping and targeted intervention planning within Banten Province. All analytical processes in this research utilize Orange Data Mining software and the R programming language as the primary computational tools.

Discussion 

The findings of this study clearly illustrate that employing a semi-supervised learning methodologyinitiating with K-Means clustering followed by dataset labelingeffectively established a robust foundation for developing a regional classification model based on Non-Communicable Disease (NCD) case data. Utilizing Orange Data Mining significantly streamlined analytical tasks, particularly in data exploration, model development, and performance evaluation phases. The initial clustering yielded two clusters with an optimal silhouette score of 0.735, denoting strong inter-cluster separation. These clusters, specifically Cluster C1 (regions with high disease prevalence) and Cluster C2 (regions with lower disease prevalence), subsequently served as pseudo-labels for training the supervised learning model. Although this pseudo-labeling approach offers a practical solution in the absence of ground-truth labels, it also introduces potential limitations, such as the risk of inaccurate grouping due to reliance on purely statistical similarity rather than domain-expert validation.During the supervised learning stage, eight distinct machine learning algorithms were evaluated to determine the most effective classification model. The majority of tested models demonstrated excellent performance, as evidenced by Area Under the Curve (AUC) values exceeding 0.98, reflecting robust discriminative capabilities. Among these, the Neural Network and k-Nearest Neighbor (kNN) models stood out prominently, achieving nearly perfect scores in key evaluation metrics such as Classification Accuracy (CA), F1-score, Precision, and Recall. Both models also recorded exceptionally high Matthews Correlation Coefficient (MCC) scores, reinforcing their reliable classification performance, especially significant given potential data imbalances.Nonetheless, it is important to acknowledge that high performance on a small dataset can be susceptible to overfitting. To mitigate this, 10-fold cross-validation was utilized to validate model generalizability. In addition, dropout regularization was employed in training the Neural Network model to prevent co-adaptation of neurons, thereby enhancing the model’s capacity to generalize across varying data instances. These methodological safeguards were critical in ensuring that the models' performance metrics were not merely artifacts of memorization or spurious patterns in the training data.





Download Full Paper