Inter national J our nal of Inf ormatics and Communication T echnology (IJ-ICT) V ol. 14, No. 1, April 2025, pp. 276 286 ISSN: 2252-8776, DOI: 10.11591/ijict.v14i1.pp276-286 276 Ensemble appr oach to rumor detection with BER T , GPT , and POS featur es Barsha P attanaik 1 , Soura v Mandal 1 , Rudra Mohan T ripath y 1 , Arif Ahmed Sekh 2 1 School of Computer Science and Engineering, XIM Uni v ersity , Bhubanesw ar , India 2 Department of Computer Science, UiT The Arctic Uni v ersity of Norw ay , T roms, Norw ay Article Inf o Article history: Recei v ed Jun 26, 2024 Re vised Oct 7, 2024 Accepted No v 19, 2024 K eyw ords: BER T BiLSTM Ensemble model GPT P art of speech Rumor detection ABSTRA CT As v ast amounts of rumor content are transmitted on social media, it is v ery challenging to detect them. This study e xplores an ens emble approach to ru- mor detection in social media messages, le v eraging the strengths of adv anced natural language processing (NLP) models. Specically , we impl emented three distinct models: (i) generati v e pre-trained transformer (GPT) combined with a bidirectional long short-term memory (BiLSTM) netw ork; (ii) a model inte- grating part-of-speech (POS) tagging with bidirectional encode r representations from transformers (BER T) and BiLSTM, and (iii) a model that mer ges POS tag- ging with GPT and BiLSTM. W e included additional features from t he dataset in all these models. Each model captures dif ferent linguistic, syntactical, and conte xtual features within the te xt, contrib uting uniquely to the classication task. T o enhance the rob ustness and accurac y of our predictions, we emplo yed an ensemble method using hard v oting. This technique aggre g ates the predic- tions from each model, determining the nal classication based on the majority v ote. Our e xperimental results demonstrate that the ensemble approach signif- icantly outperforms indi vidual models, achie ving superior accurac y in identify- ing r umors. T o determine the performance of our model, we used PHEME and W eibo datasets a v ailable publicly . W e found our model g a v e 97.6% and 98.4% accurac y , respecti v ely , on the datasets and has outperformed the state-of-the-art models. This is an open access article under the CC BY -SA license . Corresponding A uthors: Barsha P attanaik School of Computer Science and Engineering, XIM Uni v ersity Bhubanesw ar -752050, Odisha, India Email: barsha@xustudent.edu.in 1. INTR ODUCTION In this digital age, social media platforms ha v e become ubiquitous, serving as prim ary channels for information dissemination and communication. While these platforms of fer numerous benets, the y also f a- cilitate the rapid spread of misinformation and rumors, which can ha v e signicant societal impacts. The pro- liferation of f alse information on social media can lead to public panic, misinformation crises, and harm to indi viduals or groups. Therefore, de v eloping ef fecti v e techniques for detecting and mitig ating the spread of rumors has become a critical area of research. T raditional rumor detection methods often rely on manual v er - ication, which is time-consuming and impractical gi v en the v ast v olume of data generated on social media. Automated rumor detection systems, le v eraging adv ancements in natural language processing (NLP), machine learning (ML), and deep learning (DL), of fer a promising solution to this challenge. Recent de v elopments in DL-based models , such as generati v e pre-trained transformer (GPT) and bidirecti onal encoder representa- J ournal homepage: http://ijict.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 277 tions from transformers (BER T), ha v e demonstrated remarkable capabilities in understanding and generating human-lik e te xt, making them suitable for tackling comple x linguistic tasks. Pre vious research focused on the role of te xtual characteristics in rumor identica tion, although it h a s not specically considered the ef fect of additional social media f actors, such as the number of retweets, user follo ws, and user friends. These features of fer an essential conte xt for comprehending the dissemination and inuence of rumors. Pre vious studies ha v e predominantly concentrated on language-centric models, ignoring the possibilities of social media information. This study addresses this g ap by inte grating te xt embeddings and supplementary social media data into an ensemble model, pro viding a more comprehensi v e approach to rumor identication. Furthermore, adding to the no v elty , we ha v e incorporated both BER T and GPT embeddings to study the semantic understanding and syntactic features along with conte xtual embeddings. In this study , we propose an ensemble-bas ed approach for rumor d e tection, inte grating the strengths of multiple NLP models. F or this purpose, we e xtracted four dif ferent types of features using NLP tools and libraries. All the features are e xplained belo w , along with an o v ervie w of our proposed methodology . Features e xtraction and methodology includes: W e use pre-trained BER T embeddings to generate dense v ector embeddings that capture the semantic mean- ing and conte xt of the w ords in the messages. GPT embeddings, li k e BER T embeddings, capture the conte xtual meaning of a te xt b ut are unidirectional, considering conte xt from left to right. The y help the model understand language features and semantics. P art of speech (POS) tag features pro vide syntactic information crucial for understanding sentence structure and identifying patterns in rumors. Each w ord is tagged with its part of speech (noun and v erb), and these tags are con v erted into one-hot encoded v ectors. Additional features (AF) include retweet count, ‘user .follo wers count’, ‘user .friends count’, and ‘f a v orite count’, normalized using a standard scaler . These features quantify post eng agement and user inuence, indicating the spread and credibility of the message. These features include numerical metrics associated with the social media post, such as the num b e r of f a v orites, retweets, and user -specic information lik e follo wer and friend counts to pro vide quantitati v e information about the post’ s eng agement and the user’ s social inuence, which can be indicati v e of the spread and credibility of the message. By usi n g these features, we de v eloped three distinct models as discussed in section 3.2. T o enhance the o v erall performance, we emplo yed an ensemble method of these three models using hard v oting, wherein the majority v ote from the indi vidual models determines the nal classication. This ensemble approach aims to le v erage the di v erse strengths of each model, resulting in impro v ed accurac y and reliability in detecting rumors. W e ha v e used tw o publicly a v ailable rumor datasets-PHEME [1], and W eibo [2] for our e xperiments in this study . W e ha v e translated the W eibo dataset into English language using Google T ranslator disccused in section 3.1. Therefore, we refer to this dataset as ‘W eiboE’. The follo wing are our signicant contrib utions in this study: i) W e ha v e used multiple baseline architectures and pre-trained models, such as BER T and GPT , to create alternati v e neural netw orks for rumor detection. ii) W e ha v e presented an ensemble classier for rumor detection (the rst of its kind), including thorough research and performance analysis and impro ving the baseline. iiii) Our proposed model outperformed all the pre vious rumor detection systems on both the standard datasets- PHEME and W eibo. Ne xt, we describe related w ork in section 2, the proposed methodology in section 3, result analysis in section 4, and conclude in section 5. 2. RELA TED W ORK In literature, numerous techniques ha v e been de v eloped to detect f ak e ne ws [3]–[5]. Recentl y , the research community has also focus ed on identifying rumors. Although f ak e ne ws and rumors are distinct, the methods emplo yed by researchers are quite similar , in v olving te xt or document c lassication using NLP and adv anced ML or DL techniques. W ith the increasing pre v alence of rumor content on social media, researchers ha v e de vised v arious DL-based models to tackle this issue. Detailed information on these methods and their performance can be found in se v eral surv e y papers [6]–[11]. In this secti on, we summarize some of the notable research that utilized ML or DL-based approaches and demonstrated strong performance on their respecti v e datasets. Ensemble appr oac h to rumor detection with BERT , GPT , and POS featur es (Bar sha P attanaik) Evaluation Warning : The document was created with Spire.PDF for Python.
278 ISSN: 2252-8776 2.1. T raditional methods f or rumor detection Early approaches to rumor detection primarily relied on traditional ML techniques combined with handcrafted features. These methods typically in v olv ed tw o things: feature e xtraction by using le xical cues, metadata (e.g., user inform ation, message propag ation patterns), and temporal patterns, which were manually e xtracted. Second, ML algorithms lik e support v ector machines (SVM), decision trees, and Nai v e Bayes were used to classify messages as rumors or non-rumors. F or instance, Castillo et al. [12] discussed the rele v ance and signicance of information quality particularly credibility of the information in the conte xt of T witter which is one of the f astest-gro wing social media platforms posting information both true and f alse rumors. Hence, the y de v eloped a method that applies SVM with f actors lik e the number of retweets, data URLs, and credibility scores of the user publishing on the T witter platform to lter out f ak e ne ws. This approach of fered impro v ement b ut it had its do wnsides since the features had to be e xtracted tediously by hand and mi ght not al w ays apply to other datasets or scenarios. Depending on the accurac y le v el of their study , which w as about 86%, the y managed to nd answers to questions. 2.2. Use of deep lear ning models Ne w generations of the model were de v eloped by man y researchers as DL came in, which is capable of e xtracti ng features from ra w te xt without requi ring human ma nu a l interv ention. Durat ion-based recurrent neural netw orks (RNNs), part icularly long short-term memory (LSTM), and bidirectional LSTM (BiLSTM) netw orks [13], emer ged as a rich source of capturing the sequential structure of te xtual data. Meanwhile, Ma et al. [2] further proposed a no v el model adopting LSTM netw orks, which incorporated temporal infor - mation of tweet sequences to enhance the identication of e xisting rumor tweets. The accurac y on the T witter dataset is 88.1% while on the W eiboE dataset is 91%. Using both datasets, Ruchansk y et al. [14] proposed addressing f ak e ne ws detection by using CSI (capture, score, inte grate) model, based on RNNs and user and comments’ features. RNN and con v olutional neural netw ork (CNN) [15] are used for capturing both temporal and content features hence gi ving high accurac y while the incorporation of user beha vior signicantly enhances the rob ustness of the model. There w as approxima tely 89% accurac y when applied in the T witter dataset and 95.3% in the W eibo dataset depending on the CSI model. 2.3. Use of transf ormer -based models Later , deeper models lik e BER T , GPT and man y others ha v e transformed NLP by pro viding better encoding techniques of conte xt information. These model s do store deep semantic and syntactic features and are generally ef fecti v e for the te xt classication problem. The paper by De vlin et al. [16] presented BER T , a totally ne w model that ne-tuned preempti v ely on the lar ge te xt associated with the v ocab ulary obtained by W eb Scraping and achie v ed state-of the-art accurac y on numerous NLP enterprises through emplo ying bidirectional conte xt. This is mainly due to one of the considerable adv antages of BER T that allo ws it to recognize the w ord conte xt in both directions which is especially benecial for recognizing the language used in rumors. Anggrainingsih et al. [17] de v eloped a BER T -based approach for rumor detection by using sentence embedding to capture the conte xtual meanings of the message. GPT -2 w as designed by Radford et al. [18], and its e xcellent capacity for language generation established through ‘guessing’ the ne xt w ord in a gi v en te xt results from capturing conte xtual relation dependencies. This w ork is one of the P aragon w orks in this group demonstrating that massi v e language models can perform a number of tasks without problem-specic training. Liu et al. [19] used dif ferent lar ge language models such as GPT and BER T to check whether these models can detect rumors in social media by using both ne ws and comments and their propag ation information. 2.4. Use of ensemble methods The combination of models has been referred to as ensemble methods due to the ability of the higher and numerous models to impro v e on the basic models. The y can enable impro v ement by using the strengths of each model when it comes to impro v e the o v erall performance. Hard v oting [20] w orks by taking a majority v ote for cate gories while soft v oting tak es the probability that each model has assigned to cate gories. As can be seen from the abo v e equations, both techniques ha v e made enhancements in enhancing classication tasks by o v ercoming the demerits of separate models. K otteti et al. [21] emplo yed an ensemble of dL-based models by us ing CNNs, RNNs, and LSTMs for processing time-series data to impro v e the accurac y and rob ustness of rumor detection. The y used features from the time-series data, including tweet content, user metadata, and netw ork propag ation patterns. The e xtracted features were then fused to create a comprehensi v e representation used for classication. The model g a v e 64.3%, which is 79% more in terms of micro F1-score on PHEME Int J Inf & Commun T echnol, V ol. 14, No. 1, April 2025: 276–286 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 279 datasets compared to the baselines. Recently , Y uan et al. [22] de v eloped an ensembling model by using tw o dif ferent features, such as image and te xt. F or te xt data, the authors used BER T and BiLSTM, g ated recurrent unit (GR U) [23], and for image data, the y used CNN. After that, the y ensembled dif ferent models and used soft v oting for the nal class ication. Nith ya et al. [24] de v eloped a h ybrid model le v erages the strengths of multiple adv anced techniques in NLP and DL to ef fecti v ely classify rumor te xts, combining deep conte x- tual understanding (using BER T), hierarchical feature e xtraction, feature importance analysis, and sequence modeling (using Bi-LSTM) into a cohesi v e frame w ork. Our w ork focused on de v eloping three dif ferent models to capture dif ferent aspects of te xtual data, le v eraging both semantic understanding and syntactic features along with conte xtual embeddings. Finally , an ensemble model is proposed to inte grate the predictions from these models, enhancing the rob ustness and accu- rac y of the classication. W e are the rst to propose a model by considering dif ferent features and inte grating them into an ensemble model. 3. PR OPOSED METHOD The objecti v e of this research is to detect if a message or post is a rumor or not. W e ha v e considered a binary classication approach. F or this research as a baseline model, we ha v e used BER T embedding with the BiLSTM netw ork [25], [26] de v eloped for sentiment analysis and entity recognition for clinical tests, follo wed by a dense layer and softmax for classication. The BiLSTM processes the sequence of embeddings in both forw ard and backw ard directions, capturing dependencies from the past and future conte xts. 3.1. Data collection and pr e-pr ocessing F or our e xperiment, we use tw o publicly a v ailable datasets, such as PHEME [1] and W eibo [2] focus- ing on messages labeled as either rumors or non-rumors. The dataset w as curated to ensure a balanced represen- tation of both classes, allo wing for ef fecti v e training and e v aluation of our models. The PHEME dataset (5,802 samples) pro vides a detailed breakdo wn of rumor (1,972 instances) and non-rumor (3,830 instances) tweets across se v eral v e e v ents (O T : Otta w a shooting, GC: Germanwings crash, FE: Fer guson, CH: Charlihebdo, SY : Sydne ysidge). Similarly , the W eibo dataset contains 2,313 rumors and 2,351 non-rumors. The W eibo dataset contains man y features, b ut we only used four features such as ‘retweet count’, ‘user .follo wers count’, ‘user .friends count’, and ‘f a v orite count’ for our model. W e translate the W eibo dataset ori ginally a v ailable in the Chinese language to English using the “Google T ranslator” of ‘deep translator’ package. The name of the dataset is gi v en as W eiboE and the link to the dataset is a v ailable on the github [27] for future research. W eibo is not se gre g ated in e v ents lik e PHEME. The details about the datasets are e xplained belo w in a stack ed bar graph in Figure 1 and T able 1 sho ws the sample messages of PHEME datasets containing AF and labels. Data pre-processing in v olv ed se v eral steps to clean and prepare the te xt data for model training. All t he te xt breaks do wn into indi vidual tok ens (w ords or subw ords). Then, common stop-w ords are remo v ed that don’ t ha v e signicant contrib utions to the meaning of a sentence. Then, lemmatization is done to mak e a w ord into its base forms. Annotation uses POS tags to capture the syntactic information. After tagging, we use padding and truncation to ensure uniform input lengths for batch processing. F or e xample, in the case of the Otta w a shooting e v ent, a sample of data is “Otta w a police are conrming a shooting at the W ar Memorial. Minutes ago. No other info. #cbcO TT #O TTne ws” rst tok eni zation is done. Subsequently , lo wercasing transforms all tok ens to lo wercase to ensure uniformity . Stopw ords lik e “are” and “at” are eliminated to emphasize more signicant k e yw ords. Punct uation and special characters, including hashtags, are remo v ed to sanitize the data. The cleaned tok ens obtained are [‘Otta w a’, ‘police’, ‘conrming’, ‘shooting’, ‘w ar’, memorial’, ‘minutes’, ‘ago’, ‘info’, ‘cbcott’, ‘ottne ws’]. Moreo v er , POS tagging can be utilized to pro vide grammatical classications to each tok en, so of fering enhanced linguistic conte xt. 3.2. Experimental models 3.2.1. Baseline model- BER T and BiLSTM W e use a simple BER T with BiLSTM netw ork (BER T+BiLSTM) de v eloped by [25], [26] as the base model for our rumor detection task. W e used the PHEME dataset of v e e v ents containing social media posts as input messages. The model used a pre-trained transformer , BER T , to e xtract conte xtual information for w ord embedding. Ne xt, the embedding v ectors from the BER T are sequentially fed to the BiLSTM netw ork for learning bi-directional long-term dependencies of the w ords (v ectors) across the input sentences. The con- catenated and attened v ector for each sequence is then fed to the dense layer , follo wed by the softmax layer Ensemble appr oac h to rumor detection with BERT , GPT , and POS featur es (Bar sha P attanaik) Evaluation Warning : The document was created with Spire.PDF for Python.
280 ISSN: 2252-8776 for clas sication. W e introduce three ne w v ariants of this baseline model, each using dif ferent feat ure combi- nations b ut sharing a common BiLSTM architecture which are e xplained in the sections belo w . Figure 1. Rumors and non-rumors data distrib ution across PHEME and W eiboE datasets T able 1. Sample data of PHEME and W eiboE datasets sho wing features with labels (0-non rumor , 1-rumor) Sample te xts f a v orite count retweet count user .follo wers count user .friends count Label Being black in this country is dangerous b usiness. #Fer guson (PHEME) 117 198 25565 1593 0 Rest in Peace, Cpl. Nathan Cirillo. Killed today in #Otta w aShooting http://t.co/YzLXYX5JJt http://t.co/8F0qAcj9sg (PHEME) 96 112 14793 1052 1 At 8:26 am on February 26, Li T ian yi w as released on bail and is no w returning home.(W eiboE) 33 1272 977 499 1 W e are all grandsons, wh y are our realms so dif ferent? (W eiboE) 4 1073 32914 2180 0 3.2.2. Model-1 (GPT , AF , and BiLSTM) In this model, we use tw o features: GPT embeddings and additional features, or AF as sho wn in Figure 2. W e utilized the pre-trained GPT instead of BER T to generate conte xtualized embeddings for each w ord in the input message. GPT’ s capacity t o understand and generate coherent te xt w as le v eraged to capture the nuanced conte xt within the data. The embedding v ectors from GPT , along with the other features, are then fed into a BiLSTM netw ork (GPT+AF+BiLSTM) and the rest are the same as in the baseline model. 3.2.3. Model-2 (POS, BER T , AF , and BiLSTM) Model 2 in Figure 3 uses three features, such as AF deri v ed from the te xt, POS features, and BER T embeddings (POS+BER T+AF+BiLSTM). POS is a critical linguistic processing step that in v olv es annotating each w ord in a sentence with it s corresponding part of speech, s uch as noun, v erb, and adjecti v e. In the conte xt of r u m or detection, POS tag-based features serv e se v eral important purposes. Each tok en is annotated with its POS tag, pro viding additional syntactic information. These features can be particularly useful for capturing syntactic and grammatical nuances that purely w ord-based embedding techniques might miss. This enriched feature set can impro v e the o v erall perform ance of the models in detecting subtle cues indicati v e of rumors. In this model, the tok ens and their POS are input into the BER T model to obtain rich, conte xtualized embedding v ectors. Simil ar to the BER T+BiLSTM model, these embedding v ectors are then processed through a BiLSTM netw ork to capture sequential dependencies. Int J Inf & Commun T echnol, V ol. 14, No. 1, April 2025: 276–286 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 281 Figure 2. Model-1 with GPT with additional features and BiLSTM Figure 3. Model-2 and model-3; BER T and GPT interchangeably used to mak e model-2 or model-3 3.2.4. Model-3 (POS, GPT , AF , and BiLSTM) Lik e the pre vious model, the model uses GPT embeddings instead of BER T (Figure 3) along with the other t w o features (POS+GPT+AF+BiLSTM). POS tagging boosts rumor detection by enhancing syntactic, conte xtual, and semantic understanding and enriching features for deep learning. The POS-tagged tok ens are processed t h r ou gh the GPT model to generate embeddings. The embeddings are input into a BiLSTM netw ork to capture conte xtual dependencies. 3.2.5. Detail pr ocedur e and descriptions As described in the pre vious sections , we use four types of features as input to the BiLSTM. First, the tok enized and cleaned te xt is processed with NL TK’ s pos tag function to obtain POS tags, assigning gram- matical cate gories lik e nouns, v erbs, and adjecti v es to each w ord. Each tok enized w ord is labeled with a POS tag using NL TK’ s pos tag in a bid to e xtract the POS features. After that, these POS tags were encoded by a numerical v ector using one-hot encoding. When included with models such as BER T and GPT , POS tags allo w for science-based denitions of rumors that can capture the linguistic patterns of the phenomenon more accu- rately . When rumors mak e statements about a certain topic, the y tend to do it with the use of certain adjecti v es or adv erbs that inate the f act at hand. The algorithms can learn these patterns more ef ciently and enhance the accurac y of rumor detection when these POS tags are incorporated. In this step, independently we use one of the BER T or GPT tok enizers from the hugging f ace transformers [28] library to tok enize each te xt. Conte xtual embeddings are then obtained by passing the tok enized te xt with BER T or GPT . The additional features or AF are also oored and ceilinged before being s caled using the ‘Standard Scaler’ in ‘scikit-learn’. The BER T or GPT embedding v ectors, one-hot encoded POS features, and scaled additional features are combined into a single feature v ector for each te xt using ‘np.hstack’ specic to the model’ s requirements, resulting in a com- prehensi v e feature v ector for each input te xt. The BiLSTM [29] model tak es these combined feature v ectors as Ensemble appr oac h to rumor detection with BERT , GPT , and POS featur es (Bar sha P attanaik) Evaluation Warning : The document was created with Spire.PDF for Python.
282 ISSN: 2252-8776 the input, processes them, and feeds them into the fully connected layer . The fully connected layer , follo wed by the softmax layer , is responsible for mapping the combined BiLSTM output to the class scores. 3.3. Model-4: pr oposed ensemble method Figure 4 depicts our proposed ensemble model for rumor detection. W e emplo yed a hard v oting mechanism [20] to combine the predictions from the three models discussed in the pre vious subsection. Each model independently classies a message as a rumor or non-rumor , and the nal classication is determined by the majority v ote among the three model s. This approach le v erages the strengths and compensates for the weaknesses of indi vidual models, aiming to enhance o v erall accurac y . Algorithm 1 e xplains the algorithm of our ensemble model with hard v oting. Detail procedure of each model is described in section 3.2.5. Through the utilization of these three models in an ensemble, we w ant to capitalize on: The capability of GPT to ef fecti v ely capture long-range dependencies and conte xtual continuity . BER T’ s bidirectional conte xtual comprehension augmented by POS tagging. Additional (supplementary) features (AF) to record beha vioral indicators, including tweet and retweet fre- quencies. BiLSTM’ s sequential modeling f acilitates the capturing of both forw ard and backw ard te xt dependencies. Figure 4. Proposed ensemble model for rumor detection Algorithm 1. Ensemble by v ote Require: T T e xt Ensure: P f inal Final prediction 1: procedure E N S E M B L E ( T ) 2: p 1 = P r edict model 1 ( T ) 3: p 2 = P r edict model 2 ( T ) 4: p 3 = P r edict model 3 ( T ) 5: p f inal = mode ( p i ) i = 1 .. 3 6: return P f inal 7: end procedure 3.4. Ev aluation metrics In our studies, we emplo yed tw o critical assessment criteria to e v aluate the model’ s performance: Accurac y: the ratio of accurately predicted labels to the total number of labels, serving as a comprehensi v e indicator of the model’ s cate gorization ef cac y . F1-score (weighted): the weighted F1-score considers both accurac y and recall, rendering it more appropri- ate for datasets with imbalanced classes. The weighted a v erage F1-score is especially ef fecti v e for assessing performance across all classes for our classications challenge Int J Inf & Commun T echnol, V ol. 14, No. 1, April 2025: 276–286 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 283 3.5. Experimental setup, training and testing F or our e xperiment, we di vided the dataset into training and testing sets using an 80-20 split, meaning 80% of the data is allocated for training the model. At t h e same time, the remaining 20% is reserv ed for testing its performance. The model w as trained with 128 BiLSTM layers with a learning rate of 0.001. W e used Adam’ optimizer and ‘CrossEntrop yLoss’ loss function. The model w as trained for 300 epochs. During each epoch, the loss, training accurac y , and F1-s core were track ed, and the model w as e v aluated on the test set to monitor its performance. W e ha v e used a standard scaler to normalize numerical data, referred to as AF . W e measured loss v alues, accurac y , and F1-scores for e v ery epoch along with Plots of loss v alues, testing, training, and F1-scores. 4. RESUL TS AND AN AL YSIS This study e xplored the ef cac y of inte grating pre-trained language models (BER T and GPT) with POS tagging and other additional features in identifying rumors on social media. Pre vious studies ha v e e x- amined indi vidual models such as BER T for te xt classication problems. Ho we v er , the y ha v e not specically considered GPT embedding and the benets of incorporating supplementary features (e.g., ‘retweet count’, ‘user .follo wers count’, ‘user .friends count’, and ‘f a v orite count’) with ensemble learning to enhance the per - formance of the rumor detection model. In this section, we discuss the performance of the dif ferent models along with the proposed ensemble model (model-4) in terms of accurac y and F1-score on PHEME and trans- lated W eibo datasets. The results indicat ed that the ensemble model consistently outperformed the indi vidual models across all metrics. The hard v oting mechanism ef fecti v ely combined the strengths of the dif ferent mod- els, leading to a more accurate and reliable rumor detection system. Ho we v er , for W eibo, it is not properly justied to compare the performance with other systems, as we ha v e used a translated dataset in English. T able 2 sho ws a comparison between all the models along with our ensemble model, which pro v es that the ensemble model g a v e an a v erage accurac y of 97.6% on PHEME datasets and 98.4% on W eiboE dataset. The bold v alues sho w the best performance for an indi vidual model on these dataset s. As we ha v e seen, among the three models, four e v ents of PHEME dat asets gi v e the best results for the POS+AF+GPT+BiLSTM model or model -3, whereas for one e v ent (Fer guson), the GPT+AF+BiLSTM model or model-1 gi v es the best results. Model-3 gi v es the best result on the W eiboE dataset. The inte gration of additional characteristics and POS tags enhanced generalization in the task. The proposed ensemble model, or model-4, combining GPT+AF+BiLSTM (model-1), POS+BER T+AF+BiLSTM (model-2), and POS+GPT+AF+BiLSTM (model-3) e xcels in rumor de- tection by le v eraging semantic and syntactic features, achie ving superior accurac y and rob ustness compared to indi vidual models. T able 2. Comparison between baseline models with other v ariants. O T : Otta w ashooting, GC: Germanwings crash, FE: Fer guson, CH: Charlihebdo, SY : Sydne ysidge Models PHEME W eiboE O T GC FC CH SY Baseline Acc=84.8% Acc=86.2% Acc=86.2% Acc=88.5% Acc=85.7% Acc=92.8% F1=84.8% F1=86.2% F1=86.7% F1=88.4% F1=85.7% F1=92.8% Model-1 Acc=89.3% Acc=85.1% Acc= 86.9% Acc= 92.5% Acc=84.9% Acc=93.9% F1=89.3% F1=85.1% F1= 86.9% F1= 92.4% F1=84.9% F1=93.8% Model-2 Acc=83.7% Acc=81.9% Acc=82.9% Acc=89.2% Acc=84.9% Acc=93.2% F1=83.2% F1=81.9% F1=83.3% F1=89.2% F1=84.9% F1=93.2% Model-3 Acc= 89.3% Acc= 87.2% Acc=86.4% Ac c= 92.6% Acc= 86.2% Acc= 94.3% F1= 89.3% F1= 87.2% F1=86.6% F1= 92.4% F1= 86.1% F1=93.9% Model-4 (proposed) Acc= 97.5% Acc= 97.8% Acc= 97.3% Acc= 98.4% Acc= 97.1% Acc= 98.4% F1= 97.4% F1= 97.7% F1= 97.2% F1= 98.3% F1= 97.1% F1= 98.3% T ables 3 and 4 sho ws the comparison with the similar models in terms of accurac y and F1-score on PHEME and W eiboE datasets. Our ensemble model-4 outperformed similar systems by a great percentage. Inte grating di v erse neural architectures mitig ates indi vidual weaknesses and enhances generalization across v aried datasets. Ho we v er , this approach entail s signicant computational o v erhead and comple xity , raising challenges in real-time applications and model maintenance. Ov er -tting risks and dependenc y on high-quality training data are notable concerns, along with dif culties in interpretability and deb ugging. Despite these chal- Ensemble appr oac h to rumor detection with BERT , GPT , and POS featur es (Bar sha P attanaik) Evaluation Warning : The document was created with Spire.PDF for Python.
284 ISSN: 2252-8776 lenges, the model’ s high performance on benchmarks highlights its potential, necessitating further optimization and v alidation for practical deplo yment. Our ndings corroborate prior research indicating the ef cac y of GPT and BER T models in processing te xtual data; ho we v er , our ensemble model re v ealed that inte grating additional metadata and applying hard v oting can signicantly impro v e classication performance. T able 3. Comparison between dif ferent models on PHEME datasets Model Accurac y F1-score gD AR T [30] 94.8% 89.7% RDLNP [31] 88.6% 88.6% CNN-IG-A CO NB [32] 73.28% 73.2% BiLSTM-CNN [33] 86.1% 86.1% BER T+BiLSTM [25], [26] 86.2% 86.1% Model-4 (proposed) 97.6% 97.5% T able 4. Comparison between dif ferent models on W eibo datasets Model Accurac y F1-score V AE-GCN [34] 94.1% 94.0% PostCom2DR [35] 95.0% 95.0% Bi-GCN [36] 96.0% 96.0% DDGCN [37] 94.8% 95.2% Model-4 (proposed) 98.4% 98.3% Note: W e ha v e used W eiboE (translated in English) In this w ork, we assessed the ef cac y of three distinct models model-1, model-2, and model-3 each emplo ying v aried feature sets and topologies. Model-1 amalg amated GPT embeddings, AF , and BiLSTM; model-2 emplo yed POS tagging, BER T embeddings, AF , and BiLSTM; model-3 incorporated POS tagging, GPT embeddings, AF , and BiLSTM. Their performance uctuated, with accurac y between 84.8% and 92.6% and F1-scores from 84.8% to 92.4% for PHEME dataset of dif ferent datasets and accurac y from to 92.8% to 94.3% with F1-score from 92.8% to 93.9% for W eibo dataset. The proposed ensemble model, which consoli- dates predictions from these models through hard v oting, attained an o v erall accurac y of 97.6% and an F1-score of 97.5% for the PHEME dataset and an accurac y of 98.4% and an F1-score of 98.3% for the W eiboE dataset, illustrating substantial enhancement by utilizing the strengths of each m od e l and impro ving o v erall classi- cation rob ustness. The proposed ensemble model inte grating GPT , BER T , POS tagging, and supplementary characteristics e xhibited enhanced performance compared to indi vidual models. The results strongly indicate that inte grating di v erse informat ion types enhances rumor identication, with potential applications in real-time monitoring systems. Although the ensemble model sho wed strong performance, this w ork concentrated mostly on a dataset of tweets, perhaps constraining the applicability of the ndings to other types of te xtual data. The computational cost ass ociated with training e xtensi v e models such as GPT and BER T may pro vide a constraint for real-time applications. 5. CONCLUSION Our w ork adv ance s rumor detection research by inte grating POS tagging, additional features, and BER T or GPT -based embeddings with BiLSTM netw orks using the standard PHEME and W iebo datasets. W e de v eloped three predicti v e models by combining these features and ultimately proposed an ensemble method. This approach aims to le v erage the strengths of indi vidual models into an ensemble, resulting in a rob ust and accurate rumor detection system. Our method addresses the limitations of traditional techniques and standalone deep learning models, of fering a comprehensi v e solution. Through v arious e xperimental studies, indeed, it is clear that our ensemble method ranks better than other methods in terms of the accurac y of rumor detection. Through the use of multiple models that encode dif ferent aspects of the language and conte xt, the possibility of certain methods’ deciencies reecting on the nal output is signicantly reduced. More w ork is planned to be done in the future including the e xamination of other types of ensemble algorithms, e.g. soft v oting or stacking, as well as including ne w features such as temporal data or metadata about the users in order to impro v e detection rates. Int J Inf & Commun T echnol, V ol. 14, No. 1, April 2025: 276–286 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 285 REFERENCES [1] A. Zubiag a, M. Liakata, R. Procter , G. W . S. Hoi, and P . T olmie, Analysing ho w people orient to and spread rumours in social media by looking at con v ersational threads, PLoS ONE , v ol. 11, no. 3, p. e0150989, No v . 2016, doi: 10.1371/journal.pone.0150989. [2] J. Ma et al. , “Detecting rumors from microblogs with recurrent neural netw orks, in IJCAI International J oint Confer ence on Articial Intellig ence , 2016, pp. 3818–3824. [3] M. Celliers and M. Hattingh, A systematic re vie w on f ak e ne ws themes reported in literature, in Responsible Design, Implemen- tation and Use of Information and Communication T ec hnolo gy: 19th IFIP WG 6.11 Confer ence on e-Business, e-Services, and e-Society , I3E 2020 , 2020, v ol. 12067 LNCS, pp. 223–234, doi: 10.1007/978-3-030-45002-1 19. [4] X. Zhou and R. Zaf arani, A surv e y of f ak e ne ws: fundamental theories, detection methods, and opportunities, A CM Computing Surve ys , v ol. 53, no. 5, pp. 1–40, Sep. 2021, doi: 10.1145/3395046. [5] D. de Beer and M. Matthee, Approaches to identify f ak e ne ws: a systematic literature re vie w , Inte gr ated science in digital a g e 2020 , v ol. 136, pp. 13–22, 2021, doi: 10.1007/978-3-030-49264-9 2. [6] A. Zubiag a, A. Ak er , K. Bontche v a, M. Liakata, and R. Procter , “Detection and resolution of rumours in social media, A CM Computing Surve ys , v ol. 51, no. 2, pp. 1–36, Mar . 2018, doi: 10.1145/3161603. [7] M. Al-Sarem, W . Boulila, M. Al-Harby , J. Qadir , and A. Alsaeedi, “Deep learning-based rumor detection on microblogging platforms: a systematic re vie w , IEEE Access , v ol. 7, pp. 152788–152812, 2019, doi: 10.1109/A CCESS.2019.2947855. [8] A. Bondielli and F . Marcelloni, A surv e y on f ak e ne ws and rumour detection techniques, Information Sciences , v ol. 497, pp. 38–55, Sep. 2019, doi: 10.1016/j.ins.2019.05.035. [9] D. V arshne y and D. K. V ishw akarma, A re vie w on rumour prediction and v eracity assess ment in online social netw ork, Expert Systems with Applications , v ol. 168, p. 114208, Apr . 2021, doi: 10.1016/j.esw a.2020.114208. [10] L. T an, G. W ang, F . Jia, and X. Lian, “Research status of deep learning methods for rumor detection, Multimedia T ools and Applications , v ol. 82, no. 2, pp. 2941–2982, Jan. 2023, doi: 10.1007/s11042-022-12800-8. [11] B. P attanaik, S. Mandal, and R. M. T ripath y , A surv e y on rumor detection and pre v ention in social media using deep learning, Knowledg e and Information Systems , v ol. 65, no. 10, pp. 3839–3880, Oct. 2023, doi: 10.1007/s10115-023-01902-w . [12] C. Castillo, M. Mendoza, and B. Poblete, “Information credi bility on twitter , in Pr oceedings of the 20th international confer ence on W orld W ide W eb , Mar . 2011, pp. 675–684, doi: 10.1145/1963405.1963500. [13] S. Hochreiter and J. Schmidhuber , “Long short-term memory , Neur al Computation , v ol. 9, no. 8, pp. 1735–1780, No v . 1997, doi: 10.1162/neco.1997.9.8.1735. [14] N. Ruchansk y , S. Seo, and Y . Liu, “CSI: a h ybrid deep model for f ak e ne ws, in P r oceedings of the 2017 A CM on Confer ence on Information and Knowledg e Mana g ement , 2017, pp. 797–806. [15] Y . LeCun, P . Haf fner , L. Bottou, and Y . Bengio, “Object recognition with gradient-based learning, in Shape , Contour and Gr ouping in Computer V ision , Springer , 1999. [16] J. De vlin, M.-W . Chang, K. Lee, K. T . Google, and A. I. Language, “BER T : pre-training of deep bidirectional transformers for language understanding, arXiv pr eprint arXiv:1810.04805 , 2018. [17] R. Anggrainingsih, G. M. Hassan, and A. Datta, “BER T based classication system for detecting rumours on T witter , arXiv pr eprint arXiv:2109.02975 , 2021, [Online]. A v ailable: http://arxi v .or g/abs/2109.02975. [18] A. Radford, J. W u, R. Child, D. Luan, D. Amodei, and I. Sutsk e v er , “Language models are unsupervised multitask l earners, OpenAI Blo g , v ol. 1, no. 8, p. 9, 2018. [19] Q. Liu, X. T ao, J. W u, S. W u, and L. W ang, “Can lar ge language model s detect rumors on social media?, arXiv pr eprint arXiv:2402.03916 , 2024, [Online]. A v ailable: http://arxi v .or g/abs/2402.03916. [20] A. Chakraborty , S. Joardar , and A. A. Sekh, “Ensemble classier for Hindi hostile content detection, A CM T r ansactions on Asian and Low-Resour ce Langua g e Information Pr ocessing , v ol. 23, no. 1, pp. 1–17, 2024, doi: 10.1145/3591353. [21] C. M. M. K otteti, X. Dong, and L. Qian, “Ensemble deep learning on time-series representation of tweets for rumor detection in social media, Applied Sciences (Switzerland) , v ol. 10, no. 21, pp. 1–21, 2020, doi: 10.3390/app10217541. [22] L. Y uan, J. W ang, and X. Zhang, “YNU-HPCC at SemEv al-2020 T ask 8: using a parallel-channel model for memotion analysis, in 14th Inter national W orkshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Confer ence on Computational Linguistics, COLING 2020, Pr oceedings , 2020, pp. 916–921, doi: 10.18653/v1/2020.seme v al-1.116. [23] K. Cho et al. , “Learning phrase representations using RNN encoder -decoder for statistical machine translation, in EMNLP 2014 - 2014 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing , Pr oceedings of the Confer ence , 2014, pp. 1724–1734, doi: 10.3115/v1/d14-1179. [24] K. Nith ya, M. Krishnamoorthi, S. V . Easw aramoorth y , C. R. Dhi vyaa, S. Y oo, and J. Cho, “Hybrid approach of deep feature e xtraction using BER T– OPCNN & FIA C with customized Bi-LSTM for rumor te xt classication, Ale xandria Engineering J ournal , v ol. 90, pp. 65–75, 2024, doi: 10.1016/j.aej.2024.01.056. [25] R . C ai et al. , “Sentiment anal ysis about in v estors and consumers in ener gy mark et based on BER T -BILSTM, IEEE Access , v ol. 8, pp. 171408–171415, 2020, doi: 10.1109/A CCESS.2020.3024750. [26] Z. Zhu and L. W ang, “BER T -BiLSTM model for entity recognition in clinical te xt, Pr oceedings of the Iberian Langua g es Evalua- tion F orum (IberLEF 2022) co-located with the Confer ence of the Spanish Society for Natur al Langua g e Pr ocessing (SEPLN 2022) , v ol. 3202, 2022. [27] B . P a ttanaik, “W eiboE, GitHub , 2024. https://github .com/barshapattanaik/W eiboE (accessed Jun. 27, 2024). [28] “G pt-2 documentation, Hug ging F ace . https://huggingf ace.co/docs/transformers/main/en/model doc/gpt2 (accessed No v . 29, 2024). [29] M. M. Rahman, Y . W atanobe, and K. Nakamura, A bidirectional LSTM language model for code e v aluation and repair , Symmetry , v ol. 13, no. 2, p. 247, 2021, doi: 10.3390/sym13020247. [30] S. Ro y , M. Bhanu, S. Sax ena, S. Dandapat, and J. Chandra, “gD AR T : impro ving rumor v erication in social media with discrete attention representations, Information Pr ocessing & Mana g ement , v ol. 59, no. 3, p. 102927, May 2022, doi: 10.1016/j.ipm.2022.102927. [31] A. Lao, C. Shi, and Y . Y ang, Rumor detection with eld of linear and non-linear propag ation, in Pr oceedings of the W eb Confer ence 2021 , Apr . 2021, pp. 3178–3187, doi: 10.1145/3442381.3450016. Ensemble appr oac h to rumor detection with BERT , GPT , and POS featur es (Bar sha P attanaik) Evaluation Warning : The document was created with Spire.PDF for Python.