Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian short texts classification

I; Udayana University Nyoman Prayana Trisna

Ni; Udayana University Wayan Emmy Rosiana Dewi

Muhammad; Udayana University Alam Pasirulloh

Telecommunication Computing Electronics and Control

Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian short texts classification

Abstract

Even though it is considered a more traditional method compared to more modern algorithms, term frequency inversed document frequency (TF-IDF) nevertheless produces good results in a range of text mining tasks. This study assesses the effectiveness of several TF-IDF modifications for short text classification. Imbalanced datasets are another issue that is addressed in this research. To rectify the imbalanced issue, we integrate standard, log-scaled, and boolean TF-IDF in short text classification with undersampling and oversampling methods. Precision, recall, and f-measure metrics are used to evaluate each experiment. The best result is obtained when applying boolean TF-IDF with the oversampling method. Oversampling methods outperform the undersampling methods in every experiment, although there are some cases where experiments with undersampling methods are considerable. Additionally, our conducted study reveals that employing modified TF-IDF, such as boolean or log-scaled versions, provides greater advantages to classification performance, particularly in handling imbalanced datasets, when compared to solely relying on the standard TF-IDF approach.

Cite

Full View

DOI

10.12928/telkomnika.v23i2.26510

ISSN Information

1693-6930

Pages

382-392

More Information

Volume 23

Issue 2

Publish at 2025-04-01

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now