A comparative analysis of PoS tagging tools for Hindi and Marathi

International Journal of Informatics and Communication Technology

A comparative analysis of PoS tagging tools for Hindi and Marathi

Abstract

Many tools exist for performing parts of speech (PoS) data tagging in Hindi and Marathi. Still, no standard benchmark or performance evaluation data exists for these tools to help researchers choose the best according to their needs. This paper presents a performance comparison of different PoS taggers and widely available trained models for these two languages. We used different granularity data sets to compare the performance and precision of these tools with the Stanford PoS tagger. Since the tag sets used by these PoS taggers differ, we propose a mapping between different PoS tagsets to address this inherent challenge in tagger comparison. We tested our proposed PoS tag mappings on newly created Hindi and Marathi movie scripts and subtitle datasets since movie scripts are different in how they are formatted and structured. We shall be surveying and comparing five parts of speech taggers viz. IMLT Hindi rules-based PoS tagger, LTRC IIIT Hindi PoS tagger, CDAC Hindi PoS tagger, LTRC Marathi PoS tagger, CDAC Marathi PoS tagger. It would also help us evaluate how the Bureau of Indian Standards’s (BIS) tag set of Indian languages compares to the Universal Dependency (UD) PoS tag set, as no studies have been conducted before to evaluate this aspect.

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now
Library 3D Ilustration