IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 2, April 2026, pp. 1771 1782 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i2.pp1771-1782 1771 T ransf ormer -based Hindi image description and storytelling using enhanced attention and F astT ext embeddings Anjali Sharma 1 , Mayank Aggarwal 1 , Jitin Khanna 2 1 Department of Computer Science and Engineering, F aculty of Engineering and T echnology , Gurukula Kangri (Deemed to be Uni v ersity), Haridw ar , India 2 Manager Data and Analytics, IBM, P aramus, United States Article Inf o Article history: Recei v ed Jun 23, 2025 Re vised Feb 6, 2026 Accepted Mar 5, 2026 K eyw ords: Ev aluation metrics F astT e xt embeddings Hindi image Squeeze-and-e xcitation T ransformer models ABSTRA CT This w ork presents a no v el image description generation frame w ork that combines a T ransformer -based encoder -decoder architecture with a custom squeeze-and-e xcitation (SE) attention block inte grated into an Ef cientNet feature e xtractor . The decoder uses F astT e xt embeddings specically trained for Hindi and is e v aluated on the Microsoft common objects i n conte xt (MS-COCO) dataset. T o impro v e the captioning process, the model i ncorporates a generati v e pre-trained transformer (GPT) module to generate narrati v e descriptions based on the initial captions and applies multiple similarity metrics to assess output quality . The proposed system signicantly outperforms e xisting methods, achie ving high bilingual e v aluation understudy (BLEU) scores (BLEU-1 to BLEU-4: 83.24, 73.17, 64.56, and 58.22), a consensus-based image description e v aluation (CIDEr) score of 81.41, an F1 score of 90.29, and a metric for e v aluation of translati on with e xplicit ordering (METEOR) score of 81.18, indicating strong caption accurac y . Furthermore, the model achie v es lo w error rates, with a w ord error rate (WER) of 15% and a character error rate (CER) of 11%. This w ork highlights the challenges of applying lar ge-sc ale datasets lik e MS-COCO to resource-limited languages and demonstrates the ef fecti v eness of inte grating F astT e xt embeddings with transformer -based models for Hindi image captioning. This is an open access article under the CC BY -SA license . Corresponding A uthor: Anjali Sharma Department of Computer Science and Engineering, F aculty of Engineering and T echnology Gurukula Kangri (Deemed to be Uni v ersity) Haridw ar , India Email: 23631001@gkv .ac.in 1. INTR ODUCTION Image des cription synthesis emplo ys perception techniques alongside language models to cr eate correct and rele v ant te xt. Deep learning models lik e con v olutional neural netw orks (CNNs), recurrent neural netw orks (RNNs), and architectures that are based on transformers (e.g., vision transformers (V iTs) and data-ef cient image transformers (DeiTs) ) ha v e signicantly adv anced this eld by producing semantically rich captions [1], [2]. Multilingual captioning, especially in comple x languages lik e Hindi, f aces challenges due to linguistic di v ersity and limited datasets. Hindi’ s unique syntax and morphology demand adapted models, b ut current resources lik e the translated Flickr8k dataset remain insuf cient [3]–[5]. J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
1772 ISSN: 2252-8938 Despite progress in Hindi image captioni ng , limitations in dataset di v ersity and model adapta bility hinder performance. Existing datasets lik e Flickr8k-Hindi and HIC restrict model generalizability . Inte grating CNNs to e xtract visual features with transformer architectures for capturing global conte xt enhances the o v erall quality of generated pict u r e descriptions [6], [7]. Enhancing attention mechanisms further strengthens conte xtual and linguistic coherence [8]. This study tar gets three core objecti v es: b uilding di v erse datasets, rening conte xtual feature e xtraction, and impro ving linguistic accurac y . This research adv ances Hindi im age captioning by translating Microsoft common objects in conte xt (MS-COCO) into Hindi, creating a rob ust dataset. It inte grates squeeze-and-e xcitation (SE) attention-enhanced Ef cientNet for detailed visual feature e xtraction and a transformer architecture tailored to Hindi’ s linguistic structure. F astT e xt embeddings impro v e semantic richness, while generati v e pre-trained transformer (GPT) renes captions for narrati v e depth. Ev aluated using bilingual e v aluation understudy (BLEU), consensus-based image description e v aluation (CIDEr), metric for e v aluation of translation with e xplicit ordering (METEOR), w ord error rate (WER), and character error rate (CER), the model establishes a strong performance baseline. Uni qu e contrib utions include a De v anag ari-adapted SE block and GPT -based caption e xtension, addressing dataset scarcity and linguistic comple xity , with broad implications for inclusi v e AI in lo w-resource, morphologically rich languages. The proposed method surpasses prior Hindi image captioning approaches by combining SE-attention, Ef cientNet, and F astT e xt embeddings for detailed visual capture and narrati v e-le v el generation. Unlik e earlier sentence-le v el models with limited multimodal fusion and e v aluation, our approach introduces a linguistically rich frame w ork and a custom Hindi MS-COCO dataset with comprehensi v e metric co v erage. T able 1 sho ws the comparati v e analysis of Hindi image captioning approaches. T able 1. Comparati v e analysis of Hindi image captioning approaches ( = present, x = absent) P aper 1 2 3 4 5 6 7 Sharma et al. [9] x x x Kaur et al. [10] x x x P atel et al. [11] x x x x x Gupta et al. [12] x x x x x Mishra et al. [13] x x Mishra et al. [14] x x x Bisht et al. [15] x x Rai et al. [16] x x x Harshit et al. [17] x x x x Proposed method Note: 1: SE-attention, 2: CNN backbone, 3: embedding model, 4: multimodal inte gration, 5: dataset type MS-COCO, 6: narrati v e generation, and 7: e v aluation metrics Recent research in Hindi image captioning has e xplored v arious deep-learning architectures. Early ef forts used CNN-long short-term memory (LSTM) models with datasets lik e Flickr8k and Flickr30k to generate Hindi captions, sho wing moderate success in visual-te xt alignment [18], [19]. Later w orks introduced attention blocks and transformer -based decoders (e.g., GPT -2), impro ving syntactic coherence and conte xt capture [20], [13]. Ho we v er , these models remain constrained by l imited data and sentence-le v el generation, lea ving g aps in narrati v e uenc y and linguistic richness [21]. Attention mechanisms ha v e become pi v otal in enhancing caption quality by allo wing models to focus on k e y image re gions. T echniques lik e self-enhanced attention (SEA), top-do wn attention, and enhanced focal modules ha v e demonstrated impro v ed performance on standard datasets by rening spatial focus and object rele v ance [22]–[24]. Multi vie w and heterogeneous attention frame w orks further adv anced multimodal alignment and multilingual adaptability [25], [26], b ut often lack ed customization for Indian language scripts lik e De v anag ari. Ef cientNet has emer ged as a compelling image encoder , balancing accurac y and computational ef cienc y . Studies sho w its inte gration with transformer decoders impro v es feature e xtraction and caption uenc y while m aintaining lo w model comple xity—an essential aspect for deplo yment in resource-constrained en vironments [27]–[29]. Lightweight combinations lik e Ef cientNet-MobileNet-T ransformer ha v e pro v en ef fecti v e across standard benchmarks [30], [31]. One of the most enduring issue in Hindi picture description generation is the language’ s intricate linguistic structure. Morphologically rich structures, high out-of-v ocab ulary (OO V) rates, and limited training Int J Artif Intell, V ol. 15, No. 2, April 2026: 1771–1782 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1773 datasets hinder performance. F astT e xt embeddings, with their subw ord model ling, ha v e been sho wn to outperform traditional W ord2V ec and e v en ri v al transformer -based embeddings for named entity recognition and sentiment tasks in Hindi [32], [33]. Additionally , lacking lar ge-scale, di v erse Hindi caption datasets lik e MS-COCO further restricts model generalizability . This research addresses these limitations through a no v el inte gration of SE-attention Ef cientNet, F astT e xt embeddings, and transformer decoders—e xplicitly tailored to Hindi morphology and De v anag ari script structure. 2. METHOD This section presents a transformer -based Hindi image captioning frame w ork combining Ef cientNet-B4, SE-attention, and F astT e xt embeddings to handle Hindi’ s morphological richness. The system is di vided into four stages. These are data preprocessing, SE-attention feature e xtraction, transformer -based encoding-decoding, and GPT -based caption enhancement. Ov erall architecture: the model w orko w (Figure 1) be gins with translating MS-COCO captions into Hindi [34], cleaning and tok enizing them, and preparing image-caption pairs. Images are resized, con v erted into tensors, and passed to Ef cientNet-B4, enhanced with SE blocks. Figure 1. Proposed image captioning system Feature e xtraction: Ef cientNet + SE-attention. Ef cientNet uses compound scaling for bal ancing depth, width, and resolution [35]. The SE-attention and encoder -decoder o wchart is sho wn in Figure 2, where Figure 2(a) sho ws the SE-a ttention mechanism and Figure 2(b) sho ws the encoder–decoder architecture. W e e xtend its b uilt-in SE block with a custom module, impro ving channel-wise and spatial recalibration. X i, j , c = s c X i, j , c (1) T ransformer encoder -decoder with F as tT e xt: features are fed into a transformer encoder with F astT e xt Hindi embeddings and positional encodings [36]. The decoder le v erages self-attention, cross-attention, and sequential feed-forw ard netw orks to produce captions that are rich in conte xtual meaning. Attention ( Q i , K i , V i ) = softmax Q i K T i d head V i (2) F astT e xt’ s subw ord modeling ef fecti v ely captures Hindi morphology and OO V w ords [37]. Its SGNS objecti v e enhances rare w ord representation quality . Caption generation and GPT inte gration: the decoder outputs Hindi captions. C i Decode ( F d ) (3) T r ansformer -based Hindi ima g e description and storytelling using enhanced attention ... (Anjali Sharma) Evaluation Warning : The document was created with Spire.PDF for Python.
1774 ISSN: 2252-8938 Captions are e v aluated using BLEU, CIDEr , METEOR [38]–[40], and error metrics WER, CER [41], [42]. W e then rene them with GPT , impro ving uenc y and narrati v e depth. S i GPT ( C i ) (4) (a) (b) Figure 2. Attention and encoder–decoder o wcharts: (a) SE-attention mechanism and (b) encoder–decoder architecture o wchart GPT w as used as a post-processing module to enhance narrati v e richness while preserving s emantic alignment. T ransformer -generated captions were pro vided using a structured prompt with constrai nts on length, tense, and topic rele v ance, ensuring coherent and image-consistent st orytelling. GPT -enhanced captions are Int J Artif Intell, V ol. 15, No. 2, April 2026: 1771–1782 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1775 assessed using syntactic, le xical, semantic, and Jaccard similarity metrics [43], enabling broader applications lik e visual question answering and multilingual dialogue systems. In addition, a small-scale human e v aluation assessed narrati v e coherence and e xpressi v e quality , conrming that GPT outputs impro v ed uenc y and descripti v e depth without semantic drift. GPT w as guided using a structured prompt template to control narrati v e generation and pre v ent hallucinated content. An e xample of the prompt used is: Given the following Hindi ima g e caption, g ener ate a short, coher ent narr ative . Do not intr oduce ne w objects, actions, or e vents be yond the caption. Maintain the pr esent tense and limi t the output to 2–3 sentences. 3. EXPERIMENT AL RESUL TS This section discuss the dataset, preprocessing pipeline, model training, performance metrics, abla tion study , benchmarking, and qualitati v e e v aluations. 3.1. Dataset details W e used the COCO-2017 dataset [34], translating English captions into Hindi using Google T ransl ate API. T able 2 summarizes dataset stats; Figure 3 sho ws a sample translation frame. W e translated COCO-2017 English captions into Hindi using Google T ranslate and applied a multi-stage quality assurance process to reduce noise and address Hindi’ s linguistic comple xity . T able 2. COCO-2017 dataset statistics Dataset T raining V alidation T esting V ocab ulary size COCO-2017 118k 5k 40.7k 29,075 Figure 3. A sample translation frame T ranslation quality assurance and v alidation: to impro v e translation quality , a three-s tage post-translation v alidation process w as applied. Automated ltering: captions were normalized usi ng Unicode standardization for the De v anag ari script, remo v al of duplicated tok ens, punctuation correction, and elimination of non-Hindi artif acts. Semantic consistenc y: Hindi captions H i were compared with their English counterparts E i using F astT e xt embeddings. Captions with cosine similarity belo w a threshold ( τ = 0 . 65 ) were discarded: Sim ( E i , H i ) = E i · H i E i ∥∥ H i (5) Manual v alidation: a random subset of 5,000 image–caption pairs w as re vie wed by nati v e Hindi speak ers, with o v er 93% of captions deemed linguistically acceptable after automated ltering. 3.2. Data pr e-pr ocessing and model training conguration Images were resized to 224×224 and normalized. Captions were cleaned, tok enized, and padded to 51 tok ens. F astT e xt Hindi embeddings (300-dim) were projected to 512-dim using: E = W E + b (6) Where E is the original 300 -dimensional F astT e xt embedding, W is a learnable weight matrix of shape (512 × 300) , and b is a bias v ector . T r ansformer -based Hindi ima g e description and storytelling using enhanced attention ... (Anjali Sharma) Evaluation Warning : The document was created with Spire.PDF for Python.
1776 ISSN: 2252-8938 Model training w as performed using the AdamW optimizer with label smoothing (0.1), KLDi vLoss, a w armup scheduler wit h 4000 steps, and gradient clipping (norm =1.0). A t eacher forcing strate gy w as adopted throughout the training. The h yperparameters and associated model performance met rics are pro vided in T able 3. T able 3. T raining h yperparameters and model performance metrics Hyperparameter V alue Performance metric V alue Label smoothing 0.1 Model size 157.3 MB Optimizer AdamW T rainable params 39.2 M Learning rate 5 × 10 4 Inference time 16.1 ms/image Loss function KLDi vLoss GPU usage 4450 MB W armup Steps 4000 FLOPs 0.84 GFLOPs Gradient clipping 1.0 T raining strate gy T eacher forcing 3.3. P erf ormance and e v aluation metrics BLEU, CIDEr , METEOR, WER, and CER were used to e v aluate captioning. GPT -rened captions were assessed using syntactic, le xical, semantic, and Jaccard similarities. Ev aluation metrics such as BLEU and CIDEr primarily rely on n-gram o v erlap and may be sensiti v e to surf ace-le v el v ariations in morphologicall y rich languages lik e Hindi, where multiple v alid inected forms and e xible w ord order are common. As a result, these metrics can underestimate semantic correctness despite accurate visual grounding. METEOR, along with WER and CER, pro vides complementary insight by accounting for linguistic v ariation, w ord alignment, and error patterns specic to Hindi. T raining metrics and system performance trends are sho wn in Figure 4. Figure 4. Hindi image captioning: training accurac y and loss o v ervie w Final training and v alidation accurac y reached 76% and 78% respecti v ely (T able 4). Model is ef cient (16.1 ms inference time), using 4450 MB GPU memory . Ev aluation metrics are presented in Figure 5, where Figure 5(a) sho ws image captioning e v aluation metrics, Figure 5(b) sho ws WER and CER e v aluation, Figure 5(c) sho ws scalability analysis, and Figure 5(d) sho ws ablation study . T able 4 summarizes the scores. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1771–1782 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1777 T able 4. Combined summary of model training, performance, and e v aluation metrics Metric T raining V alidation Model metric V alue Ev aluation metric Score Final loss 1.3 1.2 Model size 157.3 MB BLEU-1 to 4 83.24%, 73.17%, 64.56%, 58.22% Final accurac y 76% 78% T rainable params 39.2M CIDEr 81.41% Inference time 16.1 ms/image METEOR 81.18% GPU usage 4450 MB F1-score 90.29% FLOPs 0.84 GFLOPs WER 14.82% CER 10.75% (a) (b) (c) (d) Figure 5. Combined vie w of four image captioning analysis visuals: (a) image captioning e v aluation metrics, (b) WER and CER e v aluation for image captioning, (c) scalability analysis, and (d) ablation study 3.4. Ablation analysis Three model v ariants were tested: base t ransformer , with SE-attention, and with SE + F astT e xt. T able 5 and Figure 5(d) demonstrate consi stent impro v ement in all e v aluation metrics with each architectural enhancement. T o assess the statistical s ignicance of performance g ains introduced by SE-attention and F astT e xt embeddings, a tw o-tailed paired t-test w as conducted across v e independent e xperimental runs (N=5). The test w as applied to per -run e v aluation scores computed on the same test set for all model congurations. Results indicate that the SE + F astT e xt model achie v es stati stically signicant impro v ements o v er the base conguration in BLEU-4, CIDEr , METEOR, and F1-score at the 95% condence le v el ( p < 0 . 05 ), suggesting that the observ ed g ains are unlik ely to be due to random v ariation. Furthermore, as sho wn in Figure 5(c), the model is linearly scalable for training time and memory requirements as the dataset size increases from 10K to 118K sa mples. This conrms the model’ s suitability for lar ge-scale deplo yment scenarios. The ablation study e v aluates components that directly inuence caption generation, including SE-attention and F astT e xt embeddings. GPT w as e xcluded from the ablation analysis, as it functions solely as a post-processing module for narrati v e enhancement and does not af fect core captioning T r ansformer -based Hindi ima g e description and storytelling using enhanced attention ... (Anjali Sharma) Evaluation Warning : The document was created with Spire.PDF for Python.
1778 ISSN: 2252-8938 metrics. Its impact is assessed separately through narrati v e-le v el e v aluation. All ablation e xperiments were conducted o v er multiple runs, and the reported results reect consistent performance trends across congurations, indicating the rob ustness of the observ ed impro v ements. T able 5. Ablation study results with statistical signicance analysis ( p < 0 . 05 ) Metric Base SE SE+F astT e xt t-v alue p-v alue BLEU-4 41.08 49.44 58.22 3.12 0.021 CIDEr 68.49 77.26 81.41 3.45 0.015 F1 84.00 88.47 90.29 2.87 0.028 METEOR 76.23 80.76 81.18 2.41 0.041 3.5. Scalability and generalization Be yond scalability e v aluation, the proposed model w as also tested on an e xternal Hindi image captioning dataset, Flickr8k-Hindi, to e xamine its generalization capability . Using the same model architecture and e v aluation setup, the approach maintained stable performance across datasets, obtaining a BLEU-4 score of 54.10, a CIDEr score of 74.85, a METEOR score of 76.32, and an F1-score of 87.45. Linguistic accurac y remained strong, with a WER of 18.6% and a CER of 13.9%. Although the dataset dif fers in scale and annotation style, the model consistently preserv ed relati v e performance g ains o v er baseline congurations, demonstrating ef fecti v e rob ustness and cross-dataset generalization. 3.6. Benchmarking and qualitati v e e v aluation The T able 6 proposed model outperforms e xisting Hindi image captioning approaches across BLEU scores, establishing a strong performance baseline. T o assess the impact of GPT -based narrati v e enhancement, a comparati v e e v aluation w as conducted between base captions and GPT -enhanced narrati v es. Automatic similarity met rics, including syntactic, le xical, semantic, and Jaccard similarity , were used to v erify semantic consistenc y and pre v ent content drift. T able 6. Comparison of Hindi image captioning models Authors Model B1 B2 B3 B4 Mishra et al . Encoder -decoder 62.9 43.3 29.1 19.0 Singh et al . CNN + RNN 51.3 30.4 16.7 12.4 Dhir CNN + GR U 57.0 39.0 26.4 17.3 Rathi CNN + LSTM 58.0 47.0 39.0 35.0 Me ghw al CNN + LSTM 62.5 45.8 32.8 23.2 Proposed model CNN + transformer 83.24 73.17 64.56 58.22 In addition, a small-scale human e v aluation w as performed on a randomly selected subset of samples. Nati v e Hindi speak ers rated both v ersions using a three-point Lik ert scale (lo w , medium, high) based on narrati v e coherence and e xpressi v e richness. As summarized in T able 7, GPT -enhanced narrati v es consistently impro v ed coherence and e xpressi v e depth while preserving the original semantic content. Sample qualitati v e results with captions and narrations are presented in Figure 6. T o conte xtualize the quantitati v e e v aluation metrics, a qualitati v e error analysis w as conducted on a randomly selected subset of 200 generated captions. Each caption w as manually e xamined and assigned a dominant linguistic error cate gory . The analysis re v ealed that most inaccuracies were minor and linguistically dri v en rather than semantic. Common error types included gender and number agreement mismatches, v ariations in w ord order , and postposition usage. These errors generally preserv ed the intended meaning b ut ne g ati v ely af fected n-gram–based metri cs, highlighting the importance of complementing quantitati v e scores with qualitati v e analysis for morphologically rich languages such as Hindi. The results are summarized in T able 8. T able 7. Human e v aluation of caption vs. GPT -enhanced narrati v e Criterion Base caption GPT -enhanced Narrati v e coherence Medium High Expressi v e richness Lo w Impro v ed Semantic consistenc y High High Int J Artif Intell, V ol. 15, No. 2, April 2026: 1771–1782 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1779 Figure 6. Sample images with generated captions, e v aluation scores, and GPT -enhanced narrations T able 8. Distrib ution of common error types in Hindi caption generation Error type Percentage (%) Gender/number agreement errors 36.0 W ord order v ariations 28.5 Postposition usage errors 20.0 Le xical inection errors 15.5 Scores indicate semantic correctness and uenc y . GPT -based narrations further enhance richness and e xpressi v eness and descriptions maintain semantic alignment with high syntactic and le xical similarity . The model ef fecti v ely balances accurac y and creati vity in Hindi caption generation. 4. CONCLUSION This w ork presents an adv anced Hindi image captioning frame w ork that inte grates a cus tom SE-attention mechanism with an Ef cientNet-based tra nsformer encoder–decoder architecture. Experimental results sho w substantial impro v ements in BLEU, CIDEr , and METEOR scores, along with reduced WER and CER, compared to e xisting methods. The use of F astT e xt embeddings enables ef fecti v e modeling of Hindi’ s morphological and syntactic characteristics, making the approach suitable for non-English and lo w-resource language captioning. The model maintains rob ust performance across datasets of v arying scale. Although the model e xhibits good cross-dataset generalization, domain-specic v ariations in visual content and linguistic style may af fect performance in ne w application settings. Future w ork will focus on domain adaptation and transfer learning strate gies, such as ne-tuning on tar get-domain data and multilingual pretraining, as well as e xploring adv anced attention mechanisms, multi-modal e xtensions, and reinforcement learning to further enhance caption quality . T r ansformer -based Hindi ima g e description and storytelling using enhanced attention ... (Anjali Sharma) Evaluation Warning : The document was created with Spire.PDF for Python.
1780 ISSN: 2252-8938 FUNDING INFORMA TION This w ork w as carried out independently and did not recei v e nancial assistance from an y go v ernmental, corporate, or academic grant-a w arding bodies. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib utions, reduce authorship disputes, and f acilitate collaboration. Name of A uthor C M So V a F o I R D O E V i Su P Fu Anjali Sharma Mayank Agg arw al Jitin Khanna C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT The authors declare that this research w as conducted without an y competing nancial or personal interests. ETHICAL CONSIDERA TIONS This study uses publicly a v ailable datasets and addresses ethical considerations related to a u t omated image captioning and storytelling in Hindi. When applied to sensiti v e domains such as ne ws or education, ensuring accurac y , cultural sensiti vity , and human o v ersight is essenti al to pre v ent misinterpretation or misleading content. INFORMED CONSENT No informed consent w as required, as the study did not in v olv e human subjects or personal data. D A T A A V AILABILITY The dataset emplo yed in this research is publicly accessible and can be retrie v ed from the of cial COCO dataset website: https://cocodataset.or g/#do wnload. REFERENCES [1] K. Rage, A study on dif ferent deep learning architectures on image captioning, in 2022 8th International Confer ence on Smart Structur es and Systems (ICSSS) , Apr . 2022, pp. 1–9, doi: 10.1109/ICSSS54381.2022.9782260. [2] R. Castro, I. Pineda, W . Lim, and M. E. M. -Cayamcela, “Deep learning approaches based on transformer architectures for image captioning tasks, IEEE Access , v ol. 10, pp. 33679–33694, 2022, doi: 10.1109/A CCESS.2022.3161428. [3] J . Zhang, D. Guo, X. Y ang, P . Song, and M. W ang, “V isual-linguistic-stylistic triple re w ard for cross-lingual image captioning, A CM T r ansactions on Multimedi a Computing , Communications, and Applications , v ol. 20, no. 4, pp. 1–23, Apr . 2024, doi: 10.1145/3634917. [4] B . R. Reddy , S. Gunti, R. P . K umar , and S. Sride vi, “Multilingual image captioning: mult imodal frame w ork for bridging visual and linguistic realms in T amil and T elugu through transformers, Resear c h Squar e , doi: 10.21203/rs.3.rs-3380598/v1. [5] V . Jayasw al, R. Rani , and J. Kaur , A deep learning-base d ef cient image captioning approach for Hindi language, in De velopments T owar ds Ne xt Gener ation Intellig ent Sys tems for Sustainable De velopment . Ne w Y ork, United States: IGI Global, 2024, pp. 225–246, doi: 10.4018/979-8-3693-5643-2.ch009. [6] H . Ahmadabadi, O. N. Manzari, and A. A yatollahi, “Distilling kno wledge from CNN-transformer models for enhanc ed human action recognition, in 2023 13th International Confer ence on Computer and Knowledg e Engineering (ICCKE) , No v . 2023, pp. 180–184, doi: 10.1109/ICCKE60553.2023.10326272. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1771–1782 Evaluation Warning : The document was created with Spire.PDF for Python.