IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
2,
April
2026,
pp.
1771
∼
1782
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i2.pp1771-1782
❒
1771
T
ransf
ormer
-based
Hindi
image
description
and
storytelling
using
enhanced
attention
and
F
astT
ext
embeddings
Anjali
Sharma
1
,
Mayank
Aggarwal
1
,
Jitin
Khanna
2
1
Department
of
Computer
Science
and
Engineering,
F
aculty
of
Engineering
and
T
echnology
,
Gurukula
Kangri
(Deemed
to
be
Uni
v
ersity),
Haridw
ar
,
India
2
Manager
Data
and
Analytics,
IBM,
P
aramus,
United
States
Article
Inf
o
Article
history:
Recei
v
ed
Jun
23,
2025
Re
vised
Feb
6,
2026
Accepted
Mar
5,
2026
K
eyw
ords:
Ev
aluation
metrics
F
astT
e
xt
embeddings
Hindi
image
Squeeze-and-e
xcitation
T
ransformer
models
ABSTRA
CT
This
w
ork
presents
a
no
v
el
image
description
generation
frame
w
ork
that
combines
a
T
ransformer
-based
encoder
-decoder
architecture
with
a
custom
squeeze-and-e
xcitation
(SE)
attention
block
inte
grated
into
an
Ef
cientNet
feature
e
xtractor
.
The
decoder
uses
F
astT
e
xt
embeddings
specically
trained
for
Hindi
and
is
e
v
aluated
on
the
Microsoft
common
objects
i
n
conte
xt
(MS-COCO)
dataset.
T
o
impro
v
e
the
captioning
process,
the
model
i
ncorporates
a
generati
v
e
pre-trained
transformer
(GPT)
module
to
generate
narrati
v
e
descriptions
based
on
the
initial
captions
and
applies
multiple
similarity
metrics
to
assess
output
quality
.
The
proposed
system
signicantly
outperforms
e
xisting
methods,
achie
ving
high
bilingual
e
v
aluation
understudy
(BLEU)
scores
(BLEU-1
to
BLEU-4:
83.24,
73.17,
64.56,
and
58.22),
a
consensus-based
image
description
e
v
aluation
(CIDEr)
score
of
81.41,
an
F1
score
of
90.29,
and
a
metric
for
e
v
aluation
of
translati
on
with
e
xplicit
ordering
(METEOR)
score
of
81.18,
indicating
strong
caption
accurac
y
.
Furthermore,
the
model
achie
v
es
lo
w
error
rates,
with
a
w
ord
error
rate
(WER)
of
15%
and
a
character
error
rate
(CER)
of
11%.
This
w
ork
highlights
the
challenges
of
applying
lar
ge-sc
ale
datasets
lik
e
MS-COCO
to
resource-limited
languages
and
demonstrates
the
ef
fecti
v
eness
of
inte
grating
F
astT
e
xt
embeddings
with
transformer
-based
models
for
Hindi
image
captioning.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Anjali
Sharma
Department
of
Computer
Science
and
Engineering,
F
aculty
of
Engineering
and
T
echnology
Gurukula
Kangri
(Deemed
to
be
Uni
v
ersity)
Haridw
ar
,
India
Email:
23631001@gkv
.ac.in
1.
INTR
ODUCTION
Image
des
cription
synthesis
emplo
ys
perception
techniques
alongside
language
models
to
cr
eate
correct
and
rele
v
ant
te
xt.
Deep
learning
models
lik
e
con
v
olutional
neural
netw
orks
(CNNs),
recurrent
neural
netw
orks
(RNNs),
and
architectures
that
are
based
on
transformers
(e.g.,
vision
transformers
(V
iTs)
and
data-ef
cient
image
transformers
(DeiTs)
)
ha
v
e
signicantly
adv
anced
this
eld
by
producing
semantically
rich
captions
[1],
[2].
Multilingual
captioning,
especially
in
comple
x
languages
lik
e
Hindi,
f
aces
challenges
due
to
linguistic
di
v
ersity
and
limited
datasets.
Hindi’
s
unique
syntax
and
morphology
demand
adapted
models,
b
ut
current
resources
lik
e
the
translated
Flickr8k
dataset
remain
insuf
cient
[3]–[5].
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1772
❒
ISSN:
2252-8938
Despite
progress
in
Hindi
image
captioni
ng
,
limitations
in
dataset
di
v
ersity
and
model
adapta
bility
hinder
performance.
Existing
datasets
lik
e
Flickr8k-Hindi
and
HIC
restrict
model
generalizability
.
Inte
grating
CNNs
to
e
xtract
visual
features
with
transformer
architectures
for
capturing
global
conte
xt
enhances
the
o
v
erall
quality
of
generated
pict
u
r
e
descriptions
[6],
[7].
Enhancing
attention
mechanisms
further
strengthens
conte
xtual
and
linguistic
coherence
[8].
This
study
tar
gets
three
core
objecti
v
es:
b
uilding
di
v
erse
datasets,
rening
conte
xtual
feature
e
xtraction,
and
impro
ving
linguistic
accurac
y
.
This
research
adv
ances
Hindi
im
age
captioning
by
translating
Microsoft
common
objects
in
conte
xt
(MS-COCO)
into
Hindi,
creating
a
rob
ust
dataset.
It
inte
grates
squeeze-and-e
xcitation
(SE)
attention-enhanced
Ef
cientNet
for
detailed
visual
feature
e
xtraction
and
a
transformer
architecture
tailored
to
Hindi’
s
linguistic
structure.
F
astT
e
xt
embeddings
impro
v
e
semantic
richness,
while
generati
v
e
pre-trained
transformer
(GPT)
renes
captions
for
narrati
v
e
depth.
Ev
aluated
using
bilingual
e
v
aluation
understudy
(BLEU),
consensus-based
image
description
e
v
aluation
(CIDEr),
metric
for
e
v
aluation
of
translation
with
e
xplicit
ordering
(METEOR),
w
ord
error
rate
(WER),
and
character
error
rate
(CER),
the
model
establishes
a
strong
performance
baseline.
Uni
qu
e
contrib
utions
include
a
De
v
anag
ari-adapted
SE
block
and
GPT
-based
caption
e
xtension,
addressing
dataset
scarcity
and
linguistic
comple
xity
,
with
broad
implications
for
inclusi
v
e
AI
in
lo
w-resource,
morphologically
rich
languages.
The
proposed
method
surpasses
prior
Hindi
image
captioning
approaches
by
combining
SE-attention,
Ef
cientNet,
and
F
astT
e
xt
embeddings
for
detailed
visual
capture
and
narrati
v
e-le
v
el
generation.
Unlik
e
earlier
sentence-le
v
el
models
with
limited
multimodal
fusion
and
e
v
aluation,
our
approach
introduces
a
linguistically
rich
frame
w
ork
and
a
custom
Hindi
MS-COCO
dataset
with
comprehensi
v
e
metric
co
v
erage.
T
able
1
sho
ws
the
comparati
v
e
analysis
of
Hindi
image
captioning
approaches.
T
able
1.
Comparati
v
e
analysis
of
Hindi
image
captioning
approaches
(
✓
=
present,
x
=
absent)
P
aper
1
2
3
4
5
6
7
Sharma
et
al.
[9]
x
✓
✓
✓
x
x
✓
Kaur
et
al.
[10]
x
✓
✓
✓
x
x
✓
P
atel
et
al.
[11]
x
✓
x
✓
x
x
x
Gupta
et
al.
[12]
x
✓
✓
x
x
x
x
Mishra
et
al.
[13]
x
✓
✓
✓
✓
x
✓
Mishra
et
al.
[14]
x
✓
x
✓
✓
x
✓
Bisht
et
al.
[15]
x
✓
✓
✓
✓
x
✓
Rai
et
al.
[16]
✓
x
x
✓
✓
x
✓
Harshit
et
al.
[17]
x
x
x
✓
✓
x
✓
Proposed
method
✓
✓
✓
✓
✓
✓
✓
Note:
1:
SE-attention,
2:
CNN
backbone,
3:
embedding
model,
4:
multimodal
inte
gration,
5:
dataset
type
MS-COCO,
6:
narrati
v
e
generation,
and
7:
e
v
aluation
metrics
Recent
research
in
Hindi
image
captioning
has
e
xplored
v
arious
deep-learning
architectures.
Early
ef
forts
used
CNN-long
short-term
memory
(LSTM)
models
with
datasets
lik
e
Flickr8k
and
Flickr30k
to
generate
Hindi
captions,
sho
wing
moderate
success
in
visual-te
xt
alignment
[18],
[19].
Later
w
orks
introduced
attention
blocks
and
transformer
-based
decoders
(e.g.,
GPT
-2),
impro
ving
syntactic
coherence
and
conte
xt
capture
[20],
[13].
Ho
we
v
er
,
these
models
remain
constrained
by
l
imited
data
and
sentence-le
v
el
generation,
lea
ving
g
aps
in
narrati
v
e
uenc
y
and
linguistic
richness
[21].
Attention
mechanisms
ha
v
e
become
pi
v
otal
in
enhancing
caption
quality
by
allo
wing
models
to
focus
on
k
e
y
image
re
gions.
T
echniques
lik
e
self-enhanced
attention
(SEA),
top-do
wn
attention,
and
enhanced
focal
modules
ha
v
e
demonstrated
impro
v
ed
performance
on
standard
datasets
by
rening
spatial
focus
and
object
rele
v
ance
[22]–[24].
Multi
vie
w
and
heterogeneous
attention
frame
w
orks
further
adv
anced
multimodal
alignment
and
multilingual
adaptability
[25],
[26],
b
ut
often
lack
ed
customization
for
Indian
language
scripts
lik
e
De
v
anag
ari.
Ef
cientNet
has
emer
ged
as
a
compelling
image
encoder
,
balancing
accurac
y
and
computational
ef
cienc
y
.
Studies
sho
w
its
inte
gration
with
transformer
decoders
impro
v
es
feature
e
xtraction
and
caption
uenc
y
while
m
aintaining
lo
w
model
comple
xity—an
essential
aspect
for
deplo
yment
in
resource-constrained
en
vironments
[27]–[29].
Lightweight
combinations
lik
e
Ef
cientNet-MobileNet-T
ransformer
ha
v
e
pro
v
en
ef
fecti
v
e
across
standard
benchmarks
[30],
[31].
One
of
the
most
enduring
issue
in
Hindi
picture
description
generation
is
the
language’
s
intricate
linguistic
structure.
Morphologically
rich
structures,
high
out-of-v
ocab
ulary
(OO
V)
rates,
and
limited
training
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1771–1782
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1773
datasets
hinder
performance.
F
astT
e
xt
embeddings,
with
their
subw
ord
model
ling,
ha
v
e
been
sho
wn
to
outperform
traditional
W
ord2V
ec
and
e
v
en
ri
v
al
transformer
-based
embeddings
for
named
entity
recognition
and
sentiment
tasks
in
Hindi
[32],
[33].
Additionally
,
lacking
lar
ge-scale,
di
v
erse
Hindi
caption
datasets
lik
e
MS-COCO
further
restricts
model
generalizability
.
This
research
addresses
these
limitations
through
a
no
v
el
inte
gration
of
SE-attention
Ef
cientNet,
F
astT
e
xt
embeddings,
and
transformer
decoders—e
xplicitly
tailored
to
Hindi
morphology
and
De
v
anag
ari
script
structure.
2.
METHOD
This
section
presents
a
transformer
-based
Hindi
image
captioning
frame
w
ork
combining
Ef
cientNet-B4,
SE-attention,
and
F
astT
e
xt
embeddings
to
handle
Hindi’
s
morphological
richness.
The
system
is
di
vided
into
four
stages.
These
are
data
preprocessing,
SE-attention
feature
e
xtraction,
transformer
-based
encoding-decoding,
and
GPT
-based
caption
enhancement.
Ov
erall
architecture:
the
model
w
orko
w
(Figure
1)
be
gins
with
translating
MS-COCO
captions
into
Hindi
[34],
cleaning
and
tok
enizing
them,
and
preparing
image-caption
pairs.
Images
are
resized,
con
v
erted
into
tensors,
and
passed
to
Ef
cientNet-B4,
enhanced
with
SE
blocks.
Figure
1.
Proposed
image
captioning
system
Feature
e
xtraction:
Ef
cientNet
+
SE-attention.
Ef
cientNet
uses
compound
scaling
for
bal
ancing
depth,
width,
and
resolution
[35].
The
SE-attention
and
encoder
-decoder
o
wchart
is
sho
wn
in
Figure
2,
where
Figure
2(a)
sho
ws
the
SE-a
ttention
mechanism
and
Figure
2(b)
sho
ws
the
encoder–decoder
architecture.
W
e
e
xtend
its
b
uilt-in
SE
block
with
a
custom
module,
impro
ving
channel-wise
and
spatial
recalibration.
X
′
i,
j
,
c
=
s
c
X
i,
j
,
c
(1)
T
ransformer
encoder
-decoder
with
F
as
tT
e
xt:
features
are
fed
into
a
transformer
encoder
with
F
astT
e
xt
Hindi
embeddings
and
positional
encodings
[36].
The
decoder
le
v
erages
self-attention,
cross-attention,
and
sequential
feed-forw
ard
netw
orks
to
produce
captions
that
are
rich
in
conte
xtual
meaning.
Attention
(
Q
i
,
K
i
,
V
i
)
=
softmax
Q
i
K
T
i
√
d
head
V
i
(2)
F
astT
e
xt’
s
subw
ord
modeling
ef
fecti
v
ely
captures
Hindi
morphology
and
OO
V
w
ords
[37].
Its
SGNS
objecti
v
e
enhances
rare
w
ord
representation
quality
.
Caption
generation
and
GPT
inte
gration:
the
decoder
outputs
Hindi
captions.
C
′
i
←
Decode
(
F
d
)
(3)
T
r
ansformer
-based
Hindi
ima
g
e
description
and
storytelling
using
enhanced
attention
...
(Anjali
Sharma)
Evaluation Warning : The document was created with Spire.PDF for Python.
1774
❒
ISSN:
2252-8938
Captions
are
e
v
aluated
using
BLEU,
CIDEr
,
METEOR
[38]–[40],
and
error
metrics
WER,
CER
[41],
[42].
W
e
then
rene
them
with
GPT
,
impro
ving
uenc
y
and
narrati
v
e
depth.
S
i
←
GPT
(
C
′
i
)
(4)
(a)
(b)
Figure
2.
Attention
and
encoder–decoder
o
wcharts:
(a)
SE-attention
mechanism
and
(b)
encoder–decoder
architecture
o
wchart
GPT
w
as
used
as
a
post-processing
module
to
enhance
narrati
v
e
richness
while
preserving
s
emantic
alignment.
T
ransformer
-generated
captions
were
pro
vided
using
a
structured
prompt
with
constrai
nts
on
length,
tense,
and
topic
rele
v
ance,
ensuring
coherent
and
image-consistent
st
orytelling.
GPT
-enhanced
captions
are
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1771–1782
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1775
assessed
using
syntactic,
le
xical,
semantic,
and
Jaccard
similarity
metrics
[43],
enabling
broader
applications
lik
e
visual
question
answering
and
multilingual
dialogue
systems.
In
addition,
a
small-scale
human
e
v
aluation
assessed
narrati
v
e
coherence
and
e
xpressi
v
e
quality
,
conrming
that
GPT
outputs
impro
v
ed
uenc
y
and
descripti
v
e
depth
without
semantic
drift.
GPT
w
as
guided
using
a
structured
prompt
template
to
control
narrati
v
e
generation
and
pre
v
ent
hallucinated
content.
An
e
xample
of
the
prompt
used
is:
Given
the
following
Hindi
ima
g
e
caption,
g
ener
ate
a
short,
coher
ent
narr
ative
.
Do
not
intr
oduce
ne
w
objects,
actions,
or
e
vents
be
yond
the
caption.
Maintain
the
pr
esent
tense
and
limi
t
the
output
to
2–3
sentences.
3.
EXPERIMENT
AL
RESUL
TS
This
section
discuss
the
dataset,
preprocessing
pipeline,
model
training,
performance
metrics,
abla
tion
study
,
benchmarking,
and
qualitati
v
e
e
v
aluations.
3.1.
Dataset
details
W
e
used
the
COCO-2017
dataset
[34],
translating
English
captions
into
Hindi
using
Google
T
ransl
ate
API.
T
able
2
summarizes
dataset
stats;
Figure
3
sho
ws
a
sample
translation
frame.
W
e
translated
COCO-2017
English
captions
into
Hindi
using
Google
T
ranslate
and
applied
a
multi-stage
quality
assurance
process
to
reduce
noise
and
address
Hindi’
s
linguistic
comple
xity
.
T
able
2.
COCO-2017
dataset
statistics
Dataset
T
raining
V
alidation
T
esting
V
ocab
ulary
size
COCO-2017
118k
5k
40.7k
29,075
Figure
3.
A
sample
translation
frame
T
ranslation
quality
assurance
and
v
alidation:
to
impro
v
e
translation
quality
,
a
three-s
tage
post-translation
v
alidation
process
w
as
applied.
–
Automated
ltering:
captions
were
normalized
usi
ng
Unicode
standardization
for
the
De
v
anag
ari
script,
remo
v
al
of
duplicated
tok
ens,
punctuation
correction,
and
elimination
of
non-Hindi
artif
acts.
–
Semantic
consistenc
y:
Hindi
captions
H
i
were
compared
with
their
English
counterparts
E
i
using
F
astT
e
xt
embeddings.
Captions
with
cosine
similarity
belo
w
a
threshold
(
τ
=
0
.
65
)
were
discarded:
Sim
(
E
i
,
H
i
)
=
E
i
·
H
i
∥
E
i
∥∥
H
i
∥
(5)
–
Manual
v
alidation:
a
random
subset
of
5,000
image–caption
pairs
w
as
re
vie
wed
by
nati
v
e
Hindi
speak
ers,
with
o
v
er
93%
of
captions
deemed
linguistically
acceptable
after
automated
ltering.
3.2.
Data
pr
e-pr
ocessing
and
model
training
conguration
Images
were
resized
to
224×224
and
normalized.
Captions
were
cleaned,
tok
enized,
and
padded
to
51
tok
ens.
F
astT
e
xt
Hindi
embeddings
(300-dim)
were
projected
to
512-dim
using:
E
′
=
W
E
+
b
(6)
Where
E
is
the
original
300
-dimensional
F
astT
e
xt
embedding,
W
is
a
learnable
weight
matrix
of
shape
(512
×
300)
,
and
b
is
a
bias
v
ector
.
T
r
ansformer
-based
Hindi
ima
g
e
description
and
storytelling
using
enhanced
attention
...
(Anjali
Sharma)
Evaluation Warning : The document was created with Spire.PDF for Python.
1776
❒
ISSN:
2252-8938
Model
training
w
as
performed
using
the
AdamW
optimizer
with
label
smoothing
(0.1),
KLDi
vLoss,
a
w
armup
scheduler
wit
h
4000
steps,
and
gradient
clipping
(norm
=1.0).
A
t
eacher
forcing
strate
gy
w
as
adopted
throughout
the
training.
The
h
yperparameters
and
associated
model
performance
met
rics
are
pro
vided
in
T
able
3.
T
able
3.
T
raining
h
yperparameters
and
model
performance
metrics
Hyperparameter
V
alue
Performance
metric
V
alue
Label
smoothing
0.1
Model
size
157.3
MB
Optimizer
AdamW
T
rainable
params
39.2
M
Learning
rate
5
×
10
−
4
Inference
time
16.1
ms/image
Loss
function
KLDi
vLoss
GPU
usage
4450
MB
W
armup
Steps
4000
FLOPs
0.84
GFLOPs
Gradient
clipping
1.0
T
raining
strate
gy
T
eacher
forcing
3.3.
P
erf
ormance
and
e
v
aluation
metrics
BLEU,
CIDEr
,
METEOR,
WER,
and
CER
were
used
to
e
v
aluate
captioning.
GPT
-rened
captions
were
assessed
using
syntactic,
le
xical,
semantic,
and
Jaccard
similarities.
Ev
aluation
metrics
such
as
BLEU
and
CIDEr
primarily
rely
on
n-gram
o
v
erlap
and
may
be
sensiti
v
e
to
surf
ace-le
v
el
v
ariations
in
morphologicall
y
rich
languages
lik
e
Hindi,
where
multiple
v
alid
inected
forms
and
e
xible
w
ord
order
are
common.
As
a
result,
these
metrics
can
underestimate
semantic
correctness
despite
accurate
visual
grounding.
METEOR,
along
with
WER
and
CER,
pro
vides
complementary
insight
by
accounting
for
linguistic
v
ariation,
w
ord
alignment,
and
error
patterns
specic
to
Hindi.
T
raining
metrics
and
system
performance
trends
are
sho
wn
in
Figure
4.
Figure
4.
Hindi
image
captioning:
training
accurac
y
and
loss
o
v
ervie
w
Final
training
and
v
alidation
accurac
y
reached
76%
and
78%
respecti
v
ely
(T
able
4).
Model
is
ef
cient
(16.1
ms
inference
time),
using
4450
MB
GPU
memory
.
Ev
aluation
metrics
are
presented
in
Figure
5,
where
Figure
5(a)
sho
ws
image
captioning
e
v
aluation
metrics,
Figure
5(b)
sho
ws
WER
and
CER
e
v
aluation,
Figure
5(c)
sho
ws
scalability
analysis,
and
Figure
5(d)
sho
ws
ablation
study
.
T
able
4
summarizes
the
scores.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1771–1782
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1777
T
able
4.
Combined
summary
of
model
training,
performance,
and
e
v
aluation
metrics
Metric
T
raining
V
alidation
Model
metric
V
alue
Ev
aluation
metric
Score
Final
loss
1.3
1.2
Model
size
157.3
MB
BLEU-1
to
4
83.24%,
73.17%,
64.56%,
58.22%
Final
accurac
y
76%
78%
T
rainable
params
39.2M
CIDEr
81.41%
–
–
–
Inference
time
16.1
ms/image
METEOR
81.18%
–
–
–
GPU
usage
4450
MB
F1-score
90.29%
–
–
–
FLOPs
0.84
GFLOPs
WER
14.82%
–
–
–
–
–
CER
10.75%
(a)
(b)
(c)
(d)
Figure
5.
Combined
vie
w
of
four
image
captioning
analysis
visuals:
(a)
image
captioning
e
v
aluation
metrics,
(b)
WER
and
CER
e
v
aluation
for
image
captioning,
(c)
scalability
analysis,
and
(d)
ablation
study
3.4.
Ablation
analysis
Three
model
v
ariants
were
tested:
base
t
ransformer
,
with
SE-attention,
and
with
SE
+
F
astT
e
xt.
T
able
5
and
Figure
5(d)
demonstrate
consi
stent
impro
v
ement
in
all
e
v
aluation
metrics
with
each
architectural
enhancement.
T
o
assess
the
statistical
s
ignicance
of
performance
g
ains
introduced
by
SE-attention
and
F
astT
e
xt
embeddings,
a
tw
o-tailed
paired
t-test
w
as
conducted
across
v
e
independent
e
xperimental
runs
(N=5).
The
test
w
as
applied
to
per
-run
e
v
aluation
scores
computed
on
the
same
test
set
for
all
model
congurations.
Results
indicate
that
the
SE
+
F
astT
e
xt
model
achie
v
es
stati
stically
signicant
impro
v
ements
o
v
er
the
base
conguration
in
BLEU-4,
CIDEr
,
METEOR,
and
F1-score
at
the
95%
condence
le
v
el
(
p
<
0
.
05
),
suggesting
that
the
observ
ed
g
ains
are
unlik
ely
to
be
due
to
random
v
ariation.
Furthermore,
as
sho
wn
in
Figure
5(c),
the
model
is
linearly
scalable
for
training
time
and
memory
requirements
as
the
dataset
size
increases
from
10K
to
118K
sa
mples.
This
conrms
the
model’
s
suitability
for
lar
ge-scale
deplo
yment
scenarios.
The
ablation
study
e
v
aluates
components
that
directly
inuence
caption
generation,
including
SE-attention
and
F
astT
e
xt
embeddings.
GPT
w
as
e
xcluded
from
the
ablation
analysis,
as
it
functions
solely
as
a
post-processing
module
for
narrati
v
e
enhancement
and
does
not
af
fect
core
captioning
T
r
ansformer
-based
Hindi
ima
g
e
description
and
storytelling
using
enhanced
attention
...
(Anjali
Sharma)
Evaluation Warning : The document was created with Spire.PDF for Python.
1778
❒
ISSN:
2252-8938
metrics.
Its
impact
is
assessed
separately
through
narrati
v
e-le
v
el
e
v
aluation.
All
ablation
e
xperiments
were
conducted
o
v
er
multiple
runs,
and
the
reported
results
reect
consistent
performance
trends
across
congurations,
indicating
the
rob
ustness
of
the
observ
ed
impro
v
ements.
T
able
5.
Ablation
study
results
with
statistical
signicance
analysis
(
p
<
0
.
05
)
Metric
Base
SE
SE+F
astT
e
xt
t-v
alue
p-v
alue
BLEU-4
41.08
49.44
58.22
3.12
0.021
CIDEr
68.49
77.26
81.41
3.45
0.015
F1
84.00
88.47
90.29
2.87
0.028
METEOR
76.23
80.76
81.18
2.41
0.041
3.5.
Scalability
and
generalization
Be
yond
scalability
e
v
aluation,
the
proposed
model
w
as
also
tested
on
an
e
xternal
Hindi
image
captioning
dataset,
Flickr8k-Hindi,
to
e
xamine
its
generalization
capability
.
Using
the
same
model
architecture
and
e
v
aluation
setup,
the
approach
maintained
stable
performance
across
datasets,
obtaining
a
BLEU-4
score
of
54.10,
a
CIDEr
score
of
74.85,
a
METEOR
score
of
76.32,
and
an
F1-score
of
87.45.
Linguistic
accurac
y
remained
strong,
with
a
WER
of
18.6%
and
a
CER
of
13.9%.
Although
the
dataset
dif
fers
in
scale
and
annotation
style,
the
model
consistently
preserv
ed
relati
v
e
performance
g
ains
o
v
er
baseline
congurations,
demonstrating
ef
fecti
v
e
rob
ustness
and
cross-dataset
generalization.
3.6.
Benchmarking
and
qualitati
v
e
e
v
aluation
The
T
able
6
proposed
model
outperforms
e
xisting
Hindi
image
captioning
approaches
across
BLEU
scores,
establishing
a
strong
performance
baseline.
T
o
assess
the
impact
of
GPT
-based
narrati
v
e
enhancement,
a
comparati
v
e
e
v
aluation
w
as
conducted
between
base
captions
and
GPT
-enhanced
narrati
v
es.
Automatic
similarity
met
rics,
including
syntactic,
le
xical,
semantic,
and
Jaccard
similarity
,
were
used
to
v
erify
semantic
consistenc
y
and
pre
v
ent
content
drift.
T
able
6.
Comparison
of
Hindi
image
captioning
models
Authors
Model
B1
B2
B3
B4
Mishra
et
al
.
Encoder
-decoder
62.9
43.3
29.1
19.0
Singh
et
al
.
CNN
+
RNN
51.3
30.4
16.7
12.4
Dhir
CNN
+
GR
U
57.0
39.0
26.4
17.3
Rathi
CNN
+
LSTM
58.0
47.0
39.0
35.0
Me
ghw
al
CNN
+
LSTM
62.5
45.8
32.8
23.2
Proposed
model
CNN
+
transformer
83.24
73.17
64.56
58.22
In
addition,
a
small-scale
human
e
v
aluation
w
as
performed
on
a
randomly
selected
subset
of
samples.
Nati
v
e
Hindi
speak
ers
rated
both
v
ersions
using
a
three-point
Lik
ert
scale
(lo
w
,
medium,
high)
based
on
narrati
v
e
coherence
and
e
xpressi
v
e
richness.
As
summarized
in
T
able
7,
GPT
-enhanced
narrati
v
es
consistently
impro
v
ed
coherence
and
e
xpressi
v
e
depth
while
preserving
the
original
semantic
content.
Sample
qualitati
v
e
results
with
captions
and
narrations
are
presented
in
Figure
6.
T
o
conte
xtualize
the
quantitati
v
e
e
v
aluation
metrics,
a
qualitati
v
e
error
analysis
w
as
conducted
on
a
randomly
selected
subset
of
200
generated
captions.
Each
caption
w
as
manually
e
xamined
and
assigned
a
dominant
linguistic
error
cate
gory
.
The
analysis
re
v
ealed
that
most
inaccuracies
were
minor
and
linguistically
dri
v
en
rather
than
semantic.
Common
error
types
included
gender
and
number
agreement
mismatches,
v
ariations
in
w
ord
order
,
and
postposition
usage.
These
errors
generally
preserv
ed
the
intended
meaning
b
ut
ne
g
ati
v
ely
af
fected
n-gram–based
metri
cs,
highlighting
the
importance
of
complementing
quantitati
v
e
scores
with
qualitati
v
e
analysis
for
morphologically
rich
languages
such
as
Hindi.
The
results
are
summarized
in
T
able
8.
T
able
7.
Human
e
v
aluation
of
caption
vs.
GPT
-enhanced
narrati
v
e
Criterion
Base
caption
GPT
-enhanced
Narrati
v
e
coherence
Medium
High
Expressi
v
e
richness
Lo
w
Impro
v
ed
Semantic
consistenc
y
High
High
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1771–1782
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1779
Figure
6.
Sample
images
with
generated
captions,
e
v
aluation
scores,
and
GPT
-enhanced
narrations
T
able
8.
Distrib
ution
of
common
error
types
in
Hindi
caption
generation
Error
type
Percentage
(%)
Gender/number
agreement
errors
36.0
W
ord
order
v
ariations
28.5
Postposition
usage
errors
20.0
Le
xical
inection
errors
15.5
Scores
indicate
semantic
correctness
and
uenc
y
.
GPT
-based
narrations
further
enhance
richness
and
e
xpressi
v
eness
and
descriptions
maintain
semantic
alignment
with
high
syntactic
and
le
xical
similarity
.
The
model
ef
fecti
v
ely
balances
accurac
y
and
creati
vity
in
Hindi
caption
generation.
4.
CONCLUSION
This
w
ork
presents
an
adv
anced
Hindi
image
captioning
frame
w
ork
that
inte
grates
a
cus
tom
SE-attention
mechanism
with
an
Ef
cientNet-based
tra
nsformer
encoder–decoder
architecture.
Experimental
results
sho
w
substantial
impro
v
ements
in
BLEU,
CIDEr
,
and
METEOR
scores,
along
with
reduced
WER
and
CER,
compared
to
e
xisting
methods.
The
use
of
F
astT
e
xt
embeddings
enables
ef
fecti
v
e
modeling
of
Hindi’
s
morphological
and
syntactic
characteristics,
making
the
approach
suitable
for
non-English
and
lo
w-resource
language
captioning.
The
model
maintains
rob
ust
performance
across
datasets
of
v
arying
scale.
Although
the
model
e
xhibits
good
cross-dataset
generalization,
domain-specic
v
ariations
in
visual
content
and
linguistic
style
may
af
fect
performance
in
ne
w
application
settings.
Future
w
ork
will
focus
on
domain
adaptation
and
transfer
learning
strate
gies,
such
as
ne-tuning
on
tar
get-domain
data
and
multilingual
pretraining,
as
well
as
e
xploring
adv
anced
attention
mechanisms,
multi-modal
e
xtensions,
and
reinforcement
learning
to
further
enhance
caption
quality
.
T
r
ansformer
-based
Hindi
ima
g
e
description
and
storytelling
using
enhanced
attention
...
(Anjali
Sharma)
Evaluation Warning : The document was created with Spire.PDF for Python.
1780
❒
ISSN:
2252-8938
FUNDING
INFORMA
TION
This
w
ork
w
as
carried
out
independently
and
did
not
recei
v
e
nancial
assistance
from
an
y
go
v
ernmental,
corporate,
or
academic
grant-a
w
arding
bodies.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
utions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Anjali
Sharma
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Mayank
Agg
arw
al
✓
✓
✓
✓
✓
✓
✓
Jitin
Khanna
✓
✓
✓
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
The
authors
declare
that
this
research
w
as
conducted
without
an
y
competing
nancial
or
personal
interests.
ETHICAL
CONSIDERA
TIONS
This
study
uses
publicly
a
v
ailable
datasets
and
addresses
ethical
considerations
related
to
a
u
t
omated
image
captioning
and
storytelling
in
Hindi.
When
applied
to
sensiti
v
e
domains
such
as
ne
ws
or
education,
ensuring
accurac
y
,
cultural
sensiti
vity
,
and
human
o
v
ersight
is
essenti
al
to
pre
v
ent
misinterpretation
or
misleading
content.
INFORMED
CONSENT
No
informed
consent
w
as
required,
as
the
study
did
not
in
v
olv
e
human
subjects
or
personal
data.
D
A
T
A
A
V
AILABILITY
The
dataset
emplo
yed
in
this
research
is
publicly
accessible
and
can
be
retrie
v
ed
from
the
of
cial
COCO
dataset
website:
https://cocodataset.or
g/#do
wnload.
REFERENCES
[1]
K.
Rage,
“
A
study
on
dif
ferent
deep
learning
architectures
on
image
captioning,
”
in
2022
8th
International
Confer
ence
on
Smart
Structur
es
and
Systems
(ICSSS)
,
Apr
.
2022,
pp.
1–9,
doi:
10.1109/ICSSS54381.2022.9782260.
[2]
R.
Castro,
I.
Pineda,
W
.
Lim,
and
M.
E.
M.
-Cayamcela,
“Deep
learning
approaches
based
on
transformer
architectures
for
image
captioning
tasks,
”
IEEE
Access
,
v
ol.
10,
pp.
33679–33694,
2022,
doi:
10.1109/A
CCESS.2022.3161428.
[3]
J
.
Zhang,
D.
Guo,
X.
Y
ang,
P
.
Song,
and
M.
W
ang,
“V
isual-linguistic-stylistic
triple
re
w
ard
for
cross-lingual
image
captioning,
”
A
CM
T
r
ansactions
on
Multimedi
a
Computing
,
Communications,
and
Applications
,
v
ol.
20,
no.
4,
pp.
1–23,
Apr
.
2024,
doi:
10.1145/3634917.
[4]
B
.
R.
Reddy
,
S.
Gunti,
R.
P
.
K
umar
,
and
S.
Sride
vi,
“Multilingual
image
captioning:
mult
imodal
frame
w
ork
for
bridging
visual
and
linguistic
realms
in
T
amil
and
T
elugu
through
transformers,
”
Resear
c
h
Squar
e
,
doi:
10.21203/rs.3.rs-3380598/v1.
[5]
V
.
Jayasw
al,
R.
Rani
,
and
J.
Kaur
,
“
A
deep
learning-base
d
ef
cient
image
captioning
approach
for
Hindi
language,
”
in
De
velopments
T
owar
ds
Ne
xt
Gener
ation
Intellig
ent
Sys
tems
for
Sustainable
De
velopment
.
Ne
w
Y
ork,
United
States:
IGI
Global,
2024,
pp.
225–246,
doi:
10.4018/979-8-3693-5643-2.ch009.
[6]
H
.
Ahmadabadi,
O.
N.
Manzari,
and
A.
A
yatollahi,
“Distilling
kno
wledge
from
CNN-transformer
models
for
enhanc
ed
human
action
recognition,
”
in
2023
13th
International
Confer
ence
on
Computer
and
Knowledg
e
Engineering
(ICCKE)
,
No
v
.
2023,
pp.
180–184,
doi:
10.1109/ICCKE60553.2023.10326272.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1771–1782
Evaluation Warning : The document was created with Spire.PDF for Python.