TELK
OMNIKA
T
elecommunication,
Computing,
Electr
onics
and
Contr
ol
V
ol.
23,
No.
2,
April
2025,
pp.
349
∼
370
ISSN:
1693-6930,
DOI:
10.12928/TELK
OMNIKA.v23i2.26455
❒
349
Impr
o
ving
visual
per
ception
thr
ough
technology:
a
comparati
v
e
analysis
of
r
eal-time
visual
aid
systems
Othmane
Seb
ban
1
,
Ahmed
Azough
2
,
Mohamed
Lamrini
1
1
Laboratory
of
Applied
Ph
ysics,
Informatics
and
Statistics
(LP
AIS),
F
aculty
of
Sciences
Dhar
El
Mahraz,
Sidi
Mohamed
Ben
Abdellah
Uni
v
ersity
,
Fez,
Morocco
2
De
V
inci
Higher
Education,
De
V
inci
Research
Center
,
P
aris,
France
Article
Inf
o
Article
history:
Recei
v
ed
Jul
7,
2024
Re
vised
Jan
16,
2025
Accepted
Jan
23,
2025
K
eyw
ords:
Accessibility
Assisti
v
e
technology
Benchmarking
Deep
learning
Point
of
interest
detection
V
isually
impaired
ABSTRA
CT
V
isually
impaired
indi
vi
duals
continue
to
f
ace
barriers
in
accessing
reading
and
listening
resources.
T
o
address
these
challenges,
we
present
a
comparati
v
e
anal-
ysis
of
cutting-edge
technological
solutions
designed
to
assist
people
with
vi-
sual
impairments
by
pro
viding
rele
v
ant
feedback
and
ef
fecti
v
e
support.
Our
study
e
xamines
v
arious
models
le
v
eraging
InceptionV3
and
V4
architectures,
long
short-te
rm
memory
(LSTM)
and
g
ated
recurrent
unit
(GR
U)
decoders,
and
datasets
such
as
Microsoft
Common
Objects
in
Conte
xt
(MSCOCO)
2017.
Ad-
ditionally
,
we
e
xplore
the
inte
grati
on
of
optical
character
recognition
(OCR),
translation
tools,
and
image
detection
techniques,
including
scale-in
v
ariant
fea-
ture
transform
(SIFT),
speeded-up
rob
ust
features
(SURF),
oriented
F
AST
and
rotated
BRIEF
(ORB),
and
binary
rob
ust
in
v
ariant
scalable
k
e
ypoints
(BRISK).
Through
this
analysis,
we
highlight
the
adv
ancements
and
potential
of
assisti
v
e
technologies.
T
o
assess
these
solutions,
we
ha
v
e
implemented
a
rigorous
bench-
marking
frame
w
ork
e
v
aluating
accurac
y
,
usability
,
response
time,
rob
ustness,
and
generalizability
.
Furthermore,
we
in
v
estig
ate
mobile
inte
gration
strate
gies
for
real-time
practical
applications.
As
part
of
this
ef
fort,
we
ha
v
e
de
v
eloped
a
mobile
application
incorporating
features
such
as
automatic
captioning,
OCR-
based
te
xt
recognition,
translation,
and
te
xt-to-audio
con
v
ersion,
enhancing
the
daily
e
xperiences
of
visually
impaired
users.
Our
research
focuses
on
system
ef
cienc
y
,
user
accessibili
ty
,
and
potential
impro
v
ement
s,
pa
ving
the
w
ay
for
future
inno
v
ations
in
assisti
v
e
technology
.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Othmane
Sebban
Laboratory
of
Applied
Ph
ysics,
Informatics
and
Statistics
(LP
AIS),
F
aculty
of
Sciences
Dhar
El
Mahraz
Sidi
Mohamed
Ben
Abdellah
Uni
v
ersity
Fez
30003,
Morocco
Email:
othmane.sebban@usmba.ac.ma
1.
INTR
ODUCTION
V
isually
impaired
people
[1]
f
ace
man
y
dif
culties
in
their
dail
y
acti
vities,
including
na
vig
ating
unf
a-
miliar
en
vironments,
reading
te
xt,
and
interpreting
visual
information
[2].
Although
assisti
v
e
technologies
such
as
Microsoft
Seeing
AI
and
Google
Look
out
ha
v
e
been
de
v
eloped
to
analyze
images
in
real-time
and
pro
vide
descriptions,
these
solutions
still
ha
v
e
signicant
limitations,
including
high
costs,
restricted
functionality
,
and
dependence
on
a
stable
internet
connection.
These
constraints
considerably
reduce
their
ef
fecti
v
eness,
particu-
larly
in
critical
situations
such
as
crossing
streets
or
reading
important
documents
i
n
real-time.
Despite
techno-
J
ournal
homepage:
http://journal.uad.ac.id/inde
x.php/TELK
OMNIKA
Evaluation Warning : The document was created with Spire.PDF for Python.
350
❒
ISSN:
1693-6930
logical
adv
ancements,
assisti
v
e
technologies
for
the
visually
impaired
struggle
to
deli
v
er
real-time
performance
because
of
computational
demands
and
limited
adaptability
to
real-w
orld
en
vironments.
Their
dependence
on
the
internet
reduces
their
of
ine
ef
fecti
v
eness,
compromising
the
immediate
and
reliable
assistance
users
need
for
essential
tasks
such
as
crossing
streets
or
reading
important
documents.
T
o
address
these
limitations,
our
study
proposes
a
comprehensi
v
e
benchmarking
system
designed
to
e
v
aluate
and
optimize
the
performance
of
assisti
v
e
t
echnologies
for
visually
impaired
users.
This
system
measures
the
ef
fecti
v
eness
of
v
arious
components,
including
image
captioning,
optical
character
recognition
(OCR),
real-time
translation,
and
k
e
y
image
element
detection
[3].
Our
goal
is
to
encourage
t
he
de
v
elopment
of
mobile
applications
[4]
that
combine
both
accurac
y
and
speed,
ensuring
optimal
performance
in
real-time
use
cases.
The
mobile
application
we
de
v
eloped,
”SeeAround,
”
inte
grates
these
functionalities
to
pro
vide
reliable
visual
assistance
[5].
The
general
diagram
of
our
system,
presented
in
Figure
1,
outlines
the
v
ari
o
us
modules
and
their
interactions,
illustrating
ho
w
the
y
w
ork
together
to
of
fer
real-time
assistance.
Figure
1.
General
diagram
of
the
real-time
visual
assistance
system
for
the
visually
impaired
F
or
image
captioning,
we
use
an
encoder
-decoder
architecture
combining
con
v
olutional
neural
net-
w
orks
(CNN)
and
recurrent
neural
netw
orks
(RNN).
Specically
,
we
emplo
y
InceptionV3
[6]
and
InceptionV4
[7]
models,
adapted
to
process
images
ef
ciently
in
real-w
orld
conte
xts
for
visually
impaired
people.
Addi-
tionally
,
we
use
the
Microsoft
Common
Objects
in
Conte
xt
(MSCOCO)
2017
dataset
to
train
the
models
with
enhanced
parameters,
optimizing
them
for
mobile
en
vironments
.
Our
system
also
inte
grates
long
short-term
memory
(LSTM)
and
g
ated
recurrent
unit
(GR
U)
decoders
to
capture
temporal
sequences
more
ef
fecti
v
ely
,
impro
ving
the
generation
of
image
captions
by
modeling
long-term
relationships
between
visual
and
te
xtual
elements
[8].
The
OCR
component
processes
images
containing
mainly
te
xt,
e
v
en
in
visual
ly
comple
x
en
viron-
ments.
Using
adv
anced
algorithms,
it
accurately
detects
and
e
xtracts
te
xt,
pro
viding
conte
xtual
information
essential
for
visually
impaired
users.
Furthermore,
the
real-time
translation
functionality
of
our
mobile
appli-
cation
remo
v
es
language
barriers
by
supporting
a
wide
range
of
languages
[9].
Recognized
te
xt
is
translated
into
the
chosen
language
and
then
con
v
erted
into
speech
via
our
te
xt-to-speech
module
[10],
making
it
easier
to
understand
image
descriptions
and
te
xtual
content
e
xtracted
by
OCR.
An
essential
aspect
of
pro
viding
rele
v
ant
visual
information
is
precisely
e
xtracting
k
e
y
image
ele-
ments
during
camera
analysis
[11].
Each
image
has
dif
ferent
characteristics
such
as
saturation,
brightness,
contrast,
and
camera
angle
,
meaning
that
uniform
processing
approaches
can
pro
v
e
inef
fecti
v
e.
T
o
address
this,
we
ha
v
e
incorporated
adv
anced
image
detection
algorithms,
including
scale-in
v
ariant
feature
transform
(SIFT)
[1],
speeded-
u
p
rob
ust
features
(SURF)
[1],
oriented
features
from
accelerated
se
gment
test
(F
AST)
and
rotated
BRIEF
(ORB)
[12],
and
binary
rob
ust
in
v
ariant
scalable
k
e
ypoints
(BRISK)
[12].
These
algorithms
are
reno
wned
for
their
rob
ustness
and
accurac
y
under
dif
cult
conditions.
Our
system
detects
the
limitations
of
current
methods
and
proposes
impro
v
ements
to
ensure
optimum
performance,
particularly
in
the
real-life
situations
of
visually
impaired
people.
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
2,
April
2025:
349–370
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
351
This
paper
is
structured
into
v
e
main
sections,
each
addressing
a
k
e
y
aspect
of
the
study
.
Section
2
pro
vides
a
detailed
o
v
ervie
w
of
pre
vious
research
and
current
technologies
designed
for
visually
impaired
users.
Section
3
focuses
on
our
benchmarking
system
and
the
k
e
y
components
used
in
our
application.
Section
4
presents
the
e
xperimental
results
and
their
analysis,
demonstrating
the
impact
of
our
technical
choices.
Finally
,
section
5
concludes
by
summarizing
our
ndings,
discussi
ng
the
implications
of
this
w
ork,
and
suggesting
a
v
enues
for
future
research
and
de
v
elopment
in
assisti
v
e
technology
.
2.
RELA
TED
W
ORK
V
isual
impai
rment,
a
common
disability
,
presents
dif
ferent
le
v
els
of
se
v
erity
.
Assisti
v
e
technologi
es
are
crucial
in
pro
viding
visual
alternati
v
es
via
v
arious
products,
de
vices,
softw
are,
and
system
s
[4],
[6].
Sand-
h
ya
et
al
.
[13]
present
an
application
that
helps
visually
im
paired
people
understand
their
en
vironment
using
neural
netw
orks
and
natural
language
processing.
The
application
generates
te
xtual
descriptions
of
images
captured
by
a
camera
and
inte
grates
an
OCR
module
to
read
the
te
xt
on
signs
and
documents.
The
descriptions
are
then
con
v
erted
into
audio,
pro
viding
information
in
se
v
eral
languages
such
as
T
elugu,
Hindi,
and
English.
This
application
requires
a
system
with
an
inte
grated
GPU.
Ganesan
et
al
.
[14]
propose
an
inno
v
ati
v
e
approach
to
f
acilitating
access
to
printed
content
for
the
visually
impaired,
using
CNNs
and
LSTMs
to
encode
and
de-
code
information.
The
system
inte
grates
OCR
to
con
v
ert
printed
te
xt
into
digital
format
and
then
into
speech
via
a
te
xt-to-speech
application
programming
interf
ace
(API),
making
the
content
accessible
via
v
oice
reading.
Bagrecha
et
al
.
[15]
present
”V
irtualEye,
”
an
inno
v
ati
v
e
application
in
assisti
v
e
t
echnologies
for
the
visually
impaired,
of
fering
functions
such
as
object
and
distance
detection,
recognition
of
Indian
banknotes,
and
OCR.
The
system
pro
vides
v
oice
instructions
in
English
and
Hindi,
enhancing
users’
independence
and
impro
ving
their
quality
of
life.
Uslu
et
al
.
[16]
aim
to
generate
grammatically
correct
and
semantical
ly
rele
v
ant
captions
for
visual
content
via
a
personalized
mobile
app,
impro
ving
accessibility
,
particularly
for
the
visually
impaired.
Inte
grated
with
the
“CaptionEye”
Android
app,
the
system
enables
captions
to
be
generated
of
ine
and
con-
trolled
by
v
oice,
of
fering
a
user
-friendly
interf
ace.
C
¸
aylı
et
a
l
.
[17]
present
a
captioning
system
designed
to
pro
vide
natural
language
descriptions
of
visual
scenes,
impro
ving
accessibility
and
reducing
social
isolation
for
visually
impaired
people.
This
research
demonstrates
the
practical
application
of
computer
vision
and
natural
language
processing
to
create
assisti
v
e
tools.
Despite
signicant
adv
ances,
it
is
crucial
to
continue
de
v
eloping
reliable
mobile
systems
adapted
to
e
v
eryday
life
to
impro
v
e
users’
autonomy
and
quality
of
life.
3.
METHOD
In
this
section,
we
present
our
solution
based
on
a
benchmarking
analysis.
Subsection
3.1
presents
the
initial
module,
detailing
the
process
of
benchmarking
the
modules
illustrated
in
Figure
1
of
our
system.
Subsection
3.2
e
xplores
the
generation
of
image
captions
using
encoder
-decoder
architectures
optimized
with
T
ensorFlo
w
Lite,
int
e
grat
ing
multi-GR
U
and
LSTM
models
for
accurate
descriptions.
Fi
nally
,
subsection
3.3
compares
the
performance
of
four
separate
systems
using
v
arious
models
for
image
captioning,
OCR,
transla-
tion,
and
k
e
y
point
e
xtraction.
3.1.
Description
of
the
benchmarking
pr
ocess
f
or
the
e
v
aluation
of
visual
assistance
systems
Benchmarking
measures
a
compan
y’
s
performance
ag
ainst
mark
et
leaders
[18],
[19]
to
identify
g
aps
and
dri
v
e
continuous
impro
v
ement.
In
this
study
,
realistic
tasks
adapted
to
the
needs
of
visually
impaired
peo-
ple,
such
as
image
caption
generation,
te
xt
recognition,
and
k
e
y
point
e
xtraction,
were
desi
gned.
By
comparing
our
solutions
with
industry
standards,
the
aim
is
to
close
performance
g
aps
and
impro
v
e
assisti
v
e
technologies.
3.1.1.
Benchmarking
criteria
and
methodology
T
o
impro
v
e
accessibil
ity
and
comprehension
of
multimedia
content
for
visually
impaired
users,
we
ha
v
e
inte
grated
automatic
subtitling,
OCR,
te
xt
translation,
and
image
recognition
modules.
These
compo-
nents
use
adv
anced
machine-learning
algorithms
to
optimize
processing
accurac
y
and
speed,
ensuring
uid,
instantaneous
interaction.
Ev
ery
component
has
been
designed
for
a
smooth,
optimized
e
xperience.
3.1.2.
P
erf
ormance
e
v
aluation
of
visual
assistance
systems
W
e
e
v
aluated
each
visual
assist
ance
system
according
to
four
k
e
y
criteria:
accurac
y
,
response
tim
e,
rob
ustness,
and
gene
ralizability
.
Accurac
y
w
as
measured
by
comparing
the
results
obtained
with
e
xpectations
for
tasks
such
as
image
description,
OCR,
and
translation.
Response
time,
e
xpressed
in
milliseconds,
w
as
used
Impr
o
ving
visual
per
ception
thr
ough
tec
hnolo
gy:
a
compar
ative
analysis
of
r
eal-time
...
(Othmane
Sebban)
Evaluation Warning : The document was created with Spire.PDF for Python.
352
❒
ISSN:
1693-6930
to
assess
system
ef
cienc
y
.
Rob
ustness
w
as
analyzed
under
dif
cult
conditions,
including
lo
w
lighting,
high
noise,
and
comple
x
backgrounds,
to
ensure
reliability
.
Finally
,
generalizability
w
as
e
xamined
using
unpub-
lished
images,
videos,
and
documents
to
judge
its
suitability
for
ne
w
conte
xts.
3.1.3.
Comparati
v
e
analysis
of
benchmark
r
esults
W
e
analyzed
the
results
to
determine
the
best-performing
systems
for
each
task
and
usage
scenario,
emphasizing
their
strengths
and
weaknesses.
This
rigorous
comparati
v
e
analysis
identied
the
most
ef
fecti
v
e
real-time
visual
assistance
solutions,
pro
viding
v
aluable
insights
into
their
capabilities.
These
ndings
will
help
guide
the
future
de
v
elopment
of
more
adv
anced,
ef
cient,
and
user
-fri
endly
technologies
tailored
to
the
needs
of
visually
impaired
indi
viduals.
3.2.
Optimization
of
automatic
image
caption
generation
This
subsection
presents
a
system
for
automatically
generating
image
captions,
based
on
an
encoder
-
decoder
model.
The
CNNs
InceptionV3
and
InceptionV4
are
used
to
e
xtract
visual
features
as
encoders.
The
multilayer
decoder
,
composed
of
GR
U
and
LSTM,
generates
the
semantic
capti
ons,
as
illustrated
in
Figure
2.
The
w
ork
mentioned
in
[6],
[7]
combines
CNN
and
recurrent
netw
orks,
b
ut
the
e
xcessi
v
e
increase
in
the
number
of
time
steps,
due
to
the
length
of
the
le
gends,
led
to
inferior
performance.
By
reducing
this
number
,
we
optimized
the
use
of
GR
U
and
LSTM,
leading
to
better
results.
Figure
2.
Model
architecture
for
multi-RNN-based
automated
image
captioning
3.2.1.
InceptionV3-gated
r
ecurr
ent
unit-based
multi-lay
er
image
caption
generator
model
The
image
captioning
system
cons
ists
of
tw
o
main
elements:
the
encoder
and
the
decoder
,
each
based
on
a
distinct
neural
architecture.
The
encoder
,
based
on
InceptionV3
[6],
e
xtracts
k
e
y
information
from
the
image.
This
is
then
passed
on
to
the
decoder
,
which
uses
a
GR
U
to
generate
the
caption
w
ord
by
w
ord.
The
proposed
general
model
is
illustrated
in
Figure
3.
Figure
3.
Flo
wchart
of
the
InceptionV3
multi-layer
GR
U
image
caption
generator
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
2,
April
2025:
349–370
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
353
Encoder
-InceptionV3
architecture:
the
encoder
,
based
on
a
CNN,
is
composed
of
con
v
olution,
pool-
ing,
and
connection
layers.
Sparse
interactions
identify
k
e
y
visual
elements,
while
parameter
sharing
enables
the
same
k
ernel
to
be
applied
for
optimized
learning.
Before
training,
a
2048-dimensional
v
ector
is
generated
from
the
images
via
the
mean
pooling
layer
of
the
Inception-v3
model
[6].
During
training,
a
dense
layer
[20]
renes
this
v
ector
,
progressi
v
ely
compressing
it
into
a
more
compact
and
discriminati
v
e
representati
o
n
for
image
captioning.
Decoder
-GR
U
implementation:
the
decoder
e
xpl
oits
the
e
xtracted
features
to
generate
descripti
v
e
sen-
tences,
relying
on
RNNs
to
store
part
of
the
input
data.
Ho
we
v
er
,
RNNs
f
ace
limitations
due
to
gradient
f
ading
and
e
xplosion
problems,
compromising
their
ability
to
handle
long-term
dependencies.
T
o
o
v
ercome
these
dif
culties,
we
use
GR
Us,
an
enhanced
v
ersion
of
RNNs
with
control
mechanisms
[20]
f
acilitating
dependenc
y
management.
Figure
4
sho
ws
a
typical
GR
U
architecture
with
its
update,
reset,
and
hidden
state
g
ates.
The
de-
coder
comprises
an
inte
gration
le
v
el,
a
multilayer
GR
U,
and
a
linear
layer
.
The
inte
gration
le
v
el
con
v
erts
w
ords
into
v
ectors
suitable
for
language
modeling,
while
the
GR
U
adjusts
the
hidden
state
using
its
g
ate
mechanisms.
Figure
4.
Architecture
of
GR
U
In
these
(1)-(4)
[21],
x
t
represents
the
input
,
and
h
t
is
the
hidden
state
at
time
t
.
The
weights
associated
with
the
reset,
update,
and
ne
w
information
creation
g
ates
are
denot
ed
as
W
r
,
W
z
,
and
W
u
,
respecti
v
ely
.
The
h
yperbolic
tangent
and
sigmoid
acti
v
ation
functions
are
symbolized
by
tanh
and
σ
,
respecti
v
ely
.
r
t
=
σ
(
W
xr
x
t
+
U
hr
h
t
−
1
)
(1)
z
t
=
σ
(
W
xz
x
t
+
U
hz
h
t
−
1
)
(2)
u
t
=
tanh(
W
xu
x
t
+
U
hu
(
r
t
⊙
h
t
−
1
))
(3)
h
t
=
(1
−
z
t
)
h
t
−
1
+
z
t
u
t
(4)
3.2.2.
InceptionV4-long
short-term
memory-based
multi-lay
er
image
caption
generator
model
The
model
follo
ws
an
encoder
-decoder
approach,
where
the
InceptionV4
[7]
acts
as
an
encoder
to
e
xtract
visual
features
from
images.
These
features
are
then
transmitted
to
a
recurrent
neural
netw
ork
equipped
with
LSTM
cells
that
act
as
decoders.
The
latter
uses
this
information
to
generate
sequences
of
w
ords,
thus
producing
descripti
v
e
captions
for
the
images.
The
general
scheme
of
the
model
is
illustrated
in
Figure
5.
Encoder
-InceptionV4
architecture:
we
use
InceptionV4,
a
CNN
pre-trained
by
Google,
as
the
encoder
in
our
frame
w
ork.
This
model
e
xtracts
high-le
v
el
visual
features
through
deep
con
v
olutional
l
ayers.
The
InceptionV4-based
encoder
[7],
[22],
[23]
con
v
erts
ra
w
images
into
x
ed-length
v
ectors
by
capturing
rele
v
ant
information
from
the
intermediate
pooling
layer
,
just
before
the
nal
output.
This
process
pro
vides
a
concise
and
rele
v
ant
image
representation
for
subsequent
processing.
Decoder
-LSTM
implementation:
the
decoder
is
a
deep
recurrent
neural
netw
ork
with
LSTM
cells
,
as
sho
wn
in
Figure
6.
In
our
model,
the
decoder
operates
in
tw
o
phases:
learning
and
inference.
During
learning,
the
RNN
decoder
with
LSTM
cells
aims
to
maximize
the
probability
of
each
w
ord
in
a
caption
based
on
the
con
v
oluted
features
of
the
image
and
pre
viously
generated
w
ords
[7].
Impr
o
ving
visual
per
ception
thr
ough
tec
hnolo
gy:
a
compar
ative
analysis
of
r
eal-time
...
(Othmane
Sebban)
Evaluation Warning : The document was created with Spire.PDF for Python.
354
❒
ISSN:
1693-6930
Figure
5.
Flo
wchart
of
the
InceptionV4
multi-layer
LSTM
image
caption
generator
Figure
6.
Architecture
of
LSTM
T
o
learn
a
sentence
of
length
N
,
the
decoder
loops
back
on
itself
for
N
time
steps,
storing
pre
vious
information
in
its
cell
memory
.
The
C
t
memory
is
modied
at
each
time
step
by
the
LSTM
g
ates:
the
for
get
g
ate
f
t
,
the
input
g
ate
i
t
,
and
the
output
g
ate
o
t
.
The
LSTM
decoder
learns
the
w
ord
sequences
from
the
con
v
olv
ed
features
and
the
original
caption.
At
step
t
=
0
,
the
hidden
state
h
t
of
the
decoder
is
initialized
using
these
image
features
F
.
The
main
idea
of
the
encoder
-decoder
model
is
illustrated
by
(5)-(11):
f
t
=
σ
(
W
f
·
[
h
t
−
1
,
x
t
]
+
b
f
)
(5)
i
t
=
σ
(
W
i
·
[
h
t
−
1
,
x
t
]
+
b
i
)
(6)
˜
C
t
=
σ
(
W
C
·
[
h
t
−
1
,
x
t
]
+
b
C
)
(7)
C
t
=
f
t
∗
C
t
−
1
+
i
t
∗
˜
C
t
(8)
o
t
=
σ
(
W
o
·
[
h
t
−
1
,
x
t
]
+
b
o
)
(9)
h
t
=
o
t
∗
tanh(
C
t
)
(10)
O
t
=
arg
max(
softmax
(
h
t
))
(11)
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
2,
April
2025:
349–370
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
355
3.2.3.
T
raining
pr
ocess
and
techniques
f
or
model
optimization
The
transformation
of
input
x
t
at
time
t
into
output
w
ord
O
t
is
guided
by
equations
using
learnable
weight
and
bias
v
ectors
(
W
f
,
b
f
)
,
(
W
i
,
b
i
)
,
(
W
o
,
b
o
)
,
acti
v
ated
by
sigmoid
σ
and
h
yperbolic
tangent
tanh
functions
[6],
[7].
Each
w
ord
X
t
is
con
v
erted
into
x
ed-length
v
ectors
using
a
w
ord
representation
W
e
of
dimension
V
×
W
,
where
V
is
the
number
of
w
ords
in
the
v
ocab
ulary
and
W
is
the
length
of
the
embedding
learned
during
training.
The
decoder’
s
objecti
v
e
is
to
maximize
the
probability
p
of
a
w
ord’
s
appearance
at
time
t
gi
v
en
the
cell
and
hidden
states,
features
F
,
and
pre
vious
w
ords
X
t
:0
→
t
.
This
is
achie
v
ed
by
m
inimizing
the
loss
function
L
,
which
is
the
cross-entrop
y
of
the
sampled
w
ord
probabilities
[6],
[7].
Z
=
arg
max
β
N
X
t
=0
log
(
p
(
O
t
|
X
t
:0
→
t
−
1
,
ϕ
t
;
β
))
!
(12)
L
=
H
(
u,
v
)
=
m
in
N
X
t
=0
−
u
(
X
t
)
log
(
v
(
O
t
))
!
(13)
where
H
(
u,
v
)
is
the
cross
entrop
y
,
u
and
v
represent
the
softmax
probability
distrib
utions
of
the
ground
truth
w
ord
X
t
and
the
generated
w
ord
O
t
at
time
t
.
During
inference,
the
input
image
is
passed
through
the
encoder
to
obtain
the
con
v
olv
ed
features,
which
are
then
sent
to
the
decoder
.
At
time
t
=
0
,
the
dec
od
e
r
samples
the
start
tok
en
O
t
=0
=
⟨
S
⟩
from
the
input
features
F
.
F
or
subsequent
instants,
the
decoder
samples
a
ne
w
w
ord
based
on
the
input
features
and
pre
viously
sampled
w
ords
O
t
:0
→
t
until
it
encounters
an
end
t
ok
en
⟨
/S
⟩
at
instant
t
=
N
[6],
[7].
Figure
7
illustrates
the
backup
architecture
of
the
T
ensorFlo
w
model
for
the
LSTM
and
GR
U
encoder
-decoders.
Figure
7(a)
sho
ws
our
InceptionV3-GR
U
model,
which
uses
a
CNN
to
e
xtract
visual
features
and
GR
U
units
to
generate
captions.
Figure
7(b)
sho
ws
the
architecture
of
the
InceptionV4-LSTM
model,
where
InceptionV4
e
xtracts
visual
features
and
an
LSTM
generates
captions.
(a)
(b)
Figure
7.
The
architecture
for
training
phases
of:
(a)
the
InceptionV3-GR
U
model
and
(b)
the
InceptionV4-LSTM
model
These
diagrams
sho
w
the
architectures
of
models
using
InceptionV3
[6]
and
InceptionV4
[7]
as
en-
coders,
with
decoders
based
on
GR
U
or
LSTM
units.
The
input
image
is
resized
and
processed
by
con
v
olutional
Impr
o
ving
visual
per
ception
thr
ough
tec
hnolo
gy:
a
compar
ative
analysis
of
r
eal-time
...
(Othmane
Sebban)
Evaluation Warning : The document was created with Spire.PDF for Python.
356
❒
ISSN:
1693-6930
layers,
then
features
are
e
xtracted
via
global
pooling
to
initialize
the
hidden
state
of
the
decoders.
Each
w
ord
in
the
le
gend
is
then
con
v
erted
to
v
ectors
and
processed
to
generate
the
probability
of
the
ne
xt
w
ord.
T
able
1
summarizes
the
main
parameters
used
to
train
the
dif
f
erent
image
caption
generation
models,
such
as
batch
size,
number
of
epochs,
and
time
steps,
which
inuence
performance
and
quality
.
My
ne
w
image
caption
generation
model
optimizes
pre
vious
v
ersions
[6],
[7].
Reducing
time
steps
from
22
to
18
impro
v
es
performance
by
reducing
computational
comple
xity
.
Increasing
batch
size
to
148
stabilizes
training
while
limiting
captions
to
16
w
ords
enhances
ef
cienc
y
.
T
able
1.
Pre-trained
model
settings
for
image
caption
generation
Embedding
size
Caption
preprocessing
Error
rate
Batch
size
Num
timesteps
Epochs
InceptionV4-LSTM
[7]
256
20
w
ords
2
×
10
−
3
100
22
120
InceptionV4-LSTM
(our
model)
256
16
w
ords
2
×
10
−
3
148
18
120
InceptionV3-GR
U
[6]
256
20
w
ords
2
×
10
−
3
128
22
120
InceptionV3-GR
U
(our
model)
256
16
w
ords
2
×
10
−
3
148
18
120
3.2.4.
Common
dataset
utilization
f
or
enhanced
perf
ormance
High-quality
data
is
crucial
for
an
ef
fecti
v
e
model.
Using
di
v
erse
datasets
helps
a
v
oid
o
v
ertting
and
impro
v
e
performance.
W
e
used
MSCOCO
2017
[20],
which
contains
annotated
images
with
v
e
human
captions.
T
able
2
compares
MSCOCO
2017,
Flickr
30k
[24]
and
MSCOCO
2014
[25],
highlighting
the
distri-
b
ution
of
training,
v
alidation,
and
test
sets,
with
MSCOCO
2017
of
fering
the
lar
gest
number
of
e
xamples
for
image
captioning.
T
able
2.
Characteristics
of
datasets
used
to
train
image
caption
generation
models
Dataset
T
raining
split
(k)
V
alidation
split
(k)
T
esting
split
(k)
T
otal
images
(k)
Flickr30k
(imeca[6])
28
1
1
8
MSCOCO
2014
(cam2caption[7])
83
41
41
144
MSCOCO
2017
(Our
model)
118
41
5
164
3.2.5.
Integration
of
our
pr
e-trained
model
in
the
mobile
application
W
e
are
optimizing
our
encoding-decoding
model
for
real-time
use
in
the
“SeeAround”
mobile
ap-
plication
via
T
ensorFlo
w
,
e
xploiting
its
datao
w
graph
architect
ure
[7]
and
processor
para
llelism
to
impro
v
e
ef
cienc
y
.
Graph-based
image
preprocessing
accelerates
speed
by
a
f
actor
of
six.
During
training,
checkpoints
and
metadata
les
are
generated
re
gularly
.
Checkpoints
store
learned
weights,
while
graph
denitions
link
them,
enabling
the
model
to
be
reconstructed
and
reused
for
inference
and
training.
Figure
8
sho
ws
the
backup
architecture
for
the
LSTM
and
GR
U
models,
with
Figure
8(a)
illustrating
the
LSTM
model
and
Figure
8(b)
the
GR
U
model.
(a)
(b)
Figure
8.
T
ensorFlo
w
model
backup
architecture
for:
(a)
the
LSTM
encoder
-decoder
and
(b)
the
GR
U
encoder
-decoder
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
2,
April
2025:
349–370
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
357
W
e
ha
v
e
combined
the
pre-processing,
encoding,
and
decoding
les
into
three
ProtoBuf
les
to
creat
e
an
end-to-end
model
suited
to
static
and
real-time
requirements.
A
nal
ProtoBuf
le
serv
es
as
a
black
box
for
subtitle
generation.
My
model
uses
18
w
ords
instead
of
22
[6],
[7],
impro
ving
reliability
and
speeding
up
real-time
subtitle
generation
for
image
streams
from
the
camera.
Figure
9
sho
ws
the
captions
generated
by
the
LSTM
and
GR
U
decoding
models.
Figure
9(a)
illustrates
the
captions
generated
by
the
LSTM
model,
while
Figure
9(b)
sho
ws
those
generated
by
the
GR
U
model.
(a)
(b)
Figure
9.
captions
generated
by
the
model:
(a)
LSTM
decoder
outputs
and
(b)
GR
U
decoder
outputs
3.3.
Detailed
descriptions
of
our
visual
assistance
systems
In
this
subsection,
we
present
the
four
systems
designed
for
image
caption
generation,
OCR,
m
achine
translation,
and
k
e
y
point
e
xtract
ion.
Each
system
is
b
uilt
on
specialized
templates
carefully
selected
for
their
ef
cienc
y
and
rele
v
ance.
As
illustrated
in
T
able
3,
these
choices
are
guided
by
performance
metrics
and
task-
specic
adaptability
,
ensuring
optimal
accurac
y
and
reliability
.
T
able
3.
Systems
description
for
benchmarking
e
v
aluation
Image
captioning
OCR
T
ranslation
K
e
yframe
e
xtraction
System
1
InceptionV4-LSTM
with
22
w
ords
[7]
Google
Mobile
V
ision
API
Google
Cloud
T
ranslation
API
SIFT
System
2
InceptionV4-LSTM
with
18
w
ords
(our
model)
Firebase
V
ision
T
e
xt
Detector
Google
ML
kit
SURF
System
3
InceptionV3-GR
U
with
22
w
ords
[6]
Google
Firebase
Machine
Learning
kit
Firebase
ML
Kit
BRISK
System
4
InceptionV3-GR
U
with
18
w
ords
(our
model)
T
essT
w
o-Android
Google
T
ranslate
API
ORB
3.3.1.
Inception-V4
with
LSTM
and
the
harmony
of
adv
anced
vision
and
language
pr
ocessing
Using
the
Google
Mobile
V
ision
API
to
recognize
te
xt
in
images:
technological
adv
ances
in
i
n
f
orma-
tion
capture
and
te
xt
recognition
ha
v
e
led
to
inno
v
ati
v
e
services
such
as
document
analysis
and
secure
access
to
de
vices.
OCR
[26],
a
technology
for
detecting
and
e
xtracting
te
xt
from
scanned
images
or
directly
from
the
camera,
can
w
ork
with
or
without
an
internet
connection.
Google
of
fers
Mobile
V
ision,
an
open-source
tool
Impr
o
ving
visual
per
ception
thr
ough
tec
hnolo
gy:
a
compar
ative
analysis
of
r
eal-time
...
(Othmane
Sebban)
Evaluation Warning : The document was created with Spire.PDF for Python.
358
❒
ISSN:
1693-6930
for
creating
te
xt
recognition
and
instant
translation
applications
on
Android.
In
this
research,
OCR
is
used
to
assist
the
visually
impaired.
Al
though
this
technology
is
ef
fecti
v
e
for
document
scanning
and
te
xt
analysis,
it
encounters
lim
itations
in
appl
ications
dependent
on
a
stable
Internet
connection,
particularly
in
areas
with
lo
w
connecti
vity
.
The
e
xtracted
te
xt
data
is
then
processed
by
a
REST
API
[27],
which
interacts
with
a
database
and
displays
the
information
li
v
e
on
the
de
vice,
as
illustrated
in
Figure
10.
Figure
10.
Using
the
Google
Mobile
V
ision
API
for
the
OCR
process
Google
Cloud
T
ranslation
to
con
v
ert
recognized
te
xt
into
multiple
languages:
The
Google
Cloud
Plat-
form
of
fers
pre-trained
machine
learning
models
for
creating
applications
that
interact
with
their
en
vironment
[28].
Among
these
models,
the
Google
Cloud
T
ranslation
API
of
fers
the
possibility
of
con
v
erting
content
between
dozens
of
languages.
W
e
used
this
API
to
translate
information,
captioning,
and
OCR.
Google
T
rans-
late
then
renders
this
data
in
the
language
chosen
by
the
visually
impaired
person.
Figure
11
illustrates
this
w
orko
w
,
from
te
xt
recognition
to
captioning
and
OCR
to
automated
translation,
based
on
cloud
technologies.
Figure
11.
Multilingual
translation
process
with
the
Google
Cloud
T
ranslation
API
Detection
k
e
yframe
with
SIFT
:
the
SIFT
descriptor
,
designed
by
Lo
we
[29],
is
widely
used
for
its
ef
cienc
y
in
image
processing,
particularly
for
identifying
and
characterizing
points
of
interest
using
local
gra-
dients.
The
SIFT
process
is
di
vided
into
four
main
phases,
including
the
application
of
the
Gaussian
dif
ference
(DoG)
method.
This
in
v
olv
es
subtracting
images
ltered
by
Gaussian
lters
applied
at
dif
fere
n
t
scales.
The
e
xtrema
detected
between
tw
o
adjacent
le
v
els
are
then
e
xploited
for
further
analysis
[30],
[31].
As
(14):
D
(
X
,
σ
)
=
(
G
(
X
,
k
σ
)
−
G
(
X
,
σ
))
∗
I
(
X
)
(14)
where
I
is
the
input
image
and
X
is
a
specic
point
X
(
x,
y
)
.
The
v
ariable
σ
represents
the
scale,
while
G
(
X
,
σ
)
denotes
the
Gaussian
applied
to
the
point
X
.
Re
gions
of
interest
based
on
Gaussian
dif
ferences
(DoG)
[31]
are
identied
as
e
xtrema
in
the
image
plane
and
along
the
scale
axis
of
the
function
D
(
x,
σ
)
.
T
o
locate
these
points,
the
D
(
x,
σ
)
v
alue
of
each
point
is
compared
with
its
neighbors
at
the
same
and
dif
ferent
sca
les.
The
SIFT
algorithm
e
xtracts
and
describes
these
points
of
interest
for
obstacle
detection
and
recognition.
3.3.2.
Inception-V4
with
LSTM
and
r
ebase:
text
detection
optimization
Firebase
vision
te
xt
detector
for
ef
cient
te
xt
detection:
Google’
s
Firebase
Cloud
Storage
service
enables
de
v
elopers
to
store
and
share
user
content,
such
as
photos,
videos,
and
audio
les,
in
the
cloud
[32].
Based
on
Google
Cloud
Storage,
it
of
fers
a
scalable
object
storage
solution,
perfectly
inte
grated
with
web
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
2,
April
2025:
349–370
Evaluation Warning : The document was created with Spire.PDF for Python.