Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
40,
No.
2,
No
v
ember
2025,
pp.
801
∼
813
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v40.i2.pp801-813
❒
801
Monocular
vision-based
visual
contr
ol
f
or
SCARA-type
r
obotic
arms:
a
depth
mapping
appr
oach
Diego
Chambi
1
,
Bryan
Challco
1
,
J
onathan
Catari
1
,
W
alk
er
Aguilar
1
,
Lizardo
P
ari
1
Electronic
Engineering
Professional
School,
F
aculty
of
Production
and
Services,
Uni
v
ersidad
Nacional
de
San
Agust
´
ın
de
Arequipa,
Arequipa,
Per
´
u
Article
Inf
o
Article
history:
Recei
v
ed
Jan
22,
2025
Re
vised
Jul
14,
2025
Accepted
Oct
14,
2025
K
eyw
ords:
Computer
vision
Robotic
arm
SCARA-type
robotic
arm
V
ision
transformers
V
isual
serv
oing
ABSTRA
CT
The
accelerated
gro
wth
of
an
increasingly
automated
industry
requires
the
use
of
autonomous
robotic
systems.
Ho
we
v
er
,
the
use
of
these
systems
commonly
requires
an
enormous
amount
of
sensors.
In
this
paper
we
e
v
aluate
the
perfor
-
mance
of
a
ne
w
system
for
vis
ual
control
of
a
select
i
v
e
compliance
assembly
robot
arm
(SCARA)
robotic
arm
using
a
monocular
depth
map
that
onl
y
re-
quires
one
monocular
camera.
This
system
aims
to
be
an
ef
cient
alternati
v
e
to
reduce
the
number
of
sensors
in
the
robotic
arm
area
while
mai
ntaining
the
ef-
fecti
v
eness
of
traditional
vision
algorithms
that
use
stereoscopic
architectures
of
cameras.
F
or
this
purpose,
this
system
is
compared
with
representati
v
e
state-of-
the-art
vision
a
lgorithms
focused
on
the
control
of
robotic
a
rms.
The
results
are
statistically
analyzed,
indicating
that
the
algorithm
propos
ed
in
this
research
has
competiti
v
e
performance
compared
to
state-of-the-art
robotic
arm
visual
control
algorithms
only
using
a
single
monocular
camera.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Die
go
Chambi
Electronic
Engineering
Professional
School,
F
aculty
of
Production
and
Services
Uni
v
ersidad
Nacional
de
San
Agustin
de
Arequipa
04001
Arequipa,
Per
´
u
Email:
dchambitu@unsa.edu.pe
1.
INTR
ODUCTION
Automation
has
been
emplo
yed
in
e
v
ery
industry
in
recent
years.
From
precision
industrial
robots
to
home
automation
(domotics),
automation
has
tak
en
on
an
essential
role
in
performing
repetiti
v
e
and
dangerous
tasks,
allo
wing
humans
to
focus
on
acti
vities
of
greater
rele
v
ance
[1].
Among
the
man
y
systems
used
in
automa-
tion,
robotic
syst
ems
are
the
most
widely
adopted
and
of
fer
the
broadest
range
of
applications.
These
systems
were
rst
introduced
in
f
actories
in
the
1960s
and,
by
the
1980s,
were
being
used
globally—particularly
in
the
automoti
v
e
sector
.
T
oday
,
robotic
systems
are
found
in
a
wide
v
ariety
of
settings,
including
small
b
usinesses,
educational
institutions,
and
agricultural
elds
[2]-[4].
Robotic
arms,
in
particular
,
are
composed
of
multiple
links
and
actuators,
enabling
them
to
be
used
in
tasks
such
as
painting,
pharmaceutical
production,
and
welding
in
assembly
lines
[5]-[7].
Each
robotic
arm
is
designed
and
implemented
according
to
the
specic
requirements
of
the
task
it
is
intended
to
perform.
T
o
achie
v
e
this
le
v
el
of
adaptability
and
precision,
robotic
arms
often
re-
quire
a
lar
ge
number
of
sensors
[8].
Consequently
,
man
y
industries
that
could
benet
from
automation
hesitate
to
adopt
robotic
systems
due
to
the
high
cost
of
these
sensing
components.
Cameras
ha
v
e
been
widely
used
in
research
on
the
control
of
robotic
arms;
achie
ving
this
re
q
ui
res
adequate
processing
of
the
video
captured
by
the
camera.
In
Intisar
et
al.
[9],
the
vi
deo
obtained
by
a
camera
is
processed
to
classify
by
color
using
a
transformation
to
hue,
saturation,
and
v
alue
(HSV)
in
dif
ferent
objects.
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
802
❒
ISSN:
2502-4752
Then,
a
robotic
arm
performs
a
pick-and-place
task
on
the
selected
object.
Its
interf
ace
allo
ws
the
user
to
select
an
object
and
ha
v
e
it
automatically
manipulated
by
the
robotic
arm
without
the
need
for
e
xtensi
v
e
kno
wledge
of
the
system’
s
inner
w
orkings.
Ho
we
v
er
,
this
system
directly
depends
on
objects
placed
at
the
same
le
v
el,
limiting
its
functionality
.
In
K
umar
et
al.
[10],
a
stereo
camera
system
generates
disparity
maps
to
estimate
object
location
and
distance,
allo
wing
a
robotic
arm
of
three
de
grees
of
freedom
for
pick-and-place
tasks.
Ho
we
v
er
,
it
requires
precise
calibration,
synchronization,
and
computationally
intensi
v
e
algorithms,
with
added
challenges
from
robotic
arm
mo
v
ements
in
handheld
setups.
According
to
Liyanage
and
Krouglicof
[11],
visual
control
for
a
selecti
v
e
compliance
assembly
robot
arm
(SCARA)
robot
incorporates
a
high-speed
camera
with
an
infrared
mark
er
placed
at
the
end
ef
fector
.
Kim
et
al.
[12]
highlights
a
wheelchair
-mounted
robotic
arm
that
emplo
ys
stereoscopic
cameras
along
with
a
coarse-to-ne
motion
control
strate
gy
.
As
noted
in
[13],
the
ARMAR-III
robot
applies
stereo
vision
com-
bined
with
stored
object
orientation
data
to
calculate
the
full
6D
pose
of
objects
relati
v
e
to
their
3D
models
in
real-time,
supporting
adv
anced
scene
analysis.
A
rose
pruning
robot,
described
in
[14],
inte
grates
stereoscopic
cameras
positioned
near
the
end-ef
fector
to
minimize
interference.
Meanwhile,
Ranftl
et
al.
[15]
discusses
a
dual
robotic
arm
system
that
autonomously
adjusts
the
camera’
s
vie
wpoint
to
maintain
an
occlusion-free
visual
eld.
Additionally
,
Urrea
and
P
ascal
[16]
and
Fiora
v
anti
et
al
.
[17]
describe
dual-arm
systems
using
stereoscopic
cameras
for
calibration-free
control
and
accurate
distance
estimation,
respecti
v
ely
.
Despite
these
de
v
elopments,
the
computational
load,
sensiti
vity
to
en
vironmental
changes,
and
comple
xity
of
calibration
mak
e
stereo
vision-based
systems
impractical
for
embedded
or
lo
w-cost
applications.
Monocular
vision
algo-
rithms
ha
v
e
become
a
viable
substitute
in
this
re
g
ard.
F
or
instance,
Li
et
al.
[18]
introduces
a
h
ybrid
visual
serv
o
system
for
agricultural
harv
esting
that
uses
a
single
camera,
and
Nicolis
et
al.
[19]
in
v
estig
ates
the
appli-
cation
of
V
ision
T
ransformers
for
impro
v
ed
depth
prediction
in
monocular
se
ttings.
Although
these
techniques
simplify
hardw
are
and
allo
w
for
more
e
xible
deplo
yment,
there
is
still
limited
inte
gration
of
these
techniques
into
robotic
control
systems,
especially
for
pick-and-place
and
absolute
distance
estimation
tasks.
T
o
ll
these
g
aps,
our
study
suggests
a
visual
control
system
that
inte
grates
a
SCARA-style
robotic
arm
wit
h
monocular
depth
estimation
based
on
the
MiDaS
algorithm
[20].
Our
suggested
method
achie
v
es
comparable
accurac
y
(RMSE
of
0.46
cm)
with
a
single
camera,
ob
viating
the
need
for
stereo
matching
and
calibration,
whereas
earlier
w
orks
lik
e
[10],
[12],
and
[13]
achie
v
e
high
precision
using
stereo
vision
(e.g.,
RMSE
of
0.49
cm
at
15
cm).
By
making
vision-based
robotic
manipulation
more
feasible
and
af
fordable
for
embedded
systems’,
where
stereo
vi
sion
systems
ha
v
e
traditionally
been
too
costly
and
computationally
demanding,
this
method
tackles
important
issues.
W
e
use
a
re
gress
ion-based
metric
con
v
ersion,
moti
v
ated
by
[21],
to
translate
the
relati
v
e
depth
gi
v
en
by
Mi
DaS
into
absolute
coordinates
for
robotic
control.
In
v
erse
kinematics
and
real-time
3D
localization
are
made
possible
by
this
transformation.
This
system
achie
v
es
high
accurac
y
in
robotic
tasks
while
lo
wering
hardw
are
costs
and
setup
comple
xity
by
doing
a
w
ay
with
the
need
for
stereo
cameras.
The
main
contrib
utions
of
this
w
ork
are:
-
Computational
ef
cienc
y
,
the
monocular
system
a
v
oids
stereo
matching
and
synchronization
o
v
erhead
[10],
[12],
enabling
its
use
in
lo
w-cost,
embedded
platforms.
-
Precision,
RMSE
of
0.46
cm
at
15
cm,
competiti
v
e
with
traditional
ster
eo
vision
systems
(T
able
3),
pro
viding
a
high-precision,
af
fordable
solution.
-
Rob
ustness,
stable
performance
under
v
arying
lighting,
surpassing
baseline
systems
lik
e
[12],
making
the
system
more
adaptable
to
real-w
orld
conditions.
T
o
the
best
of
our
kno
wledge,
this
is
the
rst
implementation
combining
i)
monocular
depth
estima-
tion
optimized
for
embedded
platforms
[20],
ii)
real-time
absolute
metric
con
v
ersion
[21],
and
iii)
a
lo
w-cost
SCARA
robotic
manipulator
manuf
actured
via
additi
v
e
technologies,
of
fering
a
breakthrough
for
cost-ef
fecti
v
e
automation
in
robotics.
The
research
is
or
g
anized
as
follo
ws:
section
2
presents
a
brief
re
vie
w
of
the
algorithm
used
for
visual
control,
as
well
as
the
materials
and
methods
used
to
v
alidate
the
proposed
algorithm.
Section
3
details
the
results
obtained
in
the
distance
estimati
on
and
approach
tests
of
the
robotic
gripper
to
the
tar
get.
Section
4
discusses
the
results
highlighting
the
most
rele
v
ant
observ
ations.
Finally
,
section
5
presents
the
conclusions
and
possible
lines
of
future
w
ork.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
801–813
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
803
2.
METHOD
2.1.
Hard
war
e
A
3-DOF
SCARA
robotic
arm
w
as
designed
and
b
uilt
using
additi
v
e
manuf
acturing
and
al
uminum
rods
to
v
alidate
the
proposed
vision-based
pick-and-place
system,
gi
v
en
its
industrial
v
ersatility
and
ease
of
control
[22],
[23].
Pre
vious
studies
such
as
[24]
ha
v
e
also
demonstrated
the
feasibility
of
SCARA
robots
in
precision
tasks
lik
e
pe
g-in-hole
assembly
,
highlighting
their
suitability
for
applications
requiring
accurac
y
and
compliance.
T
o
illustrate
the
structural
and
analytical
basis
of
the
proposed
robotic
system,
Figure
1
sho
ws
the
proposed
SCARA
robotic
arm’
s
kinematic
model
and
ph
ysical
structure.
Figure
1(a)
presents
the
kinematic
conguration,
highlighting
the
three
de
grees
of
freedom
(
d
1
,
θ
2
,
and
θ
3
)
and
their
associated
links
(
L
2
,
L
3
)
within
a
Cartesia
n
reference
frame.
This
model
i
s
fundamental
for
deri
ving
both
forw
ard
and
i
n
v
erse
kinematics.
Figure
1(b)
sho
ws
the
CAD
rendering
of
the
ph
ysical
robotic
arm,
de
v
eloped
through
additi
v
e
manuf
acturing
techniques.
This
design
w
as
optimized
in
lo
w-cost
robotic
applications.
The
mechanical
structure
of
the
SCARA
arm
w
as
f
abricated
using
PLA
for
3D-printed
c
omponents
and
aluminum
rods
for
v
ertical
support.
The
system
is
actuated
by
three
NEMA17
stepper
motors
for
planar
mo
v
ements
and
an
MG92R
serv
o
motor
for
the
gripper
.
GT2
pulle
ys
and
belts
are
used
for
motion
transmission,
while
linear
bearings
ensure
smooth
mo
v
em
ent.
The
robot
is
controlled
by
a
GT2560
board
programmed
using
Arduino
IDE.
The
kinematic
model
of
the
robotic
arm
is
based
on
Dena
vit-Hartenber
g
(D-H)
parameters,
which
dene
the
spatial
relationships
between
consecuti
v
e
links.
The
parameters
for
each
joint
are
summarized
in
T
able
1.
The
arm
consists
of
three
joints:
one
prismatic
(
d
1
)
and
tw
o
re
v
olute
(
θ
2
,
θ
3
).
The
corresponding
link
lengths
are
L
2
and
L
3
,
and
all
joint
of
fsets
are
set
to
zero
twist
(
α
=
0
).
(a)
(b)
Figure
1.
Proposed
SCARA
robotic
arm’
s
kinematic
m
odel
and
ph
ysical
structure
(a)
kinematic
representation
of
the
robot
with
articulated
parameters
in
a
Cartesian
reference
system
and
(b)
ph
ysical
model
of
the
robot
sho
wing
its
structural
design
under
dynamic
conditions
T
able
1.
D-H
parameters
of
the
three
D.O.F
.
for
the
SCARA
robotic
arm
θ
d
i
a
i
α
Joint
1
0
d
1
0
0
Joint
2
θ
2
0
L
2
0
Joint
3
θ
3
0
L
3
0
The
kinematics
of
a
serial-link
mechanism
can
be
determined
through
homogeneous
transform
ation
matrices,
combining
basic
rotations
and
translations
for
each
joint,
as
described
by
Cork
e
in
[25].
Using
the
Dena
vit-Hartenber
g
(D-H)
parameters
from
T
abl
e
1,
the
transformation
matrices
A
1
,
A
2
,
and
A
3
are
computed.
The
direct
kinematics
is
obtained
by
multiplying
these
matrices:
T
3
=
A
1
·
A
2
·
A
3
(1)
The
resulting
matrix
T
3
gi
v
es
the
position
and
orientation
of
the
end
ef
fector
with
respect
to
the
base
frame.
In
its
e
xpanded
form,
the
position
is
a
function
of
the
joint
angles
θ
2
and
θ
3
,
and
the
link
lengths
L
1
,
Monocular
vision-based
visual
contr
ol
for
SCARA-type
...
(Die
go
Chambi)
Evaluation Warning : The document was created with Spire.PDF for Python.
804
❒
ISSN:
2502-4752
L
2
,
and
L
3
.
T
o
calculate
the
joint
angle
θ
3
for
object
manipulation,
the
in
v
erse
kinematics
equation
is
used:
θ
3
=
arccos
P
2
x
+
P
2
y
−
L
2
1
−
L
2
2
2
L
1
L
2
!
(2)
F
or
t
he
vision
system,
A
v
atec
cameras
with
a
720p
resolution,
USB
interf
ace,
and
30
FPS
refresh
rate
were
used
to
track
the
position
of
the
object,
separately
or
in
stereo
conguration.
2.2.
Softwar
e
This
subsection
details
the
algorithms
necessary
to
perform
the
pi
cking
task
with
the
SCARA
robotic
arm.
T
ypically
,
ster
eoscopic
vision-based
systems
use
tracking
algorithms
to
obtain
a
disparity
between
cam-
eras.
T
o
represent
this
type
of
system,
we
implement
this
algorithm
using
the
MIL
tracking
model,
as
discussed
in
[26].
As
a
second
system,
the
monocular
vision
depth
mapping
algorithm
is
introduced.
In
this
congu-
ration,
only
one
webcam
is
used
along
with
the
MiDaS
model,
which
has
been
sho
wn
to
ef
fecti
v
ely
estimate
depth
from
monocular
images
[27].
The
performance
of
this
visual
control
system
is
then
com
pared
to
the
con
v
entional
stereoscopic
camera
system.
The
tw
o
systems
to
be
compared
are
summarized
as
follo
ws:
-
Stereoscopic
architecture:
an
algorithm
based
on
stereoscopic
vision
using
MIL
tracking
and
the
dispar
-
ity
algorithm,
as
outlined
in
[26].
-
Monocular
vision:
the
proposed
system
uses
the
MiDaS
algorithm
based
on
monocular
vision
[27].
Once
each
algorithm
detects
the
position
of
the
object
in
the
three
Cart
esian
coordinates,
a
third
algorithm
based
on
the
kinematics
of
the
robotic
arm
will
pick
up
the
indicated
object.
A
user
interf
ace
allo
ws
the
user
to
signal
the
object
to
be
pick
ed
up
by
the
gripper
for
manipulation
by
the
robotic
arm,
as
described
in
[28]
and
[29].
2.2.1.
Ster
eoscopic
ar
chitectur
e
In
this
vision
mode,
a
tw
o-camera
array
in
stereo
conguration
is
used.
This
algorithm
is
widely
used
in
visual
control
systems
for
robotic
arms
due
to
its
simple
operating
principle.
Usually
,
an
object
tracking
algorithm
is
used
so
that
the
operator
can
select
the
object
to
perform
the
pick-and-place
task
with
the
robotic
arm
through
a
user
interf
ace.
W
e
used
the
MIL
algorithm
for
this
specic
case,
which
is
considered
one
of
the
most
rob
ust
ag
ainst
disturbances
in
continuous
image
capture.
W
e
use
the
O
penC
V
library
and
the
command:
P
y
thon.T
r
ack
er
M
I
Lcr
eate
.
Once
the
object
is
track
ed,
we
obtain
the
center
of
mass
by
obtaining
the
mass
moments
0,
0
using
the
command
cv
2
.moments
.
W
e
then
use
the
dispa
rity
algorithm
to
calculate
the
distance
between
the
object
and
the
stereo
camera
array
.
Figure
2
graphically
sho
ws
the
disparity
obtained
from
a
position
dif
ference
captured
by
both
cameras.
In
Figure
2(a),
O
c
represents
the
optical
centers
of
the
cameras,
T
is
the
baseline,
and
f
is
the
focal
length
of
each
lens.
The
point
P
is
the
object
in
the
en
vironment,
and
Z
is
the
distance
we
w
ant
to
calculate.
In
Figure
2(b),
we
observ
e
the
object
seen
by
both
frames
of
the
stereoscopic
camera,
where
X
L
and
X
R
are
the
distances
from
the
reference
frame
of
each
camera
to
the
center
of
mass
of
the
detected
object.
(a)
(b)
Figure
2.
Disparity
obtained
from
a
position
dif
ference
captured;
(a)
depth
triangulation
scheme
in
stereo
vision
sho
wing
the
geometry
of
the
cameras
and
the
observ
ed
object
and
(b)
disparity
representation
in
images
captured
by
left
and
right
cameras
to
estimate
the
distance
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
801–813
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
805
T
o
calculate
the
distance
Z
to
the
object
using
stereo
vision,
the
follo
wing
steps
are
carried
out.
First,
the
positions
X
L
and
X
R
of
the
object
are
e
xtracted
from
the
left
and
right
camera
frames,
respecti
v
ely
.
The
disparity
is
then
calculated
as
the
dif
ference
between
these
tw
o
positions:
disparity
=
X
L
−
X
R
If
the
disparity
is
zero
(i.e.,
the
object
is
directly
aligned
between
both
came
ras),
it
is
adjusted
to
a
small
v
alue
(usually
1)
to
a
v
oid
di
vision
by
zero.
The
depth,
or
distance
Z
,
is
then
computed
using
the
formula:
Z
=
f
·
T
disparity
where
f
is
the
focal
length
of
t
h
e
cameras,
and
T
is
the
baseline
(the
distance
between
the
tw
o
cameras).
The
result
is
the
estimated
distance
Z
to
the
object
along
the
Z-axis.
The
distance
Z
is
al
w
ays
positi
v
e,
so
its
absolute
v
alue
is
tak
en
to
ensure
the
result
is
non-ne
g
ati
v
e.
In
this
w
ay
,
the
triangulation
proc
ess
determines
the
3D
coordinates
of
the
object
in
space
by
cal
cu-
lating
its
location
on
the
X
and
Y
ax
es,
along
with
the
approximate
distance
to
the
Z
axis.
2.2.2.
Monocular
ar
chitectur
e
F
or
the
proposed
monocular
vision
system
,
we
use
the
MiDaS
depth
estimation
model
based
on
deep
learning.
Recent
w
ork
by
Smith
et
al.
[30]
introduced
alternati
v
e
methods
for
linear
depth
estimation
from
uncalibrated
monocular
images
using
polarization
cues;
ho
we
v
er
,
our
approach
focuses
on
transformer
-based
depth
prediction
for
robotic
control
applications.
MiDaS
of
fers
three
v
ersions
with
v
arying
computational
demands.
T
o
reduce
the
implementation
cost
of
visual
control
in
industrial
robotic
arms,
we
selected
the
Small
v
ersion
due
to
its
lo
w
computational
requirements,
which
mak
es
it
suitable
for
lo
w-po
wer
processors.
Figure
3
sho
ws
the
depth
map
generated
by
the
MiDaS
algorithm
and
the
corresponding
top
vie
w
of
the
test
object.
In
Figure
3(a),
the
depth
map
is
visualized
wi
th
colors
that
indicate
the
relati
v
e
distances
of
the
objects.
Figure
3(b)
presents
the
same
scene
con
v
erted
to
grayscale,
highlighting
the
depth
v
ariations
more
clearly
for
easier
processing
by
the
control
system.
(a)
(b)
Figure
3.
Depth
map
is
visualized
with
colors
that
indicate
the
relati
v
e
distances
of
the
objects
(a)
MiDaS
small
algorithm
e
xample
and
(b)
proposed
image
processing
for
monocular
architecture.
The
monocular
vision
system
uses
a
neural
netw
ork
based
on
backbones
for
distance
estimation.
Ho
we
v
er
,
applying
it
to
industrial
robotic
tasks
requires
additional
signal
processing
steps,
including
perspec-
ti
v
e
transformation,
noise
ltering,
and
absolute
distance
es
timation
from
relati
v
e
measurements.
Figure
4
summarizes
these
sequential
steps,
which
are
detailed
belo
w
.
First,
the
image
of
the
webcam
is
captured;
this
video,
obtained
from
a
single
camera,
presents
a
”she
ye”
ef
fect
that
spherically
distorts
the
image.
T
o
correct
for
this
distortion,
a
perspecti
v
e
transformation
is
perf
o
r
med
using
the
command
cv
2
.w
ar
pP
er
spectiv
e
,
which
requires
selecting
f
o
ur
points
at
the
edge
of
the
w
orking
area.
Once
the
image
has
been
corrected,
the
MiDaS
depth
m
ap
algorithm
is
applied,
specically
selecting
the
Small
v
ersion.
This
model
is
loaded
from
the
P
y
tor
c
h
library
using
the
command
midas
=
tor
ch.hub.l
oad
(
′
intel
−
isl
/
M
iD
aS
′
,
M
iD
aS
s
mal
l
)
.
At
this
stage,
a
depth
map
v
ersion
of
the
input
image
is
obtained.
Subsequently
,
the
depth
map
is
normalized
using
the
cv
2
.nor
mal
iz
e
()
command;
for
this
research,
normalization
w
as
applied
to
a
range
between
1
and
10
to
f
acilitate
further
data
processing.
Figure
3(b)
sho
ws
an
e
xample
of
this
normalized
depth
map
in
the
robotic
arm’
s
w
orkspace.
Monocular
vision-based
visual
contr
ol
for
SCARA-type
...
(Die
go
Chambi)
Evaluation Warning : The document was created with Spire.PDF for Python.
806
❒
ISSN:
2502-4752
Figure
4.
Sequential
steps
for
absolute
distance
estimation
After
normalization,
spline
interpolation
is
performed
to
smooth
transitions
between
pix
el
v
alues.
From
the
interpolated
dat
a,
the
distance
from
the
center
of
the
track
ed
object
to
the
camera
is
calculated.
A
mo
ving
a
v
erage
lter
is
then
applied
to
stabilize
the
obtained
v
alues
o
v
er
time.
Subsequently
,
the
relati
v
e
dis-
tance
between
the
background
and
the
object
is
determined
based
on
the
generated
depth
map.
In
Masoumian
et
al.
[24],
a
similar
problem
is
addressed
by
approxim
ating
the
absolute
distance
from
the
relati
v
e
measurement
using
a
quadratic
function
gi
v
en
by:
Y
=
(
c
0
+
c
1
X
+
c
2
X
2
)
h
(3)
where
the
coef
cients
C
0
,
C
1
and
C
2
are
obtained
using
least
squares
equations,
h
is
the
height
at
whic
h
the
chamber
is
located,
and
X
is
the
relati
v
e
distance.
This
same
problem
is
presented
in
[25]
and
is
solv
ed
by
nding
the
optimal
curv
e
through
least
squares
equations.
F
or
this,
a
total
of
six
images
at
dif
ferent
distances
from
the
camera
were
used
to
calibrate
the
model.
Finally
,
the
estimated
absolute
distance
is
subtracted
from
the
35
cm
height
at
which
the
camera
is
located
to
determine
the
object’
s
height.
The
core
steps
of
the
monocular
vision
and
distance
est
imation
process
are
summarized
in
Algorithm
1.
The
algorithm
follo
ws
these
steps:
First,
the
image
is
captured
and
the
perspecti
v
e
distortion
is
corrected.
Then,
the
depth
map
is
gener
ated
and
normalized,
follo
wed
by
distance
calculation.
Finally
,
the
absolute
distance
is
estimated
using
a
quadratic
tting
model.
Algorithm
1
Proposed
algorithm
1:
pr
ocedur
e
P
E
R
S
P
E
C
T
I
V
E
C
O
R
R
E
C
T
I
O
N
(frame)
2:
P
1
,
P
2
,
P
3
,
P
4
←
select
four
points
3:
Points
←
[
P
1
,
P
2
,
P
3
,
P
4
]
4:
if
length(Points)
=
4
then
5:
ne
w
frame
←
cv2.W
arpPerspecti
v
e(frame,
Points)
6:
end
if
7:
frame
←
ne
w
frame
8:
end
pr
ocedur
e
9:
10:
pr
ocedur
e
D
E
P
T
H
M
A
P
(frame,
img
batch)
11:
Midas
←
model
type.MiDaS
small
12:
depth
map
←
Midas(img
batch,
frame)
13:
depth
map
←
depth
map.interpolate(frame)
14:
depth
map
←
cv2.normalize(depth
map)
15:
end
pr
ocedur
e
16:
17:
pr
ocedur
e
D
I
S
T
A
N
C
E
T
O
C
A
M
E
R
A
(frame,
depth
map)
18:
T
racking
algorithm
←
MIL
19:
Object
←
select.Object
20:
[
X
C
,
Y
C
]
←
T
racking
algorithm(Object)
21:
Bounding
box
←
T
racking
algorithm(Object,
frame)
22:
end
pr
ocedur
e
23:
24:
pr
ocedur
e
A
B
S
O
L
U
T
E
D
I
S
T
A
N
C
E
E
S
T
I
M
A
T
I
O
N
(frame,
Relati
v
e
distance)
25:
x
←
[11
.
8
,
10
.
843
,
10
.
411
,
10
.
2]
26:
y
←
[21
,
23
,
26
,
31]
27:
de
gree
←
2
28:
Quadratic
function
←
np.polyt(x,
y
,
de
gree)
29:
Distance
←
Quadratic
function(Filtered)
30:
end
pr
ocedur
e
T
o
pro
vide
a
practical
demonstration
of
the
entire
process,
a
video
has
been
included
that
sho
ws
the
monocular
vision
system
in
action
with
the
SCARA
robotic
arm.
The
video
illustrates
ho
w
the
steps
outlined
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
801–813
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
807
in
Algorithm
2
are
e
x
ecuted,
f
rom
image
capture
and
perspecti
v
e
correction
to
depth
map
generation
and
object
tracking.
This
visual
e
xample
helps
to
clarify
the
methodology
and
highlight
the
system’
s
functionality
.
The
video
can
be
vie
wed
at
the
[31].
Algorithm
2
Robotic
arm
control
1:
pr
ocedur
e
M
I
C
R
O
C
O
N
T
R
O
L
L
E
R
(SerialCommunication)
2:
M
o
tor
1
S
tep,
M
otor
1
D
ir
←
25
,
23
3:
M
o
tor
1
Ang
l
e
g
ets
(200
/
360)
←
(62
/
20)
4:
M
o
tor
2
S
tep,
M
otor
2
D
ir
←
31
,
33
5:
M
o
tor
2
Ang
l
e
g
ets
(200
/
360)
←
(89
/
20)
6:
M
o
tor
Z
S
tep,
M
otor
Z
D
ir
←
37
,
39
7:
M
o
tor
1
D
i
stan
ce
←
200
/
1
.
2
8:
S
er
v
oM
otor
P
in
←
11
9:
[
M
1
,
M
2
,
M
z
,
S
er
v
o
]
←
SerialCommunication
10:
M
otor
1
P
osition
←
M
1
∗
M
otor
1
Ang
l
e
11:
M
otor
2
P
osition
←
M
2
∗
M
otor
2
Ang
l
e
12:
M
otor
Z
P
ositio
n
←
M
Z
∗
M
otor
1
Ang
l
e
13:
S
er
v
oM
otor
P
osition
←
S
er
v
o
14:
end
pr
ocedur
e
15:
pr
ocedur
e
I
N
V
E
R
S
E
K
I
N
E
M
A
T
I
C
S
(
(
p
x
,
p
y
,
p
z
,
SpaceButton
)
)
16:
d
1
=
p
z
17:
Gripper
←
180
18:
θ
3
←
arccos
p
2
x
+
p
2
y
−
l
2
1
−
l
2
2
2
l
1
l
2
19:
θ
2
←
l
2
(
p
x
∗
sin(
θ
3
)+
p
y
∗
cos(
θ
3
))+
p
y
l
1
p
2
x
+
p
2
y
20:
data
←
[
θ
2
,
θ
3
,
d
1
,
Gripper
]
21:
if
SpaceButton
=
1
then
22:
SerialCommunication
←
data
23:
end
if
24:
end
pr
ocedur
e
2.2.3.
Robotic
arm
contr
ol
Once
the
object
is
x
ed
and
its
e
xact
position
has
been
obtained
through
the
algorithms
detailed
abo
v
e,
the
in
v
erse
kinematics
of
the
robotic
arm
are
used
so
that
it
reaches
the
object
and
picks
it
up.
In
algorithm
3,
the
rst
procedure
corresponds
to
the
algorithm
implemented
in
the
microcontroller
of
the
robotic
arm,
which
is
in
char
ge
of
recei
ving
through
serial
communication
the
data
of
the
angles
that
each
motor
must
tra
v
el;
for
this,
we
must
transform
the
steps
that
the
motor
must
tak
e
to
the
necessary
angle
considering
the
teeth
of
the
motor
gear
and
the
pulle
y
of
the
corr
esponding
link.
W
ithin
this
microcontroller
procedure,
we
also
n
e
ed
to
name
the
pins
connected
to
the
motors
for
control
obtained
from
the
GT2506
board.
F
or
the
case
of
the
motor
that
raises
or
lo
wers
the
robotic
arm
in
the
Z
axis,
the
transformation
is
reduced
as
follo
ws:
AngleT
oSteps
=
BeltT
eeth
GearT
eeth
The
second
procedure
presented
in
the
algorithm
represents
the
in
v
erse
kinematics
that
is
e
x
ecuted
in
the
computer
that
has
serial
communication
with
the
robot,
this
calculation
is
gi
v
en
by
the
equations
calculated
in
the
Hardw
are
subsection.
Finally
,
a
conditional
w
aits
for
the
operator’
s
indication
by
pressing
the
space
k
e
y
for
the
de
grees
to
be
sent
by
serial
communication
to
the
microcontroller
and
e
x
ecuted
by
the
robotic
arm.
3.
RESUL
TS
3.1.
Results
on
distance
estimation
Each
system
w
as
tested
with
the
SCARA
robot,
using
additi
v
ely
printed
objects
at
dif
ferent
hei
gh
t
s
and
positions
within
the
w
orkspace.
The
system
includes
a
lo
w-cost
SCARA
robot,
a
laptop
for
processing,
test
objects,
a
monocular
camera,
and
a
stereoscopic
camera
array
.
The
complete
setup,
including
all
components,
can
be
seen
in
Figure
5,
which
sho
ws
both
the
hardw
are
and
the
arrangement
of
the
sensors
and
robotic
arm
in
the
test
en
vironment.
Monocular
vision-based
visual
contr
ol
for
SCARA-type
...
(Die
go
Chambi)
Evaluation Warning : The document was created with Spire.PDF for Python.
808
❒
ISSN:
2502-4752
Figure
5.
Setup
implemented
for
e
xperimentation
Multiple
picking
tasks
are
performed
with
each
algorithm
to
e
v
aluate
both
proposed
systems.
Because
we
seek
to
implement
a
system
that
correctly
identies
the
position
of
the
object
so
that
the
robotic
gripper
can
pick
it
up,
we
do
not
consider
e
v
aluating
parameters
such
as
speed,
torque,
or
po
wer
consumption
of
the
robotic
arm.
In
addition,
a
user
interf
ace
w
as
implemented
that
allo
ws
the
user
to
select
the
object
to
be
pi
ck
ed
up
with
the
robotic
arm.
W
ithin
this
interf
ace,
the
user
can
see
the
camera
vie
w
in
real
time
and
select
the
objects
to
be
pick
ed
up
by
the
robotic
arm;
for
this
e
xperiment,
circular
gures
were
used
in
both
cases
to
mak
e
a
f
air
e
v
aluation.
T
able
2
presents
the
estimated
distances
and
corresponding
error
v
alues
obtained
using
both
the
pro-
posed
monocular
vision
system
and
a
traditional
stereoscopic
system.
The
table
is
di
vided
into
tw
o
main
groups:
one
for
real
distances
of
15
cm
and
13
cm
(left
side)
and
another
for
10
cm
and
5
cm
(right
side).
Each
group
compares
the
estimated
distance
with
the
actual
object
distance,
and
the
dif
ference
is
sho
wn
as
the
estimation
error
.
A
color
-coded
heatmap
highlights
lo
w
(green),
moderate
(yello
w),
and
high
(red)
errors,
f
acilitating
a
visual
assessment
of
accurac
y
.
This
format
allo
ws
for
a
clear
comparati
v
e
analysis
between
the
tw
o
systems
across
multiple
trials
and
distances.
T
able
2.
Distance
estimation
at
15
cm
-
13
cm
and
10
cm
and
5
cm
Real
Monocular
vision
Stereoscopic
Real
Monocular
vision
Stereoscopic
Distance
Proposed
Distance
Proposed
V
ision
(cm)
Estimated
Error
Estimated
Error
(cm)
Error
Estimated
Error
Distance
(cm)
Distance
(cm)
Distance
(cm)
Distance
(cm)
15
14.809
0.191
14.268
0.732
10
10.654
0.654
9.842
0.158
15
14.854
0.146
15.573
0.573
10
9.334
0.666
10.369
0.369
15
14.760
0.240
14.675
0.325
10
10.412
0.412
11.019
1.019
15
14.643
0.357
15.294
0.294
10
10.688
0.688
9.738
0.262
15
14.434
0.566
14.921
0.079
10
10.101
0.101
10.124
0.124
15
14.112
0.888
15.407
0.407
10
10.838
0.838
10.965
0.965
15
14.978
0.022
14.351
0.649
10
10.928
0.928
9.173
0.827
15
14.225
0.775
15.831
0.831
10
10.263
0.263
9.512
0.488
15
15.393
0.393
14.733
0.267
10
10.978
0.978
10.876
0.876
15
15.094
0.094
15.122
0.122
10
10.145
0.145
9.321
0.679
15
13.133
0.133
12.367
0.633
5
5.316
0.316
4.825
0.175
13
13.484
0.484
13.721
0.721
5
5.755
0.755
5.692
0.692
13
14.087
1.087
12.946
0.054
5
5.583
0.583
4.213
0.787
13
13.224
0.224
14.012
1.012
5
5.086
0.086
5.336
0.336
13
12.235
0.765
13.532
0.532
5
5.557
0.557
4.181
0.819
13
13.389
0.389
12.689
0.311
5
5.782
0.782
5.812
0.812
13
12.791
0.209
13.248
0.248
5
5.805
0.805
4.567
0.433
13
13.031
0.031
12.574
0.426
5
4.923
0.077
6.109
1.109
13
13.804
0.804
13.896
0.896
5
5.034
0.034
4.896
0.104
13
13.114
0.114
12.315
0.685
5
4.702
0.298
4.429
0.571
When
compared
with
e
xisting
stereo
vision
systems,
such
as
those
described
in
[10]
and
[13]—where
stereo
setups
with
dual
high-precision
cameras
were
used,
achie
ving
RMSE
v
alues
around
0.49
cm
at
15
cm—our
monocular
system
achie
v
es
comparable
accurac
y
(RMSE
of
0.46
cm)
while
requiring
only
a
single
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
801–813
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
809
camera.
This
mak
es
our
approach
more
cost-ef
fecti
v
e
and
easier
to
deplo
y
,
particularly
in
resource-constrained
en
vironments.
These
results
demonstrate
that
the
proposed
monocular
system
can
perform
at
a
le
v
el
of
accurac
y
similar
to
that
of
stereo
vision
systems,
b
ut
with
f
ar
fe
wer
hardw
are
requirements.
The
implication
of
this
is
signi
cant
for
appl
ications
in
industrial
robotics,
where
minimizing
hardw
are
cost
and
comple
xity
is
of-
ten
crucial.
By
replacing
e
xpensi
v
e
stereo
vision
setups
with
a
single
camera,
we
open
up
the
possibility
of
implementing
visual
control
systems
in
more
cost-ef
fecti
v
e
and
embedded
robotic
platforms.
T
able
3
pro
vides
a
comparati
v
e
analysis
of
the
monocular
vision
algorithm
(proposed)
and
the
st
ereo-
scopic
vision
algorithm
based
on
minimum
error
,
maximum
error
,
and
root
mean
square
error
(RMSE)
at
dif
ferent
real
distances.
Th
e
results
sho
w
that
the
monocular
vision
algorithm
achie
v
es
lo
wer
RMSE
at
shorter
distances
while
maintaining
competiti
v
e
performance
at
longer
distances,
highlighting
its
rob
ustness
and
relia-
bility
compared
to
the
stereoscopic
method.
T
able
3.
Comparison
of
monocular
vision
(proposed)
and
stereoscopic
vision,
error
and
RMSE
Real
distance
Monocular
vision
(proposed)
Stereoscopic
vision
(cm)
Error
max
Error
min
RMSE
Error
max
Error
min
RMSE
(cm)
(cm)
(cm)
(cm)
(cm)
(cm)
15
0.888
0.022
0.4600
0.831
0.079
0.4925
13
1.087
0.054
0.5407
1.012
0.054
0.6189
10
0.978
0.124
0.6430
1.019
0.124
0.6607
5
0.805
0.104
0.5179
1.109
0.104
0.6577
F
or
a
comparati
v
e
vie
w
of
the
results,
in
Figure
6
is
represented
the
results
of
T
able
2
in
a
box
plot;
the
results
are
grouped
in
pairs,
each
representing
the
estimation
of
the
monocular
vision
algorithm
and
the
stereo
vision-based
algorithm,
being
four
pairs
for
the
proposed
distances.
Figure
6.
Box
plot
comparing
distance
estimation
errors
Due
t
hat
the
main
focus
of
the
algorithms
presented
is
the
determination
of
the
distance
of
the
cameras
to
the
tar
get,
a
statistical
analysis
is
performed
to
e
v
aluate
the
performance
of
both
algorithms
in
t
h
i
s
estimation.
From
the
errors
in
T
able
4,
we
obtain
normal
distrib
uti
ons
according
to
the
Shapiro-W
ilk
test.
Ho
we
v
er
,
there
is
no
homogeneity
of
v
ariances
according
to
Le
v
ene’
s
test;
due
to
this,
we
use
a
non-parametric
analysis
based
on
the
Mann-Whitne
y
U
test.
The
follo
wing
h
ypotheses
are
assumed
for
this
test:
-
H
0
:
There
is
a
signicant
dif
ference
between
both
groups
of
data.
-
H
i
:
There
are
no
signicant
dif
ferences
between
the
tw
o
data
groups.
By
assigning
a
signicance
v
alue
Alpha
=
0.05
or
5%,
the
P
v
alues
sho
wn
in
T
able
4
are
obtained.
Monocular
vision-based
visual
contr
ol
for
SCARA-type
...
(Die
go
Chambi)
Evaluation Warning : The document was created with Spire.PDF for Python.
810
❒
ISSN:
2502-4752
T
able
4.
Hypotheses
for
each
estimation
distance.
Distance
(cm)
α
¶V
alue
H
0
H
i
15
0.05
0.09938
Acce
pted
Rejected
13
0.05
0.18217
Acce
pted
Rejected
10
0.05
0.14495
Acce
pted
Rejected
5
0.05
0.11323
Accepted
Rejected
In
summary
,
the
monocular
vision
algorithm
of
fers
considerable
benets
in
terms
of
cost
and
sim-
plicity
while
e
xhibiting
strong
performance
with
small
error
mar
gins
and
reaching
accurac
y
le
v
els
that
are
comparable
to
stereoscopic
systems.
These
ndings
suggest
that
monocular
vision
can
be
a
v
ery
successful
substitute
for
robotic
applications,
especially
in
settings
where
computational
ef
cienc
y
and
cost
reduction
are
top
priorities.
These
results
v
alidate
our
initial
h
ypothesis
that
a
monocular
vis
ion
system
can
serv
e
as
a
viable
and
cost-ef
fecti
v
e
alternati
v
e
to
more
comple
x
stereoscopic
systems
in
robotic
applications.
3.2.
Results
on
gripper
appr
oximation
Once
the
distances
of
the
objects
to
the
camera
are
estimated,
the
results
of
the
approximations
of
the
robotic
gripper
of
the
SCARA
arm
to
the
position
of
each
object
are
calculated
using
the
in
v
erse
kinematics
equations
presented
in
the
Hardw
are
section.
The
calculations
performed
by
Algorithm
2
containing
the
kine-
matics
equations
are
sho
wn
in
T
able
5,
as
well
as
the
errors
obtained
between
the
actual
position
and
these
calculations.
This
error
is
gi
v
en
by
the
dif
ference
between
tw
o
points
in
3
dimensions
by
the
follo
wing:
Error
=
p
(
x
2
−
x
1
)
2
+
(
y
2
−
y
1
)
2
+
(
z
2
−
z
1
)
2
T
able
5.
Gripper
approximation
results
with
proposed
algorithm
Real
position
Gripper
position
Error
X
axis
(cm)
Y
axis
(cm)
Z
axis
(cm)
X
axis
(cm)
Y
axis
(cm)
Z
axis
(cm)
(cm)
5.00
5.00
15.00
5.013
4.823
14.809
0.2607
5.00
12.50
15.00
4.847
12.422
14.854
0.2254
5.00
20.00
15.00
4.786
20.453
14.760
0.5555
5.00
5.00
15.00
5.074
14.643
14.643
0.3660
12.50
12.50
15.00
12.385
12.493
14.434
0.5776
12.50
20.00
15.00
12.871
19.765
14.112
0.9907
20.00
5.00
15.00
20.122
5.018
14.978
0.1253
20.00
12.50
15.00
19.964
12.405
14.225
0.7816
20.00
20.00
15.00
19.817
19.958
15.393
0.4355
5.00
5.00
10.00
5.056
4.896
10.654
0.6646
5.00
12.50
10.00
5.110
12.506
9.334
0.6750
5.00
20.00
10.00
4.935
19.509
10.412
0.6442
12.50
5.00
10.00
12.578
5.098
10.688
0.6993
12.50
12.50
10.00
12.492
12.381
10.101
0.1563
12.50
20.00
10.00
12.853
19.872
10.838
0.9183
20.00
5.00
10.00
20.262
4.842
10.928
0.9771
20.00
12.50
10.00
20.114
12.485
10.263
0.2870
20.00
20.00
10.00
19.973
20.421
10.978
1.0651
5.00
5.00
5.00
4.823
5.044
5.316
0.3649
5.00
12.50
5.00
5.276
12.519
5.755
0.8041
5.00
20.00
5.00
5.198
20.167
5.583
0.6380
12.50
5.00
5.00
12.622
4.897
5.086
0.1814
12.50
12.50
5.00
12.735
12.365
5.557
0.6194
12.50
20.00
5.00
12.631
20.352
5.782
0.8675
20.00
5.00
5.00
20.255
4.932
5.805
0.8472
20.00
12.50
5.00
20.153
12.460
4.923
0.1759
20.00
20.00
5.00
20.318
20.122
5.034
0.3423
At
lar
ger
distances
(e.g.,
15
cm),
the
gripper’
s
approximation
error
is
relati
v
ely
small,
with
a
maximum
error
of
0.2607
cm.
Ho
we
v
er
,
at
shorter
distances,
such
as
5
cm,
the
error
increases
to
0.8675
cm,
suggesting
that
the
system
performs
better
at
longer
ranges
b
ut
needs
further
optimization
for
accurac
y
at
close
distances.
The
gripper’
s
approximation
errors
align
with
pre
vious
studies,
which
report
errors
of
0.5
cm
to
1
cm
for
similar
robotic
systems
using
in
v
erse
kinematics
for
position
estimation
at
10
to
15
cm
distances
[9],
[10].
Our
system,
with
maximum
errors
around
1.0651
cm
at
5
cm,
sho
ws
comparable
performance
b
ut
highlights
the
potential
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
801–813
Evaluation Warning : The document was created with Spire.PDF for Python.