TELK
OMNIKA
T
elecommunication,
Computing,
Electr
onics
and
Contr
ol
V
ol.
23,
No.
6,
December
2025,
pp.
1729
∼
1742
ISSN:
1693-6930,
DOI:
10.12928/TELK
OMNIKA.v23i6.27500
❒
1729
Object
detection
and
tracking
with
decoupled
DeepSOR
T
based
on
α
β
lter
Lakhdar
Djelloul
Mazouz,
Abdessamad
Kaddour
T
r
ea,
T
ar
ek
Amiour,
Abdelaziz
Ouamri
Image
and
Signal
Laboratory
(LSI),
Uni
v
ersity
of
Sciences
and
T
echnology
of
Oran
(UST
O-MB),
Oran,
Algeria
Article
Inf
o
Article
history:
Recei
v
ed
Aug
25,
2025
Re
vised
Oct
5,
2025
Accepted
Oct
19,
2025
K
eyw
ords:
Deep
learning
DeepSOR
T
High
order
tracking
accurac
y
Object
detection
Object
tracking
V
ideo
surv
eillance
ABSTRA
CT
W
ith
the
rapid
gro
wth
of
the
population,
the
demand
for
autonomous
video
surv
eillance
systems
has
substantially
increased.
Recently
,
articial
intelligence
has
played
a
k
e
y
role
in
the
de
v
elopment
of
these
systems.
In
this
paper
,
we
present
an
enhanced
autonomous
system
for
object
detection
and
tracking
in
video
streams,
tailored
for
transportation
and
video
surv
eillance
applications.
The
system
comprises
tw
o
main
stages:
detection
stage;
this
stage
emplo
ys
you
only
look
once
(Y
OLO)v8m,
trained
on
the
KITTI
dataset,
and
is
congured
to
detect
only
pedestrians
and
cars.
The
model
achie
v
es
an
a
v
erage
precision
of
97.3%
and
87.1%
for
cars
and
pedestrians
classes
respecti
v
ely
,
resulting
a
nal
mean
a
v
erage
precision
(mAP)
of
92.2%.
T
racking
stage;
the
tracking
compo-
nent
utilizes
the
DeepSOR
T
algorithm,
which
original
ly
incorporates
a
Kalman
lter
for
motion
prediction
and
performs
data
association
using
cosine
and
Ma-
halanobis
distances
to
maintain
consistent
object
identiers
across
f
rames.
T
o
impro
v
e
tracking
performance,
we
introduce
tw
o
k
e
y
modic
ations
to
the
orig-
inal
DeepSOR
T
:
architecture
modication
and
Kalman
lter
replacement.
The
tracking
tests
are
carried
out
on
KITTI
and
MO
TChallenge
Benchmarks.
The
nal
order
tracking
accurac
y
(HO
T
A)
scores
achie
v
e
77.645
and
54.019
for
Cars
and
Pedestrians
class
es
respecti
v
ely
in
the
KITTI-Benchmark
and
45.436
for
the
Pedestrians
class
in
the
MO
TChallenge-Benchmark.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Lakhdar
Djelloul
Mazouz
Signals
and
Image
Laboratory
(LSI),
Uni
v
ersity
of
Sciences
and
T
echnology
of
Oran
(UST
O-MB)
Bir
El
Djir
31000,
Oran,
Algeria
Email:
lakhdar
.djelloul@uni
v-usto.dz
1.
INTR
ODUCTION
Real-time
object
detection
is
a
critical
component
in
applications
such
as
autonomous
v
ehicles,
robotics,
and
video
surv
eillance.
Among
the
leading
algorithms,
you
only
look
once
(Y
OLO)
stands
out
for
its
optimal
balance
between
speed
and
accurac
y
,
enabling
ef
cient
object
recognition
in
static
images.
Since
its
introduc-
tion,
se
v
eral
v
ariants
of
Y
OLO
ha
v
e
been
de
v
eloped,
each
impro
ving
upon
its
predecessor
to
enhance
perfor
-
mance
and
address
pre
vious
limitations.
T
o
public
safety
and
minimize
risk,
we
ha
v
e
de
v
eloped
an
intelligent,
autonomous
video
surv
eillance
system
capable
of
real-time
detection
and
tracking
of
pedestrians
and
v
ehicles.
The
system
operates
in
tw
o
stages:
object
detection
using
the
Y
OLO
algorithm,
follo
wed
by
tracking
with
an
optimized
v
ersion
of
DeepSOR
T
.
During
recent
years,
e
xtensi
v
e
research
has
been
conducted
in
the
eld
of
object
detection,
leading
to
the
de
v
elopment
of
v
arious
techniques.
These
techniques
can
be
broadly
cate
gorized
into
tw
o
groups:
tradi-
J
ournal
homepage:
http://journal.uad.ac.id/inde
x.php/TELK
OMNIKA
Evaluation Warning : The document was created with Spire.PDF for Python.
1730
❒
ISSN:
1693-6930
tional
techniques,
based
on
color
[1]-[3],
te
xture
[2],
morphology
[4],
edge
detection
[3],
and
classical
machine
learning
[5],
[6],
and
adv
anced
techniques
using
deep
learning
and
articial
intelligence
[7]–[12].
In
this
w
ork,
we
focus
on
the
detection
and
tracking
of
tw
o
classes
of
objects:
pedestrians
and
cars
.
W
e
train
our
Y
OLO
model
on
the
KITTI
dataset,
which
achie
v
es
a
high
mAP
score.
W
e
rst
apply
object
detecti
on,
using
Y
OLOv8,
to
identify
pedestrians
and
v
ehicles.
Then,
for
t
racking,
we
emplo
y
DeepSOR
T
,
an
enhanced
v
ersion
of
the
simple
online
real-time
tracking
(SOR
T)
algorithm.
Rather
than
using
a
single
DeepSOR
T
instance
for
all
object
classes,
we
implement
a
decoupled
approach
assigning
a
dedicated
DeepSOR
T
track
er
to
each
class.
This
reduces
int
er
-class
confusion
and
minimizes
ID
switches.
T
o
optimize
computational
ef
cienc
y
,
we
replace
the
traditional
Kalman
lter
with
a
simpler
and
f
ast
α
β
lter
.
This
paper
is
or
g
anized
as
follo
ws.
Section
2
presents
a
comprehensi
v
e
theoretical
basis.
Firstly
we
present
the
dataset
used
follo
wed
by
an
introduction
of
the
Y
OLOv8
detection
algorithm
and
describe
ho
w
it
w
as
trained
to
enhance
its
performance.
Ne
xt,
we
present
the
DeepSOR
T
track
er
along
with
tw
o
modications
we
applied
to
impro
v
e
its
tracking
capabilities.
Section
3
is
de
v
oted
to
the
detailed
description
of
the
steps
of
the
proposed
method.
Section
4
presents
the
dif
ferent
metrics
emplo
yed
to
assess
the
performance
of
the
detection
algorithm
and
the
v
arious
DeepSOR
T
based
track
ers.
Section
5
is
primarily
dedicated
to
reporting
the
e
xperimental
results
used
to
e
v
aluate
the
performance
of
the
detection
algorithm
and
three
v
ersions
of
the
DeepSOR
T
track
er
(the
original
and
the
tw
o
modied
v
ersions).
Finally
,
section
6
concludes
the
paper
.
2.
THE
COMPREHENSIVE
THEORETICAL
B
ASIS
2.1.
Dataset
Y
OLOv8
w
as
initially
trained
on
the
common
objects
in
conte
xt
(COCO)
dataset,
which
contains
approximately
330000
annotated
images,
spanning
80
object
cate
gories.
Ho
we
v
er
,
our
application
focuses
on
de
v
eloping
an
autonomous
video
surv
eillance
system
dedicated
to
monitoring
pedestrians
and
v
ehicles,
that
is
to
say
tw
o
object
classes:
person
and
car
.
This
narro
wer
scope
necessitates
the
use
of
a
more
tar
geted
dataset.
Consequently
,
we
opted
for
the
KITTI
dataset,
which
includes
only
8
object
cate
gories,
among
them
the
tw
o
classes
rele
v
ant
to
our
study:
pedestrian
and
car
[13],
[14].
F
or
object
detection
training,
we
utilized
the
KITTI
object
detection
dataset
,
comprising
7481
training
images
and
7518
test
im
ages,
with
a
total
of
80256
labeled
instances.
All
images
are
in
color
and
stored
in
PNG
format.
F
or
object
tracking
e
v
aluation,
we
empl
o
y
e
d
the
KITTI
tracking
dataset
[15]
.
It
consists
of
21
training
sequences
and
29
test
sequences,
totaling
8008
color
images
in
PNG
format.
Among
these,
31
sequences
comprise
images
with
a
resolution
of
1242
×
375
pix
els,
while
the
remaining
sequences
contain
images
with
similar
dimensions
(i.e.,
12
xx
×
37
x
).
T
o
ensure
ef
cient
inference
and
alignment
with
benchmark
standards,
our
system
is
congured
to
detect
and
track
only
tw
o
object
classes:
car
and
person.
Notably
,
the
KITTI
benchmark
restricts
its
e
v
aluation
to
these
specic
cate
gories
using
the
T
rackEv
al-Master
p
ython
source
codes
[16].
T
o
generalize
our
application,
another
benchmark
e
v
aluation
is
used
(MO
TChallenge
[17]).
2.2.
Object
detection
As
pre
viously
mentioned,
this
study
emplo
ys
the
Y
OLO
algorithm
for
object
detection.
Y
OLO
is
a
single-stage
detector
that
partitions
the
input
image
into
a
grid
of
equally
sized
cells,
typically
of
dimension
(
N
×
N
).
Each
cell
is
responsible
for
predicting
the
presence
and
location
of
objects
within
its
boundaries.
In
this
w
ork,
we
utilize
the
Y
OLOv8m
model,
released
by
Ultralytics
in
January
2023.
The
Y
OLOv8
f
amily
comprises
v
e
v
ariants
designed
for
tasks
such
as
object
detection,
se
gmentation,
and
classication:
Y
OLOv8n
(Nano),
Y
OLOv8s
(Small),
Y
OLOv8m
(Medium),
Y
OLOv8l
(Lar
ge),
and
Y
OLOv8x
(Extra
lar
ge)
[18],
[19].
Y
OLOv8n
is
the
most
lightweight
and
f
astest,
making
it
suitable
for
real-time
applications
with
limited
com-
putational
resources.
In
contrast,
Y
OLOv8x
of
fers
the
highest
accurac
y
,
albeit
with
increased
computational
cost
and
inference
time.
2.3.
Object
tracking
Online
object
tracking
systems
commonly
rely
on
the
SOR
T
algorithm,
which
consists
of
four
prin-
cipal
components:
detection,
estimation,
data
association,
identity
management
(creation
and
termination).
Despite
its
ef
cienc
y
,
SOR
T
f
aces
signi
cant
limitations
in
maintaini
ng
consistent
object
identiers
(IDs),
par
-
ticularly
when
objects
reappear
follo
wing
occlusion.
In
such
cases,
the
algorithm
often
assigns
a
ne
w
ID,
treating
the
reappearing
object
as
entirely
ne
w
.
T
o
address
this
limitation,
the
DeepSOR
T
algorithm
introduces
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1729–1742
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1731
a
more
rob
ust
tracking
mechanism
by
incorporating
a
deep
appearance-based
association
metric.
This
en-
hancement
enables
the
track
er
to
consider
visual
features
of
the
object,
thereby
impro
ving
identity
preserv
ation
across
occlusions.
The
architecture
of
the
DeepSOR
T
algorithm
is
illustrated
in
Figure
1.
Figure
1.
DeepSOR
T
architecture
[20]
The
DeepSOR
T
algorithm
maintains
consistent
object
identiers
by
e
v
aluating
a
combined
dis
tance
metric.
An
object
is
assigned
the
same
ID
if
the
follo
wing
condition
is
satised
[21]-[23]:
λ
×
d
1
(
i,
j
)
+
(1
−
λ
)
×
d
2
(
i,
j
)
≤
T
hr
eshol
d
(1)
In
this
formulation,
d
1
(
i,
j
)
represents
the
cosine
distance,
while
d
2
(
i,
j
)
denotes
the
Mahalanobis
distance.
The
v
ariables
i
and
j
correspond
to
the
coordinates
of
the
object’
s
center
,
and
λ
∈
[0,1]
is
a
weighting
f
actor
that
balances
the
contrib
ution
of
appearance-based
and
motion-based
metrics.
This
criterion
enables
the
algo-
rithm
to
associate
reappearing
objects
with
their
original
identities,
thereby
impro
ving
tracking
rob
ustness
in
scenarios
in
v
olving
occlusion.
3.
METHOD
In
order
to
enhance
the
performance
of
DeepSOR
T
,
tw
o
critical
modications
were
incorporated
into
its
original
frame
w
ork.
3.1.
Ar
chitectur
e
modication
T
o
reduce
identity
confusion
between
distinct
object
classes,
we
propose
to
modify
the
archite
cture
of
the
DeepSOR
T
tracking
algorithm.
Specically
,
the
modication
in
v
olv
es
deplo
ying
separate
DeepSOR
T
instances
for
each
class:
one
dedicated
to
pedestrians
and
the
other
to
v
ehicles
(e.g.,
cars).
These
track
ers
operate
concurrently
and
independently
,
allo
wing
for
class-specic
identity
management
and
reducing
cross-
class
mis-association.
The
nal
tracking
output
is
obtained
by
mer
ging
the
results
from
both
track
ers,
thereby
preserving
class
inte
grity
throughout
the
tracking
process.
The
proposed
dual-track
er
architecture
is
illustrated
in
Figure
2.
It
will
be
referred
in
the
follo
wing
as
the
decoupled
DeepSOR
T
algorithm.
Figure
2.
Decoupled
DeepSOR
T
architecture
Object
detection
and
tr
ac
king
with
decoupled
DeepSORT
based
on
α
β
lter
(Lakhdar
Djelloul
Mazouz)
Evaluation Warning : The document was created with Spire.PDF for Python.
1732
❒
ISSN:
1693-6930
3.2.
Kalman
lter
r
eplacement
The
second
modication
that
we
made
to
the
DeepSOR
T
algorithm
i
n
v
olv
es
replacing
the
Kalman
lter
with
a
simplied
and
computationally
ef
cient
alternati
v
e:
the
α
β
lter
.
This
lter
is
a
x
ed
coef
cients,
which
may
be
vie
wed
as
a
second-order
steady-state
Kalman.
It
w
as
originally
designed
for
tar
get
tracking
in
the
radar
eld.
Compared
to
the
Kalman
lter
,
the
α
β
lter
of
fers
se
v
eral
adv
antages
in
the
conte
xt
of
real-time
object
tracking
[24]:
-
Simplied
predi
ction
and
u
pdat
e
mechanisms:
unlik
e
the
Kalman
lter
,
which
dynamically
computes
g
ain
v
alues
based
on
the
inno
v
ation
co
v
ariance
matrix
in
each
frame,
the
α
β
lter
utilizes
x
ed
g
ain
parameters,
α
for
position
and
β
for
v
elocity
,
resulting
in
more
straightforw
ard
computations,
as
sho
wn
in
(2)
and
(3).
-
Reduced
computation
comple
xity:
the
α
β
lter
updates
the
state
v
ector
and
associated
parameters
without
requiring
matrix
in
v
ersion,
thereby
signicantly
lo
wering
the
computational
b
urden
compared
to
the
Kalman
lter
,
as
sho
wn
in
(4)
and
(5).
-
Comparable
perf
ormance
in
simple
tracking
scenarios:
despite
its
simplicity
,
the
α
β
lter
achie
v
es
tracking
accurac
y
similar
to
that
of
the
Kalman
lter
in
scenarios
with
limited
noise
and
linear
motion.
The
state
v
ector
at
time
step
(frame)
k,
used
in
the
lter
,
encompasses
the
object’
s
bounding
box
parameters
[
X
k
,
Y
k
,
W
k
,
H
k
]
and
their
v
elocities
[
˙
X
k
,
˙
Y
k
,
˙
W
k
,
˙
H
k
],
with
[
X
k
,
Y
k
]
denoting
the
object
center
coordinates,
and
[
W
k
,
H
k
]
the
height
and
width
of
the
bounding
box.
The
operational
steps
of
the
α
β
lter
are
outlined
belo
w
.
T
o
simplify
,
we
gi
v
e
the
equations
only
for
the
x
com-
ponent.
Similar
equations
apply
for
the
remaining
component.
Let
X
e
k
,
X
p
k
and
X
m
k
denote,
respecti
v
ely
,
the
estimate,
the
predicted
and
the
measurement
(pro
vided
by
the
detection
stage)
of
X
k
.
-
Initialization:
The
X
k
component
of
the
state
v
ector
is
initialized
with
the
X
coordinate
of
the
center
of
rst
(
k
=
0
)
bounding
box,
pro
vided
by
the
detection
stage.
The
˙
X
k
of
the
state
v
ector
is
initialized
with
zero.
-
Prediction
X
p
k
=
X
e
k
+
T
×
˙
X
e
k
(2)
˙
X
p
k
=
˙
X
e
k
(3)
where
T
represents
the
frame
period.
-
Update
X
e
k
=
X
p
k
+
α
x
×
(
X
p
k
−
X
m
k
)
(4)
˙
X
e
k
=
˙
X
p
k
+
β
x
T
×
(
X
p
k
−
X
m
k
)
(5)
where
α
x
and
β
x
are
the
x
ed
coef
cients
of
the
lter
,
relati
v
e
to
the
component
X
k
of
the
state
v
ector
.
The
selection
of
these
coef
cients
depends
on
the
system’
s
dynamics:
The
higher
these
coef
cients
are
the
more
responsi
v
e
is
the
lter
.
4.
METRICS
4.1.
Object
detection
metrics
T
o
determine
which
object
detector
is
t
for
our
application,
we
can
emplo
y
dif
ferent
metrics
such
as
recall,
precision
and
intersection
o
v
er
union
IoU
,
F1
scor
e
,
a
v
erage
precision
(
AP
)
and
mean
a
v
erage
precision
(
mAP
)
[25].
T
o
e
v
aluate
the
suitability
of
v
arious
object
detection
models
for
a
gi
v
en
application,
se
v
eral
performance
metrics
are
commonly
emplo
yed.
These
include
recall,
precision,
intersection
o
v
er
Union
(IoU),
F1
score,
a
v
erage
precision
(AP),
and
Mean
a
v
erage
precision
(mAP)
[25].
4.1.1.
A
v
erage
pr
ecision
W
e
be
gin
by
dening
tw
o
v
ery
important
metrics
for
detector
e
v
aluation,
which
are:
precision
and
recall.
These
tw
o
metrics
are
gi
v
en
by:
P
r
ecision
=
T
P
T
P
+
F
P
(6)
R
ecal
l
=
T
P
T
P
+
F
N
(7)
where
:
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1729–1742
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1733
-
T
P
=
T
rue
positi
v
es
-
F
P
=
F
alse
positi
v
es
-
F
N
=
F
alse
ne
g
ati
v
es
F
or
a
single
class,
the
AP
is
computed
as
the
area
under
the
precision-recall
curv
e:
Z
1
0
P
(
r
)
d
r
(8)
where
P
(
r
)
is
the
precision
as
a
function
of
recall.
In
practice,
the
a
v
erage
precision
for
class
may
be
approximated
using:
AP
i
=
k
−
1
X
j
=0
[
R
ecal
l
(
i,
j
)
−
R
ecal
l
(
i,
j
−
1)]
×
P
r
ecision
(
i,
j
)
(9)
where
R
ecal
l
(
i,
j
)
and
P
r
ecision
(
i,
j
)
are
the
recall
and
precision
of
class
i
,
e
v
aluated
using
the
j
th
threshold
and
k
is
the
number
of
threshol
ds.
The
AP
is
a
metric
that
may
be
used
to
assess
the
performance
of
detection
and
localization
algorithms,
the
higher
it
is
the
more
ef
cient
is
an
algorithm.
It
corresponds
to
the
area
under
the
precision-recall
curv
e,
and
it
may
be
estimated
using
the
pairs
(precision,
recall)
for
se
v
eral
condence
thresholds
4.1.2.
Mean
a
v
erage
pr
ecision
The
mAP
is
another
important
performance
metric,
mainly
used
for
e
v
aluating
machine
learning
models.
It
is
dened
as
the
a
v
erage
of
the
a
v
erage
precisions
of
dif
ferent
detected
classes,
and
is
calculated
through
the
(10):
mAP
=
1
N
×
N
X
i
=1
AP
i
(10)
where
N
indicates
the
number
of
classes
and
AP
i
is
the
a
v
erage
precision
of
class
i
.
4.1.3.
Scor
e
F1
(
F
1
scor
e
)
The
F
1
S
cor
e
is
a
performance
metric
used
in
classication
and
detection
tasks.
It
represents
the
harmonic
mean
of
precision
and
recall,
thus
balancing
both
metrics
into
a
single
v
alue.
F
1
scor
e
=
P
r
ecision
×
R
ecal
l
P
r
ecision
+
R
ecal
l
2
(11)
This
metric
is
particularly
useful
when
the
dataset
is
imbalanced,
as
it
considers
both
f
alse
positi
v
es
and
f
alse
ne
g
ati
v
es.
4.2.
Object
tracking
metrics
[26]
The
classication
of
e
v
ents,
acti
vities,
and
relationships
(CLEAR)
w
orkshop
has
dened
a
common,
unied
frame
w
ork
for
e
v
aluating
multi-object
tracking
(MO
T)
algorithms,
kno
wn
as
CLEAR
−
MO
T
metrics,
which
places
the
multiple
object
tracking
accurac
y
(MO
T
A)
metric
as
the
primary
metric
for
tracking
e
v
alua-
tion,
although
it
has
been
criticized
for
f
a
v
oring
detection
o
v
er
association.
In
recent
years,
the
most
commonly
used
benchmarks
for
e
v
aluating
mul
ti-object
tracking
algorithms
are
MO
TChallenge
and
KITTI,
the
main
met-
rics
used
in
these
benchmarks
are
MO
T
A,
IDF1
and
high
order
tracking
accurac
y
(HO
T
A).
4.2.1.
DetA
Detection
accurac
y
(
D
etA
)
measures
ho
w
well
the
track
er
detects
objects,
independent
of
identity
preserv
ation.
The
formula
for
its
computation
is:
D
etA
=
T
P
T
P
+
F
P
+
F
N
(12)
Object
detection
and
tr
ac
king
with
decoupled
DeepSORT
based
on
α
β
lter
(Lakhdar
Djelloul
Mazouz)
Evaluation Warning : The document was created with Spire.PDF for Python.
1734
❒
ISSN:
1693-6930
4.2.2.
AssA
Association
accurac
y
(
AssA
)
e
v
aluates
ho
w
well
the
track
er
m
aintains
object
identities
across
frames.
It
is
computed
by:
AssA
=
T
P
A
T
P
A
+
F
P
A
+
F
N
A
(13)
where
:
-
T
P
A
=
T
rue
positi
v
es
association
-
F
P
A
=
F
alse
positi
v
es
association
-
F
N
A
=
F
alse
ne
g
ati
v
es
association
4.2.3.
Identication
F1
scor
e
(IDF1)
I
D
F
1
e
v
aluates
the
accurac
y
of
identity
preserv
ation
in
tracking.
It
is
the
harmonic
mean
of
identity
precision
and
identity
recall:
I
D
F
1
=
2
×
I
D
T
P
2
×
I
D
T
P
+
I
D
F
P
+
I
D
F
N
(14)
where
:
-
I
D
T
P
=
Identity
true
positi
v
es
-
I
D
F
P
=
Identity
f
alse
positi
v
es
-
I
D
F
N
=
Identity
f
alse
ne
g
ati
v
es
4.2.4.
LocA
Localization
accurac
y
(
LocA
)
measures
the
a
v
erage
spatial
alignment
of
correctly
detected
objects
using
IoU.
The
formula
for
its
computation
is:
LocA
=
1
|
T
P
|
×
X
c
∈
T
P
Loc
−
I
oU
(
c
)
(15)
Where
I
oU
(
c
)
represents
the
intersection
o
v
er
union
for
true
positi
v
e
candidate
c
.
4.2.5.
Multiple
object
tracking
accuracy
MO
T
A
e
v
aluates
o
v
era
ll
tracking
performance
by
penalizing
f
alse
positi
v
es,
missed
detections,
and
identity
switches.
It
reects
ho
w
well
the
track
er
maintains
object
presence
and
identity
.
Its
formula
is:
M
O
T
A
=
1
−
P
F
P
×
P
F
N
×
P
I
D
S
W
P
GT
D
et
(16)
where
:
-
k
=
Frame
inde
x
-
I
D
S
W
=
Identity
switches
-
GT
D
et
=
Ground-truth
tracks
detection
4.2.6.
Multiple
object
tracking
pr
ecision
M
O
T
P
measures
the
a
v
erage
localization
precision
of
matched
object
detections,
based
on
the
spatial
o
v
erlap
(IoU)
between
predicted
and
ground
truth
bounding
box
es.
It
can
be
computed
by:
M
O
T
P
=
P
k
,i
I
oU
k
,i
P
k
c
k
(17)
Where
I
o
U
k
,i
represents
the
bounding
box
o
v
erlap
of
object
i
,
at
time
k
and
c
k
the
number
of
matches
in
frame
k
.
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1729–1742
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1735
4.2.7.
HO
T
A
HO
T
A
is
a
more
recent
metric
that
jointly
e
v
aluates
detection,
association,
and
LocA.
It
represents
the
geometric
mean
of
detection
and
association
accuracies:
H
O
T
A
α
=
p
D
ett
α
×
Ass
α
=
s
P
c
∈
T
P
Ass
−
I
oU
(
c
)
|
T
P
A
α
|
+
|
F
N
A
α
|
+
|
F
P
A
α
|
(18)
In
this
formula,
the
term
α
represents
the
dif
ferent
IoU
thresholds
used
to
used
to
compute
the
metric.
A
generalized
v
ersion
of
H
O
T
A
α
,
denoted
as
HO
T
A
is
computed
o
v
er
a
range
of
thresholds
α
∈
[0,1]:
H
O
T
A
=
Z
0
≤
α
≤
1
H
O
T
A
α
=
0
.
95
X
α
=0
.
05
H
O
T
A
α
(19)
HO
T
A
is
a
scalar
metric
that
summarizes
the
o
v
erall
tracking
performance
of
a
system
by
a
v
eraging
the
HO
T
A
scores
across
a
range
of
IoU
thresholds.
It
captures
the
balance
between
detection
accurac
y
,
association
accurac
y
,
and
localization
precision,
ma
king
it
one
of
the
most
comprehensi
v
e
metrics
for
e
v
aluating
multi-
object
tracking.
Unlik
e
traditional
metrics
that
focus
hea
vily
on
either
detection
MO
T
A
or
identity
preserv
ation
(IDF1),
HO
T
A
inte
grates
all
three
(aspects:
detection,
associ
ation,
and
localization)
into
a
unied
score
that
reects
performance
across
v
arying
spatial
tolerances.
5.
EXPERIMENT
AL
RESUL
TS
5.1.
Object
detection
In
training
the
KITTI
dataset
w
as
spl
it
into
dif
ferent
folders
with
a
ratio
of
80:10:10
for
training,
v
alidation
and
test
respecti
v
ely
.
The
KITTI
dataset
w
as
used
for
the
e
v
aluation
of
the
emplo
yed
object
detection
method.
This
dataset
w
as
split
into
three
folders,
with
a
ratio
of
80:10:10
for
training,
v
alidation
and
test,
respecti
v
ely
.
The
hardw
are
conguration
used
for
training
is:
-
Graphics
processing
unit
(GPU):
NVIDIA
GeF
orce
R
TX
3060.
-
Central
processing
unit
(CPU):
10th
Gen
Intel
Core(TM)
i5-10400,
2.9
Ghz
(12
CPU).
-
Memory
64
GB.
The
softw
are
conguration
used
in
training:
-
Python
v
ersion
3.11.9,
and
V
isual
studio
code
v
ersion
1.102.3.
T
raining
h
yperparameters:
-
Epochs
=
50,
imgsz
=
640,
batch
=
16,
learning
rate
=
0.01
and
60fps.
The
mean
a
v
erage
precision
(
mAP
)
obtained
from
training
is
presented
by
Figure
3:
(a)
(b)
Figure
3.
A
v
erage
precision:
(a)
KITTI
dataset
and
(b)
COCO
dataset
Figure
3
sho
ws
the
comparati
v
e
mAP
results,
re
v
ealing
a
clear
adv
antage
in
our
K
I
T
T
I
trained
model
Figure
(3)a
compared
to
the
C
O
C
O
reference
Figure
3(b),
with
a
signicant
impro
v
ement
in
the
score:
4.2%
for
the
pedestrian
class
and
26%
for
the
car
class.
Object
detection
and
tr
ac
king
with
decoupled
DeepSORT
based
on
α
β
lter
(Lakhdar
Djelloul
Mazouz)
Evaluation Warning : The document was created with Spire.PDF for Python.
1736
❒
ISSN:
1693-6930
5.2.
Object
tracking
r
esults
Once
a
high-performance
object
detector
has
been
obtained,
it
must
be
inte
grated
with
a
tracking
algorithm
to
de
v
elop
a
fully
autonomous
video
surv
eillance
system.
F
or
the
tracking
component,
we
selected
DeepSOR
T
,
one
of
the
most
widely
adopted
algorithms
in
this
domain,
due
to
its
pro
v
en
ef
fecti
v
eness
in
tracking
mo
ving
objects
within
video
scenes.
As
pre
viously
noted,
tw
o
modications
were
introduced
to
enhance
DeepSOR
T’
s
performance.
T
o
e
v
aluate
the
impact
of
these
impro
v
ements,
we
compare
tracking
metrics
obtained
using
the
modied
v
ersions
ag
ainst
those
from
the
original
implementation.
The
tracking
h
yperparameters
used
in
both
congurations
are
detailed
belo
w:
-
max
cosine
distance
=
0.3
:
enforces
stricter
appearance
matching.
-
nn
budg
et
=
100
:
species
the
number
of
appearance
features
to
cache.
-
max
I
o
U
distance
=
0
.
7
:
sets
the
bounding
box
o
v
erlap
threshold.
-
max
ag
e
=
60
:
determines
the
number
of
frames
a
track
is
retained
without
updates.
-
n
init
=
3
:
indicates
the
number
of
detections
required
to
conrm
a
track.
The
α
v
alue
w
as
chosen
empirically
from
the
interv
al
[0,1].
The
opti
mal
β
v
alue
w
as
chosen
according
to
the
Benedict-Bordner
rule
[24]
,
where:
β
=
α
2
2
−
α
(20)
After
se
v
eral
tests,
the
couple
(
α
,
β
)
that
g
a
v
e
the
best
results
in
t
erms
of
metrics
w
as
chosen
to
v
alidate
the
e
xperimental
results
of
our
application
is
(0.2,
0.022).
The
real-time
performance
specications:
FPS:
30,
GPU:
15%
and
latenc
y:
37
ms.
T
able
1
and
Figure
4
present
the
performance
metrics
obtained
from
the
original
DeepSOR
T
implementation.
According
to
T
able
1
and
Figure
4
,
we
can
see
that
the
scores
obtained
for
the
MO
T
A
and
HO
T
A
(Figure
4(a))
metrics
were
68.359
and
0.68
respecti
v
ely
,
with
an
IDswitch
of
434
for
the
car
class,
while
for
the
pedestrian
cl
ass,
the
MO
T
A
and
HO
T
A
(Figure
4
(b))
scores
were
42.474
and
0.42
respecti
v
ely
,
with
an
IDswitch
of
510.
T
able
2
and
Figure
5
sho
w
the
dif
ferent
metrics
obtained
from
the
implementation
of
the
decoupled
DeepSOR
T
.
T
able
1.
Original
DeepSOR
T
metrics
Class
MO
T
A
MO
TP
TP
FN
FP
IDSW
Dets
GT
Dets
IDs
GT
IDs
Car
68.359
68.007
20026
4044
1234
434
21260
24070
938
564
Pedestrian
42.474
44.682
8103
3006
3393
510
11496
11109
472
167
(a)
(b)
Figure
4.
Original
DeepSOR
T
metrics:
(a)
car
and
(b)
pedestrian
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1729–1742
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1737
T
able
2.
Decoupled
DeepSOR
T
metrics
Class
MO
T
A
MO
TP
TP
FN
FP
IDSW
Dets
GT
Dets
IDs
GT
IDs
Car
88.238
86.772
22432
1638
947
381
(-53)
23379
24070
823
564
Pedestrian
67.954
79.448
8959
2150
1072
370
(-140)
10031
11109
308
167
(a)
(b)
Figure
5.
Decoupled
DeepSOR
T
metrics:
(a)
car
and
(b)
pedestrian
Compared
with
the
original
DeepSOR
T
metri
cs,
we
can
see
that
the
scores
of
MO
T
A
and
HO
T
A
(Figure
5(a))
ha
v
e
impro
v
ed
signicantly
by
+19.238
(88.238)
and
+0.07
(0.75)
respecti
v
ely
,
and
a
reduction
by
-53
(381)
in
IDswitch
f
o
r
the
car
class.
while
for
the
pedestrian
class,
the
MO
T
A
and
HO
T
A
(Figure
5
(b))
scores
sho
w
an
impro
v
ement
of
+25.48
and
+0.1
respecti
v
ely
and
a
reduction
of
-150
in
the
IDswitch
due
to
the
use
of
the
decoupled
DeepSOR
T
which
solv
es
this
problem
by
using
tw
o
parallel
architectures,
making
it
impossible
to
confuse
the
yolo
cl
asses.
T
able
3
and
Figure
6
sho
w
the
dif
ferent
metrics
obtai
ned
from
the
implementation
of
the
α
β
lter
based
decoupled
DeepSOR
T
algorithm.
Compared
with
the
original
DeepSOR
T
metri
cs,
we
can
see
that
the
scores
of
MO
T
A
and
HO
T
A
Figure
6(a)
ha
v
e
impro
v
ed
signicantly
by
+22.023
and
+0.1
respecti
v
ely
,
and
a
reduction
by
-188
in
IDswitch
for
the
car
class.
while
for
the
pedestrian
class,
the
MO
T
A
and
HO
T
A
Figure
6(b)
scores
sho
w
an
impro
v
ement
of
+27.073
and
+0.12
respecti
v
ely
and
a
reduction
of
-172
in
the
IDswitch.
T
ables
4
and
5
present
a
comparison
of
the
metrics
obtained
with
the
3
v
ersions
of
the
DeepSOR
T
algorithm
(the
original
one
and
the
2
modied
v
ersions),
for
the
pedestrians
and
cars
classes.
As
sho
wn
in
the
comparison,
all
e
v
aluation
metrics
are
impro
v
ed
by
the
decoupled
DeepSOR
T
algo-
rithm
compared
to
the
original
v
ersi
on.
These
metrics
are
further
enhanced
when
the
α
β
lter
is
inte
grated
into
the
decoupled
DeepSOR
T
architecture.
In
order
to
generalize
the
results
of
our
application,
we
repeated
the
tests
on
another
e
v
aluation
benchmark
(MO
TChallenge
benchmark
[17]).
The
MO
TChallenge
focuses
on
pedestrian
tracking
only
.
The
results
o
bt
ained
are
presented
in
the
follo
wing
Figure
7,
where
Figure
7(a)
represents
the
metri
cs
of
the
original
DeepSOR
T
,
Figure
7(b)
represents
the
metrics
obtained
from
the
decoupled
DeepSOR
T
,
and
Figure
7(c)
represents
the
metrics
obtained
from
the
decoupled
DeepSOR
T
based
on
the
α
β
lter
.
According
to
the
Figure
7
and
T
ables
6
and
7,
we
can
conclude
that
the
results
are
consistent
with
those
obtained
from
the
KITTI
benchmark
[15].
Figure
8
sho
w
some
tracking
results
using
the
original
DeepSOR
T
before
and
after
occlusion
from
the
KITTI
object
tracking
e
v
aluation
dataset
[16].
T
able
3.
Decoupled
DeepSOR
T
based
α
β
lter
metrics
Class
MO
T
A
MO
TP
TP
FN
FP
IDSW
Dets
GT
Dets
IDs
GT
IDs
Car
90.382
87.085
22630
1440
494
246
(-188)
23124
24070
745
564
Pedestrian
69.547
79.434
8931
2178
835
338
(-172)
9766
11109
256
167
Object
detection
and
tr
ac
king
with
decoupled
DeepSORT
based
on
α
β
lter
(Lakhdar
Djelloul
Mazouz)
Evaluation Warning : The document was created with Spire.PDF for Python.
1738
❒
ISSN:
1693-6930
(a)
(b)
Figure
6.
Metrics
of
the
α
β
lter
based
decoupled
DeepSOR
T
:
(a)
cars
and
(b)
pedestrians
T
able
4.
Comparison
of
the
metrics
obtained
with
the
3
v
ersions
of
the
DeepSOR
T
algorithm,
for
the
cars’
class
Metric
(%)
Original
DeepSOR
T
Decoupled
DeepSOR
T
Decoupled
DeepSOR
T
based
αβ
lter
HO
T
A
68.359
74.767
(+6.408)
77.645
(+9.286)
DetA
68.007
77.032
(+9.025)
79.296
(+11.289)
AssA
69.093
73.074
(+3.981)
76.502
(+7.409)
DetRe
73.712
82.754
(+9.042)
83.742
(+10.03)
DetPr
83.455
85.2
(+1.745)
87.168
(+3.713)
AssRe
73.694
77.82
(+4.126)
80.209
(+6.515)
AssPr
85.374
86.728
(+1.354)
88.342
(+2.998)
LocA
87.966
88.099
(+0.133)
88.385
(+0.419)
T
able
5.
Comparison
of
the
metrics
obtained
with
the
3
v
ersions
of
the
DeepSOR
T
algorithm,
for
the
pedestrians’
class
Metric
(%)
Original
DeepSOR
T
Decoupled
DeepSOR
T
Decoupled
DeepSOR
T
based
αβ
lter
HO
T
A
42.474
52.506
(+10.032)
54.019
(+11.545)
DetA
44.682
58.606
(+13.924)
59.142
(+14.46)
AssA
40.576
47.297
(+6.721)
49.589
(+9.013)
DetRe
59.805
66.285
(+6.78)
65.835
(+6.03)
DetPr
57.792
73.408
(+15.616)
74.888
(+17.096)
AssRe
46.445
54.441
(+7.996)
56.331
(+9.886)
AssPr
65.463
68.092
(+2.629)
68.928
(+3.465)
LocA
81.742
82.042
(+0.3)
82.067
(+0.325)
T
able
6.
MO
TChallenge
benchmark’
s
metrics
with
3
v
ersion
of
DeepSOR
T
Algorithm
MO
T
A
MO
TP
TP
FN
FP
IDSW
Dets
GT
Dets
IDs
GT
IDs
Original
Deep-
SOR
T
33.171
76.761
31767
8138
18009
512
49776
39905
794
500
Decoupled
Deep-
SOR
T
34.399
76.746
31682
8223
17681
274
(-238)
49363
39905
873
500
Decoupled
Deep-
SOR
T
based
αβ
34.399
76.746
31682
8223
17681
274
(-238)
49363
39905
873
500
In
the
original
DeepSOR
T
algorithm,
both
classes
are
track
ed
simultaneously
by
the
same
DeepSOR
T
.
Occlusion
can
cause
confusion
between
the
tw
o
classes,
i.e.,
a
pedestrian
can
be
classied
as
a
car
and
vice
v
ersa.
This
confusion
is
translated
by
an
IDSwitch.
As
sho
wn
in
the
Figure
8,
the
system
confuses
their
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1729–1742
Evaluation Warning : The document was created with Spire.PDF for Python.