IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
2,
April
2026,
pp.
1709
∼
1718
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i2.pp1709-1718
❒
1709
Y
OLOv8-TMS:
spatiotemporal
attention
netw
orks
f
or
r
eal-time
occlusion-r
esilient
urban
trafc
monitoring
V
idh
ya
Kandasamy
1
,
Antony
T
aurshia
1
,
Tha
vittupalayam
M.
Thiyagu
2
,
Catherine
J
oy
RusselRaj
3
,
J
enefa
Ar
chpaul
1
1
School
of
Computer
Science
and
T
echnology
,
Karun
ya
Institute
of
T
echnology
and
Sciences,
Coimbatore,
India
2
Department
of
Computer
Science
and
Engineering
(Articial
Intelligence
and
Machine
Learning),
V
el
T
ech
Rang
arajan
Dr
.
Sagunthala
R&D
Institute
of
Science
and
T
echnology
,
Chennai,
India
3
Di
vision
of
Electronics
and
Communication
Engineering,
Karun
ya
Institute
of
T
echnology
and
Sciences,
Coimbatore,
India
Article
Inf
o
Article
history:
Recei
v
ed
Feb
8,
2025
Re
vised
Jan
17,
2026
Accepted
Feb
6,
2026
K
eyw
ords:
Computer
vision
Occlusion
resilience
Spatiotemporal
attention
T
raf
c
monitoring
Y
OLOv8
ABSTRA
CT
T
raf
c
monitoring
from
roadside
cameras
benets
from
f
ast
object
detection,
yet
real
street
scenes
remain
dif
cult
because
occlusions,
small
tar
gets,
and
adv
erse
weather
conditions
reduce
visual
reliability
.
This
study
presents
Y
OLOv8
for
traf
c
management
system
(TMS),
which
enhances
Y
OLOv8
using
h
ybrid
attention
renement,
temporal
coherence
modeling,
and
adapti
v
e
occlusion
handling
to
impro
v
e
stability
in
cro
wded
frames.
Experiments
on
the
traf
c
management
enhanced
dataset
from
the
Roboo
w
uni
v
erse
street
vie
w
project
use
5,805
trai
ning
images
and
279
testing
images
across
v
e
road-user
cate
gories.
The
model
achie
v
es
95.2%
mAP
@0.50
in
sunn
y
scenes
and
90.0%
mAP@0.50
in
rain
y
scenes,
while
sustaining
50
ms
inference
time
and
30
frames
per
second
throughput
with
8
GB
graphics
processing
unit
memory
.
The
results
support
reliable
de
plo
yment
for
near
real-time
traf
c
analytics
under
v
arying
conditions.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Jenef
a
Archpaul
School
of
Computer
Science
and
T
echnology
,
Karun
ya
Institute
of
T
echnology
and
Sciences
Coimbatore,
India
Email:
jenef
aa@karun
ya.edu
1.
INTR
ODUCTION
Urban
traf
c
management
increasingly
depends
on
automated
understanding
of
camera
feeds
to
tr
ack
road-user
acti
vity
,
congestion,
and
safety-rele
v
ant
e
v
ents,
since
manual
monitoring
is
slo
w
and
dif
cult
to
sustain
at
scale.
Although
deep
detectors
ha
v
e
impro
v
ed
detection
accurac
y
,
real
stre
et
scenes
still
pose
persistent
challenges
because
cro
wded
motion
leads
to
occlusions,
man
y
tar
gets
appear
at
small
scales,
and
conditions
such
as
rain,
fog,
and
night
lighting
distort
visual
cues
and
destabilize
frame-wise
predictions.
T
raditional
pipelines
based
on
background
subtraction,
optica
l
o
w
,
and
hand-crafted
features
remain
sensiti
v
e
to
shado
ws,
reections,
and
sensor
noise,
while
hea
vier
tw
o-stage
models
can
be
costly
for
multi-camera
operation
and
simple
per
-frame
inference
often
produces
jitter
that
weak
ens
do
wnstream
analytics.
Recent
research
on
intelligent
traf
c
monitoring
and
smart
mobility
increasingly
combines
deep
learning,
attention
mec
hanisms,
and
system-le
v
el
optimization
to
impro
v
e
rob
ustness
in
comple
x
urban
scenes.
W
ajid
et
al
.
[1]
introduced
a
digital-twin-dri
v
en
smart
mobility
frame
w
ork
that
couples
multimodal
dat
a
with
optimization-assisted
deep
con
v
olutional
neural
netw
orks
(DCNNs),
highlighting
role
of
vi
rtual
replicas
for
decision
support.
F
or
scene-le
v
el
counting
in
public
infrastructures,
Zou
et
al
.
[2]
enhanced
Y
OLOv5
via
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1710
❒
ISSN:
2252-8938
feature
association
to
impro
v
e
person–v
ehicle
counting
accurac
y
in
smart-park
en
vironments.
Complementing
street-le
v
el
analyti
cs,
Sun
et
al
.
[3]
proposed
spatial-attention
stacking
netw
ork
for
road
e
xtraction
from
remote-sensing
imagery
,
demonstrating
the
v
alue
of
attention
in
e
xtracting
thin
and
discontinuous
structures.
Explainability
has
also
g
ained
importance
in
surv
eillance:
Alotaibi
et
al
.
[4]
inte
grated
e
xplainable
articial
intelligence
with
deep
models
for
cro
wd
density
estimation,
impro
ving
interpretability
for
real-w
orld
monitoring.
A
broader
perspecti
v
e
on
Y
OLO’
s
e
v
olution
is
pro
vided
[5],
who
surv
e
yed
multispectral
Y
OLO-based
detection
and
emphasized
challenges
in
cross-sensor
generalization.
Related
vision-dri
v
en
monitoring
studies
e
xtend
be
yond
road
traf
c
and
help
moti
v
ate
design
choices
for
rob
ust
detection.
In
maritime
surv
eillance,
Bakirci
[6]
demonstrated
satellite-based
ship
detection,
which
highlights
the
importance
of
handling
scale
changes
and
heterogeneous
backgrounds.
Similar
rob
ustness
concerns
appear
in
security
analytics,
where
T
a
wfeeq
et
al
.
[7]
impro
v
ed
VPN
traf
c
classication
using
adv
ersarially
trained
Ef
cientNet,
and
in
medical
imaging,
where
Anari
et
al
.
[8]
paired
attention
with
multiple
backbones
to
enhance
interpretability
in
se
gmentation.
F
or
traf
c
safety
applications,
Singh
et
al
.
[9]
applied
Ef
cientNet
to
accident
detection
from
CCTV
footage,
while
K
umar
et
al
.
[10]
reported
unmanned
aerial
v
ehicle
(U
A
V)-based
traf
c
analysis
that
supports
broader
spatial
co
v
erage.
Federated
road-condition
assessment
by
Khan
et
al
.
[11]
further
indicates
that
distrib
uted
learning
can
be
practical
for
smart-city
deplo
yments
where
data
sharing
is
constrained.
F
or
traf
c
density
estimation,
Mittal
et
al
.
[12]
combined
f
aster
re
gional
con
v
olutional
neural
netw
ork
(F
aster
R-CNN)
and
Y
OLO
in
a
h
ybrid
strate
gy
,
highlighting
accurac
y–ef
cienc
y
trade-
of
fs.
Related
ef
cienc
y-dri
v
en
des
igns
include
MobileNetV3-based
v
ehicular
intrusion
detection
by
W
ang
et
al.
[13]
and
lightweight
satellite
image
classication
by
Y
ang
et
al.
[14],
while
Zhou
[15]
proposed
MobileNet-based
encrypted
traf
c
classication
for
lo
w-cost
inference.
Rob
ust
v
ehicle
detection
under
real-time
constraints
w
as
further
addressed
in
[16]
using
F
aster
R-CNN
v
ariants,
and
system-le
v
el
ef
cienc
y
w
as
impro
v
ed
in
[17]
through
parallel
video
traf
c
management
strate
gies.
F
or
ne-grained
traf
c
signal
understanding,
T
ammisetti
et
al
.
[18]
introduced
meta-learning
enhancements
to
Y
OLOv8
for
precise
traf
c-light
color
recognition.
In
connected
mobility
,
Khang
et
al
.
[19]
discussed
wireless
sensor
netw
ork
roles
in
intelligent
transportation,
while
Balaji
[20]
demonstrated
deep
learning
for
real-time
traf
c
classication
in
operational
settings.
Mo
ving
to
w
ard
predicti
v
e
analytics,
W
ang
et
al
.
[21]
combined
multi-tar
get
detection
with
o
w
prediction
supported
by
Chan–V
ese
se
gmentation,
linking
perception
with
dynamics.
Since
occlusion
remains
a
primary
f
ailur
e
mode,
Uthaman
et
al
.
[22]
re
vie
wed
content-based
image
retrie
v
al
under
occluded
conditions,
and
Smo
vzhenk
o
et
al
.
[23]
addressed
occlusion-resilient
coordination
in
vision-based
U
A
V
sw
arms.
T
racking-centric
rob
ustness
has
also
progressed:
Xu
et
al
.
[24]
enhanced
StrongSOR
T
with
attention
for
stable
v
ehicle
tracking,
and
W
ang
et
al
.
[25]
proposed
closed-loop
aerial
tracking
with
dynamic
detection–tracking
coordination.
Collecti
v
ely
,
these
studies
moti
v
ate
a
unied
design
that
preserv
es
real-time
speed
while
e
xplicitly
modeling
temporal
coherence
and
occlusion
resilience,
which
forms
the
basis
of
the
proposed
Y
OLOv8-TMS
frame
w
ork.
T
o
address
these
limitations,
this
paper
proposes
Y
OLOv8
for
traf
c
management
system
(TMS),
a
real-time
traf
c
monitoring
frame
w
ork
that
augments
Y
OLOv8
with
h
ybrid
attention
for
stronger
multi-scale
feature
learning.
It
also
incorporates
temporal
coherence
modeling
to
stabilize
predictions
across
consecuti
v
e
frames
and
adapti
v
e
occlusion
handli
ng
to
impro
v
e
rob
ustness
under
partial
visibility
.
The
k
e
y
contrib
utions
of
this
w
ork
are
summarized
as
follo
ws:
i)
a
Y
OLOv8-based
spatiotemporal
architecture
(Y
OLOv8-TMS)
for
occlusion-resilient
traf
c
monitoring;
ii)
a
h
ybrid
att
ention
feature
p
yramid
to
enhance
multi-scale
detection
in
dense
urban
scenes;
iii
)
a
temporal
coherence
module
to
impro
v
e
frame-to-frame
consistenc
y
for
video
analytics;
and
i
v)
a
unied
e
v
aluation
on
a
h
ybrid
dataset
combining
still
images
and
traf
c
sequences
under
di
v
erse
conditions.
The
remainder
of
this
paper
is
or
g
anized
as
follo
ws.
Section
2
details
the
proposed
methodology
.
Section
3
pres
ents
e
xperimental
results
and
discussion.
Section
4
concludes
the
paper
wit
h
limitations
and
future
directions.
2.
METHOD
This
section
describes
the
proposed
Y
OLOv8-TMS
frame
w
ork
for
real-time
urban
traf
c
monitoring.
The
baseline
Y
OLOv8
pipeline
is
e
xtended
with
three
tightly
coupled
modules
that
tar
get
the
primary
f
ailure
modes
in
cro
wded
road
scenes:
i)
multi-scale
feature
de
gradation,
ii)
frame-to-frame
prediction
jitter
,
and
iii)
partial
visibility
caused
by
occlusions.
The
resulting
design
impro
v
es
localization
reliability
and
detection
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1709–1718
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1711
consistenc
y
without
sacricing
real-time
throughput
as
illustrated
in
Figure
1.
Figure
1.
Y
OLOv8-TMS
architecture
diagram
2.1.
Ar
chitectur
e
o
v
er
view
Gi
v
en
an
input
frame
I
t
∈
R
H
×
W
×
3
at
time
inde
x
t
,
the
detector
produces
class
probabilities
and
bounding
box
es
for
N
t
candidates.
Backbone
and
neck
processing
e
xtract
multi-scale
representations
that
feed
a
detection
head,
which
outputs
{
b
t,i
,
p
t,i
}
N
t
i
=1
where
b
t,i
=
(
x,
y
,
w
,
h
)
and
p
t,i
∈
[0
,
1]
C
for
C
cate
gories.
Multi-scale
features
at
p
yramid
le
v
el
l
are
denoted
as
in
(1).
F
(
l
)
t
=
B
(
l
)
(
I
t
)
(1)
Where
B
(
l
)
(
·
)
represents
the
ba
ckbo
ne
and
neck
mapping
at
le
v
el
l
in
(1).
The
standard
Y
OLOv8
training
objecti
v
e
is
e
xpressed
as
a
weighted
sum
of
classication
and
localization
components,
and
(2)
serv
es
as
the
Y
OLOv8-TMS:
spatiotempor
al
attention
networks
for
r
eal-time
occlusion-r
esilient
...
(V
idhya
Kandasamy)
Evaluation Warning : The document was created with Spire.PDF for Python.
1712
❒
ISSN:
2252-8938
base
loss
that
is
rened
later
with
occlusion-a
w
are
weighting
in
this
frame
w
ork.
The
o
v
erall
pipeline
k
eeps
the
Y
OLOv8
head
intact,
while
inserting
an
att
ention
renement
block
before
prediction
and
applying
temporal
and
occlusion-a
w
are
post-processing
at
inference.
L
det
=
L
cls
+
λ
b
o
x
L
b
o
x
+
λ
d
L
d
(2)
T
o
pro
vide
a
clear
understanding
of
the
proposed
frame
w
ork,
the
complete
training
and
inference
w
orko
w
of
the
Y
OLOv8-TMS
model
is
summarized
in
Al
gorithm
1.
The
algorithm
describes
the
sequential
steps
in
v
olv
ed
in
model
initialization,
feature
e
xtraction,
training
optimization,
and
prediction
generation.
This
w
orko
w
highlights
ho
w
the
proposed
architecture
processes
input
data
and
produces
traf
c
detection
results.
Algorithm
1
Y
OLOv8-TMS
training
and
inference
w
orko
w
Requir
e:
T
raining
frames
{
I
t
}
with
labels,
smoothing
f
actor
α
,
loss
weights
λ
b
o
x
,
λ
d
,
λ
o
,
λ
temp
Ensur
e:
T
rained
detector
and
temporally
stabilised
predictions
1:
Initialise
Y
OLOv8
backbone,
neck,
head;
initialise
attention
parameters
in
(3)–(5)
2:
f
or
each
training
iteration
do
3:
Sample
mini-batch
frames
I
t
and
labels
4:
Extract
multi-scale
features
F
(
l
)
t
using
(1)
5:
Compute
attention
maps
using
(3)
and
(4);
rene
features
using
(5)
6:
Predict
{
b
t,i
,
p
t,i
}
N
t
i
=1
from
rened
features
7:
Compute
occlusion
scores
s
t,i
using
(9)
and
weights
w
t,i
using
(10)
8:
Update
temporal
states
and
compute
L
temp
using
(7)
and
(8)
when
sequential
frames
e
xist
9:
Compute
total
loss
L
TMS
using
(11)
and
update
parameters
10:
end
f
or
11:
Infer
ence:
F
or
each
incoming
frame
I
t
,
compute
rened
features
via
(1)–(5)
12:
Predict
{
b
t,i
,
p
t,i
}
and
apply
smoothing
via
(6)
and
(7)
13:
Apply
non-max
suppression
and
report
nal
detections
2.2.
Hybrid
attention
featur
e
extraction
Urban
road
scenes
often
contain
small
objects
and
partially
visible
instances,
which
benet
from
selecti
v
e
emphasis
on
informat
i
v
e
channels
and
spatial
re
gions.
F
or
each
p
yramid
le
v
el,
a
channel
attention
v
ector
is
computed
from
global
pooled
statistics,
where
GAP(
·
)
is
global
a
v
erage
pooling,
δ
(
·
)
is
a
ReLU
nonlinearity
,
σ
(
·
)
is
a
sigmoid
g
ate,
and
W
1
,
W
2
are
learned
weights
in
(3).
Spatial
attention
is
then
deri
v
ed
from
pooled
feature
maps
to
highlight
salient
re
gions,
where
[
·
,
·
]
denotes
channel-wise
concatenation
and
Con
v
(
·
)
in
(4)
is
a
learnable
con
v
olution.
The
rened
representation
used
by
the
detection
head
is
obtai
ned
by
applying
the
tw
o
attention
maps
multiplicati
v
ely
,
where
⊙
in
(5)
denotes
broadcast
element-wise
multiplication.
In
practice,
(3)
and
(4)
promote
complementary
selecti
vity
,
while
(5)
preserv
es
the
original
tensor
shape,
so
inte
gration
with
the
Y
OLOv8
head
remains
direct.
a
(
l
)
c,t
=
σ
W
2
δ
(
W
1
GAP(
F
(
l
)
t
))
(3)
a
(
l
)
s,t
=
σ
Con
v
[AvgP
o
ol(
F
(
l
)
t
)
,
MaxP
o
ol(
F
(
l
)
t
)]
(4)
b
F
(
l
)
t
=
F
(
l
)
t
⊙
a
(
l
)
c,t
⊙
a
(
l
)
s,t
(5)
2.3.
T
emporal
coher
ence
and
adapti
v
e
occlusion
handling
Frame-wise
detections
can
uctuate
e
v
en
when
objects
mo
v
e
smoothly
,
especially
under
lighting
changes
or
transient
occlusions.
T
o
reduce
prediction
jitter
,
an
e
xponential
smoothing
update
is
applied
to
class
probabilities
and
bounding
box
es,
where
α
∈
(0
,
1]
controls
responsi
v
eness
in
(6)
and
(7).
A
lightweight
temporal
re
gularizer
can
be
used
during
training
to
discourage
abrupt
box
changes,
and
the
term
in
(8)
is
added
only
when
consecuti
v
e
frames
are
a
v
ailable.
e
p
t,i
=
α
p
t,i
+
(1
−
α
)
e
p
t
−
1
,i
(6)
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1709–1718
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1713
e
b
t,i
=
α
b
t,i
+
(1
−
α
)
e
b
t
−
1
,i
(7)
L
temp
=
1
N
t
N
t
X
i
=1
b
t,i
−
e
b
t,i
1
(8)
Occlusion
is
treated
as
a
measurable
cro
wding
ef
fect
based
on
o
v
erlaps
among
predicted
box
es.
F
or
each
candidate
i
,
an
occlusion
score
is
dened
as
the
maximum
o
v
erlap
with
an
y
other
candidate
in
the
same
frame
as
in
(9).
s
t,i
=
max
j
̸
=
i
IoU(
b
t,i
,
b
t,j
)
(9)
Where
s
t,i
in
(9)
increases
when
instances
are
tightly
pack
ed
or
partially
co
v
eri
n
g
each
other
.
This
score
dri
v
es
an
adapti
v
e
weighting
on
the
localization
term
so
that
learning
remains
attenti
v
e
to
dif
cult,
partially
visible
objects
in
(10).
w
t,i
=
1
+
λ
o
s
t,i
(10)
And
the
occlusion-a
w
are
detection
objecti
v
e
becomes
(11).
L
TMS
=
L
cls
+
λ
b
o
x
1
N
t
N
t
X
i
=1
w
t,i
L
(
i
)
b
o
x
+
λ
d
L
d
+
λ
temp
L
temp
(11)
Where
L
(
i
)
b
o
x
is
the
per
-instance
localization
loss
and
λ
o
in
(10)
and
λ
temp
in
(11)
control
the
occlusion
and
temporal
contrib
utions.
At
inference,
the
nal
reported
outputs
use
e
p
t,i
and
e
b
t,i
from
(6)
and
(7),
follo
wed
by
standard
non-max
suppression,
which
benets
from
the
reduced
jitter
and
the
impro
v
ed
cro
wded-scene
learning
encouraged
by
(11).
3.
EXPERIMENT
AL
RESUL
TS
AND
DISCUSSION
This
section
e
xamines
the
detector
beha
viour
through
quantitati
v
e
scores
and
visual
checks
to
connect
metric
trends
with
scene-le
v
el
outcomes.
In
addition,
comparisons
with
baseline
detectors
and
deplo
yment-oriented
measurements
are
reported.
T
o
clarify
the
accurac
y–ef
cienc
y
trade-of
f
for
real-time
traf
c
monitoring.
3.1.
Dataset
composition
and
training
strategy
The
traf
c
management
e
xperiments
use
the
T
raf
c
Management
Enhanced
Dataset
collected
from
the
Roboo
w
Uni
v
erse
street-vie
w
project,
which
of
fers
dense
urban
scene
imagery
with
consistent
bounding-box
labels
for
common
road
users.
A
total
of
5,805
images
are
used
for
training
and
279
images
are
held
out
for
testing,
co
v
ering
v
e
object
cate
gories
that
directly
align
with
monitoring
needs
in
mix
ed
traf
c
corridors,
namely
bic
ycle,
b
us,
car
,
motorc
ycle,
and
person.
Detection
is
implemented
using
Ultralytics
Y
OLOv8
in
the
medium
conguration,
chosen
to
pro
vide
a
practical
balance
between
inference
cost
and
localisation
quality
for
multi-class
street
surv
eillance.
T
able
1
reports
the
dataset
specications
adopted
in
the
proposed
pipeline.
Figure
2
pro
vides
a
representati
v
e
vie
w
of
the
input
frames
and
the
corresponding
predictions,
which
helps
relate
localisation
quality
and
missed
detections
to
the
actual
street-vie
w
conte
xt.
3.2.
Hyper
parameter
conguration
and
tuning
Hyperparameters
for
Y
OLOv8
training
were
selected
through
a
controlled
tuning
study
in
which
candidate
v
alues
were
e
v
aluated
under
the
same
data
split
and
augment
ation
pipeline,
and
the
nal
selection
w
as
guided
by
v
alidation
detection
quality
and
stable
optimisation
beha
viour
.
As
summarised
in
T
able
2,
the
conguration
that
deli
v
ered
the
most
consistent
con
v
er
gence
used
a
learning
rate
of
0
.
01
with
a
batch
size
of
32
,
together
with
weight
decay
of
5
×
10
−
4
to
limit
o
v
ertting
while
preserving
learning
capacity
.
T
raining
w
as
e
xtended
to
300
epochs
so
that
the
detector
could
benet
from
repeated
e
xposure
to
di
v
erse
street-vie
w
scenes,
and
Adam
w
as
preferred
o
v
er
stochastic
gradient
descent
(SGD)
because
it
produced
smoother
updates
and
fe
wer
oscillations
under
the
same
schedule.
F
or
post-processing,
a
non-max
suppression
IoU
threshold
of
0
.
5
w
as
adopted
to
suppress
duplicate
box
es
while
retaining
closely
spaced
instances
in
cro
wded
frames,
and
Y
OLOv8-TMS:
spatiotempor
al
attention
networks
for
r
eal-time
occlusion-r
esilient
...
(V
idhya
Kandasamy)
Evaluation Warning : The document was created with Spire.PDF for Python.
1714
❒
ISSN:
2252-8938
anchor
scales
were
k
ept
at
the
def
ault
setting
since
scaled
v
ariants
did
not
pro
vide
a
clear
impro
v
ement
in
the
observ
ed
results.
T
able
1.
Dataset
description
for
traf
c
management
system
P
arameter
Details
Dataset
name
T
raf
c
management
enhanced
dataset
Source
Roboo
w
Uni
v
erse
(https://uni
v
erse.roboo
w
.com/fsmvu/street-vie
w-gdogo)
T
otal
images
5,805
images
for
training,
279
images
for
testing
Algorithm
used
Y
OLOv8
–
medium
Object
cate
gories
Bic
ycle,
b
us,
car
,
motorc
ycle,
person
(a)