IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 2, April 2026, pp. 1709 1718 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i2.pp1709-1718 1709 Y OLOv8-TMS: spatiotemporal attention netw orks f or r eal-time occlusion-r esilient urban trafc monitoring V idh ya Kandasamy 1 , Antony T aurshia 1 , Tha vittupalayam M. Thiyagu 2 , Catherine J oy RusselRaj 3 , J enefa Ar chpaul 1 1 School of Computer Science and T echnology , Karun ya Institute of T echnology and Sciences, Coimbatore, India 2 Department of Computer Science and Engineering (Articial Intelligence and Machine Learning), V el T ech Rang arajan Dr . Sagunthala R&D Institute of Science and T echnology , Chennai, India 3 Di vision of Electronics and Communication Engineering, Karun ya Institute of T echnology and Sciences, Coimbatore, India Article Inf o Article history: Recei v ed Feb 8, 2025 Re vised Jan 17, 2026 Accepted Feb 6, 2026 K eyw ords: Computer vision Occlusion resilience Spatiotemporal attention T raf c monitoring Y OLOv8 ABSTRA CT T raf c monitoring from roadside cameras benets from f ast object detection, yet real street scenes remain dif cult because occlusions, small tar gets, and adv erse weather conditions reduce visual reliability . This study presents Y OLOv8 for traf c management system (TMS), which enhances Y OLOv8 using h ybrid attention renement, temporal coherence modeling, and adapti v e occlusion handling to impro v e stability in cro wded frames. Experiments on the traf c management enhanced dataset from the Roboo w uni v erse street vie w project use 5,805 trai ning images and 279 testing images across v e road-user cate gories. The model achie v es 95.2% mAP @0.50 in sunn y scenes and 90.0% mAP@0.50 in rain y scenes, while sustaining 50 ms inference time and 30 frames per second throughput with 8 GB graphics processing unit memory . The results support reliable de plo yment for near real-time traf c analytics under v arying conditions. This is an open access article under the CC BY -SA license . Corresponding A uthor: Jenef a Archpaul School of Computer Science and T echnology , Karun ya Institute of T echnology and Sciences Coimbatore, India Email: jenef aa@karun ya.edu 1. INTR ODUCTION Urban traf c management increasingly depends on automated understanding of camera feeds to tr ack road-user acti vity , congestion, and safety-rele v ant e v ents, since manual monitoring is slo w and dif cult to sustain at scale. Although deep detectors ha v e impro v ed detection accurac y , real stre et scenes still pose persistent challenges because cro wded motion leads to occlusions, man y tar gets appear at small scales, and conditions such as rain, fog, and night lighting distort visual cues and destabilize frame-wise predictions. T raditional pipelines based on background subtraction, optica l o w , and hand-crafted features remain sensiti v e to shado ws, reections, and sensor noise, while hea vier tw o-stage models can be costly for multi-camera operation and simple per -frame inference often produces jitter that weak ens do wnstream analytics. Recent research on intelligent traf c monitoring and smart mobility increasingly combines deep learning, attention mec hanisms, and system-le v el optimization to impro v e rob ustness in comple x urban scenes. W ajid et al . [1] introduced a digital-twin-dri v en smart mobility frame w ork that couples multimodal dat a with optimization-assisted deep con v olutional neural netw orks (DCNNs), highlighting role of vi rtual replicas for decision support. F or scene-le v el counting in public infrastructures, Zou et al . [2] enhanced Y OLOv5 via J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
1710 ISSN: 2252-8938 feature association to impro v e person–v ehicle counting accurac y in smart-park en vironments. Complementing street-le v el analyti cs, Sun et al . [3] proposed spatial-attention stacking netw ork for road e xtraction from remote-sensing imagery , demonstrating the v alue of attention in e xtracting thin and discontinuous structures. Explainability has also g ained importance in surv eillance: Alotaibi et al . [4] inte grated e xplainable articial intelligence with deep models for cro wd density estimation, impro ving interpretability for real-w orld monitoring. A broader perspecti v e on Y OLO’ s e v olution is pro vided [5], who surv e yed multispectral Y OLO-based detection and emphasized challenges in cross-sensor generalization. Related vision-dri v en monitoring studies e xtend be yond road traf c and help moti v ate design choices for rob ust detection. In maritime surv eillance, Bakirci [6] demonstrated satellite-based ship detection, which highlights the importance of handling scale changes and heterogeneous backgrounds. Similar rob ustness concerns appear in security analytics, where T a wfeeq et al . [7] impro v ed VPN traf c classication using adv ersarially trained Ef cientNet, and in medical imaging, where Anari et al . [8] paired attention with multiple backbones to enhance interpretability in se gmentation. F or traf c safety applications, Singh et al . [9] applied Ef cientNet to accident detection from CCTV footage, while K umar et al . [10] reported unmanned aerial v ehicle (U A V)-based traf c analysis that supports broader spatial co v erage. Federated road-condition assessment by Khan et al . [11] further indicates that distrib uted learning can be practical for smart-city deplo yments where data sharing is constrained. F or traf c density estimation, Mittal et al . [12] combined f aster re gional con v olutional neural netw ork (F aster R-CNN) and Y OLO in a h ybrid strate gy , highlighting accurac y–ef cienc y trade- of fs. Related ef cienc y-dri v en des igns include MobileNetV3-based v ehicular intrusion detection by W ang et al. [13] and lightweight satellite image classication by Y ang et al. [14], while Zhou [15] proposed MobileNet-based encrypted traf c classication for lo w-cost inference. Rob ust v ehicle detection under real-time constraints w as further addressed in [16] using F aster R-CNN v ariants, and system-le v el ef cienc y w as impro v ed in [17] through parallel video traf c management strate gies. F or ne-grained traf c signal understanding, T ammisetti et al . [18] introduced meta-learning enhancements to Y OLOv8 for precise traf c-light color recognition. In connected mobility , Khang et al . [19] discussed wireless sensor netw ork roles in intelligent transportation, while Balaji [20] demonstrated deep learning for real-time traf c classication in operational settings. Mo ving to w ard predicti v e analytics, W ang et al . [21] combined multi-tar get detection with o w prediction supported by Chan–V ese se gmentation, linking perception with dynamics. Since occlusion remains a primary f ailur e mode, Uthaman et al . [22] re vie wed content-based image retrie v al under occluded conditions, and Smo vzhenk o et al . [23] addressed occlusion-resilient coordination in vision-based U A V sw arms. T racking-centric rob ustness has also progressed: Xu et al . [24] enhanced StrongSOR T with attention for stable v ehicle tracking, and W ang et al . [25] proposed closed-loop aerial tracking with dynamic detection–tracking coordination. Collecti v ely , these studies moti v ate a unied design that preserv es real-time speed while e xplicitly modeling temporal coherence and occlusion resilience, which forms the basis of the proposed Y OLOv8-TMS frame w ork. T o address these limitations, this paper proposes Y OLOv8 for traf c management system (TMS), a real-time traf c monitoring frame w ork that augments Y OLOv8 with h ybrid attention for stronger multi-scale feature learning. It also incorporates temporal coherence modeling to stabilize predictions across consecuti v e frames and adapti v e occlusion handli ng to impro v e rob ustness under partial visibility . The k e y contrib utions of this w ork are summarized as follo ws: i) a Y OLOv8-based spatiotemporal architecture (Y OLOv8-TMS) for occlusion-resilient traf c monitoring; ii) a h ybrid att ention feature p yramid to enhance multi-scale detection in dense urban scenes; iii ) a temporal coherence module to impro v e frame-to-frame consistenc y for video analytics; and i v) a unied e v aluation on a h ybrid dataset combining still images and traf c sequences under di v erse conditions. The remainder of this paper is or g anized as follo ws. Section 2 details the proposed methodology . Section 3 pres ents e xperimental results and discussion. Section 4 concludes the paper wit h limitations and future directions. 2. METHOD This section describes the proposed Y OLOv8-TMS frame w ork for real-time urban traf c monitoring. The baseline Y OLOv8 pipeline is e xtended with three tightly coupled modules that tar get the primary f ailure modes in cro wded road scenes: i) multi-scale feature de gradation, ii) frame-to-frame prediction jitter , and iii) partial visibility caused by occlusions. The resulting design impro v es localization reliability and detection Int J Artif Intell, V ol. 15, No. 2, April 2026: 1709–1718 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1711 consistenc y without sacricing real-time throughput as illustrated in Figure 1. Figure 1. Y OLOv8-TMS architecture diagram 2.1. Ar chitectur e o v er view Gi v en an input frame I t R H × W × 3 at time inde x t , the detector produces class probabilities and bounding box es for N t candidates. Backbone and neck processing e xtract multi-scale representations that feed a detection head, which outputs { b t,i , p t,i } N t i =1 where b t,i = ( x, y , w , h ) and p t,i [0 , 1] C for C cate gories. Multi-scale features at p yramid le v el l are denoted as in (1). F ( l ) t = B ( l ) ( I t ) (1) Where B ( l ) ( · ) represents the ba ckbo ne and neck mapping at le v el l in (1). The standard Y OLOv8 training objecti v e is e xpressed as a weighted sum of classication and localization components, and (2) serv es as the Y OLOv8-TMS: spatiotempor al attention networks for r eal-time occlusion-r esilient ... (V idhya Kandasamy) Evaluation Warning : The document was created with Spire.PDF for Python.
1712 ISSN: 2252-8938 base loss that is rened later with occlusion-a w are weighting in this frame w ork. The o v erall pipeline k eeps the Y OLOv8 head intact, while inserting an att ention renement block before prediction and applying temporal and occlusion-a w are post-processing at inference. L det = L cls + λ b o x L b o x + λ d L d (2) T o pro vide a clear understanding of the proposed frame w ork, the complete training and inference w orko w of the Y OLOv8-TMS model is summarized in Al gorithm 1. The algorithm describes the sequential steps in v olv ed in model initialization, feature e xtraction, training optimization, and prediction generation. This w orko w highlights ho w the proposed architecture processes input data and produces traf c detection results. Algorithm 1 Y OLOv8-TMS training and inference w orko w Requir e: T raining frames { I t } with labels, smoothing f actor α , loss weights λ b o x , λ d , λ o , λ temp Ensur e: T rained detector and temporally stabilised predictions 1: Initialise Y OLOv8 backbone, neck, head; initialise attention parameters in (3)–(5) 2: f or each training iteration do 3: Sample mini-batch frames I t and labels 4: Extract multi-scale features F ( l ) t using (1) 5: Compute attention maps using (3) and (4); rene features using (5) 6: Predict { b t,i , p t,i } N t i =1 from rened features 7: Compute occlusion scores s t,i using (9) and weights w t,i using (10) 8: Update temporal states and compute L temp using (7) and (8) when sequential frames e xist 9: Compute total loss L TMS using (11) and update parameters 10: end f or 11: Infer ence: F or each incoming frame I t , compute rened features via (1)–(5) 12: Predict { b t,i , p t,i } and apply smoothing via (6) and (7) 13: Apply non-max suppression and report nal detections 2.2. Hybrid attention featur e extraction Urban road scenes often contain small objects and partially visible instances, which benet from selecti v e emphasis on informat i v e channels and spatial re gions. F or each p yramid le v el, a channel attention v ector is computed from global pooled statistics, where GAP( · ) is global a v erage pooling, δ ( · ) is a ReLU nonlinearity , σ ( · ) is a sigmoid g ate, and W 1 , W 2 are learned weights in (3). Spatial attention is then deri v ed from pooled feature maps to highlight salient re gions, where [ · , · ] denotes channel-wise concatenation and Con v ( · ) in (4) is a learnable con v olution. The rened representation used by the detection head is obtai ned by applying the tw o attention maps multiplicati v ely , where in (5) denotes broadcast element-wise multiplication. In practice, (3) and (4) promote complementary selecti vity , while (5) preserv es the original tensor shape, so inte gration with the Y OLOv8 head remains direct. a ( l ) c,t = σ W 2 δ ( W 1 GAP( F ( l ) t )) (3) a ( l ) s,t = σ Con v [AvgP o ol( F ( l ) t ) , MaxP o ol( F ( l ) t )] (4) b F ( l ) t = F ( l ) t a ( l ) c,t a ( l ) s,t (5) 2.3. T emporal coher ence and adapti v e occlusion handling Frame-wise detections can uctuate e v en when objects mo v e smoothly , especially under lighting changes or transient occlusions. T o reduce prediction jitter , an e xponential smoothing update is applied to class probabilities and bounding box es, where α (0 , 1] controls responsi v eness in (6) and (7). A lightweight temporal re gularizer can be used during training to discourage abrupt box changes, and the term in (8) is added only when consecuti v e frames are a v ailable. e p t,i = α p t,i + (1 α ) e p t 1 ,i (6) Int J Artif Intell, V ol. 15, No. 2, April 2026: 1709–1718 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1713 e b t,i = α b t,i + (1 α ) e b t 1 ,i (7) L temp = 1 N t N t X i =1 b t,i e b t,i 1 (8) Occlusion is treated as a measurable cro wding ef fect based on o v erlaps among predicted box es. F or each candidate i , an occlusion score is dened as the maximum o v erlap with an y other candidate in the same frame as in (9). s t,i = max j ̸ = i IoU( b t,i , b t,j ) (9) Where s t,i in (9) increases when instances are tightly pack ed or partially co v eri n g each other . This score dri v es an adapti v e weighting on the localization term so that learning remains attenti v e to dif cult, partially visible objects in (10). w t,i = 1 + λ o s t,i (10) And the occlusion-a w are detection objecti v e becomes (11). L TMS = L cls + λ b o x 1 N t N t X i =1 w t,i L ( i ) b o x + λ d L d + λ temp L temp (11) Where L ( i ) b o x is the per -instance localization loss and λ o in (10) and λ temp in (11) control the occlusion and temporal contrib utions. At inference, the nal reported outputs use e p t,i and e b t,i from (6) and (7), follo wed by standard non-max suppression, which benets from the reduced jitter and the impro v ed cro wded-scene learning encouraged by (11). 3. EXPERIMENT AL RESUL TS AND DISCUSSION This section e xamines the detector beha viour through quantitati v e scores and visual checks to connect metric trends with scene-le v el outcomes. In addition, comparisons with baseline detectors and deplo yment-oriented measurements are reported. T o clarify the accurac y–ef cienc y trade-of f for real-time traf c monitoring. 3.1. Dataset composition and training strategy The traf c management e xperiments use the T raf c Management Enhanced Dataset collected from the Roboo w Uni v erse street-vie w project, which of fers dense urban scene imagery with consistent bounding-box labels for common road users. A total of 5,805 images are used for training and 279 images are held out for testing, co v ering v e object cate gories that directly align with monitoring needs in mix ed traf c corridors, namely bic ycle, b us, car , motorc ycle, and person. Detection is implemented using Ultralytics Y OLOv8 in the medium conguration, chosen to pro vide a practical balance between inference cost and localisation quality for multi-class street surv eillance. T able 1 reports the dataset specications adopted in the proposed pipeline. Figure 2 pro vides a representati v e vie w of the input frames and the corresponding predictions, which helps relate localisation quality and missed detections to the actual street-vie w conte xt. 3.2. Hyper parameter conguration and tuning Hyperparameters for Y OLOv8 training were selected through a controlled tuning study in which candidate v alues were e v aluated under the same data split and augment ation pipeline, and the nal selection w as guided by v alidation detection quality and stable optimisation beha viour . As summarised in T able 2, the conguration that deli v ered the most consistent con v er gence used a learning rate of 0 . 01 with a batch size of 32 , together with weight decay of 5 × 10 4 to limit o v ertting while preserving learning capacity . T raining w as e xtended to 300 epochs so that the detector could benet from repeated e xposure to di v erse street-vie w scenes, and Adam w as preferred o v er stochastic gradient descent (SGD) because it produced smoother updates and fe wer oscillations under the same schedule. F or post-processing, a non-max suppression IoU threshold of 0 . 5 w as adopted to suppress duplicate box es while retaining closely spaced instances in cro wded frames, and Y OLOv8-TMS: spatiotempor al attention networks for r eal-time occlusion-r esilient ... (V idhya Kandasamy) Evaluation Warning : The document was created with Spire.PDF for Python.
1714 ISSN: 2252-8938 anchor scales were k ept at the def ault setting since scaled v ariants did not pro vide a clear impro v ement in the observ ed results. T able 1. Dataset description for traf c management system P arameter Details Dataset name T raf c management enhanced dataset Source Roboo w Uni v erse (https://uni v erse.roboo w .com/fsmvu/street-vie w-gdogo) T otal images 5,805 images for training, 279 images for testing Algorithm used Y OLOv8 medium Object cate gories Bic ycle, b us, car , motorc ycle, person (a) (b) Figure 2. Sample input and corresponding prediction results (a) input data visualization and (b) predicti v e analysis visualization T able 2. Hyperparameter tuning for Y OLOv8 Hyperparameter T ested v alues Best v alue Learning rate 0.001, 0.01, 0.1 0.01 Batch size 16, 32, 64 32 W eight decay 0.0001, 0.0005, 0.001 0.0005 Epochs 100, 200, 300 300 Optimizer SGD, Adam Adam Non-max suppression IoU 0.4, 0.5, 0.6 0.5 Anchor scales Def ault, scaled up, scaled do wn Def ault 3.3. P erf ormance analysis under v arying conditions and model comparisons T able 3 reports that Y OLOv8 maintai n s its strongest detection quality in clear daylight, with precision, recall, and mAP50 peaking in sunn y scenes, while progressi v ely harsher visibility and illumination conditions reduce all three measures, a beha viour that aligns with prior observ ations that atmospheric scattering and lo w-light imaging suppres s contrast cues and weak en feature separability for object detectors. Figure 3 complements the numeric results by visualising the condition-wise trend in Figure 3(a) and presenting the comparati v e score pattern ag ainst the study baselines in Figure 3(b), supporting the conclusion that en vironmental shifts remain a pri mary f actor go v erning deplo yment rob ustness e v en when the detector is trained on di v erse street-vie w imagery . Rain introduces a moderate drop (mAP50 from 95.2% to 90.0%), which is consistent with the combined ef fect of rain streaks, specular road reections, and intermittent occlusions that Int J Artif Intell, V ol. 15, No. 2, April 2026: 1709–1718 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1715 perturb box localisation. F og imposes a further decline (mAP50 88.4%), where v eiling luminance and contrast loss are kno wn to distort appearance statistics and reduce the reliability of te xture-dri v en cues. The lo west scores occur at night (mAP50 85.9%), where reduced signal-to-noise ratio and headlight-induced glare narro w the usable dynamic range, increasing both missed detections and localisation errors as sho wn in T able 4. T able 3. Performance of Y OLOv8 across dif ferent climatic conditions Climatic condition Precision (%) Recall (%) mAP50 (%) Sunn y 92.5 93.0 95.2 Rain y 87.3 88.1 90.0 F oggy 85.0 84.5 88.4 Night 83.2 82.7 85.9 (a) (b) Figure 3. Performance analysis of Y OLOv8 under v arying conditions and model comparisons: (a) across dif ferent climatic conditions and (b) compared with other models T able 4. Comparati v e performance of Y OLOv8 and other recent models Model Precision (%) Recall (%) mAP50 (%) Y OLOv8 92.0 91.5 94.3 Y OLOv5 90.5 90.0 93.0 SSD (MobileNetV3) 89.0 88.5 92.1 F aster R-CNN (ResNet50) 87.0 86.5 90.8 Ef cientDet-D7 91.3 90.8 95.1 Mask R-CNN 89.7 89.0 92.3 Y OLOv8-TMS: spatiotempor al attention networks for r eal-time occlusion-r esilient ... (V idhya Kandasamy) Evaluation Warning : The document was created with Spire.PDF for Python.
1716 ISSN: 2252-8938 3.4. Computational efciency and deployment metrics Resource proling for the Y OLOv8 deplo yment indicates that the model operates within a practical compute en v elope for near real-time traf c monitoring, with the measured utilization gures and runtime characteristics summarised in T able 5. During inference, CPU usage stabilises at 70%, while GPU memory consumption remains at 8 GB, suggesting that the w orkload benets from accelerator support without e xhausting typical mid-range GPU capacity . The end-to-end inference latenc y is 50 ms per frame, which corresponds to an observ ed throughput of 30 FPS under the tested conguration, s u pport ing continuous scene analysis with minimal b uf fering. The stored model footprint is 250 MB, reecting the parameter b udget of the selected Y OLOv8 v ariant and indicating a moderate storage requirement for edge or w orkstation deplo yment. Ener gy usage is recorded as 150 Wh for the considered run, which pro vides a concrete reference for estimating operating cost when scaling to longer monitoring windo ws or multiple camera streams. T able 5. Resource utilization and ef cient computation for Y OLOv8 model Resource/computation aspect Utilization/ef cienc y measures CPU usage (%) 70 GPU memory (GB) 8 Inference time (ms) 50 Model size (MB) 250 Inference speed (FPS) 30 Ener gy consumption (Wh) 150 4. CONCLUSION This paper presented Y OLOv8-TMS, a spatiotemporal attention-enhanced frame w ork for occlusion-resilient urban traf c monitoring. By inte grating h ybrid attention-based multi-scale renement, temporal coherence modeling, and adapti v e occlusion handling, the proposed method impro v es detection stability in cro wded and d ynam ic scenes. Experimental v alidation on a h ybrid benchmark demonstrates 96.3% mAP@0.5 at 67 FPS, yielding a 5.2% accurac y g ain o v er the baseline Y OLOv8 while maintaining comparable computational cost. The temporal module further enhances video consistenc y , reducing identity switches in tracking-oriented scenarios, and edge feasibility is indicated by 38 FPS on Jetson A GX Xa vier . Future w ork will in v estig ate multimodal fusion (e.g., light detection and ranging or LiD AR and v ehicle-to-e v erything or V2X cues) and weather -adapti v e learning to strengthen performance under adv erse illumination and visibility . FUNDING INFORMA TION Authors state no funding in v olv ed. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib utions, reduce authorship disputes, and f acilitate collaboration. Name of A uthor C M So V a F o I R D O E V i Su P Fu V idh ya Kandasamy Anton y T aurshia Tha vittupalayam M. Thiyagu Catherine Jo y RusselRaj Jenef a Archpaul C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting Int J Artif Intell, V ol. 15, No. 2, April 2026: 1709–1718 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1717 CONFLICT OF INTEREST ST A TEMENT Authors state no conict of interest. D A T A A V AILABILITY The data that support the ndings of this study are a v ailable on request from the corresponding author , [J A]. REFERENCES [1] M. A. W ajid, M. S. W ajid, A. Zaf ar , and H. T . Marin, “Digital twin technology for multimodal-based smart mobility using h ybrid Co-ABC optimization based deep CNN, Cluster Computing , v ol. 28, no. 3, Jan. 2025, doi: 10.1007/s10586-024-04903-8. [2] W . Zou, Y . Hu, X. W ang, and J. Li, “Y OLOv5s-F A C: enhanced feature association detector for person-v ehicle counting in smart park, Signal, Ima g e and V ideo Pr ocessing , v ol. 19, no. 1, Dec. 2024, doi: 10.1007/s11760-024-03735-8. [3] K. Sun et al ., “SASNet: road e xtraction from remote sensing images based on spatial attention stacking, in Sixth International Confer ence on Geoscience and Remote Sensing Mapping (GRSM 2024) , Qingdao, China: SPIE, Jan. 2025, doi: 10.1117/12.3057567. [4] S. R. Alotaibi et al ., “Inte grating e xplainable articial intelligence with adv anced deep learning model for cro wd density estimation in real-w orld surv eillance systems, IEEE Access , v ol. 13, pp. 20750–20762, 2025, doi: 10.1109/A CCESS.2025.3529843. [5] J. E. Gallagher and E. J. Oughton, “Surv e ying you only look once (Y OLO) multispectral object detection adv ancements, applications, and challenges, IEEE Access , v ol. 13, pp. 7366–7395, 2025, doi: 10.1109/A CCESS.2025.3526458. [6] M. Bakirci, Adv anced ship detection and ocean monitoring with satellite imagery and deep learning for marine science applications, Re gional Studies in Marine Science , v ol. 81, Jan. 2025, doi: 10.1016/j.rsma.2024.103975. [7] T . M. T a wfeeq and M. Nickray , Adv ersarial training for impro v ed VPN traf c classication using ef cientNet-B0 and projected gradient descent, International J ournal of Intellig ent Engineering and Systems , v ol. 18, no. 1, pp. 1200–1215, Feb . 2025, doi: 10.22266/ijies2025.0229.87. [8] S. Anari, S. Sade ghi, G. Sheikhi, R. Ranjbarzadeh, and M. Bendechache, “Explainable attention based breast tumor se gmentation using a combination of UNet, ResNet, DenseNet, and Ef cientNet models, Scientic Reports , v ol. 15, no. 1, Jan. 2025, doi: 10.1038/s41598-024-84504-y . [9] R. Singh, N. Sharma, K. Rajput, and H. S. Pokhariya, “Ef cientNet-B7 enhanced road accident detection using CCTV footage, in 2024 Asia P acic Confer ence on Inno vation in T ec hnolo gy (APCIT) , Jul. 2024, pp. 1–6, doi: 10.1109/APCIT62007.2024.10673607. [10] M. K umar and S. Anw ar , “Deep learning model for U A V aided traf c analysis and v ehicle classication, in 2024 5th International Confer ence on Data Intellig ence and Co gnitive Informatics (ICDICI) , No v . 2024, pp. 1219–1224, doi: 10.1109/ICDICI62993.2024.10810974. [11] M. N. A. Khan et al. , “FedUN A: a federated learning approach for rob ust and pri v ac y-preserving pothol e classication using ef cientNet, in 2024 IEEE 8th International Confer ence on Signal and Ima g e Pr ocessing Applications (ICSIP A) , Sep. 2024, pp. 1–6, doi: 10.1109/ICSIP A62061.2024.10686750. [12] U. Mittal, P . Cha wla, and R. T iw ari, “EnsembleNet: a h ybrid approa ch for v ehicle detection and estimation of traf c density based on f aster R-CNN and Y OLO models, Neur al Computing and Applications , v ol. 35, no. 6, pp. 4755–4774, Feb . 2023, doi: 10.1007/s00521-022-07940-9. [13] S. W ang et al ., “Intrusion detection system for v ehicular netw orks based on MobileNetV3, IEEE Access , v ol. 12, pp. 106285–106302, 2024, doi: 10.1109/A CCESS.2024.3437416. [14] X. Y ang et al ., An ef cient l ightweight satellite image classication mode l with impro v ed MobileNetV3, in IEEE INFOCOM 2024 - IEEE Confer ence on Computer Communications W orkshops (INFOCOM WKSHPS) , May 2024, pp. 1–6, doi: 10.1109/INFOCOMWKSHPS61880.2024.10620744. [15] P . Zhou, “ET -MobileNet: a lightweight encryption traf c classication method, in F ifth International Confer ence on Computer Communication and Network Security (CCNS 2024) , Guangzhou, China: SPIE, Aug. 2024, doi: 10.1117/12.3038159. [16] M . K. Alam et al ., “F aster RCNN based rob ust v ehicle detect ion algorithm for identifying and classifying v ehicles, J ournal of Real-T ime Ima g e Pr ocessing , v ol. 20, no. 5, Jul. 2023, doi: 10.1007/s11554-023-01344-1. [17] M . Sankaranarayanan and K. Si v aSai, “T w o-tier parallel virtual lattice layers for enhanced ef cienc y in video traf c management, International J ournal of Intellig ent T r ansportation Syste ms Resear c h , v ol. 23, no. 1, pp. 464–474, Apr . 2025, doi: 10.1007/s13177-025-00461-4. [18] V . T ammisetti, G. Stettinger , M. P . Cuellar , and M. M. Solana, “Meta-Y OLOv8: Meta-learning-enhanced Y OLOv8 for precise traf c light color detection in AD AS, Electr onics , v ol. 14, no. 3, Jan. 2025, doi: 10.3390/electronics14030468. [19] A. Khang, V . Abdullaye v , and Y . Niu, Analysis of wireless sensor netw orks applications in intelligent transportation system, in Driving Gr een T r ansportation System Thr ough Articial Intellig ence and A utomation: Appr oac hes, T ec hnolo gies and Applications . Cham, Switzerland: Springer Nature, 2025, pp. 359–377, doi: 10.1007/978-3-031-72617-0 19. [20] J. G. Balaji., S. V arunika, and R. K. Grace, “Enhancing traf c management using deep learning for realtime classication, in 2024 11th International Confer ence on Reliability , Infocom T ec hnolo gies and Optimization (T r ends and Futur e Dir ections) (ICRIT O) , Mar . 2024, pp. 1–7, doi: 10.1109/ICRIT O61523.2024.10522426. [21] C. W ang, D. Zhao, Y . Guo, and L. Li, “Deep learning–based multi-tar get detection and o w prediction in comple x traf c systems supported by the CV (chan–v ese) se gmentation model, Disco ver Applied Sciences , v ol. 8, no. 1, No v . 2025, doi: 10.1007/s42452-025-08028-4. [22] M. Uthaman and A. Bhagyalakshmi, A re vie w on content based image retrie v al under occluded conditions, in 2025 3r d International Confer ence on Sustainable Computing and Data Communication Systems (ICSCDS) , Aug. 2025, pp. 574–580, doi: 10.1109/ICSCDS65426.2025.11167489. Y OLOv8-TMS: spatiotempor al attention networks for r eal-time occlusion-r esilient ... (V idhya Kandasamy) Evaluation Warning : The document was created with Spire.PDF for Python.
1718 ISSN: 2252-8938 [23] O. Smo vzhenk o and A. Pysarenk o, “V ision-based neighbor selecti on method for occlusion-resilient uncre wed aeria l v ehicle sw arm coordination in three-dimensional en vironments, Information, Computing and Intellig ent systems , no. 6, pp. 100–117, Sep. 2025, doi: 10.20535/2786-8729.6.2025.331602. [24] W . Xu, X. Du, R. Li, B. Li, Y . Jiao, and L. Xing, Attention-enhanced StrongSOR T for rob ust v ehicle tracking in comple x en vironments, Scientic Reports , v ol. 15, no. 1, May 2025, doi: 10.1038/s41598-025-99524-5. [25] Y . W ang, H. Huang, J. He, D. Han, and Z. Zhao, “Closed-loop aerial tracking with dynamic detection-tracking coordination, Dr ones , v ol. 9, no. 7, Jun. 2025, doi: 10.3390/drones9070467. BIOGRAPHIES OF A UTHORS V idh ya Kandasamy is an a ssociate professor at the School of Computer Science and T echnology , Karun ya Institute of T echnology and Sciences, Coimbatore, India. Her research interests include image processing and intelligent vi sion systems. She can be contacted at email: vidh yak@karun ya.edu. Antony T aurshia is an assistant professor at Karun ya Institute of T echnology and Sciences, India. Her research interests incl ude c ybersecurity and AI-enabled surv eillance. She can be contacted at email: anton ytaurshia@karun ya.edu. Tha vittupalayam M. Thiyagu is an associate professor at the Department of Computer Science and Engineering, V el T ech Rang arajan Dr . Sagunthala R&D Instit ute of Science and T echnology , India. His research interests include c ybersecurit y and intelligent systems. He can be contacted at email: t.m.thiyagu@gmail.com. Catherine J oy RusselRaj is an assistant professor at Karun ya Institute of T echnology and Sciences, India. Her research focuses on articial intelligence and image processing. She can be contacted at email: catherinejo y@karun ya.edu. J enefa Ar chpaul an associate professor at Karun ya Institute of T echnology and Sciences, she brings e xpertise in articial intelligence and image processing to her teaching and research, follo wing the completion of her Ph.D. at Anna Uni v ersity in 2022. She can be contacted at email: jenef aa@karun ya.edu. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1709–1718 Evaluation Warning : The document was created with Spire.PDF for Python.