Indonesian J our nal of Electrical Engineering and Computer Science V ol. 40, No. 2, No v ember 2025, pp. 801 813 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v40.i2.pp801-813 801 Monocular vision-based visual contr ol f or SCARA-type r obotic arms: a depth mapping appr oach Diego Chambi 1 , Bryan Challco 1 , J onathan Catari 1 , W alk er Aguilar 1 , Lizardo P ari 1 Electronic Engineering Professional School, F aculty of Production and Services, Uni v ersidad Nacional de San Agust ´ ın de Arequipa, Arequipa, Per ´ u Article Inf o Article history: Recei v ed Jan 22, 2025 Re vised Jul 14, 2025 Accepted Oct 14, 2025 K eyw ords: Computer vision Robotic arm SCARA-type robotic arm V ision transformers V isual serv oing ABSTRA CT The accelerated gro wth of an increasingly automated industry requires the use of autonomous robotic systems. Ho we v er , the use of these systems commonly requires an enormous amount of sensors. In this paper we e v aluate the perfor - mance of a ne w system for vis ual control of a select i v e compliance assembly robot arm (SCARA) robotic arm using a monocular depth map that onl y re- quires one monocular camera. This system aims to be an ef cient alternati v e to reduce the number of sensors in the robotic arm area while mai ntaining the ef- fecti v eness of traditional vision algorithms that use stereoscopic architectures of cameras. F or this purpose, this system is compared with representati v e state-of- the-art vision a lgorithms focused on the control of robotic a rms. The results are statistically analyzed, indicating that the algorithm propos ed in this research has competiti v e performance compared to state-of-the-art robotic arm visual control algorithms only using a single monocular camera. This is an open access article under the CC BY -SA license . Corresponding A uthor: Die go Chambi Electronic Engineering Professional School, F aculty of Production and Services Uni v ersidad Nacional de San Agustin de Arequipa 04001 Arequipa, Per ´ u Email: dchambitu@unsa.edu.pe 1. INTR ODUCTION Automation has been emplo yed in e v ery industry in recent years. From precision industrial robots to home automation (domotics), automation has tak en on an essential role in performing repetiti v e and dangerous tasks, allo wing humans to focus on acti vities of greater rele v ance [1]. Among the man y systems used in automa- tion, robotic syst ems are the most widely adopted and of fer the broadest range of applications. These systems were rst introduced in f actories in the 1960s and, by the 1980s, were being used globally—particularly in the automoti v e sector . T oday , robotic systems are found in a wide v ariety of settings, including small b usinesses, educational institutions, and agricultural elds [2]-[4]. Robotic arms, in particular , are composed of multiple links and actuators, enabling them to be used in tasks such as painting, pharmaceutical production, and welding in assembly lines [5]-[7]. Each robotic arm is designed and implemented according to the specic requirements of the task it is intended to perform. T o achie v e this le v el of adaptability and precision, robotic arms often re- quire a lar ge number of sensors [8]. Consequently , man y industries that could benet from automation hesitate to adopt robotic systems due to the high cost of these sensing components. Cameras ha v e been widely used in research on the control of robotic arms; achie ving this re q ui res adequate processing of the video captured by the camera. In Intisar et al. [9], the vi deo obtained by a camera is processed to classify by color using a transformation to hue, saturation, and v alue (HSV) in dif ferent objects. J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
802 ISSN: 2502-4752 Then, a robotic arm performs a pick-and-place task on the selected object. Its interf ace allo ws the user to select an object and ha v e it automatically manipulated by the robotic arm without the need for e xtensi v e kno wledge of the system’ s inner w orkings. Ho we v er , this system directly depends on objects placed at the same le v el, limiting its functionality . In K umar et al. [10], a stereo camera system generates disparity maps to estimate object location and distance, allo wing a robotic arm of three de grees of freedom for pick-and-place tasks. Ho we v er , it requires precise calibration, synchronization, and computationally intensi v e algorithms, with added challenges from robotic arm mo v ements in handheld setups. According to Liyanage and Krouglicof [11], visual control for a selecti v e compliance assembly robot arm (SCARA) robot incorporates a high-speed camera with an infrared mark er placed at the end ef fector . Kim et al. [12] highlights a wheelchair -mounted robotic arm that emplo ys stereoscopic cameras along with a coarse-to-ne motion control strate gy . As noted in [13], the ARMAR-III robot applies stereo vision com- bined with stored object orientation data to calculate the full 6D pose of objects relati v e to their 3D models in real-time, supporting adv anced scene analysis. A rose pruning robot, described in [14], inte grates stereoscopic cameras positioned near the end-ef fector to minimize interference. Meanwhile, Ranftl et al. [15] discusses a dual robotic arm system that autonomously adjusts the camera’ s vie wpoint to maintain an occlusion-free visual eld. Additionally , Urrea and P ascal [16] and Fiora v anti et al . [17] describe dual-arm systems using stereoscopic cameras for calibration-free control and accurate distance estimation, respecti v ely . Despite these de v elopments, the computational load, sensiti vity to en vironmental changes, and comple xity of calibration mak e stereo vision-based systems impractical for embedded or lo w-cost applications. Monocular vision algo- rithms ha v e become a viable substitute in this re g ard. F or instance, Li et al. [18] introduces a h ybrid visual serv o system for agricultural harv esting that uses a single camera, and Nicolis et al. [19] in v estig ates the appli- cation of V ision T ransformers for impro v ed depth prediction in monocular se ttings. Although these techniques simplify hardw are and allo w for more e xible deplo yment, there is still limited inte gration of these techniques into robotic control systems, especially for pick-and-place and absolute distance estimation tasks. T o ll these g aps, our study suggests a visual control system that inte grates a SCARA-style robotic arm wit h monocular depth estimation based on the MiDaS algorithm [20]. Our suggested method achie v es comparable accurac y (RMSE of 0.46 cm) with a single camera, ob viating the need for stereo matching and calibration, whereas earlier w orks lik e [10], [12], and [13] achie v e high precision using stereo vision (e.g., RMSE of 0.49 cm at 15 cm). By making vision-based robotic manipulation more feasible and af fordable for embedded systems’, where stereo vi sion systems ha v e traditionally been too costly and computationally demanding, this method tackles important issues. W e use a re gress ion-based metric con v ersion, moti v ated by [21], to translate the relati v e depth gi v en by Mi DaS into absolute coordinates for robotic control. In v erse kinematics and real-time 3D localization are made possible by this transformation. This system achie v es high accurac y in robotic tasks while lo wering hardw are costs and setup comple xity by doing a w ay with the need for stereo cameras. The main contrib utions of this w ork are: - Computational ef cienc y , the monocular system a v oids stereo matching and synchronization o v erhead [10], [12], enabling its use in lo w-cost, embedded platforms. - Precision, RMSE of 0.46 cm at 15 cm, competiti v e with traditional ster eo vision systems (T able 3), pro viding a high-precision, af fordable solution. - Rob ustness, stable performance under v arying lighting, surpassing baseline systems lik e [12], making the system more adaptable to real-w orld conditions. T o the best of our kno wledge, this is the rst implementation combining i) monocular depth estima- tion optimized for embedded platforms [20], ii) real-time absolute metric con v ersion [21], and iii) a lo w-cost SCARA robotic manipulator manuf actured via additi v e technologies, of fering a breakthrough for cost-ef fecti v e automation in robotics. The research is or g anized as follo ws: section 2 presents a brief re vie w of the algorithm used for visual control, as well as the materials and methods used to v alidate the proposed algorithm. Section 3 details the results obtained in the distance estimati on and approach tests of the robotic gripper to the tar get. Section 4 discusses the results highlighting the most rele v ant observ ations. Finally , section 5 presents the conclusions and possible lines of future w ork. Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 801–813 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 803 2. METHOD 2.1. Hard war e A 3-DOF SCARA robotic arm w as designed and b uilt using additi v e manuf acturing and al uminum rods to v alidate the proposed vision-based pick-and-place system, gi v en its industrial v ersatility and ease of control [22], [23]. Pre vious studies such as [24] ha v e also demonstrated the feasibility of SCARA robots in precision tasks lik e pe g-in-hole assembly , highlighting their suitability for applications requiring accurac y and compliance. T o illustrate the structural and analytical basis of the proposed robotic system, Figure 1 sho ws the proposed SCARA robotic arm’ s kinematic model and ph ysical structure. Figure 1(a) presents the kinematic conguration, highlighting the three de grees of freedom ( d 1 , θ 2 , and θ 3 ) and their associated links ( L 2 , L 3 ) within a Cartesia n reference frame. This model i s fundamental for deri ving both forw ard and i n v erse kinematics. Figure 1(b) sho ws the CAD rendering of the ph ysical robotic arm, de v eloped through additi v e manuf acturing techniques. This design w as optimized in lo w-cost robotic applications. The mechanical structure of the SCARA arm w as f abricated using PLA for 3D-printed c omponents and aluminum rods for v ertical support. The system is actuated by three NEMA17 stepper motors for planar mo v ements and an MG92R serv o motor for the gripper . GT2 pulle ys and belts are used for motion transmission, while linear bearings ensure smooth mo v em ent. The robot is controlled by a GT2560 board programmed using Arduino IDE. The kinematic model of the robotic arm is based on Dena vit-Hartenber g (D-H) parameters, which dene the spatial relationships between consecuti v e links. The parameters for each joint are summarized in T able 1. The arm consists of three joints: one prismatic ( d 1 ) and tw o re v olute ( θ 2 , θ 3 ). The corresponding link lengths are L 2 and L 3 , and all joint of fsets are set to zero twist ( α = 0 ). (a) (b) Figure 1. Proposed SCARA robotic arm’ s kinematic m odel and ph ysical structure (a) kinematic representation of the robot with articulated parameters in a Cartesian reference system and (b) ph ysical model of the robot sho wing its structural design under dynamic conditions T able 1. D-H parameters of the three D.O.F . for the SCARA robotic arm θ d i a i α Joint 1 0 d 1 0 0 Joint 2 θ 2 0 L 2 0 Joint 3 θ 3 0 L 3 0 The kinematics of a serial-link mechanism can be determined through homogeneous transform ation matrices, combining basic rotations and translations for each joint, as described by Cork e in [25]. Using the Dena vit-Hartenber g (D-H) parameters from T abl e 1, the transformation matrices A 1 , A 2 , and A 3 are computed. The direct kinematics is obtained by multiplying these matrices: T 3 = A 1 · A 2 · A 3 (1) The resulting matrix T 3 gi v es the position and orientation of the end ef fector with respect to the base frame. In its e xpanded form, the position is a function of the joint angles θ 2 and θ 3 , and the link lengths L 1 , Monocular vision-based visual contr ol for SCARA-type ... (Die go Chambi) Evaluation Warning : The document was created with Spire.PDF for Python.
804 ISSN: 2502-4752 L 2 , and L 3 . T o calculate the joint angle θ 3 for object manipulation, the in v erse kinematics equation is used: θ 3 = arccos   P 2 x + P 2 y L 2 1 L 2 2 2 L 1 L 2 ! (2) F or t he vision system, A v atec cameras with a 720p resolution, USB interf ace, and 30 FPS refresh rate were used to track the position of the object, separately or in stereo conguration. 2.2. Softwar e This subsection details the algorithms necessary to perform the pi cking task with the SCARA robotic arm. T ypically , ster eoscopic vision-based systems use tracking algorithms to obtain a disparity between cam- eras. T o represent this type of system, we implement this algorithm using the MIL tracking model, as discussed in [26]. As a second system, the monocular vision depth mapping algorithm is introduced. In this congu- ration, only one webcam is used along with the MiDaS model, which has been sho wn to ef fecti v ely estimate depth from monocular images [27]. The performance of this visual control system is then com pared to the con v entional stereoscopic camera system. The tw o systems to be compared are summarized as follo ws: - Stereoscopic architecture: an algorithm based on stereoscopic vision using MIL tracking and the dispar - ity algorithm, as outlined in [26]. - Monocular vision: the proposed system uses the MiDaS algorithm based on monocular vision [27]. Once each algorithm detects the position of the object in the three Cart esian coordinates, a third algorithm based on the kinematics of the robotic arm will pick up the indicated object. A user interf ace allo ws the user to signal the object to be pick ed up by the gripper for manipulation by the robotic arm, as described in [28] and [29]. 2.2.1. Ster eoscopic ar chitectur e In this vision mode, a tw o-camera array in stereo conguration is used. This algorithm is widely used in visual control systems for robotic arms due to its simple operating principle. Usually , an object tracking algorithm is used so that the operator can select the object to perform the pick-and-place task with the robotic arm through a user interf ace. W e used the MIL algorithm for this specic case, which is considered one of the most rob ust ag ainst disturbances in continuous image capture. W e use the O penC V library and the command: P y thon.T r ack er M I Lcr eate . Once the object is track ed, we obtain the center of mass by obtaining the mass moments 0, 0 using the command cv 2 .moments . W e then use the dispa rity algorithm to calculate the distance between the object and the stereo camera array . Figure 2 graphically sho ws the disparity obtained from a position dif ference captured by both cameras. In Figure 2(a), O c represents the optical centers of the cameras, T is the baseline, and f is the focal length of each lens. The point P is the object in the en vironment, and Z is the distance we w ant to calculate. In Figure 2(b), we observ e the object seen by both frames of the stereoscopic camera, where X L and X R are the distances from the reference frame of each camera to the center of mass of the detected object. (a) (b) Figure 2. Disparity obtained from a position dif ference captured; (a) depth triangulation scheme in stereo vision sho wing the geometry of the cameras and the observ ed object and (b) disparity representation in images captured by left and right cameras to estimate the distance Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 801–813 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 805 T o calculate the distance Z to the object using stereo vision, the follo wing steps are carried out. First, the positions X L and X R of the object are e xtracted from the left and right camera frames, respecti v ely . The disparity is then calculated as the dif ference between these tw o positions: disparity = X L X R If the disparity is zero (i.e., the object is directly aligned between both came ras), it is adjusted to a small v alue (usually 1) to a v oid di vision by zero. The depth, or distance Z , is then computed using the formula: Z = f · T disparity where f is the focal length of t h e cameras, and T is the baseline (the distance between the tw o cameras). The result is the estimated distance Z to the object along the Z-axis. The distance Z is al w ays positi v e, so its absolute v alue is tak en to ensure the result is non-ne g ati v e. In this w ay , the triangulation proc ess determines the 3D coordinates of the object in space by cal cu- lating its location on the X and Y ax es, along with the approximate distance to the Z axis. 2.2.2. Monocular ar chitectur e F or the proposed monocular vision system , we use the MiDaS depth estimation model based on deep learning. Recent w ork by Smith et al. [30] introduced alternati v e methods for linear depth estimation from uncalibrated monocular images using polarization cues; ho we v er , our approach focuses on transformer -based depth prediction for robotic control applications. MiDaS of fers three v ersions with v arying computational demands. T o reduce the implementation cost of visual control in industrial robotic arms, we selected the Small v ersion due to its lo w computational requirements, which mak es it suitable for lo w-po wer processors. Figure 3 sho ws the depth map generated by the MiDaS algorithm and the corresponding top vie w of the test object. In Figure 3(a), the depth map is visualized wi th colors that indicate the relati v e distances of the objects. Figure 3(b) presents the same scene con v erted to grayscale, highlighting the depth v ariations more clearly for easier processing by the control system. (a) (b) Figure 3. Depth map is visualized with colors that indicate the relati v e distances of the objects (a) MiDaS small algorithm e xample and (b) proposed image processing for monocular architecture. The monocular vision system uses a neural netw ork based on backbones for distance estimation. Ho we v er , applying it to industrial robotic tasks requires additional signal processing steps, including perspec- ti v e transformation, noise ltering, and absolute distance es timation from relati v e measurements. Figure 4 summarizes these sequential steps, which are detailed belo w . First, the image of the webcam is captured; this video, obtained from a single camera, presents a ”she ye” ef fect that spherically distorts the image. T o correct for this distortion, a perspecti v e transformation is perf o r med using the command cv 2 .w ar pP er spectiv e , which requires selecting f o ur points at the edge of the w orking area. Once the image has been corrected, the MiDaS depth m ap algorithm is applied, specically selecting the Small v ersion. This model is loaded from the P y tor c h library using the command midas = tor ch.hub.l oad ( intel isl / M iD aS , M iD aS s mal l ) . At this stage, a depth map v ersion of the input image is obtained. Subsequently , the depth map is normalized using the cv 2 .nor mal iz e () command; for this research, normalization w as applied to a range between 1 and 10 to f acilitate further data processing. Figure 3(b) sho ws an e xample of this normalized depth map in the robotic arm’ s w orkspace. Monocular vision-based visual contr ol for SCARA-type ... (Die go Chambi) Evaluation Warning : The document was created with Spire.PDF for Python.
806 ISSN: 2502-4752 Figure 4. Sequential steps for absolute distance estimation After normalization, spline interpolation is performed to smooth transitions between pix el v alues. From the interpolated dat a, the distance from the center of the track ed object to the camera is calculated. A mo ving a v erage lter is then applied to stabilize the obtained v alues o v er time. Subsequently , the relati v e dis- tance between the background and the object is determined based on the generated depth map. In Masoumian et al. [24], a similar problem is addressed by approxim ating the absolute distance from the relati v e measurement using a quadratic function gi v en by: Y = ( c 0 + c 1 X + c 2 X 2 ) h (3) where the coef cients C 0 , C 1 and C 2 are obtained using least squares equations, h is the height at whic h the chamber is located, and X is the relati v e distance. This same problem is presented in [25] and is solv ed by nding the optimal curv e through least squares equations. F or this, a total of six images at dif ferent distances from the camera were used to calibrate the model. Finally , the estimated absolute distance is subtracted from the 35 cm height at which the camera is located to determine the object’ s height. The core steps of the monocular vision and distance est imation process are summarized in Algorithm 1. The algorithm follo ws these steps: First, the image is captured and the perspecti v e distortion is corrected. Then, the depth map is gener ated and normalized, follo wed by distance calculation. Finally , the absolute distance is estimated using a quadratic tting model. Algorithm 1 Proposed algorithm 1: pr ocedur e P E R S P E C T I V E C O R R E C T I O N (frame) 2: P 1 , P 2 , P 3 , P 4 select four points 3: Points [ P 1 , P 2 , P 3 , P 4 ] 4: if length(Points) = 4 then 5: ne w frame cv2.W arpPerspecti v e(frame, Points) 6: end if 7: frame ne w frame 8: end pr ocedur e 9: 10: pr ocedur e D E P T H M A P (frame, img batch) 11: Midas model type.MiDaS small 12: depth map Midas(img batch, frame) 13: depth map depth map.interpolate(frame) 14: depth map cv2.normalize(depth map) 15: end pr ocedur e 16: 17: pr ocedur e D I S T A N C E T O C A M E R A (frame, depth map) 18: T racking algorithm MIL 19: Object select.Object 20: [ X C , Y C ] T racking algorithm(Object) 21: Bounding box T racking algorithm(Object, frame) 22: end pr ocedur e 23: 24: pr ocedur e A B S O L U T E D I S T A N C E E S T I M A T I O N (frame, Relati v e distance) 25: x [11 . 8 , 10 . 843 , 10 . 411 , 10 . 2] 26: y [21 , 23 , 26 , 31] 27: de gree 2 28: Quadratic function np.polyt(x, y , de gree) 29: Distance Quadratic function(Filtered) 30: end pr ocedur e T o pro vide a practical demonstration of the entire process, a video has been included that sho ws the monocular vision system in action with the SCARA robotic arm. The video illustrates ho w the steps outlined Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 801–813 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 807 in Algorithm 2 are e x ecuted, f rom image capture and perspecti v e correction to depth map generation and object tracking. This visual e xample helps to clarify the methodology and highlight the system’ s functionality . The video can be vie wed at the [31]. Algorithm 2 Robotic arm control 1: pr ocedur e M I C R O C O N T R O L L E R (SerialCommunication) 2: M o tor 1 S tep, M otor 1 D ir 25 , 23 3: M o tor 1 Ang l e g ets (200 / 360) (62 / 20) 4: M o tor 2 S tep, M otor 2 D ir 31 , 33 5: M o tor 2 Ang l e g ets (200 / 360) (89 / 20) 6: M o tor Z S tep, M otor Z D ir 37 , 39 7: M o tor 1 D i stan ce 200 / 1 . 2 8: S er v oM otor P in 11 9: [ M 1 , M 2 , M z , S er v o ] SerialCommunication 10: M otor 1 P osition M 1 M otor 1 Ang l e 11: M otor 2 P osition M 2 M otor 2 Ang l e 12: M otor Z P ositio n M Z M otor 1 Ang l e 13: S er v oM otor P osition S er v o 14: end pr ocedur e 15: pr ocedur e I N V E R S E K I N E M A T I C S ( ( p x , p y , p z , SpaceButton ) ) 16: d 1 = p z 17: Gripper 180 18: θ 3 arccos p 2 x + p 2 y l 2 1 l 2 2 2 l 1 l 2 19: θ 2 l 2 ( p x sin( θ 3 )+ p y cos( θ 3 ))+ p y l 1 p 2 x + p 2 y 20: data [ θ 2 , θ 3 , d 1 , Gripper ] 21: if SpaceButton = 1 then 22: SerialCommunication data 23: end if 24: end pr ocedur e 2.2.3. Robotic arm contr ol Once the object is x ed and its e xact position has been obtained through the algorithms detailed abo v e, the in v erse kinematics of the robotic arm are used so that it reaches the object and picks it up. In algorithm 3, the rst procedure corresponds to the algorithm implemented in the microcontroller of the robotic arm, which is in char ge of recei ving through serial communication the data of the angles that each motor must tra v el; for this, we must transform the steps that the motor must tak e to the necessary angle considering the teeth of the motor gear and the pulle y of the corr esponding link. W ithin this microcontroller procedure, we also n e ed to name the pins connected to the motors for control obtained from the GT2506 board. F or the case of the motor that raises or lo wers the robotic arm in the Z axis, the transformation is reduced as follo ws: AngleT oSteps = BeltT eeth GearT eeth The second procedure presented in the algorithm represents the in v erse kinematics that is e x ecuted in the computer that has serial communication with the robot, this calculation is gi v en by the equations calculated in the Hardw are subsection. Finally , a conditional w aits for the operator’ s indication by pressing the space k e y for the de grees to be sent by serial communication to the microcontroller and e x ecuted by the robotic arm. 3. RESUL TS 3.1. Results on distance estimation Each system w as tested with the SCARA robot, using additi v ely printed objects at dif ferent hei gh t s and positions within the w orkspace. The system includes a lo w-cost SCARA robot, a laptop for processing, test objects, a monocular camera, and a stereoscopic camera array . The complete setup, including all components, can be seen in Figure 5, which sho ws both the hardw are and the arrangement of the sensors and robotic arm in the test en vironment. Monocular vision-based visual contr ol for SCARA-type ... (Die go Chambi) Evaluation Warning : The document was created with Spire.PDF for Python.
808 ISSN: 2502-4752 Figure 5. Setup implemented for e xperimentation Multiple picking tasks are performed with each algorithm to e v aluate both proposed systems. Because we seek to implement a system that correctly identies the position of the object so that the robotic gripper can pick it up, we do not consider e v aluating parameters such as speed, torque, or po wer consumption of the robotic arm. In addition, a user interf ace w as implemented that allo ws the user to select the object to be pi ck ed up with the robotic arm. W ithin this interf ace, the user can see the camera vie w in real time and select the objects to be pick ed up by the robotic arm; for this e xperiment, circular gures were used in both cases to mak e a f air e v aluation. T able 2 presents the estimated distances and corresponding error v alues obtained using both the pro- posed monocular vision system and a traditional stereoscopic system. The table is di vided into tw o main groups: one for real distances of 15 cm and 13 cm (left side) and another for 10 cm and 5 cm (right side). Each group compares the estimated distance with the actual object distance, and the dif ference is sho wn as the estimation error . A color -coded heatmap highlights lo w (green), moderate (yello w), and high (red) errors, f acilitating a visual assessment of accurac y . This format allo ws for a clear comparati v e analysis between the tw o systems across multiple trials and distances. T able 2. Distance estimation at 15 cm - 13 cm and 10 cm and 5 cm Real Monocular vision Stereoscopic Real Monocular vision Stereoscopic Distance Proposed Distance Proposed V ision (cm) Estimated Error Estimated Error (cm) Error Estimated Error Distance (cm) Distance (cm) Distance (cm) Distance (cm) 15 14.809 0.191 14.268 0.732 10 10.654 0.654 9.842 0.158 15 14.854 0.146 15.573 0.573 10 9.334 0.666 10.369 0.369 15 14.760 0.240 14.675 0.325 10 10.412 0.412 11.019 1.019 15 14.643 0.357 15.294 0.294 10 10.688 0.688 9.738 0.262 15 14.434 0.566 14.921 0.079 10 10.101 0.101 10.124 0.124 15 14.112 0.888 15.407 0.407 10 10.838 0.838 10.965 0.965 15 14.978 0.022 14.351 0.649 10 10.928 0.928 9.173 0.827 15 14.225 0.775 15.831 0.831 10 10.263 0.263 9.512 0.488 15 15.393 0.393 14.733 0.267 10 10.978 0.978 10.876 0.876 15 15.094 0.094 15.122 0.122 10 10.145 0.145 9.321 0.679 15 13.133 0.133 12.367 0.633 5 5.316 0.316 4.825 0.175 13 13.484 0.484 13.721 0.721 5 5.755 0.755 5.692 0.692 13 14.087 1.087 12.946 0.054 5 5.583 0.583 4.213 0.787 13 13.224 0.224 14.012 1.012 5 5.086 0.086 5.336 0.336 13 12.235 0.765 13.532 0.532 5 5.557 0.557 4.181 0.819 13 13.389 0.389 12.689 0.311 5 5.782 0.782 5.812 0.812 13 12.791 0.209 13.248 0.248 5 5.805 0.805 4.567 0.433 13 13.031 0.031 12.574 0.426 5 4.923 0.077 6.109 1.109 13 13.804 0.804 13.896 0.896 5 5.034 0.034 4.896 0.104 13 13.114 0.114 12.315 0.685 5 4.702 0.298 4.429 0.571 When compared with e xisting stereo vision systems, such as those described in [10] and [13]—where stereo setups with dual high-precision cameras were used, achie ving RMSE v alues around 0.49 cm at 15 cm—our monocular system achie v es comparable accurac y (RMSE of 0.46 cm) while requiring only a single Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 801–813 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 809 camera. This mak es our approach more cost-ef fecti v e and easier to deplo y , particularly in resource-constrained en vironments. These results demonstrate that the proposed monocular system can perform at a le v el of accurac y similar to that of stereo vision systems, b ut with f ar fe wer hardw are requirements. The implication of this is signi cant for appl ications in industrial robotics, where minimizing hardw are cost and comple xity is of- ten crucial. By replacing e xpensi v e stereo vision setups with a single camera, we open up the possibility of implementing visual control systems in more cost-ef fecti v e and embedded robotic platforms. T able 3 pro vides a comparati v e analysis of the monocular vision algorithm (proposed) and the st ereo- scopic vision algorithm based on minimum error , maximum error , and root mean square error (RMSE) at dif ferent real distances. Th e results sho w that the monocular vision algorithm achie v es lo wer RMSE at shorter distances while maintaining competiti v e performance at longer distances, highlighting its rob ustness and relia- bility compared to the stereoscopic method. T able 3. Comparison of monocular vision (proposed) and stereoscopic vision, error and RMSE Real distance Monocular vision (proposed) Stereoscopic vision (cm) Error max Error min RMSE Error max Error min RMSE (cm) (cm) (cm) (cm) (cm) (cm) 15 0.888 0.022 0.4600 0.831 0.079 0.4925 13 1.087 0.054 0.5407 1.012 0.054 0.6189 10 0.978 0.124 0.6430 1.019 0.124 0.6607 5 0.805 0.104 0.5179 1.109 0.104 0.6577 F or a comparati v e vie w of the results, in Figure 6 is represented the results of T able 2 in a box plot; the results are grouped in pairs, each representing the estimation of the monocular vision algorithm and the stereo vision-based algorithm, being four pairs for the proposed distances. Figure 6. Box plot comparing distance estimation errors Due t hat the main focus of the algorithms presented is the determination of the distance of the cameras to the tar get, a statistical analysis is performed to e v aluate the performance of both algorithms in t h i s estimation. From the errors in T able 4, we obtain normal distrib uti ons according to the Shapiro-W ilk test. Ho we v er , there is no homogeneity of v ariances according to Le v ene’ s test; due to this, we use a non-parametric analysis based on the Mann-Whitne y U test. The follo wing h ypotheses are assumed for this test: - H 0 : There is a signicant dif ference between both groups of data. - H i : There are no signicant dif ferences between the tw o data groups. By assigning a signicance v alue Alpha = 0.05 or 5%, the P v alues sho wn in T able 4 are obtained. Monocular vision-based visual contr ol for SCARA-type ... (Die go Chambi) Evaluation Warning : The document was created with Spire.PDF for Python.
810 ISSN: 2502-4752 T able 4. Hypotheses for each estimation distance. Distance (cm) α ¶V alue H 0 H i 15 0.05 0.09938 Acce pted Rejected 13 0.05 0.18217 Acce pted Rejected 10 0.05 0.14495 Acce pted Rejected 5 0.05 0.11323 Accepted Rejected In summary , the monocular vision algorithm of fers considerable benets in terms of cost and sim- plicity while e xhibiting strong performance with small error mar gins and reaching accurac y le v els that are comparable to stereoscopic systems. These ndings suggest that monocular vision can be a v ery successful substitute for robotic applications, especially in settings where computational ef cienc y and cost reduction are top priorities. These results v alidate our initial h ypothesis that a monocular vis ion system can serv e as a viable and cost-ef fecti v e alternati v e to more comple x stereoscopic systems in robotic applications. 3.2. Results on gripper appr oximation Once the distances of the objects to the camera are estimated, the results of the approximations of the robotic gripper of the SCARA arm to the position of each object are calculated using the in v erse kinematics equations presented in the Hardw are section. The calculations performed by Algorithm 2 containing the kine- matics equations are sho wn in T able 5, as well as the errors obtained between the actual position and these calculations. This error is gi v en by the dif ference between tw o points in 3 dimensions by the follo wing: Error = p ( x 2 x 1 ) 2 + ( y 2 y 1 ) 2 + ( z 2 z 1 ) 2 T able 5. Gripper approximation results with proposed algorithm Real position Gripper position Error X axis (cm) Y axis (cm) Z axis (cm) X axis (cm) Y axis (cm) Z axis (cm) (cm) 5.00 5.00 15.00 5.013 4.823 14.809 0.2607 5.00 12.50 15.00 4.847 12.422 14.854 0.2254 5.00 20.00 15.00 4.786 20.453 14.760 0.5555 5.00 5.00 15.00 5.074 14.643 14.643 0.3660 12.50 12.50 15.00 12.385 12.493 14.434 0.5776 12.50 20.00 15.00 12.871 19.765 14.112 0.9907 20.00 5.00 15.00 20.122 5.018 14.978 0.1253 20.00 12.50 15.00 19.964 12.405 14.225 0.7816 20.00 20.00 15.00 19.817 19.958 15.393 0.4355 5.00 5.00 10.00 5.056 4.896 10.654 0.6646 5.00 12.50 10.00 5.110 12.506 9.334 0.6750 5.00 20.00 10.00 4.935 19.509 10.412 0.6442 12.50 5.00 10.00 12.578 5.098 10.688 0.6993 12.50 12.50 10.00 12.492 12.381 10.101 0.1563 12.50 20.00 10.00 12.853 19.872 10.838 0.9183 20.00 5.00 10.00 20.262 4.842 10.928 0.9771 20.00 12.50 10.00 20.114 12.485 10.263 0.2870 20.00 20.00 10.00 19.973 20.421 10.978 1.0651 5.00 5.00 5.00 4.823 5.044 5.316 0.3649 5.00 12.50 5.00 5.276 12.519 5.755 0.8041 5.00 20.00 5.00 5.198 20.167 5.583 0.6380 12.50 5.00 5.00 12.622 4.897 5.086 0.1814 12.50 12.50 5.00 12.735 12.365 5.557 0.6194 12.50 20.00 5.00 12.631 20.352 5.782 0.8675 20.00 5.00 5.00 20.255 4.932 5.805 0.8472 20.00 12.50 5.00 20.153 12.460 4.923 0.1759 20.00 20.00 5.00 20.318 20.122 5.034 0.3423 At lar ger distances (e.g., 15 cm), the gripper’ s approximation error is relati v ely small, with a maximum error of 0.2607 cm. Ho we v er , at shorter distances, such as 5 cm, the error increases to 0.8675 cm, suggesting that the system performs better at longer ranges b ut needs further optimization for accurac y at close distances. The gripper’ s approximation errors align with pre vious studies, which report errors of 0.5 cm to 1 cm for similar robotic systems using in v erse kinematics for position estimation at 10 to 15 cm distances [9], [10]. Our system, with maximum errors around 1.0651 cm at 5 cm, sho ws comparable performance b ut highlights the potential Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 801–813 Evaluation Warning : The document was created with Spire.PDF for Python.