TELK OMNIKA T elecommunication, Computing, Electr onics and Contr ol V ol. 23, No. 2, April 2025, pp. 349 370 ISSN: 1693-6930, DOI: 10.12928/TELK OMNIKA.v23i2.26455 349 Impr o ving visual per ception thr ough technology: a comparati v e analysis of r eal-time visual aid systems Othmane Seb ban 1 , Ahmed Azough 2 , Mohamed Lamrini 1 1 Laboratory of Applied Ph ysics, Informatics and Statistics (LP AIS), F aculty of Sciences Dhar El Mahraz, Sidi Mohamed Ben Abdellah Uni v ersity , Fez, Morocco 2 De V inci Higher Education, De V inci Research Center , P aris, France Article Inf o Article history: Recei v ed Jul 7, 2024 Re vised Jan 16, 2025 Accepted Jan 23, 2025 K eyw ords: Accessibility Assisti v e technology Benchmarking Deep learning Point of interest detection V isually impaired ABSTRA CT V isually impaired indi vi duals continue to f ace barriers in accessing reading and listening resources. T o address these challenges, we present a comparati v e anal- ysis of cutting-edge technological solutions designed to assist people with vi- sual impairments by pro viding rele v ant feedback and ef fecti v e support. Our study e xamines v arious models le v eraging InceptionV3 and V4 architectures, long short-te rm memory (LSTM) and g ated recurrent unit (GR U) decoders, and datasets such as Microsoft Common Objects in Conte xt (MSCOCO) 2017. Ad- ditionally , we e xplore the inte grati on of optical character recognition (OCR), translation tools, and image detection techniques, including scale-in v ariant fea- ture transform (SIFT), speeded-up rob ust features (SURF), oriented F AST and rotated BRIEF (ORB), and binary rob ust in v ariant scalable k e ypoints (BRISK). Through this analysis, we highlight the adv ancements and potential of assisti v e technologies. T o assess these solutions, we ha v e implemented a rigorous bench- marking frame w ork e v aluating accurac y , usability , response time, rob ustness, and generalizability . Furthermore, we in v estig ate mobile inte gration strate gies for real-time practical applications. As part of this ef fort, we ha v e de v eloped a mobile application incorporating features such as automatic captioning, OCR- based te xt recognition, translation, and te xt-to-audio con v ersion, enhancing the daily e xperiences of visually impaired users. Our research focuses on system ef cienc y , user accessibili ty , and potential impro v ement s, pa ving the w ay for future inno v ations in assisti v e technology . This is an open access article under the CC BY -SA license . Corresponding A uthor: Othmane Sebban Laboratory of Applied Ph ysics, Informatics and Statistics (LP AIS), F aculty of Sciences Dhar El Mahraz Sidi Mohamed Ben Abdellah Uni v ersity Fez 30003, Morocco Email: othmane.sebban@usmba.ac.ma 1. INTR ODUCTION V isually impaired people [1] f ace man y dif culties in their dail y acti vities, including na vig ating unf a- miliar en vironments, reading te xt, and interpreting visual information [2]. Although assisti v e technologies such as Microsoft Seeing AI and Google Look out ha v e been de v eloped to analyze images in real-time and pro vide descriptions, these solutions still ha v e signicant limitations, including high costs, restricted functionality , and dependence on a stable internet connection. These constraints considerably reduce their ef fecti v eness, particu- larly in critical situations such as crossing streets or reading important documents i n real-time. Despite techno- J ournal homepage: http://journal.uad.ac.id/inde x.php/TELK OMNIKA Evaluation Warning : The document was created with Spire.PDF for Python.
350 ISSN: 1693-6930 logical adv ancements, assisti v e technologies for the visually impaired struggle to deli v er real-time performance because of computational demands and limited adaptability to real-w orld en vironments. Their dependence on the internet reduces their of ine ef fecti v eness, compromising the immediate and reliable assistance users need for essential tasks such as crossing streets or reading important documents. T o address these limitations, our study proposes a comprehensi v e benchmarking system designed to e v aluate and optimize the performance of assisti v e t echnologies for visually impaired users. This system measures the ef fecti v eness of v arious components, including image captioning, optical character recognition (OCR), real-time translation, and k e y image element detection [3]. Our goal is to encourage t he de v elopment of mobile applications [4] that combine both accurac y and speed, ensuring optimal performance in real-time use cases. The mobile application we de v eloped, ”SeeAround, inte grates these functionalities to pro vide reliable visual assistance [5]. The general diagram of our system, presented in Figure 1, outlines the v ari o us modules and their interactions, illustrating ho w the y w ork together to of fer real-time assistance. Figure 1. General diagram of the real-time visual assistance system for the visually impaired F or image captioning, we use an encoder -decoder architecture combining con v olutional neural net- w orks (CNN) and recurrent neural netw orks (RNN). Specically , we emplo y InceptionV3 [6] and InceptionV4 [7] models, adapted to process images ef ciently in real-w orld conte xts for visually impaired people. Addi- tionally , we use the Microsoft Common Objects in Conte xt (MSCOCO) 2017 dataset to train the models with enhanced parameters, optimizing them for mobile en vironments . Our system also inte grates long short-term memory (LSTM) and g ated recurrent unit (GR U) decoders to capture temporal sequences more ef fecti v ely , impro ving the generation of image captions by modeling long-term relationships between visual and te xtual elements [8]. The OCR component processes images containing mainly te xt, e v en in visual ly comple x en viron- ments. Using adv anced algorithms, it accurately detects and e xtracts te xt, pro viding conte xtual information essential for visually impaired users. Furthermore, the real-time translation functionality of our mobile appli- cation remo v es language barriers by supporting a wide range of languages [9]. Recognized te xt is translated into the chosen language and then con v erted into speech via our te xt-to-speech module [10], making it easier to understand image descriptions and te xtual content e xtracted by OCR. An essential aspect of pro viding rele v ant visual information is precisely e xtracting k e y image ele- ments during camera analysis [11]. Each image has dif ferent characteristics such as saturation, brightness, contrast, and camera angle , meaning that uniform processing approaches can pro v e inef fecti v e. T o address this, we ha v e incorporated adv anced image detection algorithms, including scale-in v ariant feature transform (SIFT) [1], speeded- u p rob ust features (SURF) [1], oriented features from accelerated se gment test (F AST) and rotated BRIEF (ORB) [12], and binary rob ust in v ariant scalable k e ypoints (BRISK) [12]. These algorithms are reno wned for their rob ustness and accurac y under dif cult conditions. Our system detects the limitations of current methods and proposes impro v ements to ensure optimum performance, particularly in the real-life situations of visually impaired people. TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 2, April 2025: 349–370 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 351 This paper is structured into v e main sections, each addressing a k e y aspect of the study . Section 2 pro vides a detailed o v ervie w of pre vious research and current technologies designed for visually impaired users. Section 3 focuses on our benchmarking system and the k e y components used in our application. Section 4 presents the e xperimental results and their analysis, demonstrating the impact of our technical choices. Finally , section 5 concludes by summarizing our ndings, discussi ng the implications of this w ork, and suggesting a v enues for future research and de v elopment in assisti v e technology . 2. RELA TED W ORK V isual impai rment, a common disability , presents dif ferent le v els of se v erity . Assisti v e technologi es are crucial in pro viding visual alternati v es via v arious products, de vices, softw are, and system s [4], [6]. Sand- h ya et al . [13] present an application that helps visually im paired people understand their en vironment using neural netw orks and natural language processing. The application generates te xtual descriptions of images captured by a camera and inte grates an OCR module to read the te xt on signs and documents. The descriptions are then con v erted into audio, pro viding information in se v eral languages such as T elugu, Hindi, and English. This application requires a system with an inte grated GPU. Ganesan et al . [14] propose an inno v ati v e approach to f acilitating access to printed content for the visually impaired, using CNNs and LSTMs to encode and de- code information. The system inte grates OCR to con v ert printed te xt into digital format and then into speech via a te xt-to-speech application programming interf ace (API), making the content accessible via v oice reading. Bagrecha et al . [15] present ”V irtualEye, an inno v ati v e application in assisti v e t echnologies for the visually impaired, of fering functions such as object and distance detection, recognition of Indian banknotes, and OCR. The system pro vides v oice instructions in English and Hindi, enhancing users’ independence and impro ving their quality of life. Uslu et al . [16] aim to generate grammatically correct and semantical ly rele v ant captions for visual content via a personalized mobile app, impro ving accessibility , particularly for the visually impaired. Inte grated with the “CaptionEye” Android app, the system enables captions to be generated of ine and con- trolled by v oice, of fering a user -friendly interf ace. C ¸ aylı et a l . [17] present a captioning system designed to pro vide natural language descriptions of visual scenes, impro ving accessibility and reducing social isolation for visually impaired people. This research demonstrates the practical application of computer vision and natural language processing to create assisti v e tools. Despite signicant adv ances, it is crucial to continue de v eloping reliable mobile systems adapted to e v eryday life to impro v e users’ autonomy and quality of life. 3. METHOD In this section, we present our solution based on a benchmarking analysis. Subsection 3.1 presents the initial module, detailing the process of benchmarking the modules illustrated in Figure 1 of our system. Subsection 3.2 e xplores the generation of image captions using encoder -decoder architectures optimized with T ensorFlo w Lite, int e grat ing multi-GR U and LSTM models for accurate descriptions. Fi nally , subsection 3.3 compares the performance of four separate systems using v arious models for image captioning, OCR, transla- tion, and k e y point e xtraction. 3.1. Description of the benchmarking pr ocess f or the e v aluation of visual assistance systems Benchmarking measures a compan y’ s performance ag ainst mark et leaders [18], [19] to identify g aps and dri v e continuous impro v ement. In this study , realistic tasks adapted to the needs of visually impaired peo- ple, such as image caption generation, te xt recognition, and k e y point e xtraction, were desi gned. By comparing our solutions with industry standards, the aim is to close performance g aps and impro v e assisti v e technologies. 3.1.1. Benchmarking criteria and methodology T o impro v e accessibil ity and comprehension of multimedia content for visually impaired users, we ha v e inte grated automatic subtitling, OCR, te xt translation, and image recognition modules. These compo- nents use adv anced machine-learning algorithms to optimize processing accurac y and speed, ensuring uid, instantaneous interaction. Ev ery component has been designed for a smooth, optimized e xperience. 3.1.2. P erf ormance e v aluation of visual assistance systems W e e v aluated each visual assist ance system according to four k e y criteria: accurac y , response tim e, rob ustness, and gene ralizability . Accurac y w as measured by comparing the results obtained with e xpectations for tasks such as image description, OCR, and translation. Response time, e xpressed in milliseconds, w as used Impr o ving visual per ception thr ough tec hnolo gy: a compar ative analysis of r eal-time ... (Othmane Sebban) Evaluation Warning : The document was created with Spire.PDF for Python.
352 ISSN: 1693-6930 to assess system ef cienc y . Rob ustness w as analyzed under dif cult conditions, including lo w lighting, high noise, and comple x backgrounds, to ensure reliability . Finally , generalizability w as e xamined using unpub- lished images, videos, and documents to judge its suitability for ne w conte xts. 3.1.3. Comparati v e analysis of benchmark r esults W e analyzed the results to determine the best-performing systems for each task and usage scenario, emphasizing their strengths and weaknesses. This rigorous comparati v e analysis identied the most ef fecti v e real-time visual assistance solutions, pro viding v aluable insights into their capabilities. These ndings will help guide the future de v elopment of more adv anced, ef cient, and user -fri endly technologies tailored to the needs of visually impaired indi viduals. 3.2. Optimization of automatic image caption generation This subsection presents a system for automatically generating image captions, based on an encoder - decoder model. The CNNs InceptionV3 and InceptionV4 are used to e xtract visual features as encoders. The multilayer decoder , composed of GR U and LSTM, generates the semantic capti ons, as illustrated in Figure 2. The w ork mentioned in [6], [7] combines CNN and recurrent netw orks, b ut the e xcessi v e increase in the number of time steps, due to the length of the le gends, led to inferior performance. By reducing this number , we optimized the use of GR U and LSTM, leading to better results. Figure 2. Model architecture for multi-RNN-based automated image captioning 3.2.1. InceptionV3-gated r ecurr ent unit-based multi-lay er image caption generator model The image captioning system cons ists of tw o main elements: the encoder and the decoder , each based on a distinct neural architecture. The encoder , based on InceptionV3 [6], e xtracts k e y information from the image. This is then passed on to the decoder , which uses a GR U to generate the caption w ord by w ord. The proposed general model is illustrated in Figure 3. Figure 3. Flo wchart of the InceptionV3 multi-layer GR U image caption generator TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 2, April 2025: 349–370 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 353 Encoder -InceptionV3 architecture: the encoder , based on a CNN, is composed of con v olution, pool- ing, and connection layers. Sparse interactions identify k e y visual elements, while parameter sharing enables the same k ernel to be applied for optimized learning. Before training, a 2048-dimensional v ector is generated from the images via the mean pooling layer of the Inception-v3 model [6]. During training, a dense layer [20] renes this v ector , progressi v ely compressing it into a more compact and discriminati v e representati o n for image captioning. Decoder -GR U implementation: the decoder e xpl oits the e xtracted features to generate descripti v e sen- tences, relying on RNNs to store part of the input data. Ho we v er , RNNs f ace limitations due to gradient f ading and e xplosion problems, compromising their ability to handle long-term dependencies. T o o v ercome these dif culties, we use GR Us, an enhanced v ersion of RNNs with control mechanisms [20] f acilitating dependenc y management. Figure 4 sho ws a typical GR U architecture with its update, reset, and hidden state g ates. The de- coder comprises an inte gration le v el, a multilayer GR U, and a linear layer . The inte gration le v el con v erts w ords into v ectors suitable for language modeling, while the GR U adjusts the hidden state using its g ate mechanisms. Figure 4. Architecture of GR U In these (1)-(4) [21], x t represents the input , and h t is the hidden state at time t . The weights associated with the reset, update, and ne w information creation g ates are denot ed as W r , W z , and W u , respecti v ely . The h yperbolic tangent and sigmoid acti v ation functions are symbolized by tanh and σ , respecti v ely . r t = σ ( W xr x t + U hr h t 1 ) (1) z t = σ ( W xz x t + U hz h t 1 ) (2) u t = tanh( W xu x t + U hu ( r t h t 1 )) (3) h t = (1 z t ) h t 1 + z t u t (4) 3.2.2. InceptionV4-long short-term memory-based multi-lay er image caption generator model The model follo ws an encoder -decoder approach, where the InceptionV4 [7] acts as an encoder to e xtract visual features from images. These features are then transmitted to a recurrent neural netw ork equipped with LSTM cells that act as decoders. The latter uses this information to generate sequences of w ords, thus producing descripti v e captions for the images. The general scheme of the model is illustrated in Figure 5. Encoder -InceptionV4 architecture: we use InceptionV4, a CNN pre-trained by Google, as the encoder in our frame w ork. This model e xtracts high-le v el visual features through deep con v olutional l ayers. The InceptionV4-based encoder [7], [22], [23] con v erts ra w images into x ed-length v ectors by capturing rele v ant information from the intermediate pooling layer , just before the nal output. This process pro vides a concise and rele v ant image representation for subsequent processing. Decoder -LSTM implementation: the decoder is a deep recurrent neural netw ork with LSTM cells , as sho wn in Figure 6. In our model, the decoder operates in tw o phases: learning and inference. During learning, the RNN decoder with LSTM cells aims to maximize the probability of each w ord in a caption based on the con v oluted features of the image and pre viously generated w ords [7]. Impr o ving visual per ception thr ough tec hnolo gy: a compar ative analysis of r eal-time ... (Othmane Sebban) Evaluation Warning : The document was created with Spire.PDF for Python.
354 ISSN: 1693-6930 Figure 5. Flo wchart of the InceptionV4 multi-layer LSTM image caption generator Figure 6. Architecture of LSTM T o learn a sentence of length N , the decoder loops back on itself for N time steps, storing pre vious information in its cell memory . The C t memory is modied at each time step by the LSTM g ates: the for get g ate f t , the input g ate i t , and the output g ate o t . The LSTM decoder learns the w ord sequences from the con v olv ed features and the original caption. At step t = 0 , the hidden state h t of the decoder is initialized using these image features F . The main idea of the encoder -decoder model is illustrated by (5)-(11): f t = σ ( W f · [ h t 1 , x t ] + b f ) (5) i t = σ ( W i · [ h t 1 , x t ] + b i ) (6) ˜ C t = σ ( W C · [ h t 1 , x t ] + b C ) (7) C t = f t C t 1 + i t ˜ C t (8) o t = σ ( W o · [ h t 1 , x t ] + b o ) (9) h t = o t tanh( C t ) (10) O t = arg max( softmax ( h t )) (11) TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 2, April 2025: 349–370 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 355 3.2.3. T raining pr ocess and techniques f or model optimization The transformation of input x t at time t into output w ord O t is guided by equations using learnable weight and bias v ectors ( W f , b f ) , ( W i , b i ) , ( W o , b o ) , acti v ated by sigmoid σ and h yperbolic tangent tanh functions [6], [7]. Each w ord X t is con v erted into x ed-length v ectors using a w ord representation W e of dimension V × W , where V is the number of w ords in the v ocab ulary and W is the length of the embedding learned during training. The decoder’ s objecti v e is to maximize the probability p of a w ord’ s appearance at time t gi v en the cell and hidden states, features F , and pre vious w ords X t :0 t . This is achie v ed by m inimizing the loss function L , which is the cross-entrop y of the sampled w ord probabilities [6], [7]. Z = arg max β   N X t =0 log ( p ( O t | X t :0 t 1 , ϕ t ; β )) ! (12) L = H ( u, v ) = m in   N X t =0 u ( X t ) log ( v ( O t )) ! (13) where H ( u, v ) is the cross entrop y , u and v represent the softmax probability distrib utions of the ground truth w ord X t and the generated w ord O t at time t . During inference, the input image is passed through the encoder to obtain the con v olv ed features, which are then sent to the decoder . At time t = 0 , the dec od e r samples the start tok en O t =0 = S from the input features F . F or subsequent instants, the decoder samples a ne w w ord based on the input features and pre viously sampled w ords O t :0 t until it encounters an end t ok en /S at instant t = N [6], [7]. Figure 7 illustrates the backup architecture of the T ensorFlo w model for the LSTM and GR U encoder -decoders. Figure 7(a) sho ws our InceptionV3-GR U model, which uses a CNN to e xtract visual features and GR U units to generate captions. Figure 7(b) sho ws the architecture of the InceptionV4-LSTM model, where InceptionV4 e xtracts visual features and an LSTM generates captions. (a) (b) Figure 7. The architecture for training phases of: (a) the InceptionV3-GR U model and (b) the InceptionV4-LSTM model These diagrams sho w the architectures of models using InceptionV3 [6] and InceptionV4 [7] as en- coders, with decoders based on GR U or LSTM units. The input image is resized and processed by con v olutional Impr o ving visual per ception thr ough tec hnolo gy: a compar ative analysis of r eal-time ... (Othmane Sebban) Evaluation Warning : The document was created with Spire.PDF for Python.
356 ISSN: 1693-6930 layers, then features are e xtracted via global pooling to initialize the hidden state of the decoders. Each w ord in the le gend is then con v erted to v ectors and processed to generate the probability of the ne xt w ord. T able 1 summarizes the main parameters used to train the dif f erent image caption generation models, such as batch size, number of epochs, and time steps, which inuence performance and quality . My ne w image caption generation model optimizes pre vious v ersions [6], [7]. Reducing time steps from 22 to 18 impro v es performance by reducing computational comple xity . Increasing batch size to 148 stabilizes training while limiting captions to 16 w ords enhances ef cienc y . T able 1. Pre-trained model settings for image caption generation Embedding size Caption preprocessing Error rate Batch size Num timesteps Epochs InceptionV4-LSTM [7] 256 20 w ords 2 × 10 3 100 22 120 InceptionV4-LSTM (our model) 256 16 w ords 2 × 10 3 148 18 120 InceptionV3-GR U [6] 256 20 w ords 2 × 10 3 128 22 120 InceptionV3-GR U (our model) 256 16 w ords 2 × 10 3 148 18 120 3.2.4. Common dataset utilization f or enhanced perf ormance High-quality data is crucial for an ef fecti v e model. Using di v erse datasets helps a v oid o v ertting and impro v e performance. W e used MSCOCO 2017 [20], which contains annotated images with v e human captions. T able 2 compares MSCOCO 2017, Flickr 30k [24] and MSCOCO 2014 [25], highlighting the distri- b ution of training, v alidation, and test sets, with MSCOCO 2017 of fering the lar gest number of e xamples for image captioning. T able 2. Characteristics of datasets used to train image caption generation models Dataset T raining split (k) V alidation split (k) T esting split (k) T otal images (k) Flickr30k (imeca[6]) 28 1 1 8 MSCOCO 2014 (cam2caption[7]) 83 41 41 144 MSCOCO 2017 (Our model) 118 41 5 164 3.2.5. Integration of our pr e-trained model in the mobile application W e are optimizing our encoding-decoding model for real-time use in the “SeeAround” mobile ap- plication via T ensorFlo w , e xploiting its datao w graph architect ure [7] and processor para llelism to impro v e ef cienc y . Graph-based image preprocessing accelerates speed by a f actor of six. During training, checkpoints and metadata les are generated re gularly . Checkpoints store learned weights, while graph denitions link them, enabling the model to be reconstructed and reused for inference and training. Figure 8 sho ws the backup architecture for the LSTM and GR U models, with Figure 8(a) illustrating the LSTM model and Figure 8(b) the GR U model. (a) (b) Figure 8. T ensorFlo w model backup architecture for: (a) the LSTM encoder -decoder and (b) the GR U encoder -decoder TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 2, April 2025: 349–370 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 357 W e ha v e combined the pre-processing, encoding, and decoding les into three ProtoBuf les to creat e an end-to-end model suited to static and real-time requirements. A nal ProtoBuf le serv es as a black box for subtitle generation. My model uses 18 w ords instead of 22 [6], [7], impro ving reliability and speeding up real-time subtitle generation for image streams from the camera. Figure 9 sho ws the captions generated by the LSTM and GR U decoding models. Figure 9(a) illustrates the captions generated by the LSTM model, while Figure 9(b) sho ws those generated by the GR U model. (a) (b) Figure 9. captions generated by the model: (a) LSTM decoder outputs and (b) GR U decoder outputs 3.3. Detailed descriptions of our visual assistance systems In this subsection, we present the four systems designed for image caption generation, OCR, m achine translation, and k e y point e xtract ion. Each system is b uilt on specialized templates carefully selected for their ef cienc y and rele v ance. As illustrated in T able 3, these choices are guided by performance metrics and task- specic adaptability , ensuring optimal accurac y and reliability . T able 3. Systems description for benchmarking e v aluation Image captioning OCR T ranslation K e yframe e xtraction System 1 InceptionV4-LSTM with 22 w ords [7] Google Mobile V ision API Google Cloud T ranslation API SIFT System 2 InceptionV4-LSTM with 18 w ords (our model) Firebase V ision T e xt Detector Google ML kit SURF System 3 InceptionV3-GR U with 22 w ords [6] Google Firebase Machine Learning kit Firebase ML Kit BRISK System 4 InceptionV3-GR U with 18 w ords (our model) T essT w o-Android Google T ranslate API ORB 3.3.1. Inception-V4 with LSTM and the harmony of adv anced vision and language pr ocessing Using the Google Mobile V ision API to recognize te xt in images: technological adv ances in i n f orma- tion capture and te xt recognition ha v e led to inno v ati v e services such as document analysis and secure access to de vices. OCR [26], a technology for detecting and e xtracting te xt from scanned images or directly from the camera, can w ork with or without an internet connection. Google of fers Mobile V ision, an open-source tool Impr o ving visual per ception thr ough tec hnolo gy: a compar ative analysis of r eal-time ... (Othmane Sebban) Evaluation Warning : The document was created with Spire.PDF for Python.
358 ISSN: 1693-6930 for creating te xt recognition and instant translation applications on Android. In this research, OCR is used to assist the visually impaired. Al though this technology is ef fecti v e for document scanning and te xt analysis, it encounters lim itations in appl ications dependent on a stable Internet connection, particularly in areas with lo w connecti vity . The e xtracted te xt data is then processed by a REST API [27], which interacts with a database and displays the information li v e on the de vice, as illustrated in Figure 10. Figure 10. Using the Google Mobile V ision API for the OCR process Google Cloud T ranslation to con v ert recognized te xt into multiple languages: The Google Cloud Plat- form of fers pre-trained machine learning models for creating applications that interact with their en vironment [28]. Among these models, the Google Cloud T ranslation API of fers the possibility of con v erting content between dozens of languages. W e used this API to translate information, captioning, and OCR. Google T rans- late then renders this data in the language chosen by the visually impaired person. Figure 11 illustrates this w orko w , from te xt recognition to captioning and OCR to automated translation, based on cloud technologies. Figure 11. Multilingual translation process with the Google Cloud T ranslation API Detection k e yframe with SIFT : the SIFT descriptor , designed by Lo we [29], is widely used for its ef cienc y in image processing, particularly for identifying and characterizing points of interest using local gra- dients. The SIFT process is di vided into four main phases, including the application of the Gaussian dif ference (DoG) method. This in v olv es subtracting images ltered by Gaussian lters applied at dif fere n t scales. The e xtrema detected between tw o adjacent le v els are then e xploited for further analysis [30], [31]. As (14): D ( X , σ ) = ( G ( X , k σ ) G ( X , σ )) I ( X ) (14) where I is the input image and X is a specic point X ( x, y ) . The v ariable σ represents the scale, while G ( X , σ ) denotes the Gaussian applied to the point X . Re gions of interest based on Gaussian dif ferences (DoG) [31] are identied as e xtrema in the image plane and along the scale axis of the function D ( x, σ ) . T o locate these points, the D ( x, σ ) v alue of each point is compared with its neighbors at the same and dif ferent sca les. The SIFT algorithm e xtracts and describes these points of interest for obstacle detection and recognition. 3.3.2. Inception-V4 with LSTM and r ebase: text detection optimization Firebase vision te xt detector for ef cient te xt detection: Google’ s Firebase Cloud Storage service enables de v elopers to store and share user content, such as photos, videos, and audio les, in the cloud [32]. Based on Google Cloud Storage, it of fers a scalable object storage solution, perfectly inte grated with web TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 2, April 2025: 349–370 Evaluation Warning : The document was created with Spire.PDF for Python.