IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 1, February 2025, pp. 151 158 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i1.pp151-158 151 Lar ge language models-based metric f or generati v e question answering systems Hazem Abdel Azim, Mohamed Tharwat W aheed, Ammar Mohammed School of Computing and Digital T echnologies, ESLSCA Uni v eristy , Cairo, Egypt Article Inf o Article history: Recei v ed Mar 20, 2024 Re vised Aug 13, 2024 Accepted Aug 30, 2024 K eyw ords: Ev aluation metrics Generati v e question answering Lar ge language models Lik ert-scale scoring Zero-shot prompting ABSTRA CT In the e v olving landscape of te xt generation, which has adv anced rapidly in re- cent years, techniques for e v aluating the performance and quality of the gen- erated te xt lag behind relati v ely . T raditionally , le xical-based metrics such as bilingual e v aluation understudy (BLEU), recall-oriented understudy for gisting e v aluation (R OUGE), m etric for e v aluation of translation with e xplicit order - ing (METEOR), consensus-based image description e v aluation (CIDER), and F1 ha v e been utilized, primarily relying on n-gram similarity for e v aluation. In recent years, neural and machine-learning-based metrics, lik e bidirectional encoder representations from transformers (BER T) score, k e y phrase question answering (KPQA), and BER T supervised training of learned e v aluation met- ric for reading comprehension (LERC) ha v e sho wn s uperior performance o v er traditional met rics b ut suf fered from a lack of generalization to w ards dif ferent domains and requires massi v e human-labeled training data. The main contrib u- tion of the current research is to in v estig ate the use of train-free lar ge language models (LLMs) as scoring metrics, e v aluators, and judges within a question- answering conte xt, encompassing both closed and open-QA scenarios. T o v al- idate this idea, we emplo y a s imple zero-shot prompting of Mixtral 8x7 B, a popular and widely used open-source LLM, to score a v ariety of datasets and domains. The e xperimental results on ten dif ferent benchmark datasets are compared ag ainst human judgments, re v ealing that, on a v erage, simple LLM- based metrics outperformed sophisticated state-of-the-art statistical and neural machine-learning-based metrics by 2-8 points on answer -pairs scoring tasks and up to 15 points on contrasti v e preferential tasks. This is an open access article under the CC BY -SA license . Corresponding A uthor: Hazem Abdel Azim School of Computing and Digital T echnologies, ESLSCA Uni v eristy Cairo, Egypt Email: hazem.abdelazim@eslsca.edu.e g 1. INTR ODUCTION Question answering (QA), dating back to the seminal w ork of Hirschman and Gaizauskas [1], has long aspired to equip computer systems with the ability to furnis h accurate and pertinent re sponses to posed inquiries, le v eraging either predened conte xt or curated kno wledge base s. QA systems are typically decom- posed into tw o k e y components [2]: a retrie v er and a reader . The retrie v er’ s funct ion is to search among an e xtensi v e collection of passages and retrie v e the most rele v ant passage gi v en the query . The reader’ s function is to comprehend the passage and answer the query from the gi v en passage or set of passages retrie v ed. The cur - rent research focuses on the reader component, namely , the reading comprehension (RC) task, and in particular , dif ferent metrics are used to measure the performance of the RC task. J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
152 ISSN: 2252-8938 T raditional RC-QA systems [3] rely on e xtracti v e ”span-based” QA, whether in a closed or open domain. Span-based e xtracti v e QA means gi ving a passage and a question, and the task of the AI model is to e xtract the answer from within the passage in a span [start-end] indices. Accordingly , the metrics used to e v aluate those systems were designed to capture le xical-based similarities between the model answers and the ground-truth ideal answers created by human annotators. Recent QA systems are generati v e [4], sometimes kno wn as ”abstracti v e” QA, cater for generating a ”semantically” correct answer from within the passage, and do not necessarily c apture a span of answers. More adv anced metrics are required to e v aluate those generati v e responses. Generally , the c u r rent landscape of QA metrics can be cate gorized into three broad cat e gori es: le xical-statistical metrics, embedding-based metrics, and neural bidirectional encoder representations from transformers (BER T)-bas ed models [5]. Le xical-statistical metrics are the more con v entional metrics used for se v eral years. The y rely on tok en matching, whether e xact match (EM) or relax ed (F1-score), with dif ferent n-gram v ariants. These metrics include bilingual e v aluation understudy (BLEU), recall-oriented understudy for gisting e v aluation (R OUGE), metric for e v aluation of translation with e xplicit ordering (METEOR), consensus- based image description e v aluation (CIDER), EM, and F1-score. BLEU is precision-centric and widely used in e v aluating translation ta sks [1]; R OUGE is recall-centric and commonly used in summarization tasks [6]. Although these traditional metrics ha v e pro vided acceptable performance for span-based e xtracti v e QA sys- tems, the y suf fer from critical dra wbacks as the y do not capture semantic features in the tok ens. On the other hand, the semantic capturing aspect has been addressed in t he second cate gory of embedding-based metrics, which utilize tok en embeddings to pro vide a more nuanced similarity score and mitig ate the limitations of le xical metri cs [7], [8]. While these metrics of fer e xibility and impro v e QA scoring compared to le xical metrics, t he y nonetheless encounter challenges adapting to specic conte xts due to their static nature, f ailing to consi der the conte xtual nuances of tok ens within questions or answers [9]. F or instance, a w ord lik e ”bank” w ould yield the same st atic embedding v ector in dif ferent conte xts, such as ”depositing a paycheck in the bank” and ”crossing the ri v er bank”. Those limitations were handled in the third cate gory: Neural BER T -based models, using dif ferent v ariants of BER T architectures [10], to capture conte xtu- alized embeddings, which sho wed superior performance correlat ing with human judgments compared to other cate gories. Se v eral models were reported recently lik e BER Tscore [8], which relies either on w ords or conte x- tualized embeddings and cosine similarity to generate a numeric score. Bilingual e v aluation understudy with representations from transformers (BLEUR T) [11] is a rened v ersion of BER Tscore that empo wers augmented synthesized data to train the model. Another adv anced v ersion uses BER T to train the model to learn certain critical weights for each tok en, lik e in k e y phrase question answering (KPQA) models [9]. The renement here is that instead of treating tok ens in the model answer and ground truth gold answers equally , the y are weighted based on their importance in answering the question. Standard BER T architecture is follo wed by a softmax classier layer to generate the weights for each tok en, and those weights are incorporated into con v entional metrics lik e R OUGE, BLEU, and the BER Tscore metric. A BER T -based direct supervised learning approach adopted by [12], which learns the required rating directly using massi v e training labelled data. The model is called l earned e v aluation metric for reading comprehension (LERC). This m odel is bas ed on BER T architec- ture that has under gone ne-tuning based on human judgment scores. LERC tak es as input a passage (conte xt), question, reference, and candidate, and the output score measures the accurac y of the candidate as compared to the ground truth human judgement. The preceding neural-BER T systems ha v e demonstrated signicantly superior performance to traditional le xical and static embedding metrics, especially within the domains for which the y are trained. Ho we v er , the y are hindered by a comple x training procedure, necessitating costly manual annotation of samples due to their reliance on lar ge amounts of human-annotated data for training. Additionally , the y e xhibit limited generalization across v arious domains, and there is still more room for impro v ements on out-of-distrib ution data, particularly on contrasti v e pairs [12]. Recently , a fourth cate gory based on using lar ge language models (LLMs) in scoring as a judge sho ws signicant promise compared to the preceding three cate gories. Utilizing LLM with carefully crafted prompts [13] has demonst rated remarkable success in v arious tasks, both within academic benchmarks [14] and real- w orld settings [15]. Ho we v er , to our kno wledge, no published research has yet reported on using LLMs as a scoring agent for RC tasks in a QA conte xt to mimic the human judgments on a Lik ert scale and the sim- pler binary tasks for correct/incorrect answers. Thus, this research’ s primary contrib ution lies in e xploring a fourth cate gory , emplo ying GPT LLMs and zero-shot prompting to assess the correlation between model scores and human judgments compared to other state-of-the-art QA metrics. W e conducted e xperiments using Int J Artif Intell, V ol. 14, No. 1, February 2025: 151–158 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 153 this proposed approach on 10 state-of-the-art datasets [12], [16]–[23], encompassing di v erse domains and QA styles. The rst eight datasets feature human scores based on a 5-Lik ert scale, while the last tw o consist of binary openQA datasets. In these latter datasets, the task assigned to the LLM is to determine whether the answer is correct compared to a gold (ground truth) answer . Through careful prompting, we directed the LLM to e x ecute the scoring task. This disc riminati v e task is particularly challenging, especially with the 5-Lik ert scale judgment, as the LLM must distinctly dif ferentiate between closely labelled cate gories between 1 and 5. The rest of the paper is or g anized as follo ws: in section 2, we discuss the research method, in subsection 2.1., we discuss the proposed LLM mixtral-based scoring model and subsection 2.2. describes the datasets emplo yed in this research, pro viding a foundation for the empirical analysis. The ndings from our empirical analysis across all datasets are presented and discussed in section 3. Finally , section 4 concludes the paper with a summary of k e y insights and conclusions dra wn from the research. 2. METHOD In this section, we describe the methodology used in our research to de v elop and e v aluate the proposed LLM-based metric, as well as the the datasets used for e v aluation. 2.1. Lar ge language model-based metric The proposed metric in our research is based on capitalizing on open-source Mixtral 8x7B LLM reasoning capabilities as a scoring machine, which we will call LLM -Mixtral. This research answers a h ypothesis about whether sim ple prompt-based zero-shot open-source LLM can outperform all state-of-the- art e xisting metrics we ha v e co v ered in the p r e vious sections and correlate better with human judgements. W e formally design a prompt containing a question, q, a gold (reference) answer and a model generated answer AI-generated answer as (1): ˆ y = M LLM ( prompt ) (1) The predicted score ˆ y is then compared to the corresponding human judjements e xample of prompt that can be applied as an input to (12): ”Here is a question, a set of golden answers (split wit h /), an AI-generated answer . Can you judge whether the AI-generated answer is correct according to the question and golden answers, answer Y es or No. W e used se v eral prompts depending on the task and the dataset used. An e xample is sho wn in Figure 1, to instruct the LLM-Mixtral to generate a human-lik e judgement on ho w well the h ypothesis candidate answer is aligned semantically with the ground truth reference answer . The predicted judgment ˆ y could be on a Lik ert- scale from 1-5 for the rst eight datasets or binary judgement (correct/incorrect) for the last tw o datasets, as will be e xplained in the ne xt section. Figure 1. Zero-shot prompt applied to LLM-Mixtral model Lar g e langua g e models-based metric for g ener ative question answering systems (Hazem Abdel Azim) Evaluation Warning : The document was created with Spire.PDF for Python.
154 ISSN: 2252-8938 2.2. Datasets used in question answering e v aluation Numerous benchmark datasets ar e a v ailable i n the lit erature for e v al uating QA. W e selected datas ets that were deplo yed in the same research setting, by comparing the metri c with human judgments mostly on a Lik ert scale from 1 to 5 where 5 is the most rele v ant model answer compared to the gold - ground trut h answers. Datasets utilized in our e xperiments is summarized in T able 1. T able 1. Summary of datasets Dataset Description References Narrati v eQA Benchmark for GenQA metrics, with short answers a v eraging 4.7 w ords. [17], [24] SemEv al Used for GenQA metrics, with v ery short answers a v eraging 2.5 w ords. [16], [17] MS-MARCO Contains human judgments for model-generated answers, kno wn for longer responses. [17] A VSD Collected human judgments on model responses, with longer and comple x answers. [17] MCScript Ev aluates reasoning within stories for children, assessing comprehension skills. [16] CosmosQA F ocuses on commonsense reasoning through e v eryday blogs, assessing real-w orld reasoning. [18] SocialIQA Ev aluates social reasoning from kno wledge-base passages, focusing on social interactions. [19] Quoref Assesses coreferential reasoning within W ikipedia articles for language comprehension. [20] Contrasti v e pairs Consists of contrasti v e answer pairs for e v aluating models ag ainst human judgments. [12] EV OUN A (NQ, TQ) Aggre g ates outcomes from v arious Open-QA models on NQ and TQ datasets. [21]–[23] 3. EXPERIMENT AL RESUL TS W e tested our proposed LLM—Mixtral metric on ten dif ferent datasets and compared it with all the methods co v ered in section 3. W e chose Mixtral 7 B because v ery little research has tackled this problem using open-source models, and most of the related research i n this area used closed GPT models (OpenAI and Claudera), which are paid services. The second reason is that Mixtral 8x7 B is one of the top performing open source models [25] on general tasks with relati v ely fe wer parameters than man y open source LLMs. Mixtral notably e xhibits superior performanc e, matching or surpassing Llama 2 70B and GPT -3.5 on public tasks, with remarkable results in mathematics, code generation, and multilingual tasks. So, we in v estig ate the model’ s performance in this challenging closed specic task of Lik ert-scale scoring of QA-generated answers v ersus human judgments. The third reason is that open source models, for pri v ac y reasons, are more appealing for some go v ernment and pri v ate sector enterprises where the criticality of data pri v ac y is v ery high, and the y prefer to ha v e their data on-premises, which is achie v able using open source models. 3.1. Experiment I: comparison with k ey phrase question answering metric W e benchmark ed LLM-Mixtral ag ainst the datasets used in [9]. Based on the LLM - prompt in Figure 1, the resulting output is parsed to get the Li k ert scale judgment from 1 to 5. The question, candidate ans wer , and ground truth reference are grabbed and applied to the LLM model for each dataset. The Pearson correlation coef cient is computed for the LLM-Mixtral and human judgments. As depicted in the results in T able 2, the simple proposed model LLM-Mixtral outperforms all metrics on a v erage and for 3 out of four datasets. T est sets are used in the comparati v e study . The le xical metrics are f ar behind in terms of correlation with human judgments. The KPQA pro vides a relati v ely good correlation b ut, on a v erage, is some what less than the simple LLM-Mixtral metric. T able 2. Benchmarking LLM-Mixtral ag ainst Le xical and KPQA metrics [18] Metric MS-MARCO A VSD Narrati v e-QA Sem-Ev al A v erage BLEU-1 0.349 0.58 0.634 0.359 0.4805 BLEU-4 0.193 0.499 0.258 -0.035 0.22875 R OUGE-L 0.309 0.585 0.707 0.566 0.54175 METEOR 0.423 0.578 0.735 0.543 0.56975 CIDER 0.275 0.567 0.648 0.429 0.47975 BER TScore 0.463 0.658 0.785 0.63 0.634 BLEU-1-KPQA 0.675 0.719 0.716 0.362 0.618 R OUGE-L-KPQA 0.698 0.712 0.774 0.742 0.7315 BER TScore-KPQA 0.673 0.729 0.782 0.741 0.73125 LLM-Mixtral 0.691 0.749 0.818 0.777 0.75875 Int J Artif Intell, V ol. 14, No. 1, February 2025: 151–158 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 155 3.2. Experiment II: comparison with lear ned e v aluation f or r eading compr ehension metric W e benchmark ed LLM-Mixtral on dif ferent datasets and dif ferent models presented by the authors in [12]. The results are sho wn in T able 3. BER T semantic te xtual similarity benchmark (STS-B) is a BER T -base model ne-tuned on the sentence similarity task, STS-B [26]. LERC, as described in section 3, is a LERC based on supervised ne-tuning and a 40k+ dataset. The proposed LLM-Mixtral metric outperformed BER T STS-B and LERC on a v erage, e xcept for the Quoref dataset. LERC, the second performer after LLM-Mixtral, w as competiti v e in tw o datasets, CosmosQA and Quoref. In general, it produced an a v erage correlation of 0.744, while our proposed LLM-Mixtral achie v ed a moderate correlation of 0.882, a remarkable increase of 14 points. T able 3. Benchmarking LLM-Mixtral ag ainst Le xical and LERC metrics [12] Narrati v e QA MCScript CosmosQA Quoref A v erage BLEU 1 0.472 0.260 0.670 0.578 0.460 METEOR 0.615 0.502 0.711 0.716 0.611 R OUGE-L 0.495 0.297 0.701 0.604 0.490 BER TScore 0.534 0.194 0.779 0.286 0.447 BER T STS-B 0.686 0.449 0.789 0.750 0.638 LERC 0.738 0.694 0.824 0.741 0.744 LLM-Mixtral 0.884 0.795 0.824 0.735 0.822 3.3. Experiment III: comparing LLM-Mixtral with LERC on out-of-distrib ution datasets Although LERC achie v ed good performance on some of the data sets, in v estig ating the training and tes t datasets used in LERC sho wed that the data is statistically biased, which pro vides doubt on t he generalization capabilities of LERC. As sho wn in Figure 2, the distrib ution of the training and test data has similar biases. T o v erify that we applied LERC on totally unseen out-of-distrib ution data from dataset 1, namely Microsoft machine reading comprehension (MSMARCO) and audio-visual scene understanding (A VSD). The correlation results in T able 4 sho wed a lo wer performance as e xpected compared to LLM-Mixtral, which is one of the critical adv antages of using LLM-Mixtral based metric as it’ s dataset and domain agnostic, and is not inuenced by a training distrib ution biases. Figure 2. Biases in the training and test sets used in LERC T able 4. Comparison of models LERC and LLM-Mixtral on MS-MARCO and A VSD datasets [17] Model MS-MARCO A VSD LLM-Mixtral 0.691 0.749 LERC 0.601 0.621 Lar g e langua g e models-based metric for g ener ative question answering systems (Hazem Abdel Azim) Evaluation Warning : The document was created with Spire.PDF for Python.
156 ISSN: 2252-8938 3.4. Experiment IV : contrasti v e scoring task The e xperiment w as conducted on the contrasti v e pairs dataset [12]. This dataset assesses the prefer - ence between tw o possible answers. The results are summarized in T able 5, with the accurac y of the results reported. Le xical-based metrics performed poorly , as e xpected since the contrasti v e pairs were designed to ha v e similar tok en o v erlap with the reference. On the other hand, the sentence similarity model STS-B outperformed others, lik ely because it generalizes be yond tok en o v erlap. The LERC model, presented in this research setting, achie v ed the best results, with an a v erage accurac y of 80%. Our proposed LLM-Mixtral metric, earned an impressi v e a v erage accurac y of 95%. This result supports our h ypothesis that LLM-based models outperform con v entional and state-of-the-art models in this scoring task. T able 5. Results of contrasti v e pairs e xperiment on datasets [12] Metric Narrat i v eQA MCScript CosmosQA SocialIQA A vg. BLEU-1 53 54 52 55 53.5 R OUGE-L 53 57 53 53 61.2 METEOR 60 62 57 53 54 BER TScore 70 58 74 62 66 BER T STS-B 70.6 70 59.3 66.6 66.6 LERC 80 87.3 72.6 81.3 80.3 LLM-Mixtral 96 94 96 94 95 3.5. Experiment V : open question answering datasets The pre vious e xperiments were conducted using closed QA, where the answer te xt is pro vided within a gi v en conte xt pas sage. In the current e xperiment, we aim to e v aluate the metric on a more challenging task on commonly used OpenQA datasets, namely natural questions (NQ), T ri via question answering (TQ), and e v ent and opinion understanding in natural language (EV OUN A) datasets. LLM-Mixtral outperformed BER Tscore applied on the same dataset as sho wn in T able 6, which summarizes the relati v e performance of LLM-Mixtral o v er other state-of-t he-art models. The best-performing neural-BER T model is chosen for each subset of the ten datasets used in the e xperimentation. The incremental dif ference between the proposed LLM-Mixtral model and the best-performing neural-BER T model ranges from 2.7 points to 8.4 points on the answer -pairs scoring task and 14.7 points on the contrasti v e answer -pairs task.is selected on each subset of the ten datasets used in the e xperimentation. The incremental dif ference between the proposed LLM-Mixtral m odel and the best-performing neural-BER T model ranges from 2.7 points to 8.4 points on the answer -pairs scoring task and 14.7 points on the contrasti v e answer -pairs task. T able 6. Comparati v e analysis of best performing neural BER T models with LLM-Mixtral Datasets Model A vg. performance LLM-Mixtral Dif ference MS-MARCO, A VSD, Narrati v eQA, SemEv al R OUGE-L-KPQA 73.15 75.87 2.72 CosmosQA, MCScript, Narrati v eQA, Quoref LERC 74.44 82.2 7.76 NaturalQuestions BER TScore 80.84 88.2 7.36 T ri viaQA BER TScore 85.28 93.68 8.4 Contrasti v e pairs Datasets (CosmosQA, MCScript, Narrati v eQA, SocialiQA) LERC 80.3 95 14.7 4. CONCLUSION This study e xplored applying LLMs as an e v aluation metric for QA tasks. Our inquiry has resulted in a more profound comprehension of the capabilities of LLMs in assessing, adjudicating, and appraising the performance of QA system s in both closed and open domains. W e conducted e xtensi v e e xperiments on ten datasets, comparing our proposed LLM-Mixtral metric with e xisting methods on QA tasks. The results indicated the superiorit y of LLM-Mixtral in pro viding accurate e v aluations of answer quality . It outperformed traditional le xical metrics, neural BER T -based models, and KPQA approaches. Mixtral 8x7 B, a simple LLM- based metric, sho wcased higher correlations with hum an judgments compared to more sophisticated state-of- the-art statistical and neural machine-learning-based metrics. It reached an impressi v e Pearson correlation of o v er 80%. Human judgments in e v aluating answer pairs achie v ed accurac y rates e xceeding 95% in contrasti v e scoring. This superior performance across a di v erse range of datasets and models underscores the potential of LLMs in QA e v aluation. Our adopted metric e xhibited v ersat ility in open-domain QA e xperiments, specically Int J Artif Intell, V ol. 14, No. 1, February 2025: 151–158 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 157 on NQ and TQ datasets. It achie v ed results closer to human judgments and outperformed o v er -relax ed le xical matching metrics, bridging the g ap between automated scoring and human assessment. The correlation with human judgments on these datasets reinforced the ef fecti v eness of LLM-Mixtral, positioning it on par with GPT -3.5 and outperforming state-of-the-art neural BER T -based models lik e BER TScore. Our ndings open ne w horizons for applying LLMs in QA e v aluation, of fering a complementary approach to traditional and neural-based metrics. This research marks a crucial step in pursuing more accurate and ef fecti v e QA e v aluation methods. Some k e y benets of using LLM-based metrics o v er state-of-the-art metrics include customizability , multif aceted e v aluation, and train-free capabilities. These features enable us to create a metric that can e xibly perform the judgment task across v arious datasets without requiring a learning process while still achie ving competiti v e performance. LLM-based metrics are more domain agnostic than most machine learning BER T - based techniques, which sho wed a distrib ution domain - bias when correlating with human judgments. REFERENCES [1] L. Hirschman and R. Gaizauskas, “Natural language question answering: The vie w from here, Natur al Langua g e Engineeri ng , v ol. 7, no. 4, pp. 275–300, 2001, doi: 10.1017/S1351324901002807. [2] A. M. N. Allam and M. H. Hagg ag, “The question answering systems: A surv e y , International J ournal of Resear c h and Re vi e w s in Information Sciences (IJRRIS) , v ol. 2, no. 3, 2012. [3] M. Rotaru and D. J. Litman, “Impro ving question answering for reading comprehension tests by combining multiple systems, in Pr oceedings of the AAAI 2005 W orkshop on Question Answering in Restricted Domains , 2005, pp. 46–50. [4] Y . Liu, C. Zhang, X. Y an, Y . Chang, and P . S. Y u, “Generati v e question renement with deep reinforcement learning in retrie v al- based QA syst em, in Pr oceedings of the 28th A CM International Confer ence on Information and Knowledg e Mana g ement , 2019, pp. 1643–1652, doi: 10.1145/3357384.3358046. [5] D. Deutsch, T . B. -W eiss, and D. Roth, “T o w ards question-answering as an automatic metric for e v aluating the content quality of a summary , T r ansactions of the Association for Computational Linguistics , v ol. 9, pp. 774–789, 2021, doi: 10.1162/tacl a 00397. [6] C.-Y . Lin, “R OUGE: A packa ge for automatic e v aluation of summaries, in T e xt Summarization Br anc hes Out , 2004, pp. 74–81. [7] E. Clark, A. Celik yilmaz, and N. A. Smith, “Sentence mo v er’ s similarity: automatic e v aluation for multi-sentence te xts, in Pr o- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019, pp. 2748–2760, doi: 10.18653/v1/P19- 1264. [8] T . Zhang, V . Kishore, F . W u, K. Q. W einber ger , and Y . Artzi, “BER Tscore: e v aluating te xt generation with BER T , in 8th Interna- tional Confer ence on Learning Repr esentations, ICLR 2020 , 2020, pp. 1–43. [9] H. Lee et al. , “KPQA: A metric for generati v e question answering using k e yphrase weights, in 2021 Confer ence of the North Amer - ican Chapter of the Association for Computat ional Linguistics: Human Langua g e T ec hnolo gies, Pr oceedings of the Confer ence , 2021, pp. 2105–2115, doi: 10.18653/v1/2021.naacl-main.170. [10] J. De vlin, M.-W . Chang, K. Lee, K. T . Google, and A. I. Language, “BER T : Pre-training of deep bidi rectional transformers for language understanding, in Pr oceedings of N AA CL-HL T 2019 , 2019, pp. 4171–4186. [11] T . Sellam, D. Das, and A. P . P arikh, “BLEUR T : Learning rob ust metrics for te xt generation, in Pr oceedings of the Annual Meeting of the Association for Computational Linguistics , 2020, pp. 7881–7892, doi: 10.18653/v1/2020.acl-main.704. [12] A. Chen, G. Stano vsk y , S. Singh, and M. Gardner , “MOCHA: A dataset for training and e v aluating generati v e reading compre- hension metrics, in EMNLP 2020 - 2020 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing , Pr oceedings of the Confer ence , 2020, pp. 6521–6532, doi: 10.18653/v1/2020.emnlp-main.528. [13] P . Liu, W . Y uan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic surv e y of prompting methods in natural language processing, arXiv-Computer Science , pp. 1–46, 2021, doi: 10.48550/arXi v .2107.13586. [14] V . Sanh et al. , “Multitask prompted training enables zero-shot task generalization, arXiv-Computer Science , 2021, doi: 10.48550/arXi v .2110.08207. [15] L. Ouyang et al. , “T raining language models to follo w instructions with human feedback, arXiv-Computer Science , pp. 1–68, 2022, doi: 10.48550/arXi v .2203.02155. [16] S. Ostermann, M. Roth, A. Modi, S. Thater , and M. Pinkal, “SemEv al-2018 T ask 11: Machine comprehension using commonsense kno wledge, in Pr oceedings of The 12th International W orkshop on Semantic Evaluation , 2018, pp. 747–757, doi: 10.18653/v1/S18- 1119. [17] B. Bi, C. W u, M. Y an, W . W ang, J. Xia, and C. Li, “Incorporating e xternal kno wledge into machine reading for generati v e question answering, in Pr oceedings of the 2019 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing and the 9th International J oint Confer ence on Natur al Langua g e Pr ocessing (EMNLP-IJCNLP) , 2019, pp. 2521–2530, doi: 10.18653/v1/D19-1255. [18] L. Huang, R. L. Bras, C. Bhag a v atula, and Y . Choi, “COSMOS QA: Machine reading comprehension with conte xtual com- monsense reasoning, in EMNLP-IJCNLP 2019 - 2019 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing and 9th International J oint Conf er ence on Natur al Langua g e Pr ocessing , Pr oceedings of the Confer ence , 2019, pp. 2391–2401, doi: 10.18653/v1/d19-1243. [19] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y . Choi, “Social IQA: Commons ense reasoning about social interactions, in 2019 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing and 9th International J oint Confer ence on Natur al Langua g e Pr ocessing , 2019, pp. 4463–4473, doi: 10.18653/v1/d19-1454. [20] P . Dasigi, N. F . Liu, A. Maraso vi ´ c, N. A. Smith, and M. Gardner , “Quoref: A reading comprehension dataset with questions requiring coreferential reasoning, in 2019 Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing and 9th International J oint Confer ence on Natur al Langua g e Pr ocessing , 2019, pp. 5925–5932, doi: 10.18653/v1/d19-1606. [21] C. W ang et al. , “Ev aluating open-QA e v aluation, in 37th International Confer ence on Neur al Information Pr ocessing Systems , 2023, pp. 77013–77042. Lar g e langua g e models-based metric for g ener ative question answering systems (Hazem Abdel Azim) Evaluation Warning : The document was created with Spire.PDF for Python.
158 ISSN: 2252-8938 [22] T . Kwiatk o wski et al. , “Natural questions: a benchmark for question answering research, T r ansactions of the Association for Computational Linguistics , v ol. 7, pp. 453–466, 2019, doi: 10.1162/tacl a 00276. [23] M. Joshi, E. Choi, D. S. W eld, and L. Zettlemo yer , “T ri viaQA: A lar ge scale distantly supervised challenge dataset for reading com- prehension, in A CL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Pr oceedings of the Confer ence , 2017, v ol. 1, pp. 1601–1611, doi: 10.18653/v1/P17-1147. [24] T . K o ˇ cisk ´ y et al. , “The narrati v eQA reading comprehension challenge, T r ansactions of the Association for Computational Linguis- tics , v ol. 6, pp. 317–328, 2018, doi: 10.1162/tacl a 00023. [25] A. Q. J iang et al. , “Mixtral of e xperts, arXiv-Computer Science , pp. 1–13, 2024, doi: 10.48550/arXi v .2401.04088. [26] D. Cer , M. Diab, E. Agirre, I. L. -Gazpio, and L. Specia, “SemEv al-2017 task 1: Semantic te xtual similarity multilingual and cross-lingual focused e v aluation, in Pr oceedings of the Annual Meeting of the Association for Computational Linguistics , 2017, pp. 1–14, doi: 10.18653/v1/s17-2001. BIOGRAPHIES OF A UTHORS Hazem Abdelazim is currently a Professor of AI and ML and Dean at ESLSCA Uni v er - sity’ s School of Computing and Digital T echnology . He has been locally and internationall y recog- nized for his achie v ements . He w as a w arded an ‘In v ention Achie v ement A w ard’ from IBM in 1991, the First Scientic Inno v ation Prize for Arab Scientists (1993), State Excellence and encouragement A w ard (1995), and MB A Director’ s Cup (2003) from MSM, Netherlands. His journe y included aca- demic positions at Cairo Uni v ersity , A UC, and U AE Uni v ersity , and professional positions as an IBM Research Scientist, and Director of research at Microsoft. His research interests are generati v e arti- cial intelligence (AI), LLM, information retrie v al, and NLP . He has 35+ publications. He can be contacted at email: hazem.abdelazim@eslsca.edu.e g. Mohamed Tharwat W aheed graduated from the Department of Electronics and Commu- nication, F aculty of Engineering, Cairo Uni v ersity in 2006. He recei v ed the M.Sc. de gree in using reinforcement learning in mobile communication in 2017. He completed his Ph.D. with a focus on the applications of AI/M L in the T elecom industry at Cairo Uni v e rsity . In addition to his industry role as a Subject M atter Expert in the technology domain at V odafone, Egypt. He is a research and teaching doctor at ESLSCA Uni v ersity School of Computing and Digital T echnologies. He is also an IEEE Senior Member . His research interests span a di v erse spectrum, including IoT in smart ci ties, 5G, autonomous dri ving, AI/ML in mobile communication, and the implementation of generati v e AI in domain-specic tasks. He can be contacted at email: mohamed.mohamed-w aheed@v odafone.com. Ammar Mohammad earned his bachelor’ s and master’ s de grees in computer science from Cairo Uni v ersity , Egypt, and obtained his Ph.D. in computer science from the Uni v ersity of K oblenz-Landau, German y , in 2010. He has pre viously serv ed as a resea rcher and research fello w with the AI Research Group at the Uni v ersity of K oblenz-Landau. Currently , he holds the position of a professor of computer science at both Cairo Uni v ersity and MSA Uni v ersity i n Egypt. His research interests encompass machine and deep learning techniques, methods, algorithms, and applications across v arious domains. He can be contacted at email: ammar@cu.edu.e g. Int J Artif Intell, V ol. 14, No. 1, February 2025: 151–158 Evaluation Warning : The document was created with Spire.PDF for Python.