IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
14,
No.
1,
February
2025,
pp.
151
∼
158
ISSN:
2252-8938,
DOI:
10.11591/ijai.v14.i1.pp151-158
❒
151
Lar
ge
language
models-based
metric
f
or
generati
v
e
question
answering
systems
Hazem
Abdel
Azim,
Mohamed
Tharwat
W
aheed,
Ammar
Mohammed
School
of
Computing
and
Digital
T
echnologies,
ESLSCA
Uni
v
eristy
,
Cairo,
Egypt
Article
Inf
o
Article
history:
Recei
v
ed
Mar
20,
2024
Re
vised
Aug
13,
2024
Accepted
Aug
30,
2024
K
eyw
ords:
Ev
aluation
metrics
Generati
v
e
question
answering
Lar
ge
language
models
Lik
ert-scale
scoring
Zero-shot
prompting
ABSTRA
CT
In
the
e
v
olving
landscape
of
te
xt
generation,
which
has
adv
anced
rapidly
in
re-
cent
years,
techniques
for
e
v
aluating
the
performance
and
quality
of
the
gen-
erated
te
xt
lag
behind
relati
v
ely
.
T
raditionally
,
le
xical-based
metrics
such
as
bilingual
e
v
aluation
understudy
(BLEU),
recall-oriented
understudy
for
gisting
e
v
aluation
(R
OUGE),
m
etric
for
e
v
aluation
of
translation
with
e
xplicit
order
-
ing
(METEOR),
consensus-based
image
description
e
v
aluation
(CIDER),
and
F1
ha
v
e
been
utilized,
primarily
relying
on
n-gram
similarity
for
e
v
aluation.
In
recent
years,
neural
and
machine-learning-based
metrics,
lik
e
bidirectional
encoder
representations
from
transformers
(BER
T)
score,
k
e
y
phrase
question
answering
(KPQA),
and
BER
T
supervised
training
of
learned
e
v
aluation
met-
ric
for
reading
comprehension
(LERC)
ha
v
e
sho
wn
s
uperior
performance
o
v
er
traditional
met
rics
b
ut
suf
fered
from
a
lack
of
generalization
to
w
ards
dif
ferent
domains
and
requires
massi
v
e
human-labeled
training
data.
The
main
contrib
u-
tion
of
the
current
research
is
to
in
v
estig
ate
the
use
of
train-free
lar
ge
language
models
(LLMs)
as
scoring
metrics,
e
v
aluators,
and
judges
within
a
question-
answering
conte
xt,
encompassing
both
closed
and
open-QA
scenarios.
T
o
v
al-
idate
this
idea,
we
emplo
y
a
s
imple
zero-shot
prompting
of
Mixtral
8x7
B,
a
popular
and
widely
used
open-source
LLM,
to
score
a
v
ariety
of
datasets
and
domains.
The
e
xperimental
results
on
ten
dif
ferent
benchmark
datasets
are
compared
ag
ainst
human
judgments,
re
v
ealing
that,
on
a
v
erage,
simple
LLM-
based
metrics
outperformed
sophisticated
state-of-the-art
statistical
and
neural
machine-learning-based
metrics
by
2-8
points
on
answer
-pairs
scoring
tasks
and
up
to
15
points
on
contrasti
v
e
preferential
tasks.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Hazem
Abdel
Azim
School
of
Computing
and
Digital
T
echnologies,
ESLSCA
Uni
v
eristy
Cairo,
Egypt
Email:
hazem.abdelazim@eslsca.edu.e
g
1.
INTR
ODUCTION
Question
answering
(QA),
dating
back
to
the
seminal
w
ork
of
Hirschman
and
Gaizauskas
[1],
has
long
aspired
to
equip
computer
systems
with
the
ability
to
furnis
h
accurate
and
pertinent
re
sponses
to
posed
inquiries,
le
v
eraging
either
predened
conte
xt
or
curated
kno
wledge
base
s.
QA
systems
are
typically
decom-
posed
into
tw
o
k
e
y
components
[2]:
a
retrie
v
er
and
a
reader
.
The
retrie
v
er’
s
funct
ion
is
to
search
among
an
e
xtensi
v
e
collection
of
passages
and
retrie
v
e
the
most
rele
v
ant
passage
gi
v
en
the
query
.
The
reader’
s
function
is
to
comprehend
the
passage
and
answer
the
query
from
the
gi
v
en
passage
or
set
of
passages
retrie
v
ed.
The
cur
-
rent
research
focuses
on
the
reader
component,
namely
,
the
reading
comprehension
(RC)
task,
and
in
particular
,
dif
ferent
metrics
are
used
to
measure
the
performance
of
the
RC
task.
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
152
❒
ISSN:
2252-8938
T
raditional
RC-QA
systems
[3]
rely
on
e
xtracti
v
e
”span-based”
QA,
whether
in
a
closed
or
open
domain.
Span-based
e
xtracti
v
e
QA
means
gi
ving
a
passage
and
a
question,
and
the
task
of
the
AI
model
is
to
e
xtract
the
answer
from
within
the
passage
in
a
span
[start-end]
indices.
Accordingly
,
the
metrics
used
to
e
v
aluate
those
systems
were
designed
to
capture
le
xical-based
similarities
between
the
model
answers
and
the
ground-truth
ideal
answers
created
by
human
annotators.
Recent
QA
systems
are
generati
v
e
[4],
sometimes
kno
wn
as
”abstracti
v
e”
QA,
cater
for
generating
a
”semantically”
correct
answer
from
within
the
passage,
and
do
not
necessarily
c
apture
a
span
of
answers.
More
adv
anced
metrics
are
required
to
e
v
aluate
those
generati
v
e
responses.
Generally
,
the
c
u
r
rent
landscape
of
QA
metrics
can
be
cate
gorized
into
three
broad
cat
e
gori
es:
le
xical-statistical
metrics,
embedding-based
metrics,
and
neural
bidirectional
encoder
representations
from
transformers
(BER
T)-bas
ed
models
[5].
Le
xical-statistical
metrics
are
the
more
con
v
entional
metrics
used
for
se
v
eral
years.
The
y
rely
on
tok
en
matching,
whether
e
xact
match
(EM)
or
relax
ed
(F1-score),
with
dif
ferent
n-gram
v
ariants.
These
metrics
include
bilingual
e
v
aluation
understudy
(BLEU),
recall-oriented
understudy
for
gisting
e
v
aluation
(R
OUGE),
metric
for
e
v
aluation
of
translation
with
e
xplicit
ordering
(METEOR),
consensus-
based
image
description
e
v
aluation
(CIDER),
EM,
and
F1-score.
BLEU
is
precision-centric
and
widely
used
in
e
v
aluating
translation
ta
sks
[1];
R
OUGE
is
recall-centric
and
commonly
used
in
summarization
tasks
[6].
Although
these
traditional
metrics
ha
v
e
pro
vided
acceptable
performance
for
span-based
e
xtracti
v
e
QA
sys-
tems,
the
y
suf
fer
from
critical
dra
wbacks
as
the
y
do
not
capture
semantic
features
in
the
tok
ens.
On
the
other
hand,
the
semantic
capturing
aspect
has
been
addressed
in
t
he
second
cate
gory
of
embedding-based
metrics,
which
utilize
tok
en
embeddings
to
pro
vide
a
more
nuanced
similarity
score
and
mitig
ate
the
limitations
of
le
xical
metri
cs
[7],
[8].
While
these
metrics
of
fer
e
xibility
and
impro
v
e
QA
scoring
compared
to
le
xical
metrics,
t
he
y
nonetheless
encounter
challenges
adapting
to
specic
conte
xts
due
to
their
static
nature,
f
ailing
to
consi
der
the
conte
xtual
nuances
of
tok
ens
within
questions
or
answers
[9].
F
or
instance,
a
w
ord
lik
e
”bank”
w
ould
yield
the
same
st
atic
embedding
v
ector
in
dif
ferent
conte
xts,
such
as
”depositing
a
paycheck
in
the
bank”
and
”crossing
the
ri
v
er
bank”.
Those
limitations
were
handled
in
the
third
cate
gory:
Neural
BER
T
-based
models,
using
dif
ferent
v
ariants
of
BER
T
architectures
[10],
to
capture
conte
xtu-
alized
embeddings,
which
sho
wed
superior
performance
correlat
ing
with
human
judgments
compared
to
other
cate
gories.
Se
v
eral
models
were
reported
recently
lik
e
BER
Tscore
[8],
which
relies
either
on
w
ords
or
conte
x-
tualized
embeddings
and
cosine
similarity
to
generate
a
numeric
score.
Bilingual
e
v
aluation
understudy
with
representations
from
transformers
(BLEUR
T)
[11]
is
a
rened
v
ersion
of
BER
Tscore
that
empo
wers
augmented
synthesized
data
to
train
the
model.
Another
adv
anced
v
ersion
uses
BER
T
to
train
the
model
to
learn
certain
critical
weights
for
each
tok
en,
lik
e
in
k
e
y
phrase
question
answering
(KPQA)
models
[9].
The
renement
here
is
that
instead
of
treating
tok
ens
in
the
model
answer
and
ground
truth
gold
answers
equally
,
the
y
are
weighted
based
on
their
importance
in
answering
the
question.
Standard
BER
T
architecture
is
follo
wed
by
a
softmax
classier
layer
to
generate
the
weights
for
each
tok
en,
and
those
weights
are
incorporated
into
con
v
entional
metrics
lik
e
R
OUGE,
BLEU,
and
the
BER
Tscore
metric.
A
BER
T
-based
direct
supervised
learning
approach
adopted
by
[12],
which
learns
the
required
rating
directly
using
massi
v
e
training
labelled
data.
The
model
is
called
l
earned
e
v
aluation
metric
for
reading
comprehension
(LERC).
This
m
odel
is
bas
ed
on
BER
T
architec-
ture
that
has
under
gone
ne-tuning
based
on
human
judgment
scores.
LERC
tak
es
as
input
a
passage
(conte
xt),
question,
reference,
and
candidate,
and
the
output
score
measures
the
accurac
y
of
the
candidate
as
compared
to
the
ground
truth
human
judgement.
The
preceding
neural-BER
T
systems
ha
v
e
demonstrated
signicantly
superior
performance
to
traditional
le
xical
and
static
embedding
metrics,
especially
within
the
domains
for
which
the
y
are
trained.
Ho
we
v
er
,
the
y
are
hindered
by
a
comple
x
training
procedure,
necessitating
costly
manual
annotation
of
samples
due
to
their
reliance
on
lar
ge
amounts
of
human-annotated
data
for
training.
Additionally
,
the
y
e
xhibit
limited
generalization
across
v
arious
domains,
and
there
is
still
more
room
for
impro
v
ements
on
out-of-distrib
ution
data,
particularly
on
contrasti
v
e
pairs
[12].
Recently
,
a
fourth
cate
gory
based
on
using
lar
ge
language
models
(LLMs)
in
scoring
as
a
judge
sho
ws
signicant
promise
compared
to
the
preceding
three
cate
gories.
Utilizing
LLM
with
carefully
crafted
prompts
[13]
has
demonst
rated
remarkable
success
in
v
arious
tasks,
both
within
academic
benchmarks
[14]
and
real-
w
orld
settings
[15].
Ho
we
v
er
,
to
our
kno
wledge,
no
published
research
has
yet
reported
on
using
LLMs
as
a
scoring
agent
for
RC
tasks
in
a
QA
conte
xt
to
mimic
the
human
judgments
on
a
Lik
ert
scale
and
the
sim-
pler
binary
tasks
for
correct/incorrect
answers.
Thus,
this
research’
s
primary
contrib
ution
lies
in
e
xploring
a
fourth
cate
gory
,
emplo
ying
GPT
LLMs
and
zero-shot
prompting
to
assess
the
correlation
between
model
scores
and
human
judgments
compared
to
other
state-of-the-art
QA
metrics.
W
e
conducted
e
xperiments
using
Int
J
Artif
Intell,
V
ol.
14,
No.
1,
February
2025:
151–158
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
153
this
proposed
approach
on
10
state-of-the-art
datasets
[12],
[16]–[23],
encompassing
di
v
erse
domains
and
QA
styles.
The
rst
eight
datasets
feature
human
scores
based
on
a
5-Lik
ert
scale,
while
the
last
tw
o
consist
of
binary
openQA
datasets.
In
these
latter
datasets,
the
task
assigned
to
the
LLM
is
to
determine
whether
the
answer
is
correct
compared
to
a
gold
(ground
truth)
answer
.
Through
careful
prompting,
we
directed
the
LLM
to
e
x
ecute
the
scoring
task.
This
disc
riminati
v
e
task
is
particularly
challenging,
especially
with
the
5-Lik
ert
scale
judgment,
as
the
LLM
must
distinctly
dif
ferentiate
between
closely
labelled
cate
gories
between
1
and
5.
The
rest
of
the
paper
is
or
g
anized
as
follo
ws:
in
section
2,
we
discuss
the
research
method,
in
subsection
2.1.,
we
discuss
the
proposed
LLM
mixtral-based
scoring
model
and
subsection
2.2.
describes
the
datasets
emplo
yed
in
this
research,
pro
viding
a
foundation
for
the
empirical
analysis.
The
ndings
from
our
empirical
analysis
across
all
datasets
are
presented
and
discussed
in
section
3.
Finally
,
section
4
concludes
the
paper
with
a
summary
of
k
e
y
insights
and
conclusions
dra
wn
from
the
research.
2.
METHOD
In
this
section,
we
describe
the
methodology
used
in
our
research
to
de
v
elop
and
e
v
aluate
the
proposed
LLM-based
metric,
as
well
as
the
the
datasets
used
for
e
v
aluation.
2.1.
Lar
ge
language
model-based
metric
The
proposed
metric
in
our
research
is
based
on
capitalizing
on
open-source
Mixtral
8x7B
LLM
reasoning
capabilities
as
a
scoring
machine,
which
we
will
call
LLM
-Mixtral.
This
research
answers
a
h
ypothesis
about
whether
sim
ple
prompt-based
zero-shot
open-source
LLM
can
outperform
all
state-of-the-
art
e
xisting
metrics
we
ha
v
e
co
v
ered
in
the
p
r
e
vious
sections
and
correlate
better
with
human
judgements.
W
e
formally
design
a
prompt
containing
a
question,
q,
a
gold
(reference)
answer
and
a
model
generated
answer
AI-generated
answer
as
(1):
ˆ
y
=
M
LLM
(
prompt
)
(1)
The
predicted
score
ˆ
y
is
then
compared
to
the
corresponding
human
judjements
e
xample
of
prompt
that
can
be
applied
as
an
input
to
(12):
”Here
is
a
question,
a
set
of
golden
answers
(split
wit
h
/),
an
AI-generated
answer
.
Can
you
judge
whether
the
AI-generated
answer
is
correct
according
to
the
question
and
golden
answers,
answer
Y
es
or
No.
”
W
e
used
se
v
eral
prompts
depending
on
the
task
and
the
dataset
used.
An
e
xample
is
sho
wn
in
Figure
1,
to
instruct
the
LLM-Mixtral
to
generate
a
human-lik
e
judgement
on
ho
w
well
the
h
ypothesis
candidate
answer
is
aligned
semantically
with
the
ground
truth
reference
answer
.
The
predicted
judgment
ˆ
y
could
be
on
a
Lik
ert-
scale
from
1-5
for
the
rst
eight
datasets
or
binary
judgement
(correct/incorrect)
for
the
last
tw
o
datasets,
as
will
be
e
xplained
in
the
ne
xt
section.
Figure
1.
Zero-shot
prompt
applied
to
LLM-Mixtral
model
Lar
g
e
langua
g
e
models-based
metric
for
g
ener
ative
question
answering
systems
(Hazem
Abdel
Azim)
Evaluation Warning : The document was created with Spire.PDF for Python.
154
❒
ISSN:
2252-8938
2.2.
Datasets
used
in
question
answering
e
v
aluation
Numerous
benchmark
datasets
ar
e
a
v
ailable
i
n
the
lit
erature
for
e
v
al
uating
QA.
W
e
selected
datas
ets
that
were
deplo
yed
in
the
same
research
setting,
by
comparing
the
metri
c
with
human
judgments
mostly
on
a
Lik
ert
scale
from
1
to
5
where
5
is
the
most
rele
v
ant
model
answer
compared
to
the
gold
-
ground
trut
h
answers.
Datasets
utilized
in
our
e
xperiments
is
summarized
in
T
able
1.
T
able
1.
Summary
of
datasets
Dataset
Description
References
Narrati
v
eQA
Benchmark
for
GenQA
metrics,
with
short
answers
a
v
eraging
4.7
w
ords.
[17],
[24]
SemEv
al
Used
for
GenQA
metrics,
with
v
ery
short
answers
a
v
eraging
2.5
w
ords.
[16],
[17]
MS-MARCO
Contains
human
judgments
for
model-generated
answers,
kno
wn
for
longer
responses.
[17]
A
VSD
Collected
human
judgments
on
model
responses,
with
longer
and
comple
x
answers.
[17]
MCScript
Ev
aluates
reasoning
within
stories
for
children,
assessing
comprehension
skills.
[16]
CosmosQA
F
ocuses
on
commonsense
reasoning
through
e
v
eryday
blogs,
assessing
real-w
orld
reasoning.
[18]
SocialIQA
Ev
aluates
social
reasoning
from
kno
wledge-base
passages,
focusing
on
social
interactions.
[19]
Quoref
Assesses
coreferential
reasoning
within
W
ikipedia
articles
for
language
comprehension.
[20]
Contrasti
v
e
pairs
Consists
of
contrasti
v
e
answer
pairs
for
e
v
aluating
models
ag
ainst
human
judgments.
[12]
EV
OUN
A
(NQ,
TQ)
Aggre
g
ates
outcomes
from
v
arious
Open-QA
models
on
NQ
and
TQ
datasets.
[21]–[23]
3.
EXPERIMENT
AL
RESUL
TS
W
e
tested
our
proposed
LLM—Mixtral
metric
on
ten
dif
ferent
datasets
and
compared
it
with
all
the
methods
co
v
ered
in
section
3.
W
e
chose
Mixtral
7
B
because
v
ery
little
research
has
tackled
this
problem
using
open-source
models,
and
most
of
the
related
research
i
n
this
area
used
closed
GPT
models
(OpenAI
and
Claudera),
which
are
paid
services.
The
second
reason
is
that
Mixtral
8x7
B
is
one
of
the
top
performing
open
source
models
[25]
on
general
tasks
with
relati
v
ely
fe
wer
parameters
than
man
y
open
source
LLMs.
Mixtral
notably
e
xhibits
superior
performanc
e,
matching
or
surpassing
Llama
2
70B
and
GPT
-3.5
on
public
tasks,
with
remarkable
results
in
mathematics,
code
generation,
and
multilingual
tasks.
So,
we
in
v
estig
ate
the
model’
s
performance
in
this
challenging
closed
specic
task
of
Lik
ert-scale
scoring
of
QA-generated
answers
v
ersus
human
judgments.
The
third
reason
is
that
open
source
models,
for
pri
v
ac
y
reasons,
are
more
appealing
for
some
go
v
ernment
and
pri
v
ate
sector
enterprises
where
the
criticality
of
data
pri
v
ac
y
is
v
ery
high,
and
the
y
prefer
to
ha
v
e
their
data
on-premises,
which
is
achie
v
able
using
open
source
models.
3.1.
Experiment
I:
comparison
with
k
ey
phrase
question
answering
metric
W
e
benchmark
ed
LLM-Mixtral
ag
ainst
the
datasets
used
in
[9].
Based
on
the
LLM
-
prompt
in
Figure
1,
the
resulting
output
is
parsed
to
get
the
Li
k
ert
scale
judgment
from
1
to
5.
The
question,
candidate
ans
wer
,
and
ground
truth
reference
are
grabbed
and
applied
to
the
LLM
model
for
each
dataset.
The
Pearson
correlation
coef
cient
is
computed
for
the
LLM-Mixtral
and
human
judgments.
As
depicted
in
the
results
in
T
able
2,
the
simple
proposed
model
LLM-Mixtral
outperforms
all
metrics
on
a
v
erage
and
for
3
out
of
four
datasets.
T
est
sets
are
used
in
the
comparati
v
e
study
.
The
le
xical
metrics
are
f
ar
behind
in
terms
of
correlation
with
human
judgments.
The
KPQA
pro
vides
a
relati
v
ely
good
correlation
b
ut,
on
a
v
erage,
is
some
what
less
than
the
simple
LLM-Mixtral
metric.
T
able
2.
Benchmarking
LLM-Mixtral
ag
ainst
Le
xical
and
KPQA
metrics
[18]
Metric
MS-MARCO
A
VSD
Narrati
v
e-QA
Sem-Ev
al
A
v
erage
BLEU-1
0.349
0.58
0.634
0.359
0.4805
BLEU-4
0.193
0.499
0.258
-0.035
0.22875
R
OUGE-L
0.309
0.585
0.707
0.566
0.54175
METEOR
0.423
0.578
0.735
0.543
0.56975
CIDER
0.275
0.567
0.648
0.429
0.47975
BER
TScore
0.463
0.658
0.785
0.63
0.634
BLEU-1-KPQA
0.675
0.719
0.716
0.362
0.618
R
OUGE-L-KPQA
0.698
0.712
0.774
0.742
0.7315
BER
TScore-KPQA
0.673
0.729
0.782
0.741
0.73125
LLM-Mixtral
0.691
0.749
0.818
0.777
0.75875
Int
J
Artif
Intell,
V
ol.
14,
No.
1,
February
2025:
151–158
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
155
3.2.
Experiment
II:
comparison
with
lear
ned
e
v
aluation
f
or
r
eading
compr
ehension
metric
W
e
benchmark
ed
LLM-Mixtral
on
dif
ferent
datasets
and
dif
ferent
models
presented
by
the
authors
in
[12].
The
results
are
sho
wn
in
T
able
3.
BER
T
semantic
te
xtual
similarity
benchmark
(STS-B)
is
a
BER
T
-base
model
ne-tuned
on
the
sentence
similarity
task,
STS-B
[26].
LERC,
as
described
in
section
3,
is
a
LERC
based
on
supervised
ne-tuning
and
a
40k+
dataset.
The
proposed
LLM-Mixtral
metric
outperformed
BER
T
STS-B
and
LERC
on
a
v
erage,
e
xcept
for
the
Quoref
dataset.
LERC,
the
second
performer
after
LLM-Mixtral,
w
as
competiti
v
e
in
tw
o
datasets,
CosmosQA
and
Quoref.
In
general,
it
produced
an
a
v
erage
correlation
of
0.744,
while
our
proposed
LLM-Mixtral
achie
v
ed
a
moderate
correlation
of
0.882,
a
remarkable
increase
of
14
points.
T
able
3.
Benchmarking
LLM-Mixtral
ag
ainst
Le
xical
and
LERC
metrics
[12]
Narrati
v
e
QA
MCScript
CosmosQA
Quoref
A
v
erage
BLEU
1
0.472
0.260
0.670
0.578
0.460
METEOR
0.615
0.502
0.711
0.716
0.611
R
OUGE-L
0.495
0.297
0.701
0.604
0.490
BER
TScore
0.534
0.194
0.779
0.286
0.447
BER
T
STS-B
0.686
0.449
0.789
0.750
0.638
LERC
0.738
0.694
0.824
0.741
0.744
LLM-Mixtral
0.884
0.795
0.824
0.735
0.822
3.3.
Experiment
III:
comparing
LLM-Mixtral
with
LERC
on
out-of-distrib
ution
datasets
Although
LERC
achie
v
ed
good
performance
on
some
of
the
data
sets,
in
v
estig
ating
the
training
and
tes
t
datasets
used
in
LERC
sho
wed
that
the
data
is
statistically
biased,
which
pro
vides
doubt
on
t
he
generalization
capabilities
of
LERC.
As
sho
wn
in
Figure
2,
the
distrib
ution
of
the
training
and
test
data
has
similar
biases.
T
o
v
erify
that
we
applied
LERC
on
totally
unseen
out-of-distrib
ution
data
from
dataset
1,
namely
Microsoft
machine
reading
comprehension
(MSMARCO)
and
audio-visual
scene
understanding
(A
VSD).
The
correlation
results
in
T
able
4
sho
wed
a
lo
wer
performance
as
e
xpected
compared
to
LLM-Mixtral,
which
is
one
of
the
critical
adv
antages
of
using
LLM-Mixtral
based
metric
as
it’
s
dataset
and
domain
agnostic,
and
is
not
inuenced
by
a
training
distrib
ution
biases.
Figure
2.
Biases
in
the
training
and
test
sets
used
in
LERC
T
able
4.
Comparison
of
models
LERC
and
LLM-Mixtral
on
MS-MARCO
and
A
VSD
datasets
[17]
Model
MS-MARCO
A
VSD
LLM-Mixtral
0.691
0.749
LERC
0.601
0.621
Lar
g
e
langua
g
e
models-based
metric
for
g
ener
ative
question
answering
systems
(Hazem
Abdel
Azim)
Evaluation Warning : The document was created with Spire.PDF for Python.
156
❒
ISSN:
2252-8938
3.4.
Experiment
IV
:
contrasti
v
e
scoring
task
The
e
xperiment
w
as
conducted
on
the
contrasti
v
e
pairs
dataset
[12].
This
dataset
assesses
the
prefer
-
ence
between
tw
o
possible
answers.
The
results
are
summarized
in
T
able
5,
with
the
accurac
y
of
the
results
reported.
Le
xical-based
metrics
performed
poorly
,
as
e
xpected
since
the
contrasti
v
e
pairs
were
designed
to
ha
v
e
similar
tok
en
o
v
erlap
with
the
reference.
On
the
other
hand,
the
sentence
similarity
model
STS-B
outperformed
others,
lik
ely
because
it
generalizes
be
yond
tok
en
o
v
erlap.
The
LERC
model,
presented
in
this
research
setting,
achie
v
ed
the
best
results,
with
an
a
v
erage
accurac
y
of
80%.
Our
proposed
LLM-Mixtral
metric,
earned
an
impressi
v
e
a
v
erage
accurac
y
of
95%.
This
result
supports
our
h
ypothesis
that
LLM-based
models
outperform
con
v
entional
and
state-of-the-art
models
in
this
scoring
task.
T
able
5.
Results
of
contrasti
v
e
pairs
e
xperiment
on
datasets
[12]
Metric
Narrat
i
v
eQA
MCScript
CosmosQA
SocialIQA
A
vg.
BLEU-1
53
54
52
55
53.5
R
OUGE-L
53
57
53
53
61.2
METEOR
60
62
57
53
54
BER
TScore
70
58
74
62
66
BER
T
STS-B
70.6
70
59.3
66.6
66.6
LERC
80
87.3
72.6
81.3
80.3
LLM-Mixtral
96
94
96
94
95
3.5.
Experiment
V
:
open
question
answering
datasets
The
pre
vious
e
xperiments
were
conducted
using
closed
QA,
where
the
answer
te
xt
is
pro
vided
within
a
gi
v
en
conte
xt
pas
sage.
In
the
current
e
xperiment,
we
aim
to
e
v
aluate
the
metric
on
a
more
challenging
task
on
commonly
used
OpenQA
datasets,
namely
natural
questions
(NQ),
T
ri
via
question
answering
(TQ),
and
e
v
ent
and
opinion
understanding
in
natural
language
(EV
OUN
A)
datasets.
LLM-Mixtral
outperformed
BER
Tscore
applied
on
the
same
dataset
as
sho
wn
in
T
able
6,
which
summarizes
the
relati
v
e
performance
of
LLM-Mixtral
o
v
er
other
state-of-t
he-art
models.
The
best-performing
neural-BER
T
model
is
chosen
for
each
subset
of
the
ten
datasets
used
in
the
e
xperimentation.
The
incremental
dif
ference
between
the
proposed
LLM-Mixtral
model
and
the
best-performing
neural-BER
T
model
ranges
from
2.7
points
to
8.4
points
on
the
answer
-pairs
scoring
task
and
14.7
points
on
the
contrasti
v
e
answer
-pairs
task.is
selected
on
each
subset
of
the
ten
datasets
used
in
the
e
xperimentation.
The
incremental
dif
ference
between
the
proposed
LLM-Mixtral
m
odel
and
the
best-performing
neural-BER
T
model
ranges
from
2.7
points
to
8.4
points
on
the
answer
-pairs
scoring
task
and
14.7
points
on
the
contrasti
v
e
answer
-pairs
task.
T
able
6.
Comparati
v
e
analysis
of
best
performing
neural
BER
T
models
with
LLM-Mixtral
Datasets
Model
A
vg.
performance
LLM-Mixtral
Dif
ference
MS-MARCO,
A
VSD,
Narrati
v
eQA,
SemEv
al
R
OUGE-L-KPQA
73.15
75.87
2.72
CosmosQA,
MCScript,
Narrati
v
eQA,
Quoref
LERC
74.44
82.2
7.76
NaturalQuestions
BER
TScore
80.84
88.2
7.36
T
ri
viaQA
BER
TScore
85.28
93.68
8.4
Contrasti
v
e
pairs
Datasets
(CosmosQA,
MCScript,
Narrati
v
eQA,
SocialiQA)
LERC
80.3
95
14.7
4.
CONCLUSION
This
study
e
xplored
applying
LLMs
as
an
e
v
aluation
metric
for
QA
tasks.
Our
inquiry
has
resulted
in
a
more
profound
comprehension
of
the
capabilities
of
LLMs
in
assessing,
adjudicating,
and
appraising
the
performance
of
QA
system
s
in
both
closed
and
open
domains.
W
e
conducted
e
xtensi
v
e
e
xperiments
on
ten
datasets,
comparing
our
proposed
LLM-Mixtral
metric
with
e
xisting
methods
on
QA
tasks.
The
results
indicated
the
superiorit
y
of
LLM-Mixtral
in
pro
viding
accurate
e
v
aluations
of
answer
quality
.
It
outperformed
traditional
le
xical
metrics,
neural
BER
T
-based
models,
and
KPQA
approaches.
Mixtral
8x7
B,
a
simple
LLM-
based
metric,
sho
wcased
higher
correlations
with
hum
an
judgments
compared
to
more
sophisticated
state-of-
the-art
statistical
and
neural
machine-learning-based
metrics.
It
reached
an
impressi
v
e
Pearson
correlation
of
o
v
er
80%.
Human
judgments
in
e
v
aluating
answer
pairs
achie
v
ed
accurac
y
rates
e
xceeding
95%
in
contrasti
v
e
scoring.
This
superior
performance
across
a
di
v
erse
range
of
datasets
and
models
underscores
the
potential
of
LLMs
in
QA
e
v
aluation.
Our
adopted
metric
e
xhibited
v
ersat
ility
in
open-domain
QA
e
xperiments,
specically
Int
J
Artif
Intell,
V
ol.
14,
No.
1,
February
2025:
151–158
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
157
on
NQ
and
TQ
datasets.
It
achie
v
ed
results
closer
to
human
judgments
and
outperformed
o
v
er
-relax
ed
le
xical
matching
metrics,
bridging
the
g
ap
between
automated
scoring
and
human
assessment.
The
correlation
with
human
judgments
on
these
datasets
reinforced
the
ef
fecti
v
eness
of
LLM-Mixtral,
positioning
it
on
par
with
GPT
-3.5
and
outperforming
state-of-the-art
neural
BER
T
-based
models
lik
e
BER
TScore.
Our
ndings
open
ne
w
horizons
for
applying
LLMs
in
QA
e
v
aluation,
of
fering
a
complementary
approach
to
traditional
and
neural-based
metrics.
This
research
marks
a
crucial
step
in
pursuing
more
accurate
and
ef
fecti
v
e
QA
e
v
aluation
methods.
Some
k
e
y
benets
of
using
LLM-based
metrics
o
v
er
state-of-the-art
metrics
include
customizability
,
multif
aceted
e
v
aluation,
and
train-free
capabilities.
These
features
enable
us
to
create
a
metric
that
can
e
xibly
perform
the
judgment
task
across
v
arious
datasets
without
requiring
a
learning
process
while
still
achie
ving
competiti
v
e
performance.
LLM-based
metrics
are
more
domain
agnostic
than
most
machine
learning
BER
T
-
based
techniques,
which
sho
wed
a
distrib
ution
domain
-
bias
when
correlating
with
human
judgments.
REFERENCES
[1]
L.
Hirschman
and
R.
Gaizauskas,
“Natural
language
question
answering:
The
vie
w
from
here,
”
Natur
al
Langua
g
e
Engineeri
ng
,
v
ol.
7,
no.
4,
pp.
275–300,
2001,
doi:
10.1017/S1351324901002807.
[2]
A.
M.
N.
Allam
and
M.
H.
Hagg
ag,
“The
question
answering
systems:
A
surv
e
y
,
”
International
J
ournal
of
Resear
c
h
and
Re
vi
e
w
s
in
Information
Sciences
(IJRRIS)
,
v
ol.
2,
no.
3,
2012.
[3]
M.
Rotaru
and
D.
J.
Litman,
“Impro
ving
question
answering
for
reading
comprehension
tests
by
combining
multiple
systems,
”
in
Pr
oceedings
of
the
AAAI
2005
W
orkshop
on
Question
Answering
in
Restricted
Domains
,
2005,
pp.
46–50.
[4]
Y
.
Liu,
C.
Zhang,
X.
Y
an,
Y
.
Chang,
and
P
.
S.
Y
u,
“Generati
v
e
question
renement
with
deep
reinforcement
learning
in
retrie
v
al-
based
QA
syst
em,
”
in
Pr
oceedings
of
the
28th
A
CM
International
Confer
ence
on
Information
and
Knowledg
e
Mana
g
ement
,
2019,
pp.
1643–1652,
doi:
10.1145/3357384.3358046.
[5]
D.
Deutsch,
T
.
B.
-W
eiss,
and
D.
Roth,
“T
o
w
ards
question-answering
as
an
automatic
metric
for
e
v
aluating
the
content
quality
of
a
summary
,
”
T
r
ansactions
of
the
Association
for
Computational
Linguistics
,
v
ol.
9,
pp.
774–789,
2021,
doi:
10.1162/tacl
a
00397.
[6]
C.-Y
.
Lin,
“R
OUGE:
A
packa
ge
for
automatic
e
v
aluation
of
summaries,
”
in
T
e
xt
Summarization
Br
anc
hes
Out
,
2004,
pp.
74–81.
[7]
E.
Clark,
A.
Celik
yilmaz,
and
N.
A.
Smith,
“Sentence
mo
v
er’
s
similarity:
automatic
e
v
aluation
for
multi-sentence
te
xts,
”
in
Pr
o-
ceedings
of
the
57th
Annual
Meeting
of
the
Association
for
Computational
Linguistics
,
2019,
pp.
2748–2760,
doi:
10.18653/v1/P19-
1264.
[8]
T
.
Zhang,
V
.
Kishore,
F
.
W
u,
K.
Q.
W
einber
ger
,
and
Y
.
Artzi,
“BER
Tscore:
e
v
aluating
te
xt
generation
with
BER
T
,
”
in
8th
Interna-
tional
Confer
ence
on
Learning
Repr
esentations,
ICLR
2020
,
2020,
pp.
1–43.
[9]
H.
Lee
et
al.
,
“KPQA:
A
metric
for
generati
v
e
question
answering
using
k
e
yphrase
weights,
”
in
2021
Confer
ence
of
the
North
Amer
-
ican
Chapter
of
the
Association
for
Computat
ional
Linguistics:
Human
Langua
g
e
T
ec
hnolo
gies,
Pr
oceedings
of
the
Confer
ence
,
2021,
pp.
2105–2115,
doi:
10.18653/v1/2021.naacl-main.170.
[10]
J.
De
vlin,
M.-W
.
Chang,
K.
Lee,
K.
T
.
Google,
and
A.
I.
Language,
“BER
T
:
Pre-training
of
deep
bidi
rectional
transformers
for
language
understanding,
”
in
Pr
oceedings
of
N
AA
CL-HL
T
2019
,
2019,
pp.
4171–4186.
[11]
T
.
Sellam,
D.
Das,
and
A.
P
.
P
arikh,
“BLEUR
T
:
Learning
rob
ust
metrics
for
te
xt
generation,
”
in
Pr
oceedings
of
the
Annual
Meeting
of
the
Association
for
Computational
Linguistics
,
2020,
pp.
7881–7892,
doi:
10.18653/v1/2020.acl-main.704.
[12]
A.
Chen,
G.
Stano
vsk
y
,
S.
Singh,
and
M.
Gardner
,
“MOCHA:
A
dataset
for
training
and
e
v
aluating
generati
v
e
reading
compre-
hension
metrics,
”
in
EMNLP
2020
-
2020
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
,
Pr
oceedings
of
the
Confer
ence
,
2020,
pp.
6521–6532,
doi:
10.18653/v1/2020.emnlp-main.528.
[13]
P
.
Liu,
W
.
Y
uan,
J.
Fu,
Z.
Jiang,
H.
Hayashi,
and
G.
Neubig,
“Pre-train,
prompt,
and
predict:
A
systematic
surv
e
y
of
prompting
methods
in
natural
language
processing,
”
arXiv-Computer
Science
,
pp.
1–46,
2021,
doi:
10.48550/arXi
v
.2107.13586.
[14]
V
.
Sanh
et
al.
,
“Multitask
prompted
training
enables
zero-shot
task
generalization,
”
arXiv-Computer
Science
,
2021,
doi:
10.48550/arXi
v
.2110.08207.
[15]
L.
Ouyang
et
al.
,
“T
raining
language
models
to
follo
w
instructions
with
human
feedback,
”
arXiv-Computer
Science
,
pp.
1–68,
2022,
doi:
10.48550/arXi
v
.2203.02155.
[16]
S.
Ostermann,
M.
Roth,
A.
Modi,
S.
Thater
,
and
M.
Pinkal,
“SemEv
al-2018
T
ask
11:
Machine
comprehension
using
commonsense
kno
wledge,
”
in
Pr
oceedings
of
The
12th
International
W
orkshop
on
Semantic
Evaluation
,
2018,
pp.
747–757,
doi:
10.18653/v1/S18-
1119.
[17]
B.
Bi,
C.
W
u,
M.
Y
an,
W
.
W
ang,
J.
Xia,
and
C.
Li,
“Incorporating
e
xternal
kno
wledge
into
machine
reading
for
generati
v
e
question
answering,
”
in
Pr
oceedings
of
the
2019
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
and
the
9th
International
J
oint
Confer
ence
on
Natur
al
Langua
g
e
Pr
ocessing
(EMNLP-IJCNLP)
,
2019,
pp.
2521–2530,
doi:
10.18653/v1/D19-1255.
[18]
L.
Huang,
R.
L.
Bras,
C.
Bhag
a
v
atula,
and
Y
.
Choi,
“COSMOS
QA:
Machine
reading
comprehension
with
conte
xtual
com-
monsense
reasoning,
”
in
EMNLP-IJCNLP
2019
-
2019
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
and
9th
International
J
oint
Conf
er
ence
on
Natur
al
Langua
g
e
Pr
ocessing
,
Pr
oceedings
of
the
Confer
ence
,
2019,
pp.
2391–2401,
doi:
10.18653/v1/d19-1243.
[19]
M.
Sap,
H.
Rashkin,
D.
Chen,
R.
L.
Bras,
and
Y
.
Choi,
“Social
IQA:
Commons
ense
reasoning
about
social
interactions,
”
in
2019
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
and
9th
International
J
oint
Confer
ence
on
Natur
al
Langua
g
e
Pr
ocessing
,
2019,
pp.
4463–4473,
doi:
10.18653/v1/d19-1454.
[20]
P
.
Dasigi,
N.
F
.
Liu,
A.
Maraso
vi
´
c,
N.
A.
Smith,
and
M.
Gardner
,
“Quoref:
A
reading
comprehension
dataset
with
questions
requiring
coreferential
reasoning,
”
in
2019
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
and
9th
International
J
oint
Confer
ence
on
Natur
al
Langua
g
e
Pr
ocessing
,
2019,
pp.
5925–5932,
doi:
10.18653/v1/d19-1606.
[21]
C.
W
ang
et
al.
,
“Ev
aluating
open-QA
e
v
aluation,
”
in
37th
International
Confer
ence
on
Neur
al
Information
Pr
ocessing
Systems
,
2023,
pp.
77013–77042.
Lar
g
e
langua
g
e
models-based
metric
for
g
ener
ative
question
answering
systems
(Hazem
Abdel
Azim)
Evaluation Warning : The document was created with Spire.PDF for Python.
158
❒
ISSN:
2252-8938
[22]
T
.
Kwiatk
o
wski
et
al.
,
“Natural
questions:
a
benchmark
for
question
answering
research,
”
T
r
ansactions
of
the
Association
for
Computational
Linguistics
,
v
ol.
7,
pp.
453–466,
2019,
doi:
10.1162/tacl
a
00276.
[23]
M.
Joshi,
E.
Choi,
D.
S.
W
eld,
and
L.
Zettlemo
yer
,
“T
ri
viaQA:
A
lar
ge
scale
distantly
supervised
challenge
dataset
for
reading
com-
prehension,
”
in
A
CL
2017
-
55th
Annual
Meeting
of
the
Association
for
Computational
Linguistics,
Pr
oceedings
of
the
Confer
ence
,
2017,
v
ol.
1,
pp.
1601–1611,
doi:
10.18653/v1/P17-1147.
[24]
T
.
K
o
ˇ
cisk
´
y
et
al.
,
“The
narrati
v
eQA
reading
comprehension
challenge,
”
T
r
ansactions
of
the
Association
for
Computational
Linguis-
tics
,
v
ol.
6,
pp.
317–328,
2018,
doi:
10.1162/tacl
a
00023.
[25]
A.
Q.
J
iang
et
al.
,
“Mixtral
of
e
xperts,
”
arXiv-Computer
Science
,
pp.
1–13,
2024,
doi:
10.48550/arXi
v
.2401.04088.
[26]
D.
Cer
,
M.
Diab,
E.
Agirre,
I.
L.
-Gazpio,
and
L.
Specia,
“SemEv
al-2017
task
1:
Semantic
te
xtual
similarity
multilingual
and
cross-lingual
focused
e
v
aluation,
”
in
Pr
oceedings
of
the
Annual
Meeting
of
the
Association
for
Computational
Linguistics
,
2017,
pp.
1–14,
doi:
10.18653/v1/s17-2001.
BIOGRAPHIES
OF
A
UTHORS
Hazem
Abdelazim
is
currently
a
Professor
of
AI
and
ML
and
Dean
at
ESLSCA
Uni
v
er
-
sity’
s
School
of
Computing
and
Digital
T
echnology
.
He
has
been
locally
and
internationall
y
recog-
nized
for
his
achie
v
ements
.
He
w
as
a
w
arded
an
‘In
v
ention
Achie
v
ement
A
w
ard’
from
IBM
in
1991,
the
First
Scientic
Inno
v
ation
Prize
for
Arab
Scientists
(1993),
State
Excellence
and
encouragement
A
w
ard
(1995),
and
MB
A
Director’
s
Cup
(2003)
from
MSM,
Netherlands.
His
journe
y
included
aca-
demic
positions
at
Cairo
Uni
v
ersity
,
A
UC,
and
U
AE
Uni
v
ersity
,
and
professional
positions
as
an
IBM
Research
Scientist,
and
Director
of
research
at
Microsoft.
His
research
interests
are
generati
v
e
arti-
cial
intelligence
(AI),
LLM,
information
retrie
v
al,
and
NLP
.
He
has
35+
publications.
He
can
be
contacted
at
email:
hazem.abdelazim@eslsca.edu.e
g.
Mohamed
Tharwat
W
aheed
graduated
from
the
Department
of
Electronics
and
Commu-
nication,
F
aculty
of
Engineering,
Cairo
Uni
v
ersity
in
2006.
He
recei
v
ed
the
M.Sc.
de
gree
in
using
reinforcement
learning
in
mobile
communication
in
2017.
He
completed
his
Ph.D.
with
a
focus
on
the
applications
of
AI/M
L
in
the
T
elecom
industry
at
Cairo
Uni
v
e
rsity
.
In
addition
to
his
industry
role
as
a
Subject
M
atter
Expert
in
the
technology
domain
at
V
odafone,
Egypt.
He
is
a
research
and
teaching
doctor
at
ESLSCA
Uni
v
ersity
School
of
Computing
and
Digital
T
echnologies.
He
is
also
an
IEEE
Senior
Member
.
His
research
interests
span
a
di
v
erse
spectrum,
including
IoT
in
smart
ci
ties,
5G,
autonomous
dri
ving,
AI/ML
in
mobile
communication,
and
the
implementation
of
generati
v
e
AI
in
domain-specic
tasks.
He
can
be
contacted
at
email:
mohamed.mohamed-w
aheed@v
odafone.com.
Ammar
Mohammad
earned
his
bachelor’
s
and
master’
s
de
grees
in
computer
science
from
Cairo
Uni
v
ersity
,
Egypt,
and
obtained
his
Ph.D.
in
computer
science
from
the
Uni
v
ersity
of
K
oblenz-Landau,
German
y
,
in
2010.
He
has
pre
viously
serv
ed
as
a
resea
rcher
and
research
fello
w
with
the
AI
Research
Group
at
the
Uni
v
ersity
of
K
oblenz-Landau.
Currently
,
he
holds
the
position
of
a
professor
of
computer
science
at
both
Cairo
Uni
v
ersity
and
MSA
Uni
v
ersity
i
n
Egypt.
His
research
interests
encompass
machine
and
deep
learning
techniques,
methods,
algorithms,
and
applications
across
v
arious
domains.
He
can
be
contacted
at
email:
ammar@cu.edu.e
g.
Int
J
Artif
Intell,
V
ol.
14,
No.
1,
February
2025:
151–158
Evaluation Warning : The document was created with Spire.PDF for Python.