I
AE
S
I
n
t
e
r
n
at
ion
al
Jou
r
n
al
of
Ar
t
if
icial
I
n
t
e
ll
ig
e
n
c
e
(
I
J
-
AI
)
Vol.
14
,
No.
4
,
Augus
t
20
25
,
pp.
3241
~
3252
I
S
S
N:
2252
-
8938
,
DO
I
:
10
.
11591/i
jai
.
v
14
.i
4
.
pp
32
41
-
3252
3241
Jou
r
n
al
h
omepage
:
ht
tp:
//
ij
ai
.
iaes
c
or
e
.
c
om
Hu
m
an
s
e
n
t
i
m
e
n
t
an
al
y
t
ic
s u
si
n
g m
u
lti
-
m
od
e
l
d
e
e
p
l
e
ar
n
in
g
ap
p
r
oac
h
Ani
l
Ku
m
ar
M
u
t
h
e
vi
1
,
M
agan
t
i
Ve
n
k
at
e
s
h
2
,
P
al
lavi
Gau
r
av
Ad
k
e
3
,
Raj
as
h
r
e
e
Gad
h
ave
4
,
G.
L
.
Nar
as
am
b
a
Vangu
r
i
5
,
Thi
ru
v
eedul
a
Sri
n
i
v
a
s
ul
u
5
1
D
e
pa
r
tm
e
nt
of
C
omput
e
r
S
c
ie
nc
e
a
nd E
ngi
ne
e
r
in
g, A
di
ty
a
U
ni
ve
r
s
it
y,
S
ur
a
mpa
le
m
, I
ndi
a
2
D
e
pa
r
tm
e
nt
of
C
omput
e
r
S
c
ie
nc
e
a
nd E
ngi
ne
e
r
in
g (
A
I
M
L
)
, A
di
ty
a
U
ni
ve
r
s
it
y,
S
ur
a
mpa
le
m
, I
ndi
a
3
I
ns
ti
tu
te
of
A
r
ti
f
ic
ia
l
I
nt
e
ll
ig
e
nc
e
,
D
r
. V
is
hw
a
na
th
K
a
r
a
d M
I
T
-
W
or
ld
P
e
a
c
e
U
n
iv
e
r
s
it
y
, P
une
, I
ndi
a
4
D
e
pa
r
tm
e
nt
of
C
omput
e
r
E
ngi
ne
e
r
in
g, P
il
la
i
H
O
C
C
ol
le
ge
of
E
ngi
ne
e
r
in
g a
nd T
e
c
hnol
ogy, Unive
r
s
it
y of
M
umba
i,
M
umb
a
i
, I
ndi
a
5
D
e
pa
r
tm
e
nt
of
I
nf
or
ma
ti
on T
e
c
hnol
ogy, Aditya U
ni
ve
r
s
it
y,
S
ur
a
mpa
le
m,
I
ndi
a
Ar
t
icle
I
n
f
o
AB
S
T
RA
CT
A
r
ti
c
le
h
is
tor
y
:
R
e
c
e
ived
F
e
b
14
,
2024
R
e
vis
e
d
Apr
16
,
2025
Ac
c
e
pted
J
un
8
,
2025
Fo
r
as
s
es
s
i
n
g
h
u
ma
n
b
e
i
n
g
s
,
t
h
e
mea
s
u
reme
n
t
o
f
w
i
l
l
p
o
w
er
an
d
h
u
man
emo
t
i
o
n
s
p
l
ay
s
an
i
mp
o
rt
a
n
t
r
o
l
e
b
ecau
s
e
h
u
man
b
ei
n
g
s
are
em
o
t
i
o
n
al
creat
u
re
s
.
E
mo
t
i
o
n
al
a
n
al
y
s
i
s
,
a
l
s
o
k
n
o
w
n
a
s
s
e
n
t
i
m
en
t
a
n
al
y
s
i
s
,
i
s
t
h
e
p
ro
ce
d
u
re
o
f
u
s
i
n
g
n
a
t
u
ra
l
l
a
n
g
u
ag
e
p
r
o
ces
s
i
n
g
(
N
L
P)
a
n
d
mac
h
i
n
e
l
ear
n
i
n
g
t
o
d
et
erm
i
n
e
t
h
e
emo
t
i
o
n
s
ex
p
res
s
ed
i
n
s
p
eec
h
,
t
ex
t
,
o
r
o
t
h
er
w
ay
s
o
f
co
mmu
n
i
ca
t
i
o
n
.
H
o
w
ev
er,
cri
t
i
cal
emo
t
i
o
n
a
l
an
al
y
s
i
s
i
s
l
i
m
i
t
e
d
t
o
h
u
ma
n
i
n
t
eract
i
o
n
s
o
n
l
y
.
H
u
ma
n
emo
t
i
o
n
al
art
i
fi
c
i
al
i
n
t
e
l
l
i
g
en
ce
,
o
r
h
u
man
s
en
t
i
me
n
t
a
l
an
al
y
t
i
cs
,
a
s
u
b
d
o
ma
i
n
o
f
N
L
P
s
eek
s
t
o
i
mp
r
o
v
e
t
h
i
s
u
n
d
ers
t
an
d
i
n
g
.
T
h
e
p
res
e
n
t
s
t
u
d
y
d
ev
e
l
o
p
s
a
mo
d
e
l
u
s
i
n
g
mu
l
t
i
-
mo
d
el
d
ee
p
l
earn
i
n
g
(D
L
)
ap
p
r
o
ach
w
h
i
c
h
i
s
cap
ab
l
e
o
f
effi
c
i
en
t
l
y
u
n
d
er
s
t
a
n
d
i
n
g
h
u
ma
n
emo
t
i
o
n
s
an
d
t
h
ei
r
i
n
t
en
t
i
o
n
s
,
cl
o
s
e
l
y
mi
rr
o
ri
n
g
h
u
man
co
g
n
i
t
i
o
n
.
B
y
ex
t
e
n
d
i
n
g
emo
t
i
o
n
al
a
n
al
y
s
i
s
b
ey
o
n
d
t
h
e
t
ra
d
i
t
i
o
n
a
l
l
i
mi
t
s
,
t
h
i
s
mo
d
e
l
w
i
l
l
co
l
l
ect
b
r
o
ad
ra
n
g
i
n
g
d
a
t
a
t
o
u
n
c
o
v
er
c
l
ear
an
d
h
i
d
d
e
n
emo
t
i
o
n
al
d
et
a
i
l
s
.
T
h
e
ma
i
n
i
n
t
en
t
i
o
n
o
f
t
h
i
s
p
a
p
er
i
s
t
o
b
u
i
l
d
h
i
g
h
l
y
ef
fect
i
v
e
mo
d
el
w
h
i
ch
p
ro
v
i
d
es
i
n
-
d
ep
t
h
i
n
s
i
g
h
t
s
i
n
t
o
h
u
man
emo
t
i
o
n
s
,
l
ead
i
n
g
t
o
l
o
g
i
ca
l
co
n
c
l
u
s
i
o
n
s
d
ep
en
d
i
n
g
o
n
al
l
av
ai
l
ab
l
e
fact
o
r
s
an
d
rea
s
o
n
s
.
T
h
e
n
eces
s
ary
i
n
p
u
t
d
a
t
a
f
o
r
t
h
e
c
u
rren
t
s
t
u
d
y
w
i
l
l
b
e
co
l
l
ec
t
ed
fr
o
m
au
d
i
o
-
v
i
s
u
a
l
me
d
i
a
co
v
er
i
n
g
a
v
as
t
ran
g
e
o
f
au
d
i
o
a
n
d
v
i
s
u
al
s
amp
l
es
.
K
e
y
w
o
r
d
s
:
De
e
p
lea
r
ning
E
mot
ional
a
r
t
if
icia
l
in
telli
ge
nc
e
F
e
a
tur
e
-
leve
l
f
us
ion
Huma
n
e
mot
ions
M
a
c
hine
lea
r
ning
Na
tur
a
l
langua
ge
pr
oc
e
s
s
ing
Ne
ur
a
l
ne
twor
ks
Th
i
s
i
s
a
n
o
p
en
a
c
ces
s
a
r
t
i
c
l
e
u
n
d
e
r
t
h
e
CC
B
Y
-
SA
l
i
ce
n
s
e.
C
or
r
e
s
pon
din
g
A
u
th
or
:
Anil
Kuma
r
M
uthevi
De
pa
r
tm
e
nt
of
C
omput
e
r
S
c
ienc
e
a
nd
E
nginee
r
ing
,
Aditya
Unive
r
s
it
y
Aditya
Na
ga
r
,
AD
B
R
oa
d,
S
ur
a
mpale
m
533437
,
K
a
kinada
Dis
tr
ict,
Andhr
a
P
r
a
de
s
h,
I
ndia
E
mail:
letter
toanil@gm
a
il
.
c
om
1.
I
NT
RODU
C
T
I
ON
E
mot
ions
a
r
e
tr
uly
m
ind
-
boggli
ng
e
leme
nts
that
c
a
n
c
ha
nge
the
e
nti
r
e
mea
ning
of
a
human
c
onve
r
s
a
ti
on.
M
ult
ipl
e
types
o
f
e
mot
ions
inf
luenc
e
how
we
li
ve
a
nd
int
e
r
a
c
t
with
other
pe
ople.
S
ome
ti
mes
,
it
a
ppe
a
r
s
that
e
mot
ions
a
r
e
the
one
s
in
c
ontr
ol
of
u
s
.
Our
de
c
is
ions
,
be
ha
vior
s
,
a
nd
pe
r
c
e
pti
ons
a
r
e
a
ll
dr
iven
by
the
f
e
e
li
ngs
pe
ople
e
nc
ounter
in
e
ve
r
yda
y
li
f
e
.
P
s
yc
hologi
s
ts
ha
ve
made
e
f
f
or
ts
to
r
e
c
ognize
the
dif
f
e
r
e
nt
kinds
of
s
e
nti
ments
pe
ople
go
th
r
ough
f
r
om
th
e
va
s
t
s
pe
c
tr
um
of
human
e
xpe
r
ienc
e
.
S
e
ve
r
a
l
dis
ti
nc
t
pe
r
s
pe
c
ti
ve
s
ha
ve
e
mer
ge
d
in
a
n
a
tt
e
mpt
to
c
las
s
if
y
a
nd
r
e
pr
e
s
e
nt
the
f
e
e
li
ngs
that
indi
viduals
pos
s
e
s
s
.
P
a
ul
E
c
kman
pr
opos
e
d
s
ix
ba
s
ic
e
mot
ions
a
nd
s
t
a
ted
that
thes
e
a
r
e
the
mos
t
wide
ly
s
e
e
n
a
c
r
os
s
a
l
l
human
c
ult
ur
e
s
.
T
he
s
e
c
omm
only
r
e
c
ognize
d
e
mot
ions
in
c
lude
f
e
a
r
,
s
ur
p
r
is
e
,
d
is
gu
s
t,
ha
ppines
s
,
s
a
dne
s
s
,
a
nd
a
nge
r
.
T
he
s
e
6
e
mot
ions
a
r
e
c
ons
ider
e
d
the
f
unda
men
tal
or
igi
n
of
many
other
e
mot
ions
,
e
xc
e
pt
f
or
t
hos
e
of
ne
utr
a
li
ty
a
nd
c
a
lm
ne
s
s
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
S
N
:
2252
-
8938
I
nt
J
Ar
ti
f
I
ntell
,
Vol.
14
,
No.
4
,
Augus
t
20
25
:
324
1
-
3252
3242
Ar
ti
f
icia
l
int
e
ll
igenc
e
(
AI
)
is
e
s
s
e
nti
a
ll
y
the
im
it
a
ti
on
of
human
na
tur
a
l
c
leve
r
ne
s
s
in
mac
hines
to
pe
r
f
or
m
tas
ks
in
a
human
-
li
ke
manne
r
.
I
t
invol
ve
s
de
s
igni
ng
a
mac
hine
c
a
pa
ble
of
thi
nking
,
ha
ndli
ng
e
ve
r
ythi
ng
f
r
o
m
ba
s
ic
f
unc
ti
ons
to
c
ompl
e
x
pr
oc
e
s
s
e
s
,
with
va
r
ying
leve
ls
of
c
ognit
ive
s
kil
l.
Adva
n
c
e
ments
in
br
a
in
s
c
ienc
e
ha
ve
e
na
bled
the
s
hif
t
f
r
om
a
r
ti
f
icia
l
na
r
r
ow
int
e
ll
igenc
e
(
AN
I
)
to
a
r
ti
f
icia
l
ge
ne
r
a
l
int
e
ll
igenc
e
(
AG
I
)
,
whic
h
a
ll
ows
mac
hines
to
p
e
r
f
or
m,
thi
nk,
a
nd
c
a
r
r
y
out
tas
ks
s
im
il
a
r
ly
to
humans
.
Although
AG
I
r
e
s
e
a
r
c
h
r
e
mains
de
c
a
de
s
a
wa
y,
ba
s
ic
human
e
va
luations
a
id
in
r
e
f
ini
ng
the
tec
hniq
ue
s
us
e
d
f
or
r
e
pli
c
a
ti
ng
the
human
mi
nd
.
T
his
s
pe
c
if
ic
domain
withi
n
AI
,
f
oc
us
ing
on
the
a
na
lys
is
of
human
c
omm
unica
ti
on,
is
loos
e
ly
ter
med
a
s
na
tur
a
l
langua
ge
unde
r
s
tanding
.
Huma
n
e
mot
ional
a
na
lyt
ics
is
de
r
ived
f
r
om
a
s
p
e
c
if
ic
c
omponent
of
na
tur
a
l
langua
g
e
pr
oc
e
s
s
ing
(
NL
P
)
.
A
wide
a
r
r
a
y
of
human
e
mot
ions
ne
e
ds
to
be
c
a
tegor
ize
d
to
de
ter
mi
ne
the
c
or
r
e
c
t
polar
it
y
,
f
e
e
li
ng,
or
int
e
nt
be
hind
a
s
tate
ment.
NL
P
e
mphas
ize
s
int
e
r
pr
e
ti
ng
text
in
human
langua
ge
to
ge
ne
r
a
te
ins
ight
s
that
a
s
s
is
t
in
s
im
pli
f
ying
bus
ines
s
de
c
is
ion
-
m
a
king.
How
e
ve
r
,
the
human
e
mot
ional
s
pe
c
tr
um
is
s
ig
nif
ica
ntl
y
mor
e
int
r
ica
te.
I
t
p
r
im
a
r
il
y
r
e
li
e
s
on
vis
ua
l
c
ue
s
,
tone
of
voice
,
o
r
s
poke
n
wor
ds
.
W
it
h
the
gr
owing
c
a
pa
bil
it
ies
of
AI
a
nd
mac
hine
lea
r
ning
(
ML
)
,
ther
e
is
pr
omi
s
ing
potential
to
de
ve
lop
a
mac
hine
c
a
pa
ble
of
identif
ying
a
us
e
r
’
s
e
mot
ions
.
I
s
ther
e
a
ny
pe
r
f
e
c
t
a
nd
c
ompl
e
te
s
ys
tem
f
or
e
mot
ion
de
tec
ti
on?
T
he
a
ns
we
r
f
or
th
is
que
s
ti
on
is
the
objec
ti
ve
of
thi
s
r
e
s
e
a
r
c
h
is
to
buil
d
a
highl
y
e
f
f
icie
nt
s
ys
tem
that
pr
ov
id
e
s
de
e
p
ins
ight
s
int
o
hu
man
e
mot
ions
,
lea
ding
to
logi
c
a
l
c
onc
lus
ions
ba
s
e
d
on
a
ll
a
va
il
a
ble
f
a
c
tor
s
a
nd
c
ontextua
l
r
e
a
s
oning.
T
he
ne
c
e
s
s
a
r
y
input
da
ta
f
o
r
thi
s
a
na
lys
is
will
be
s
our
c
e
d
f
r
o
m
a
udio
-
vis
ua
l
media
,
incor
p
or
a
ti
ng
a
diver
s
e
r
a
nge
of
a
udio
a
nd
vis
ua
l
s
a
mpl
e
s
.
2.
L
I
T
E
RA
T
UR
E
S
UR
VE
Y
M
a
c
hine
int
e
ll
igenc
e
,
whic
h
ha
s
a
lwa
ys
be
e
n
c
ons
ider
e
d
a
da
ydr
e
a
m
s
ince
the
e
a
r
ly
1900s
,
a
im
e
d
to
e
na
ble
c
omput
e
r
s
to
c
ompr
e
he
nd
na
tur
a
l
da
ta.
W
it
h
the
r
is
e
o
f
many
im
a
ginar
y
s
tor
ies
a
nd
m
ovies
,
it
a
ppe
a
r
e
d
to
r
e
main
a
da
ydr
e
a
m
unti
l
the
e
a
r
ly
19
5
0s
.
T
ha
t
wa
s
the
pe
r
iod
whe
n
the
f
ounda
ti
ons
of
AI
we
r
e
e
s
tablis
he
d.
F
igur
e
1
il
lus
tr
a
tes
the
dif
f
e
r
e
nt
a
ppr
oa
c
he
s
to
s
e
nti
ment
a
n
a
lys
is
.
R
e
s
e
a
r
c
h
p
e
r
s
is
ted,
e
xpe
r
im
e
nti
ng
with
diver
s
e
methods
s
uc
h
a
s
s
upe
r
vis
e
d
ML
a
nd
uns
upe
r
vis
e
d
ML
,
a
mong
other
s
,
i
n
e
f
f
or
ts
to
de
ter
mi
ne
the
polar
it
y
of
a
f
a
c
e
in
a
n
im
a
ge
o
r
t
o
de
tec
t
tr
a
c
e
s
of
po
lar
it
y
wi
thi
n
a
p
iec
e
of
text
.
F
igur
e
1.
S
e
nti
ment
a
na
lys
is
a
nd
methods
T
he
s
tar
ter
o
f
a
dva
nc
e
d
c
onc
e
pts
li
ke
a
r
ti
f
icia
l
ne
ur
a
l
ne
twor
k
(
ANN
)
e
xpa
nde
d
the
s
c
ope
o
f
e
mot
ion
de
tec
ti
on,
e
na
bli
ng
mac
hines
to
wo
r
k
t
oge
ther
mor
e
e
f
f
e
c
ti
ve
ly
with
human
us
e
r
s
.
T
h
r
ough
the
a
ppli
c
a
ti
on
of
f
e
a
tur
e
e
xtr
a
c
ti
on
a
nd
de
e
p
lea
r
ning
(
DL
)
tec
hniques
,
mac
hines
a
c
hieve
d
s
ign
if
ica
ntl
y
im
pr
ove
d
outcome
s
in
a
na
lyzing
f
a
c
ial
e
xpr
e
s
s
ions
a
nd
wor
d
or
de
r
polar
it
y
.
De
s
pit
e
s
ubs
tantial
a
dva
nc
e
ments
,
e
xtr
a
c
ti
ng
human
-
li
ke
r
e
s
ult
s
f
r
om
mul
ti
-
dim
e
ns
ional
da
ta
r
e
mains
unde
r
e
xplor
e
d.
T
h
e
f
us
ion
of
vis
ua
l
a
nd
a
udit
o
r
y
c
ontent
pr
e
s
e
nts
the
pote
nti
a
l
to
unc
ove
r
ins
ight
s
that
a
r
e
im
p
e
r
c
e
pti
ble
thr
ough
indepe
nde
nt
pr
oc
e
s
s
ing.
P
r
oc
e
s
s
ing
int
r
ica
te
a
nd
c
ompl
e
x
input
s
us
ing
ba
s
ic
c
onve
nti
ona
l
ne
ur
a
l
ne
twor
ks
c
a
n
be
labor
ious
.
I
n
s
uc
h
s
c
e
na
r
ios
,
de
e
ply
-
f
us
e
d
ne
twor
ks
c
a
n
play
a
pivot
a
l
r
ole
in
e
va
luating
the
de
pth
of
the
da
ta
withi
n
the
de
e
p
ne
twor
k.
W
hil
e
de
r
ivi
ng
e
mot
ions
f
r
om
e
quivale
nt
e
xpr
e
s
s
ions
is
ma
na
ge
a
ble,
human
e
mot
ions
a
r
e
inher
e
ntl
y
diver
s
e
a
nd
o
f
ten
de
f
y
s
im
pli
s
ti
c
c
las
s
if
ica
ti
on
s
c
a
les
.
M
ixed
e
mot
ions
a
nd
e
mot
ional
f
luctua
ti
ons
a
r
e
unique
to
humans
,
a
s
t
he
mos
t
e
mot
ionally
c
o
mpl
e
x
s
pe
c
ies
.
Unde
r
s
tanding
de
e
p
Evaluation Warning : The document was created with Spire.PDF for Python.
I
nt
J
Ar
ti
f
I
ntell
I
S
S
N:
2252
-
8938
Hum
an
s
e
nti
me
nt
analytics
us
ing
multi
-
mode
l
de
e
p
lear
ning
appr
oac
h
(
A
nil
K
umar
M
uthev
i)
3243
a
nd
c
ombi
ne
d
e
mot
ions
ba
s
e
d
on
their
int
e
ns
it
y
a
dds
a
nother
laye
r
o
f
dif
f
iculty,
a
nd
c
las
s
if
ying
a
n
e
nti
ty
a
c
c
or
dingl
y
be
c
omes
a
n
e
ve
n
gr
e
a
ter
c
ha
ll
e
nge
.
Ove
r
the
ye
a
r
s
,
s
e
nti
ment
a
na
lys
is
r
e
s
e
a
r
c
h
ha
s
ins
pir
e
d
r
e
s
e
a
r
c
he
r
s
to
de
ve
lop
a
va
r
iety
of
s
ys
tems
to
a
id
in
a
na
lys
is
.
M
os
t
of
thes
e
s
ys
tems
a
r
e
ta
il
or
e
d
to
a
na
lyze
a
s
ingl
e
type
o
f
c
ontent
f
or
s
e
nti
ment
c
las
s
if
ica
ti
on
withi
n
their
s
pe
c
if
ic
domains
.
E
s
ti
mation
mi
ning
invol
ve
s
c
las
s
if
ying
uns
tr
uc
tur
e
d
da
ta
a
nd
text
int
o
n
e
ga
ti
ve
,
pos
it
ive,
ne
utr
a
l
,
a
nd
c
a
tegor
ies
[
1]
.
S
igni
f
ica
nt
a
dva
nc
e
ments
ha
ve
be
e
n
a
c
hieve
d
in
t
he
f
ields
of
e
mot
ion
a
nd
s
e
nti
ment
a
na
lys
is
thr
ough
va
r
io
us
ML
tec
hnique
s
[
2]
,
[
3
]
.
T
r
a
dit
ionally,
s
e
nti
m
e
nts
a
r
e
c
las
s
if
ied
int
o
two
main
c
a
tegor
ies
pos
it
ive
a
nd
ne
ga
ti
ve
[
4]
,
[
5
]
.
Nume
r
ous
M
L
tec
hniques
ha
ve
be
e
n
de
ve
loped
f
or
s
e
nti
ment
c
las
s
if
ica
ti
on,
including
s
tocha
s
ti
c
gr
a
dient
de
s
c
e
nt
(
S
GD
)
,
whic
h
e
na
bles
lea
r
ning
f
r
om
c
las
s
if
ier
s
ba
s
e
d
on
non
-
dif
f
e
r
e
nti
a
ble
los
s
f
u
nc
ti
ons
,
a
s
pr
e
s
e
nted
by
B
if
e
t
a
nd
F
r
a
nk
[
6]
.
Anot
he
r
we
ll
-
known
a
nd
e
f
f
e
c
ti
ve
a
lgor
it
hm
is
n
a
ïve
B
a
ye
s
,
i
nit
ially
int
r
oduc
e
d
by
T
homas
B
a
ye
s
a
nd
late
r
e
l
a
bor
a
ted
upon
by
C
ir
e
s
a
n
e
t
al
.
[
7]
.
Among
s
upe
r
vis
e
d
lea
r
ning
a
lgor
it
hms
,
s
uppor
t
ve
c
tor
mac
hine
(
S
VM
)
is
highl
y
pr
omi
ne
nt
[
8]
.
W
hil
e
many
tool
s
a
nd
methods
a
r
e
a
va
il
a
ble
f
or
s
e
nti
ment
a
na
lys
is
us
ing
ML
,
S
VM
c
ons
is
tently
de
mons
tr
a
tes
s
upe
r
ior
a
c
c
ur
a
c
y
a
nd
e
f
f
icie
nc
y
c
ompar
e
d
to
o
ther
a
ppr
oa
c
he
s
,
a
c
c
o
r
ding
to
c
ompar
a
ti
ve
s
tudi
e
s
in
[
9
]
‒
[
11]
,
who
thor
ough
ly
e
xplor
e
d
text
a
nd
a
udio
-
vis
u
al
c
ue
s
f
or
mu
lt
im
oda
l
e
mot
ional
a
na
lys
is
.
As
noted
by
Z
ha
ng
e
t
al
.
[
12
]
,
e
mot
ion
a
nd
s
e
nti
menta
li
ty
a
na
lys
is
both
pe
r
ta
in
to
a
n
indi
vidual's
int
e
r
na
l
s
tate
,
a
nd
only
two
notable
methodologi
e
s
f
or
mul
ti
modal
e
mot
ional
a
na
lys
is
e
xis
t,
a
s
pr
opos
e
d
by
[
13]
,
[
14
]
.
P
r
ior
r
e
s
e
a
r
c
h
in
mul
ti
mo
da
l
e
mot
ional
a
na
lys
is
ge
ne
r
a
ll
y
f
a
ll
s
int
o
two
c
a
t
e
gor
ies
:
one
f
oc
us
ing
on
f
e
a
tur
e
e
xtr
a
c
ti
on
f
r
om
indi
vidu
a
l
modalit
ies
,
a
nd
the
other
on
tec
hniques
to
f
us
e
f
e
a
tur
e
s
d
e
r
i
ve
d
f
r
om
m
ul
ti
p
le
mo
da
l
i
ti
e
s
.
I
n
19
70
,
a
u
th
or
s
in
[
15
]
[
1
6
]
c
on
duc
te
d
c
om
p
r
e
h
e
ns
iv
e
s
t
ud
ies
on
f
a
c
ia
l
e
xp
r
e
s
s
i
ons
,
c
o
nc
lu
di
ng
t
ha
t
u
ni
ve
r
s
a
l
f
a
c
ia
l
e
x
pr
e
s
s
io
ns
h
e
l
p
in
id
e
n
ti
f
yi
ng
e
m
o
ti
ons
.
T
he
y
r
e
c
og
ni
z
e
d
a
n
ge
r
,
s
ur
p
r
is
e
,
d
is
g
us
t
,
f
e
a
r
,
s
a
d
ne
s
s
,
a
nd
j
oy
a
s
s
i
x
ba
s
ic
e
mo
t
io
na
l
c
a
te
go
r
ies
.
T
he
s
e
c
a
teg
o
r
i
e
s
e
f
f
e
c
t
iv
e
l
y
r
e
p
r
e
s
e
nt
m
os
t
f
a
c
ial
l
y
e
x
pr
e
s
s
e
d
e
m
ot
io
ns
.
A
s
e
ve
n
t
h
c
a
t
e
g
o
r
y
,
c
on
te
mp
t
,
wa
s
la
te
r
i
nt
r
od
uc
e
d
by
C
i
r
e
s
a
n
e
t
al
.
[
7
]
.
P
a
k
a
nd
P
a
r
oube
k
[
17]
de
ve
loped
the
f
a
c
ial
a
c
ti
on
c
oding
s
ys
tem
(
F
AC
S
)
,
whic
h
de
c
ode
s
f
a
c
ial
langua
ge
by
br
e
a
king
down
e
xpr
e
s
s
ions
int
o
a
s
e
r
ies
of
a
c
ti
on
u
nit
s
(
AU
)
.
R
e
c
e
nt
inves
ti
ga
ti
on
on
s
pe
e
c
h
-
ba
s
e
d
s
e
nti
ment
a
na
lys
is
ha
s
f
oc
us
e
d
on
r
e
c
ognizing
a
udio
f
e
a
tur
e
s
li
ke
f
unda
menta
l
f
r
e
que
nc
y
(
pit
c
h)
,
ba
ndwidth,
s
pe
e
c
h
int
e
ns
it
y
,
a
nd
dur
a
ti
on
,
a
s
e
xplor
e
d
by
C
h
e
n
[
18]
.
S
pe
a
ke
r
-
de
pe
nde
nt
a
ppr
oa
c
he
s
of
ten
yield
be
tt
e
r
outcome
s
c
ompar
e
d
to
s
pe
a
ke
r
-
indepe
nde
nt
one
s
.
T
his
is
e
vident
in
the
im
pr
e
s
s
ive
r
e
s
ult
s
o
f
Na
va
s
e
t
al
.
[
19]
,
who
a
c
hieve
d
a
r
ound
98
%
a
c
c
ur
a
c
y
us
ing
Ga
us
s
ian
mi
xtur
e
models
(
GM
M
)
a
nd
incor
po
r
a
ti
ng
pr
os
o
dic,
voc
a
l
qua
li
ty
,
a
nd
mel
f
r
e
que
nc
y
c
e
ps
tr
a
l
c
oe
f
f
icie
nts
(
M
F
C
C
)
a
s
s
pe
e
c
h
c
ha
r
a
c
ter
is
ti
c
s
.
How
e
v
e
r
,
s
pe
a
ke
r
-
de
pe
nde
nt
methods
a
r
e
im
pr
a
c
ti
c
a
l
f
or
a
ppl
ica
ti
ons
invol
ving
a
lar
ge
us
e
r
ba
s
e
.
Vis
ua
l
s
e
nti
ment
e
xa
mi
na
ti
on
ba
s
e
d
on
text
de
s
c
r
ipt
ions
is
e
f
f
e
c
ti
ve
ly
de
s
c
r
ibed
by
Or
ti
s
e
t
a
l
.
[
20]
.
T
e
xt
-
ba
s
e
d
s
e
nti
ment
r
e
c
ognit
ion
is
a
r
a
pidl
y
gr
o
wing
f
ield
in
NL
P
,
dr
a
wing
s
igni
f
ica
nt
a
tt
e
nti
on
f
r
om
both
a
c
a
de
mi
c
a
nd
indus
tr
ial
s
e
c
tor
s
.
T
r
a
dit
i
ona
ll
y,
s
e
nti
ment
a
nd
e
mot
ion
de
tec
ti
on
in
text
h
a
s
r
e
li
e
d
on
r
ule
-
ba
s
e
d
s
ys
tem
s
,
ba
g
-
of
-
wor
ds
models
us
ing
e
xpa
ns
ive
e
mot
ion
or
s
e
nti
ment
lexic
ons
,
a
s
menti
one
d
by
M
is
hne
[
21]
.
D
a
ta
-
dr
iven
a
ppr
oa
c
he
s
leve
r
a
ging
lar
ge
a
nnotate
d
da
tas
e
ts
a
r
e
a
ls
o
us
e
d,
a
s
de
s
c
r
ibed
by
M
uthevi
e
t
al
.
[
22]
a
nd
Xia
e
t
al
.
[
23
]
.
Ac
c
or
ding
to
W
e
i
e
t
al
.
[
24
]
,
de
e
p
ne
ur
a
l
ne
tw
or
ks
(
DN
N)
ha
ve
s
e
e
n
notable
im
pr
ove
ments
in
r
e
c
e
nt
ye
a
r
s
,
e
s
pe
c
ially
in
opti
mi
z
a
ti
on
tec
hniques
,
a
c
ti
va
ti
on
f
unc
ti
on
s
,
r
e
gular
iza
ti
on,
poo
li
ng,
a
nd
ne
twor
k
de
s
ign.
M
ult
i
-
c
olum
n
DNN
int
r
oduc
e
d
by
Ahma
d
e
t
al
.
[
25
]
e
xplor
e
d
de
c
is
ion
f
us
ion,
late
r
e
xpa
nde
d
to
include
we
ight
e
d
a
ve
r
a
ging
a
nd
a
da
pti
ve
methods
ba
s
e
d
on
input
c
ondit
ions
by
M
a
ts
umot
o
[
26]
.
T
he
c
ur
r
e
nt
methodology
take
s
a
dif
f
e
r
e
nt
r
oute
by
de
e
ply
int
e
gr
a
ti
ng
f
e
a
tur
e
s
a
c
r
o
s
s
mul
ti
ple
int
e
r
media
te
laye
r
s
,
c
onc
ur
r
e
ntl
y
lea
r
ning
the
de
mons
tr
a
ti
on
of
ba
s
e
ne
twor
ks
.
W
a
ng
e
t
al
.
[
24]
p
r
opos
e
d
a
nove
l
DL
method
de
e
ply
-
f
us
e
d
ne
t
s
c
e
nter
e
d
on
de
e
p
f
us
ion.
Da
ta
pr
e
-
pr
oc
e
s
s
ing
tec
hniques
a
r
e
c
ompr
e
he
n
s
ively
a
ddr
e
s
s
e
d
by
I
lyas
a
nd
C
hu
[
27]
,
whil
e
M
a
ll
e
y
e
t
al
.
[
28
]
de
ta
il
a
va
r
iety
o
f
pr
e
-
pr
oc
e
s
s
ing
methods
.
M
ode
r
n
s
e
nti
ment
a
na
lys
is
a
ppr
oa
c
he
s
us
ing
DL
a
r
e
de
s
c
r
ibed
in
[
29]
,
[
30
]
.
3.
CONT
RI
B
U
T
E
D
WORK
T
he
pr
e
vious
a
ppr
oa
c
he
s
in
thi
s
do
main
invo
lved
uti
li
z
ing
a
va
r
iety
of
M
L
a
lgor
it
h
ms
a
nd
logi
c
a
l
r
ule
r
e
movals
on
s
ingl
e
,
e
xa
c
t
da
tas
e
ts
to
e
xtr
a
c
t
r
e
s
ult
s
.
How
e
ve
r
,
they
p
r
e
s
e
nted
s
e
ve
r
a
l
li
mi
tati
ons
that
c
ould
not
be
r
e
s
olved
due
to
a
r
e
s
tr
icte
d
pe
r
s
pe
c
ti
ve
on
da
ta
f
e
a
tur
e
s
.
T
he
s
e
li
mi
tations
include
t
one
a
nd
s
ubjec
ti
vit
y
,
c
omm
unica
ti
on
c
ontext,
polar
it
y
inf
e
r
e
nc
e
,
s
a
r
c
a
s
m
a
nd
i
r
ony,
li
mi
ted
c
las
s
labe
ls
,
e
ns
lave
ment
on
da
tas
e
t.
T
o
a
dd
r
e
s
s
the
a
f
or
e
mentioned
li
mi
tat
ions
obs
e
r
ve
d
in
e
a
r
li
e
r
models
,
the
c
ur
r
e
nt
wo
r
k
a
im
s
to
c
ons
tr
uc
t
a
ne
w
mac
hine
that
ove
r
c
omes
c
e
r
tain
s
hor
tcomings
of
pr
io
r
f
a
il
e
d
methodologi
e
s
.
C
ons
e
que
ntl
y,
we
ha
ve
a
dopted
ne
wly
de
ve
loped
DL
method
s
a
n
d
ne
ur
a
l
ne
twor
k
modul
e
s
to
c
a
ptur
e
e
s
s
e
nti
a
l
da
ta
f
e
a
tur
e
s
that
a
r
e
of
ten
hidden
o
r
dif
f
icult
to
de
t
e
c
t
in
c
onve
nti
ona
l
a
na
lys
is
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
S
N
:
2252
-
8938
I
nt
J
Ar
ti
f
I
ntell
,
Vol.
14
,
No.
4
,
Augus
t
20
25
:
324
1
-
3252
3244
3.
1
.
Dat
ab
as
e
s
u
s
e
d
T
o
buil
d
a
mac
hine
c
a
pa
ble
of
de
tec
ti
ng
va
r
i
ous
e
mot
ional
a
s
pe
c
ts
f
r
om
AV
da
ta,
we
a
r
e
c
ons
tr
a
ined
by
the
a
va
il
a
bil
it
y
o
f
s
uit
a
ble
da
tas
e
ts
.
S
e
ve
r
a
l
major
s
our
c
e
s
ha
ve
c
ontr
ibut
e
d
r
e
leva
nt
da
ta,
includin
g:
i)
S
AV
E
E
:
thi
s
da
tas
e
t
c
ontains
both
a
udio
a
nd
video
r
e
c
or
dings
f
r
om
f
our
male
a
c
tor
s
us
ing
phone
ti
c
a
ll
y
ba
lanc
e
d,
ge
ne
r
ic
B
r
it
is
h
E
ngli
s
h
s
e
ntenc
e
s
to
r
e
pr
e
s
e
nt
va
r
ious
e
mot
ions
a
c
r
os
s
mul
ti
ple
r
e
pe
ti
ti
ons
;
ii
)
R
AV
DE
S
S
:
thi
s
da
tas
e
t
include
s
24
pa
r
ti
c
ipants
(
12
f
e
male
a
nd
12
male
)
,
who
a
r
ti
c
ulate
lexic
a
ll
y
-
matc
he
d
phr
a
s
e
s
in
a
ne
utr
a
l
Nor
th
Am
e
r
ica
n
a
c
c
e
nt
;
ii
i)
T
E
S
S
:
thi
s
c
ompr
is
ing
r
e
c
or
di
ngs
f
r
om
two
f
e
male
a
c
tor
s
one
younge
r
a
nd
one
o
lder
thi
s
da
tas
e
t
por
tr
a
ys
a
va
r
iety
of
e
mot
ions
with
ne
utr
a
l
e
mot
ional
int
e
ns
it
y
us
ing
ge
ne
r
ic
s
tate
ments
;
i
v)
You
T
ube
:
a
global
video
-
s
ha
r
ing
plat
f
or
m
of
f
e
r
ing
thous
a
nds
of
videos
f
r
om
va
r
ious
c
a
tegor
ies
,
c
ont
r
ibut
e
d
by
diver
s
e
us
e
r
s
a
nd
or
ga
niza
ti
ons
a
c
r
os
s
the
we
b
;
v)
F
E
R
2013
:
thi
s
is
a
vis
ua
l
da
tas
e
t
f
e
a
tur
i
ng
f
a
c
ial
e
xpr
e
s
s
ions
of
male
a
nd
f
e
male
a
c
tor
s
,
c
oll
e
c
ted
f
r
om
f
il
ms
a
nd
other
r
e
s
our
c
e
s
,
de
picting
mul
ti
ple
e
mot
ional
s
tate
s
;
a
nd
vi)
Google
Ne
ws
Ve
c
tor
s
:
thi
s
r
e
s
our
c
e
is
pa
r
t
of
Google
’
s
c
ode
pr
ojec
t,
c
ontaining
a
va
s
t
dictionar
y
of
E
ngli
s
h
voc
a
bular
y
a
nd
ter
ms
,
int
e
nde
d
f
or
c
las
s
if
ying
textua
l
c
ontent
int
o
p
r
e
c
is
e
gr
oups
.
T
h
e
s
e
da
tas
e
t
s
va
r
y
in
c
ontent,
c
ove
r
ing
a
udio,
v
ideo,
f
a
c
ial
e
xpr
e
s
s
ions
,
a
nd
textua
l
ve
c
tor
s
,
a
nd
p
r
ovide
e
s
s
e
nti
a
l
r
e
s
our
c
e
s
f
or
de
tec
ti
ng
mul
ti
ple
e
mot
ional
s
tat
e
s
.
3.
2
.
Dat
a
p
r
e
-
p
r
oc
e
s
s
in
g
T
he
da
ta
obtaine
d
f
r
om
thes
e
e
xtens
ive
da
tas
e
ts
a
nd
manua
ll
y
ga
ther
e
d
s
our
c
e
s
is
ini
ti
a
ll
y
uns
tr
uc
tur
e
d
a
nd
mi
xe
d
in
c
ontent.
T
he
r
e
f
or
e
,
to
e
ns
ur
e
us
a
bil
it
y,
it
is
e
s
s
e
nti
a
l
to
or
ga
nize
a
nd
n
or
malize
thi
s
da
ta.
Da
tas
e
ts
unde
r
go
pr
e
-
pr
oc
e
s
s
ing
f
unc
ti
ons
to
c
a
tegor
ize
da
ta
by
e
mot
ional
type
,
ge
nde
r
(
e
.
g.
,
male
or
f
e
male
vo
ice
)
,
a
nd
to
r
e
f
or
mat
them
us
ing
s
pe
c
if
ic
identif
ier
s
.
T
his
f
a
c
il
it
a
tes
diver
s
it
y
ha
nd
li
ng
a
nd
c
las
s
if
ica
ti
on
e
f
f
icie
nc
y.
T
he
pr
e
f
e
r
r
e
d
dim
e
ns
ional
s
tanda
r
ds
a
r
e
maintaine
d
to
b
e
los
s
les
s
,
mi
nim
izing
inf
or
mation
los
s
a
nd
maximi
z
ing
f
e
a
tur
e
e
xtr
a
c
ti
o
n.
All
c
oll
e
c
ti
ons
a
r
e
tr
a
ns
f
or
med
int
o
s
tr
uc
tur
e
d
f
or
mats
to
e
ns
ur
e
e
mot
ional
a
tt
r
ibut
e
s
a
r
e
r
e
taine
d
dis
ti
nc
tl
y.
How
e
ve
r
,
the
he
ter
oge
ne
it
y
o
f
the
da
ta
s
ti
ll
pos
e
s
a
c
ha
ll
e
nge
,
ne
c
e
s
s
it
a
ti
ng
s
tanda
r
dize
d
r
ule
s
e
t
s
f
or
s
moot
he
r
ne
ur
a
l
ne
twor
k
tr
a
ini
ng.
I
n
the
s
a
me
wa
y,
only
c
ons
is
tent
da
ta
vis
ua
ls
that
of
f
e
r
r
ich
f
e
a
tu
r
e
s
e
ts
a
r
e
include
d,
while
incons
is
tent
or
m
is
lea
ding
da
ta
unit
s
a
r
e
f
il
ter
e
d
out
us
ing
s
e
ve
r
a
l
c
las
s
if
ier
s
to
e
ns
ur
e
unif
o
r
m
c
or
r
e
c
tnes
s
.
3.
3
.
P
r
op
os
e
d
m
e
t
h
od
P
r
im
a
r
y
a
im
of
thi
s
s
tudy
is
to
c
r
e
a
te
a
s
tr
uc
tur
e
a
c
c
ompl
is
he
d
of
int
e
gr
a
ti
ng
mul
ti
ple
s
e
nti
ment
a
na
lys
is
modalit
ies
int
o
a
unif
ied
outcome
us
ing
DL
ne
ur
a
l
ne
twor
ks
.
M
ult
im
oda
l
input
da
ta
is
c
ons
ider
e
d
in
thi
s
pr
oc
e
s
s
,
whe
r
e
e
a
c
h
modalit
y
is
indi
viduall
y
a
na
lyze
d
a
nd
their
outcome
s
a
r
e
c
ombi
ne
d
to
yield
a
c
ompr
e
he
ns
ive
s
e
nti
ment
c
onc
lus
ion.
T
his
a
ppr
oa
c
h
e
nha
nc
e
s
s
e
nti
ment
r
e
li
a
bil
it
y
a
nd
unc
ove
r
s
a
ddit
ional
da
ta
c
ha
r
a
c
ter
is
ti
c
s
.
T
he
C
NN
pe
r
f
or
ms
on
pa
r
wi
th
human
e
xpe
r
ts
a
c
r
os
s
tas
ks
,
de
mons
tr
a
ti
ng
the
a
b
il
it
y
to
de
tec
t
s
e
nti
ment
polar
it
y
a
nd
c
las
s
if
y
e
mot
ions
with
a
c
ompete
nc
e
leve
l
c
ompar
a
ble
to
human
j
udgment.
Ne
ur
a
l
ne
twor
k
models
dif
f
e
r
s
igni
f
ica
ntl
y
f
r
om
t
r
a
dit
ional
M
L
tec
hniques
s
uc
h
a
s
S
VM
,
Na
ïve
B
a
ye
s
,
a
nd
li
ne
a
r
r
e
gr
e
s
s
ion
,
o
f
f
e
r
ing
im
pr
ove
ments
in
a
r
e
a
s
whe
r
e
e
a
r
li
e
r
models
f
a
lt
e
r
e
d.
3.
3.
1
.
Dat
a
d
is
c
r
im
in
at
io
n
T
he
a
udio
a
nd
vis
ua
l
da
ta
is
p
r
oc
e
s
s
e
d
thr
ough
thr
e
e
s
e
pa
r
a
te
modul
e
s
f
oc
us
e
d
on
tone,
video,
a
nd
text.
Da
ta
s
e
gr
e
ga
ti
on
he
r
e
r
e
f
e
r
s
to
is
olating
e
a
c
h
modalit
y
r
a
ther
than
c
ombi
ning
mul
ti
p
le
types
.
W
e
e
xtr
a
c
t
a
udio
tones
,
vis
ua
l
c
ue
s
f
r
om
videos
,
a
nd
text
tr
a
ns
c
r
ibed
f
r
om
a
udio
,
a
s
s
igni
ng
c
or
r
e
s
ponding
tens
ion
-
we
ight
leve
ls
.
T
his
s
e
gmenta
ti
on
pr
oc
e
s
s
is
de
picte
d
in
F
igur
e
2.
F
igur
e
2
.
S
e
pa
r
a
ted
da
ta
dis
tr
ibut
e
d
to
indi
vidual
n
e
ur
a
l
ne
ts
Evaluation Warning : The document was created with Spire.PDF for Python.
I
nt
J
Ar
ti
f
I
ntell
I
S
S
N:
2252
-
8938
Hum
an
s
e
nti
me
nt
analytics
us
ing
multi
-
mode
l
de
e
p
lear
ning
appr
oac
h
(
A
nil
K
umar
M
uthev
i)
3245
T
o
e
ns
ur
e
a
s
tr
uc
tur
e
d
methodology
a
nd
a
da
ptabili
ty
f
o
r
f
utur
e
r
e
s
e
a
r
c
h
modi
f
ica
ti
ons
,
the
a
lgor
it
hmi
c
pr
og
r
e
s
s
ion
is
outl
ined
a
s
f
oll
ows
:
‒
S
tep
1:
the
pr
opos
e
d
s
ys
tem
e
xtr
a
c
ts
mul
ti
modal
f
e
a
tur
e
s
f
r
om
videos
,
e
a
c
h
de
picting
a
unique
s
c
e
na
r
io.
T
he
s
e
f
e
a
tur
e
s
include
a
s
e
que
nc
e
of
video
f
r
a
mes
,
a
n
a
udio
s
ignal
in
W
AV
f
o
r
m
a
t,
a
nd
text
obtai
ne
d
thr
ough
a
udio
t
r
a
ns
c
r
ipt
ion.
‒
S
tep
2:
f
a
c
ial
e
xpr
e
s
s
ion
a
nd
e
mot
ion
r
e
c
ognit
ion
tec
hniques
a
r
e
a
ppli
e
d
to
the
video
f
r
a
mes
to
e
xt
r
a
c
t
vis
ua
l
e
mot
ional
f
e
a
tur
e
s
.
‒
S
tep
3:
the
a
udio
s
ignal
is
a
na
lyze
d
to
obtain
r
e
lev
a
nt
a
c
ous
ti
c
f
e
a
tur
e
s
,
identif
ying
s
pe
e
c
h
ins
tanc
e
s
a
nd
e
xtr
a
c
ti
ng
c
omponents
li
ke
voicing
pr
oba
bil
it
y
,
to
na
li
ty,
a
nd
the
main
f
r
e
que
nc
y
of
s
pe
e
c
h
va
r
iations
.
‒
S
tep
4:
the
a
udio
da
ta
is
f
u
r
ther
pr
oc
e
s
s
e
d
to
dr
il
l
-
out
only
the
t
r
a
ns
c
r
ibed
textua
l
c
ontent
.
T
he
s
e
gr
e
ga
ted
output
s
f
r
om
a
ll
thr
e
e
modul
e
s
a
r
e
then
f
or
wa
r
de
d
to
p
r
e
-
pr
oc
e
s
s
ing
unit
s
.
T
he
s
e
modalit
ies
,
the
a
udio,
vis
ua
l,
a
nd
textua
l
modalit
ies
(
i.
e
.
,
mul
ti
ple
media
s
our
c
e
s
)
,
a
r
e
e
a
c
h
ha
ndled
by
de
dica
ted
pr
e
-
pr
oc
e
s
s
ing
unit
s
tailor
e
d
to
their
s
pe
c
if
ic
type
.
T
he
s
e
unit
s
pe
r
f
or
m
i
ndivi
dua
l
ope
r
a
ti
ons
s
uc
h
a
s
da
ta
c
lea
ning
a
nd
c
onve
r
s
ion.
T
he
y
a
r
e
s
ubs
e
que
ntl
y
li
nke
d
to
s
e
pa
r
a
te
ne
ur
a
l
ne
t
wor
ks
.
3.
3.
2
.
Vis
u
al
p
r
oc
e
s
s
in
g
W
e
e
mpl
oy
the
Ha
a
r
c
a
s
c
a
de
c
las
s
if
ier
f
r
om
Ope
n
C
V’
s
c
omput
e
r
vis
ion
modul
e
s
to
de
tec
t
f
a
c
e
s
in
the
vis
ua
l
da
ta.
T
o
r
e
duc
e
the
dim
e
ns
ional
c
om
plexity
a
s
s
oc
iate
d
with
R
GB
c
olor
s
tor
a
ge
,
the
e
xtr
a
c
ted
f
r
a
mes
f
r
om
input
videos
a
r
e
c
onve
r
ted
to
gr
e
ys
c
a
le.
De
tec
ti
ng
e
dge
s
a
nd
bounda
r
ies
in
c
olor
e
d
v
is
ua
ls
is
notably
mor
e
c
ompl
e
x;
he
nc
e
,
gr
e
ys
c
a
le
c
onv
e
r
s
ion
r
e
tains
int
e
ns
it
y
leve
ls
a
nd
e
nha
nc
e
s
c
las
s
if
ier
pe
r
f
or
manc
e
,
while
e
ns
ur
ing
no
bias
is
int
r
oduc
e
d,
ir
r
e
s
pe
c
ti
ve
of
the
s
ubjec
t’
s
r
a
c
e
.
Af
ter
the
tr
a
ns
f
or
mation
,
f
a
c
ial
ke
y
po
int
s
a
r
e
de
tec
ted
a
nd
the
vis
ua
l
is
s
e
gmente
d
to
is
olate
f
e
a
tur
e
s
s
pe
c
if
ic
to
the
int
e
nde
d
indi
vidual.
T
he
f
a
c
e
c
oor
dinate
s
obtaine
d
a
r
e
mappe
d
to
p
r
oduc
e
a
s
e
gmente
d
im
a
ge
,
mi
nim
izing
int
e
r
f
e
r
e
nc
e
f
r
o
m
e
xter
na
l
e
leme
nts
that
c
ould
int
r
oduc
e
unint
e
nde
d
nois
e
dur
ing
a
na
lys
is
.
Upon
c
ompl
e
ti
on
o
f
f
e
a
tur
e
e
xtr
a
c
ti
on,
the
s
e
gmente
d
vis
ua
l
is
t
r
a
ns
f
or
me
d
int
o
a
mul
ti
dim
e
ns
ional
a
r
r
a
y
that
pr
e
s
e
r
ve
s
pixel
-
leve
l
da
ta,
whic
h
is
s
ubs
e
que
ntl
y
f
e
d
int
o
the
ne
u
r
a
l
ne
tw
or
k.
A
s
e
que
nti
a
l
model
is
e
mpl
oye
d,
with
c
ha
r
a
c
ter
is
ti
c
va
lues
a
nd
f
or
ms
c
onf
igur
e
d
to
ini
ti
a
li
z
e
it
f
or
vis
ua
l
da
ta
pr
oc
e
s
s
ing.
T
he
model
c
ompr
is
e
s
m
ult
ipl
e
laye
r
s
invol
ving
pooli
ng
a
nd
dr
opout
it
e
r
a
ti
ons
to
r
e
tain
opti
mal
f
e
a
tur
e
s
a
nd
e
li
mi
na
te
we
a
k
c
onn
e
c
ti
ons
.
Ac
ti
va
ti
on
f
unc
ti
ons
us
e
d
in
thi
s
pha
s
e
include
r
e
c
ti
f
ied
li
ne
a
r
uni
t
(
R
e
L
U
)
a
nd
S
o
f
tM
a
x
.
T
he
s
e
we
r
e
s
e
lec
ted
ba
s
e
d
on
their
e
f
f
e
c
ti
ve
ne
s
s
f
or
our
s
pe
c
if
ic
tas
k.
T
he
R
e
L
U
is
a
piec
e
wis
e
li
ne
a
r
a
c
ti
va
ti
on
f
unc
ti
on
that
r
e
tur
ns
the
input
it
s
e
lf
whe
n
it
is
pos
i
ti
ve
,
a
nd
z
e
r
o
whe
n
it
is
not
.
T
he
r
e
s
ult
ing
ve
c
tor
is
a
n
in
ter
media
te
r
e
pr
e
s
e
ntation
s
tor
e
d
f
or
f
u
r
ther
pr
oc
e
s
s
ing
by
de
e
pe
r
ne
twor
k
laye
r
s
.
3.
3.
3
.
T
on
al
a
n
alys
is
T
r
a
dit
ional
e
mot
ion
r
e
c
ognit
ion
tec
hniques
e
mpl
oy
NL
P
to
a
na
lyze
the
s
e
mantics
of
wor
ds
a
nd
phr
a
s
e
s
,
then
a
s
s
e
s
s
s
e
nti
ment
a
c
c
or
dingl
y.
How
e
ve
r
,
langua
ge
is
inher
e
ntl
y
c
ompl
e
x
,
a
nd
s
uc
h
c
on
ve
nti
ona
l
a
na
lys
is
of
ten
ove
r
looks
nua
nc
e
s
li
ke
r
e
gional
diale
c
ts
,
tone,
pit
c
h
a
nd
vo
lum
e
.
He
nc
e
,
we
pr
opos
e
a
s
ys
tem
that
not
only
a
na
lyze
s
the
c
ontent
of
s
pe
e
c
h
but
a
ls
o
it
s
de
li
ve
r
y.
Audio
f
e
a
tur
e
s
a
r
e
de
r
ived
f
r
om
e
a
c
h
s
e
gmente
d
por
ti
on
o
f
the
videos
us
ing
a
48
kHz
s
a
mpl
ing
r
a
te
a
nd
a
100
ms
s
li
ding
window,
a
ll
owin
g
f
or
the
c
a
ptur
e
of
f
ine
-
gr
a
ined
de
tails
.
T
o
nor
malize
th
e
a
udio
da
ta,
Z
-
s
tanda
r
diza
ti
on
is
a
ppli
e
d,
e
nha
nc
ing
the
vis
ibi
li
ty
of
diver
s
e
a
c
ous
ti
c
f
e
a
tur
e
s
f
or
f
ur
ther
a
na
lys
is
.
T
he
s
pe
e
c
h
wa
ve
f
or
m
is
then
tr
a
ns
f
or
m
e
d
int
o
a
pa
r
a
metr
ic
s
ymbol
ic
r
e
pr
e
s
e
ntation,
whic
h
r
e
d
uc
e
s
the
da
ta
r
a
te
a
nd
f
a
c
il
it
a
tes
e
f
f
icie
nt
do
wns
tr
e
a
m
pr
oc
e
s
s
ing.
T
he
e
f
f
e
c
ti
ve
ne
s
s
of
c
las
s
if
ica
ti
on
r
e
li
e
s
he
a
vil
y
on
the
dis
ti
nc
ti
ve
ne
s
s
a
nd
qua
li
ty
of
thes
e
e
xtr
a
c
ted
f
e
a
tur
e
s
.
F
or
thi
s
pur
pos
e
,
w
e
uti
li
z
e
M
F
C
C
.
T
he
f
o
r
mul
a
f
o
r
c
a
lcula
ti
ng
the
mel
f
r
e
que
nc
y
f
or
a
given
input
f
r
e
que
nc
y
is
(
1
)
a
nd
M
F
C
C
s
a
r
e
c
omp
uted
us
ing
the
(
2
)
.
(
)
=
2595
×
10
(
1
+
/
700
)
(
1)
He
r
e
,
M
e
l(
f)
is
the
f
r
e
que
nc
y
in
mels
a
nd
f
is
the
f
r
e
que
nc
y
in
Hz
.
Ĉ
=
∑
=
1
(
log
Ŝ
)
×
co
s
[
(
k
−
1
/
2
)
π
/
k
]
(
2)
W
he
r
e
k
is
the
number
of
mel
c
e
ps
tr
um
c
oe
f
f
icie
n
ts
,
Ĉₙ
is
the
f
inal
M
F
C
C
c
oe
f
f
icie
nt,
a
nd
Ŝₖ
is
the
output
of
the
f
il
ter
ba
nk.
T
he
M
F
C
C
da
t
a
is
c
omp
r
e
s
s
e
d
in
to
13
c
oe
f
f
ic
ients
,
r
e
pr
e
s
e
n
ti
n
g
the
f
r
e
q
ue
nc
y
s
pe
c
tr
um
f
r
o
m
20
Hz
to
22
kHz
.
T
he
s
e
c
oe
f
f
ic
ien
ts
c
o
r
r
e
s
po
nd
to
s
p
e
c
if
ic
f
r
e
q
ue
nc
y
r
e
g
ions
,
w
it
h
the
ir
in
tens
it
ies
v
is
ua
li
z
e
d
thr
o
ugh
va
r
yi
ng
c
olo
r
de
p
ths
a
t
mappe
d
c
oor
dina
te
p
oin
ts
.
T
he
L
i
bR
OSA
l
ib
r
a
r
y
is
us
e
d
to
c
o
nve
r
t
s
ter
e
o
a
udio
in
t
o
mono
wh
il
e
main
tain
ing
the
o
r
ig
inal
s
a
mpl
i
ng
r
a
te
,
e
ns
ur
ing
tha
t
e
s
s
e
n
ti
a
l
a
ud
io
c
ha
r
a
c
ter
is
ti
c
s
a
r
e
pr
e
s
e
r
ve
d
.
T
he
e
xt
r
a
c
te
d
M
F
C
C
f
e
a
tu
r
e
s
a
r
e
then
s
tr
uc
t
ur
e
d
in
to
a
n
n
-
d
im
e
ns
i
ona
l
a
r
r
a
y
a
nd
or
ga
niz
e
d
int
o
a
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
S
N
:
2252
-
8938
I
nt
J
Ar
ti
f
I
ntell
,
Vol.
14
,
No.
4
,
Augus
t
20
25
:
324
1
-
3252
3246
da
ta
f
r
a
me
.
A
f
te
r
c
omp
let
ing
M
F
C
C
s
e
g
menta
ti
o
n
a
nd
f
e
a
tu
r
e
e
xt
r
a
c
t
ion
,
the
da
tas
e
t
is
p
r
e
pa
r
e
d
f
o
r
i
nput
in
to
ne
ur
a
l
ne
t
wor
k
m
ode
ls
.
An
othe
r
s
e
q
ue
nt
ial
model
is
us
e
d
f
or
to
na
l
inf
or
mati
on
p
r
oc
e
s
s
i
ng
.
M
ode
l
va
lues
a
nd
s
ha
pe
s
a
r
e
a
p
pr
o
pr
iate
l
y
c
on
f
i
gur
e
d.
T
h
is
ne
tw
or
k
unde
r
goe
s
n
ume
r
ous
i
ter
a
ti
o
ns
of
p
ool
ing
,
c
o
n
volut
ion
,
dr
op
out
,
a
nd
f
lat
teni
ng
to
e
n
ha
nc
e
f
e
a
tu
r
e
r
e
c
o
gni
ti
on
in
hid
de
n
lay
e
r
s
a
nd
r
e
mov
e
incons
is
tent
da
ta
li
nks
.
Ac
ti
va
ti
on
f
unc
ti
ons
us
e
d
a
ga
in
include
R
e
L
U
a
nd
S
of
tM
a
x
,
a
ppli
e
d
a
c
r
os
s
mul
ti
ple
laye
r
s
to
e
va
luate
e
a
c
h
node
c
onne
c
ti
on.
A
s
mall
lea
r
ni
ng
r
a
te
is
us
e
d
a
s
a
hype
r
pa
r
a
mete
r
to
e
ns
ur
e
th
e
ne
ur
a
l
ne
twor
k
lea
r
ns
gr
a
dua
ll
y,
im
pr
oving
it
s
a
bil
it
y
t
o
de
tec
t
tone
-
domi
na
nt
f
e
a
tur
e
s
.
Optim
ize
r
s
us
e
d
include
r
oot
mea
n
s
qua
r
e
pr
opa
ga
ti
on
(
R
M
S
pr
op
)
a
nd
Ad
a
m,
s
e
lec
ted
thr
ough
one
-
vs
-
one
c
omp
a
r
is
on
to
d
e
ter
mi
ne
the
opti
mal
c
hoice
pe
r
s
c
e
na
r
io.
T
he
ou
tput
is
a
n
int
e
r
media
te
da
ta
ve
c
tor
s
tor
e
d
f
or
a
ddit
ional
pr
oc
e
s
s
ing.
Onc
e
the
s
pe
e
c
h
s
e
gments
a
r
e
identif
ied,
the
e
x
tr
a
c
ted
a
udio
is
pa
s
s
e
d
thr
ough
a
s
pe
e
c
h
-
to
-
text
modul
e
to
r
e
c
ove
r
the
s
poke
n
c
ontent.
T
o
c
ons
tr
uc
t
a
r
e
li
a
ble
textua
l
a
na
lys
is
model,
va
r
ious
gr
a
mm
a
t
ica
l
a
nd
s
yntac
ti
c
r
ules
a
r
e
a
ppli
e
d,
including
s
ubjec
t
noun
r
ule,
dir
e
c
t
ins
igni
f
ica
nt
objec
ts
,
ne
ga
ti
on,
modi
f
ier
s
(
a
djec
ti
va
l,
a
dve
r
bial,
pa
r
ti
c
ipi
a
l
)
,
p
r
e
pos
it
ional
p
hr
a
s
e
s
,
noun
c
o
mpound
modi
f
ie
r
s
.
T
he
s
e
r
u
les
e
ns
ur
e
that
the
r
e
s
ult
ing
model
maintains
the
int
e
gr
it
y
of
the
textua
l
inf
or
mation
while
de
li
ve
r
ing
c
ons
is
tent
a
nd
mea
ningf
ul
pr
e
dictions
.
T
he
model
output
s
a
ve
c
tor
,
whic
h
f
or
ms
a
nother
int
e
r
media
te
r
e
s
ult
r
e
a
dy
f
or
int
e
gr
a
ti
on
in
the
f
inal
f
us
e
d
ne
ur
a
l
ne
twor
k.
3.
3.
4
.
S
yn
t
h
e
s
is
of
t
r
i
-
m
od
a
l
an
alys
is
T
his
modul
e
f
oc
us
e
s
on
f
e
a
tur
e
-
leve
l
f
us
ion,
c
ombi
ning
inf
or
mation
f
r
om
textua
l,
a
udio
,
a
nd
vis
ua
l
modalit
ies
.
M
ult
im
oda
l
f
us
ion
s
e
r
ve
s
a
s
a
c
or
e
e
leme
nt
in
a
ny
e
f
f
e
c
ti
ve
e
mot
ion
de
tec
ti
on
s
ys
tem,
s
igni
f
ica
ntl
y
c
ontr
ibut
ing
to
the
i
mpr
ove
ment
of
a
ge
nt
–
us
e
r
int
e
r
a
c
ti
on
qua
li
ty.
A
pr
im
a
r
y
c
ha
ll
e
n
ge
in
thi
s
domain
li
e
s
in
de
vis
ing
a
n
e
f
f
e
c
ti
ve
s
tr
a
tegy
f
or
in
t
e
gr
a
ti
ng
c
ognit
ive
a
nd
f
unc
ti
ona
l
in
f
or
mation
f
r
om
diver
s
e
s
our
c
e
s
e
a
c
h
c
ha
r
a
c
ter
ize
d
by
unique
tempor
a
l
s
c
a
les
a
nd
da
ta
dim
e
ns
ions
.
As
il
lus
tr
a
ted
in
F
i
gur
e
3,
two
main
f
us
ion
tec
hniq
ue
s
a
r
e
uti
li
z
e
d:
i
)
f
e
a
tur
e
-
leve
l
c
ombi
na
ti
on:
thi
s
a
ppr
oa
c
h
mer
ge
s
a
tt
r
ibut
e
s
f
r
om
e
a
c
h
modalit
y
i
nto
a
unif
ied
joi
nt
ve
c
tor
be
f
o
r
e
a
ny
c
las
s
if
ica
ti
on
s
tep
is
unde
r
take
n
a
nd
ii
)
d
e
c
is
ion
-
leve
l
c
ombi
na
ti
on:
e
a
c
h
modalit
y
is
modele
d
a
nd
c
a
tegor
ize
d
s
e
pa
r
a
tely.
T
he
indi
vidual
r
e
s
ult
s
a
r
e
then
c
ombi
ne
d
us
ing
e
s
tablis
he
d
methods
,
s
uc
h
a
s
e
xpe
r
t
r
ules
o
r
s
im
ple
mathe
matica
l
ope
r
a
ti
ons
c
ompr
is
ing
s
umm
a
ti
on
,
pr
oduc
t,
major
it
y
voti
ng,
a
nd
s
tatis
ti
c
a
l
we
ight
in
g.
M
ult
i
-
model
a
na
lys
is
f
us
ion
c
a
n
be
obs
e
r
ve
d
in
F
igur
e
4
.
F
igur
e
3.
F
us
ion
methods
a
nd
types
F
igur
e
4.
F
us
ion
o
f
mul
t
im
oda
l
a
na
lys
is
Evaluation Warning : The document was created with Spire.PDF for Python.
I
nt
J
Ar
ti
f
I
ntell
I
S
S
N:
2252
-
8938
Hum
an
s
e
nti
me
nt
analytics
us
ing
multi
-
mode
l
de
e
p
lear
ning
appr
oac
h
(
A
nil
K
umar
M
uthev
i)
3247
I
n
the
c
ur
r
e
nt
s
tudy,
f
e
a
tur
e
-
leve
l
f
us
ion
wa
s
a
p
pli
e
d
by
c
ombi
ning
the
f
e
a
tur
e
ve
c
tor
s
f
r
om
a
ll
modalit
ies
to
c
ons
tr
uc
t
a
unif
ied,
e
xtende
d
f
e
a
tur
e
ve
c
tor
.
Additi
ona
ll
y,
de
c
is
ion
-
leve
l
f
us
ion
wa
s
a
ppli
e
d
a
c
r
os
s
va
r
ious
e
mot
ional
int
e
nts
to
e
xa
mi
ne
how
dif
f
e
r
e
nt
a
na
lyt
ica
l
a
ppr
oa
c
he
s
pe
r
f
or
m
whe
n
pr
oc
e
s
s
e
d
thr
ough
the
s
ubs
e
que
nt
c
las
s
if
ier
modul
e
to
ob
tain
e
mot
ional
de
s
c
r
ipt
ions
.
F
e
a
tu
r
e
ve
c
tor
s
f
r
o
m
e
a
c
h
modalit
y
we
r
e
a
ls
o
me
r
ge
d
in
to
a
s
ingl
e
f
e
a
tur
e
s
tr
e
a
m
in
thi
s
s
tudy
.
T
he
a
s
s
umpt
ion
that
a
s
im
ple
f
us
ion
ge
ne
r
a
ti
ng
a
r
e
s
ult
a
nt
ve
c
tor
li
mi
ted
t
o
a
ba
s
ic
e
mot
ion
s
e
t
would
incr
e
a
s
e
mac
hine
a
s
s
ur
a
nc
e
is
a
c
omm
only
dr
a
wn
but
mi
s
take
n
c
onc
lus
ion
f
r
om
s
uc
h
a
na
lys
is
.
How
e
ve
r
,
the
tr
i
-
modalit
y
a
pp
r
oa
c
h
int
r
oduc
e
s
a
br
oa
de
r
pe
r
s
pe
c
ti
ve
,
e
na
bli
ng
the
de
tec
ti
on
o
f
a
wide
r
a
nge
of
e
mot
ional
int
e
ns
it
ies
e
xpr
e
s
s
e
d
by
human
s
ubjec
ts
.
Huma
ns
,
a
s
inher
e
ntl
y
c
ompl
e
x
e
mot
ional
be
ing
s
c
a
pa
ble
of
e
xpe
r
ienc
ing
a
nd
e
xpr
e
s
s
ing
mi
xe
d
e
mot
ions
,
of
f
e
r
a
va
luable
f
ounda
ti
on
f
or
uti
li
z
in
g
thi
s
e
mot
ional
model
to
e
xplo
r
e
de
e
pe
r
e
mot
ion
a
l
s
tate
s
.
T
he
model’
s
de
pth
de
pe
nds
on
t
he
in
ter
media
te
s
uppor
t
a
nd
c
onf
idenc
e
leve
ls
de
r
ived
f
r
om
th
e
f
us
ion
pr
oc
e
s
s
.
R
e
f
e
r
e
nc
ing
P
lut
c
hik’
s
whe
e
l
of
e
mot
i
ons
,
a
s
de
picte
d
in
F
igu
r
e
5
,
we
c
a
n
ident
if
y
diver
s
e
e
mot
ional
s
pe
c
tr
ums
a
nd
thei
r
in
f
luenc
e
,
f
a
c
il
it
a
ti
ng
a
mo
r
e
de
tailed
e
mot
ional
r
e
c
ogn
it
ion
p
r
oc
e
s
s
in
humans
,
not
mer
e
ly
c
a
ptur
ing
the
e
mot
ional
s
e
que
nc
e
but
a
ls
o
int
e
r
pr
e
ti
ng
the
int
e
nt
be
hind
c
omm
unica
ti
ve
e
xpr
e
s
s
ions
.
P
lut
c
hik’
s
model
s
e
r
ve
s
a
s
a
f
r
a
mew
or
k
f
o
r
view
ing
e
mot
ional
l
it
e
r
a
c
y
th
r
oug
h
a
mor
e
e
xpa
ns
ive
lens
.
F
igur
e
5.
P
lut
c
hik
’
s
whe
e
l
of
e
mot
ions
Util
izing
the
s
a
me
t
r
i
-
model
outcome
s
,
both
be
ha
vior
a
l
i
r
ony
a
nd
c
omm
unica
ti
ve
ir
ony
c
a
n
be
identif
ied
to
a
s
igni
f
ica
nt
de
gr
e
e
.
T
hus
,
e
nha
n
c
ing
e
mot
ional
li
te
r
a
c
y
invol
ve
s
mor
e
than
e
x
pa
nding
voc
a
bular
y
f
or
e
mot
ions
;
it
e
nc
ompas
s
e
s
und
e
r
s
tanding
the
int
e
r
r
e
lations
hips
a
mong
e
mot
i
ons
a
nd
r
e
c
ognizing
how
they
e
volve
ove
r
t
im
e
.
L
e
ve
r
a
ging
the
tr
i
-
model
r
e
s
ult
s
,
thi
s
wo
r
k
a
ls
o
de
mons
tr
a
tes
the
potential
to
de
tec
t
be
ha
vior
a
l
a
nd
c
omm
unica
ti
ve
i
r
ony
with
g
r
e
a
ter
p
r
e
c
is
ion.
4.
RE
S
UL
T
S
AN
D
DI
S
CU
S
S
I
ONS
A
T
kint
e
r
GU
I
a
ppli
c
a
ti
on
ha
s
be
e
n
us
e
d
to
or
g
a
nize
the
im
pleme
nted
pr
otot
ype
a
nd
a
s
s
e
s
s
the
pe
r
f
or
manc
e
of
the
de
e
ply
-
f
us
e
d
ne
ur
a
l
ne
twor
k
unde
r
dif
f
e
r
e
nt
c
ondit
ions
.
T
he
int
r
oduc
to
r
y
s
c
r
e
e
n
of
the
a
ppli
c
a
ti
on
a
ll
ows
us
e
r
s
to
e
a
s
il
y
loca
te
a
nd
lo
a
d
the
da
ta
int
e
nde
d
f
or
a
na
lys
is
,
a
long
with
a
r
e
por
t
butt
on
to
ini
ti
a
te
the
e
va
luation
p
r
oc
e
s
s
.
A
s
c
r
e
e
ns
hot
of
the
r
e
por
t
window
is
s
hown
in
F
igu
r
e
6
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
S
N
:
2252
-
8938
I
nt
J
Ar
ti
f
I
ntell
,
Vol.
14
,
No.
4
,
Augus
t
20
25
:
324
1
-
3252
3248
F
igur
e
6.
P
e
r
f
or
manc
e
of
the
de
e
ply
-
f
us
e
d
ne
ur
a
l
n
e
twor
k
d
e
tailed
r
e
por
t
T
he
mac
hine
de
mons
tr
a
ted
c
omm
e
nda
ble
pe
r
f
or
m
a
nc
e
whe
n
tes
ted
with
a
ge
ne
r
ic
s
im
ulate
d
da
tas
e
t
e
nti
ty.
How
e
ve
r
,
thi
s
doe
s
not
im
ply
that
r
e
a
l
-
ti
me
outcome
s
would
ne
c
e
s
s
a
r
il
y
yield
the
s
a
me
leve
l
of
a
c
c
ur
a
c
y.
T
he
r
e
f
o
r
e
,
to
tho
r
oughly
e
va
luate
the
s
ys
tem,
it
wa
s
e
s
s
e
nti
a
l
to
c
oll
e
c
t
c
ompl
e
x
da
ta
c
a
pa
ble
of
pr
oduc
ing
c
r
i
ti
c
a
l
a
na
lyt
ica
l
r
e
s
ult
s
.
As
a
r
e
s
ult
,
we
s
e
lec
ted
c
inema
s
a
nd
T
V
s
hows
,
whe
r
e
a
c
tor
s
por
tr
a
y
c
ha
ll
e
nging
e
mot
ional
e
xpr
e
s
s
ions
on
moni
tor
.
T
h
e
mac
hine
s
uc
c
e
s
s
f
ull
y
de
tec
ted
the
a
nxiety
e
xpe
r
ienc
e
d
by
the
gir
l,
who
wa
s
s
im
ult
a
ne
ous
ly
e
xhibi
ti
ng
s
igns
of
f
e
a
r
a
nd
s
a
dne
s
s
while
r
e
f
lec
ti
ng
on
li
f
e
without
a
los
t
loved
one
.
I
t
a
ls
o
a
c
c
ur
a
tely
identif
ied
the
indi
vid
ua
l
a
s
f
e
male
,
a
nd
de
ter
mi
ne
d
that
the
wor
d
or
de
r
us
e
d
in
he
r
s
pe
e
c
h
indi
c
a
ted
no
s
pe
c
if
ic
polar
it
y
domi
na
n
c
e
.
F
igu
r
e
7
pr
e
s
e
nts
s
a
mpl
e
s
of
the
e
nha
nc
e
d
vi
s
ua
l
a
nd
a
udio
s
e
gmente
d
da
ta
e
mpl
oye
d
by
the
s
ys
tem.
F
igur
e
7
.
S
a
mpl
e
s
of
the
e
nha
nc
e
d
vis
ua
l
a
nd
a
udi
o
s
e
gmente
d
da
ta
Evaluation Warning : The document was created with Spire.PDF for Python.
I
nt
J
Ar
ti
f
I
ntell
I
S
S
N:
2252
-
8938
Hum
an
s
e
nti
me
nt
analytics
us
ing
multi
-
mode
l
de
e
p
lear
ning
appr
oac
h
(
A
nil
K
umar
M
uthev
i)
3249
T
his
de
mon
s
tr
a
tes
how
the
mac
hine
wa
s
c
a
pa
ble
of
ha
ndli
ng
c
ha
ll
e
nging
da
ta
a
nd
pr
ovidi
ng
de
e
pe
r
ins
ight
s
int
o
the
ove
r
a
ll
e
mot
ional
s
tate
of
the
in
divi
dua
l.
Adva
nc
e
d
e
leme
nts
s
uc
h
a
s
e
mot
ional
we
ll
ne
s
s
,
be
ha
vior
a
l
ir
ony
,
a
nd
wor
dplay
de
tec
ti
on
we
r
e
f
in
e
-
tuned
to
e
xtr
a
c
t
unc
omm
on
a
nd
of
ten
unnoti
c
e
d
f
e
a
tur
e
s
.
I
n
the
de
tailed
r
e
por
t
,
e
ve
r
y
c
onc
lus
ion
de
r
ived
f
r
om
the
e
mot
ions
of
dif
f
e
r
e
nt
s
e
gr
e
ga
ted
input
s
wa
s
c
lea
r
ly
doc
umente
d.
T
he
r
e
por
t
p
r
ovides
a
p
r
e
c
is
e
e
va
luation
of
a
pe
r
s
on’
s
s
e
nti
ment
,
s
uc
c
e
s
s
f
ull
y
de
li
ve
r
ing
a
c
onf
idenc
e
f
a
c
tor
f
or
the
r
e
s
ult
ing
int
e
r
pr
e
tation
.
I
t
va
li
da
tes
e
mot
ional
f
e
e
li
ng
,
polar
it
y
,
tonal
in
f
o
r
mation
,
a
nd
int
e
nt
of
e
mot
ion
,
a
nd
i
r
ony
a
ll
s
im
ult
a
ne
ous
ly.
T
a
ble
1
s
hows
types
of
e
mot
ional
output
s
.
T
a
ble
1.
T
ype
s
of
e
mo
ti
ona
l
outpu
ts
E
nt
r
y
D
e
s
c
r
ip
ti
on
T
r
i
-
moda
l
r
e
s
ul
t
P
r
e
s
e
nt
s
t
he
c
onc
lu
s
io
n de
r
iv
e
d f
r
om a
na
ly
z
in
g a
ll
da
ta
moda
li
t
ie
s
, i
nc
lu
di
ng vis
ua
l,
a
udi
o, a
nd
s
poke
n s
p
e
e
c
h.
F
a
c
ia
l
domi
na
nc
e
D
is
pl
a
ys
t
he
pr
e
domi
na
nt
e
mot
io
n e
xpr
e
s
s
e
d on the
i
ndi
vi
dua
l’
s
f
a
c
e
i
n
th
e
vi
s
ua
l.
F
a
c
ia
l
r
e
c
e
s
s
iv
e
D
is
pl
a
ys
a
l
is
t
of
e
mot
io
ns
e
xpr
e
s
s
e
d by the
i
ndi
vi
dua
l,
e
xc
lu
di
ng t
he
mos
t
a
ppa
r
e
nt
one
s
, i
f
a
ny
T
ona
l
c
onc
lu
s
io
n
P
r
ovi
de
s
t
he
e
mot
io
na
l
in
f
e
r
e
nc
e
de
r
iv
e
d f
r
om t
he
s
pe
a
ke
r
'
s
voi
c
e
a
nd t
one
.
S
pe
e
c
h pola
r
it
y
P
r
ovi
de
s
t
he
pol
a
r
it
y
-
ba
s
e
d c
onc
lu
s
io
n of
t
he
s
p
e
e
c
h s
e
nt
e
nc
e
s
poke
n by the
s
pe
a
k
e
r
.
V
oc
a
l
r
a
nge
I
ndi
c
a
te
s
w
he
th
e
r
t
he
s
p
e
a
ke
r
’
s
voi
c
e
r
e
s
e
mbl
e
s
t
ha
t
of
a
ma
l
e
or
f
e
ma
le
.
I
r
ony pe
r
c
e
nt
a
ge
I
ndi
c
a
te
s
t
he
pe
r
c
e
nt
a
g
e
of
i
r
oni
c
c
ont
e
nt
de
te
c
te
d by the
s
ys
te
m t
hr
oughout t
he
a
n
a
ly
s
is
.
D
e
ta
il
e
d r
e
por
t
P
r
ovi
de
s
a
c
ompr
e
he
ns
iv
e
s
umm
a
r
y of
t
he
a
na
ly
s
is
a
nd t
h
e
va
r
io
us
f
a
c
to
r
s
de
r
iv
e
d f
r
om i
t.
As
a
r
e
s
ult
,
the
model
moves
be
yond
tr
a
dit
io
na
l
e
mot
ion
c
las
s
if
ica
ti
on,
e
mb
r
a
c
ing
e
mot
ion
de
tec
ti
on
to
mo
r
e
e
f
f
e
c
ti
ve
ly
int
e
r
p
r
e
t
the
r
ich
a
n
d
ove
r
lapping
e
mot
ional
pa
tt
e
r
ns
typi
c
a
ll
y
f
ound
i
n
human
be
ha
vior
.
F
ur
the
r
mor
e
,
the
model
is
c
a
pa
ble
of
a
s
s
e
s
s
ing
whe
ther
a
n
indi
vidual
in
the
input
da
ta
is
de
mons
tr
a
ti
ng
e
mot
ional
ba
lanc
e
.
E
mot
ional
ba
lanc
e
is
a
vit
a
l
a
s
pe
c
t
of
menta
l
he
a
lt
h,
a
nd
ind
ivi
d
ua
ls
with
f
r
e
que
nt
f
luctua
ti
ons
c
a
n
be
a
c
c
ur
a
tely
de
tec
ted
by
the
mac
hine.
5.
CONC
L
USI
ON
T
he
pr
opos
e
d
s
ys
tem
a
ddr
e
s
s
e
s
s
e
ve
r
a
l
li
mi
tations
e
nc
ounter
e
d
by
pr
ior
s
ingul
a
r
models
.
One
s
ugge
s
ted
im
pr
ove
ment
f
or
a
c
hieving
mor
e
pr
e
c
is
e
r
e
s
ult
s
is
to
s
e
gment
the
media
c
li
ps
a
t
na
tur
a
l
pa
us
e
s
,
s
e
pa
r
a
tor
s
,
or
s
e
que
nc
e
e
nding
s
a
ppr
oxim
a
tely
e
ve
r
y
~6
s
e
c
onds
a
nd
then
pr
oc
e
s
s
e
a
c
h
s
e
gm
e
nt
in
a
s
e
que
nti
a
l
it
e
r
a
ti
on,
whic
h
e
nha
nc
e
s
outcome
a
c
c
u
r
a
c
y.
W
hil
e
the
c
ur
r
e
nt
model
is
s
ti
ll
in
it
s
r
e
s
e
a
r
c
h
pha
s
e
,
it
is
a
lr
e
a
dy
c
a
pa
ble
of
de
tec
ti
ng
a
r
a
nge
of
4
0
-
50
dif
f
e
r
e
nt
e
mot
ional
s
tate
s
.
T
he
im
pleme
nt
a
ti
on
of
mul
ti
modal
s
e
nti
ment
a
n
a
lys
is
to
pe
r
f
o
r
m
mul
ti
dim
e
ns
ional
e
mot
ion
a
na
lys
is
c
ould
be
r
e
vol
uti
ona
r
y.
How
e
ve
r
,
thi
s
is
not
the
uppe
r
li
mi
t
of
it
s
poten
ti
a
l.
T
he
r
e
a
r
e
numer
ous
a
ppli
c
a
ti
ons
f
or
e
mot
io
na
l
s
tate
identif
ica
ti
on
in
humans
.
I
n
today’
s
digi
tal
e
r
a
w
he
r
e
c
omm
unica
ti
on
a
nd
r
e
view
s
a
r
e
s
hif
ti
ng
f
r
o
m
pur
e
ly
textua
l
c
ontent
to
r
ich
media
the
pr
opos
e
d
model
c
a
n
be
us
e
d
to
a
na
lyze
media
c
ontent
,
inf
luenc
ing
d
e
c
is
ions
in
bus
ines
s
,
he
a
lt
hc
a
r
e
,
a
nd
other
s
e
c
tor
s
,
dr
ivi
ng
pr
ogr
e
s
s
a
t
a
n
e
nti
r
e
ly
ne
w
leve
l.
Addit
ionally,
i
t
c
a
n
be
e
mp
loyed
f
or
f
r
a
ud
de
tec
ti
on
,
whe
r
e
indi
viduals
a
tt
e
mpt
to
im
pe
r
s
ona
te
other
s
,
a
s
e
ve
n
mi
nor
dis
c
r
e
pa
nc
ies
in
their
e
mot
ional
s
tate
s
c
a
n
be
de
tec
ted
by
the
e
n
ha
nc
e
d
s
ys
tem.
I
t
may
a
ls
o
be
int
e
g
r
a
ted
int
o
the
f
utur
e
of
AI
,
c
ontr
ibut
ing
to
the
c
r
e
a
ti
on
of
int
e
l
l
igent
pe
r
s
ona
l
a
s
s
is
tant
s
li
ke
the
c
onc
e
ptual
“
J
a
r
vis
”
tai
lor
e
d
to
indi
vidual
us
e
r
s
,
r
e
c
ognizing
their
e
mot
ional
s
tate
s
a
nd
int
e
r
a
c
ti
ng
in
a
mor
e
human
-
li
ke
ma
nne
r
by
unde
r
s
tanding
the
s
ubtl
e
ti
e
s
o
f
c
onve
r
s
a
ti
on
.
At
p
r
e
s
e
nt,
the
s
ys
tem’
s
f
unc
ti
ona
li
ty
is
r
e
s
tr
icte
d
to
the
E
ngl
is
h
langua
ge
,
s
ince
s
pe
e
c
h
e
xtr
a
c
ti
on
a
nd
e
mot
ion
int
e
r
pr
e
tation
a
r
e
p
r
im
a
r
il
y
e
xe
c
uted
in
E
ngli
s
h.
F
u
r
t
he
r
mor
e
,
ir
ony
a
nd
wor
dp
lay
dif
f
e
r
s
igni
f
ica
ntl
y
a
c
r
os
s
langua
ge
s
a
nd
c
ult
ur
e
s
.
One
of
the
f
utu
r
e
goa
ls
is
to
e
x
pa
nd
the
s
ys
tem
to
s
uppor
t
mul
ti
li
ngua
l
a
nd
mul
ti
c
ult
ur
a
l
de
tec
ti
on.
C
ur
r
e
ntl
y,
tonal
da
ta
is
ba
s
e
d
mainly
o
n
B
r
it
is
h
a
nd
Nor
th
Ame
r
ica
n
a
c
c
e
nts
.
E
xpa
nding
the
da
t
a
s
e
t
to
include
other
a
c
c
e
nts
s
uc
h
a
s
Ame
r
ica
n
,
I
ndian
,
Aus
tr
a
li
a
n
,
a
nd
Ge
r
man
c
ould
s
igni
f
ica
ntl
y
e
nha
nc
e
the
model’
s
a
da
ptabili
ty
a
nd
pe
r
f
or
manc
e
a
c
r
os
s
diver
s
e
global
populations
.
AC
KNOWL
E
DGM
E
N
T
S
T
he
a
uthor
s
thank
the
M
a
na
ge
ment
f
or
their
s
upp
or
t
a
nd
r
e
s
our
c
e
s
thr
oughout
thi
s
r
e
s
e
a
r
c
h.
S
pe
c
ial
a
ppr
e
c
iation
is
e
xtende
d
to
the
f
a
c
ult
y
a
nd
s
taf
f
f
or
their
guidanc
e
a
nd
e
nc
our
a
ge
ment.
F
UN
DI
NG
I
NF
ORM
AT
I
ON
T
his
r
e
s
e
a
r
c
h
r
e
c
e
ived
no
s
pe
c
if
ic
gr
a
nt
f
r
om
a
n
y
f
unding
a
ge
nc
y,
c
omm
e
r
c
ial
,
or
not
-
f
or
-
pr
of
it
s
e
c
tor
s
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
S
N
:
2252
-
8938
I
nt
J
Ar
ti
f
I
ntell
,
Vol.
14
,
No.
4
,
Augus
t
20
25
:
324
1
-
3252
3250
AU
T
HO
R
CONT
RI
B
U
T
I
ONS
S
T
AT
E
M
E
N
T
T
his
jour
na
l
us
e
s
the
C
ontr
ibut
o
r
R
oles
T
a
xo
nomy
(
C
R
e
diT
)
to
r
e
c
ognize
indi
vidual
a
uthor
c
ontr
ibut
ions
,
r
e
duc
e
a
utho
r
s
hip
dis
putes
,
a
nd
f
a
c
il
it
a
te
c
oll
a
bor
a
ti
on.
Nam
e
of
Au
t
h
or
C
M
So
Va
Fo
I
R
D
O
E
Vi
Su
P
Fu
Anil
Kuma
r
M
uthevi
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
M
a
ga
nti
Ve
nka
tes
h
✓
✓
✓
✓
✓
✓
✓
P
a
ll
a
vi
Ga
ur
a
v
Adke
✓
✓
✓
✓
✓
✓
✓
✓
R
a
jas
hr
e
e
Ga
dha
ve
✓
✓
✓
✓
✓
✓
✓
✓
G.
L
.
Na
r
a
s
a
mba
Va
ngur
i
✓
✓
✓
✓
✓
T
h
i
r
u
v
e
e
d
u
la
S
r
i
n
i
v
a
s
ul
u
✓
✓
✓
✓
✓
C
:
C
onc
e
pt
ua
li
z
a
ti
on
M
:
M
e
th
odol
ogy
So
:
So
f
twa
r
e
Va
:
Va
li
da
ti
on
Fo
:
Fo
r
ma
l
a
na
ly
s
is
I
:
I
nve
s
ti
ga
ti
on
R
:
R
e
s
our
c
e
s
D
:
D
a
ta
C
ur
a
ti
on
O
:
W
r
it
in
g
-
O
r
ig
in
a
l
D
r
a
f
t
E
:
W
r
it
in
g
-
R
e
vi
e
w
&
E
di
ti
ng
Vi
:
Vi
s
ua
li
z
a
ti
on
Su
:
Su
pe
r
vi
s
io
n
P
:
P
r
oj
e
c
t
a
dmi
ni
s
tr
a
ti
on
Fu
:
Fu
ndi
ng a
c
qui
s
it
io
n
CONF
L
I
CT
OF
I
NT
E
RE
S
T
S
T
AT
E
M
E
N
T
Author
s
s
tate
no
c
onf
li
c
t
of
int
e
r
e
s
t.
I
NF
ORM
E
D
CONSE
NT
W
e
ha
ve
obtaine
d
inf
or
med
c
ons
e
nt
f
r
om
a
ll
ind
ivi
dua
ls
include
d
in
thi
s
s
tudy.
E
T
HI
CA
L
AP
P
ROVA
L
T
he
r
e
s
e
a
r
c
h
r
e
late
d
to
human
us
e
ha
s
c
ompl
ied
with
a
ll
r
e
leva
nt
na
ti
ona
l
r
e
gulations
a
nd
ins
ti
tut
ional
poli
c
ies
in
a
c
c
or
da
nc
e
with
the
tene
ts
of
the
He
ls
inki
De
c
lar
a
ti
on
a
nd
ha
s
be
e
n
a
ppr
ove
d
by
the
a
uthor
s
'
ins
ti
tut
ional
r
e
view
boa
r
d
or
e
quivale
nt
c
o
mm
it
tee
.
DA
T
A
AV
AI
L
A
B
I
L
I
T
Y
T
he
da
ta
that
s
uppor
t
the
f
indi
ngs
of
thi
s
s
tudy
a
r
e
a
va
il
a
ble
f
r
om
the
c
o
r
r
e
s
ponding
a
uthor
,
[
M
AK
]
,
upon
r
e
a
s
ona
ble
r
e
que
s
t.
RE
F
E
RE
NC
E
S
[
1]
F
.
A
gos
ti
ne
ll
i,
M
.
R
.
A
nde
r
s
on,
a
nd
H
.
L
e
e
,
“
R
obus
t
im
a
ge
de
noi
s
in
g
w
it
h
mul
ti
-
c
ol
umn
de
e
p
ne
ur
a
l
ne
twor
ks
,”
in
A
dv
anc
e
s
in
N
e
ur
al
I
nf
or
m
at
io
n P
r
oc
e
s
s
in
g Sy
s
te
m
s
,
C
ur
r
a
n A
s
s
o
c
ia
te
s
, I
nc
., 2013, pp. 1493
–
1501.
[
2]
W
.
M
e
dha
t,
A
.
H
a
s
s
a
n,
a
nd
H
.
K
or
a
s
hy,
“
S
e
nt
im
e
nt
a
na
ly
s
i
s
a
lg
or
it
hms
a
nd
a
ppl
ic
a
ti
on
s
:
a
s
ur
ve
y,”
A
in
Shams
E
ngi
ne
e
r
in
g
J
our
nal
, vol
. 5, no. 4, pp. 1093
–
1113, De
c
. 2014, doi:
10.1016/
j.
a
s
e
j.
2014.04.011.
[
3]
D
.
I
.
H
.
F
a
r
ia
s
a
nd
P
.
R
os
s
o,
“
I
r
ony,
s
a
r
c
a
s
m,
a
n
d
s
e
nt
im
e
nt
a
na
ly
s
is
,”
in
Se
nt
ime
nt
A
nal
y
s
i
s
in
Soc
ia
l
N
e
tw
or
k
s
,
E
ls
e
vi
e
r
,
2
017,
pp. 113
–
128
, doi
:
10.1016/B
978
-
0
-
12
-
804412
-
4.00007
-
3.
[
4]
K
.
N
.
D
e
vi
a
nd
V
.
M
.
B
ha
s
ka
r
n,
“
O
nl
in
e
f
or
ums
hot
s
pot
pr
e
di
c
ti
on
ba
s
e
d
on
s
e
nt
im
e
nt
a
na
ly
s
is
,
”
J
our
nal
of
C
om
put
e
r
Sc
ie
nc
e
,
vol
. 8, no. 8, pp. 1219
–
1224, Aug. 2012, do
i:
10.3844/j
c
s
s
p.201
2.1219.1224.
[
5]
E
.
C
a
mbr
ia
,
N
.
H
ow
a
r
d, J
.
H
s
u,
a
nd
A
.
H
us
s
a
in
,
“
S
e
nt
ic
bl
e
ndi
ng:
s
c
a
la
bl
e
mul
ti
moda
l
f
us
io
n
f
or
th
e
c
ont
in
uous
in
t
e
r
pr
e
ta
ti
on
of
s
e
ma
nt
ic
s
a
nd s
e
nt
ic
s
,”
i
n
2013 I
E
E
E
Sy
m
pos
iu
m
on C
om
put
at
io
nal
I
nt
e
ll
ig
e
nc
e
f
or
H
um
an
-
li
k
e
I
nt
e
ll
ig
e
nc
e
(
C
I
H
L
I
)
,
I
E
E
E
,
A
pr
.
2013, pp. 108
–
117
, doi
:
10.1109/C
I
H
L
I
.2013.6613272.
[6
]
A
.
B
if
e
t
a
nd
E
.
F
r
a
nk,
“
S
e
nt
im
e
nt
knowle
dge
di
s
c
ove
r
y
in
t
w
it
te
r
s
tr
e
a
mi
ng
da
ta
,”
in
D
i
s
c
ov
e
r
y
Sc
i
e
nc
e
,
B
e
r
li
n,
H
e
id
e
lb
e
r
g
:
S
pr
in
ge
r
, 2010, pp. 1
–
15
, doi
:
10.1007/978
-
3
-
642
-
16184
-
1_1.
[
7]
D
.
C
ir
e
s
a
n,
U
.
M
e
ie
r
,
a
nd
J
.
S
c
hmi
dhube
r
,
“
M
ul
ti
-
c
ol
umn
de
e
p
ne
ur
a
l
ne
twor
ks
f
or
im
a
ge
c
la
s
s
if
ic
a
ti
on,”
in
2012
I
E
E
E
C
onf
e
r
e
nc
e
on C
om
put
e
r
V
is
io
n and P
at
te
r
n R
e
c
ogni
ti
on
, I
E
E
E
, J
un. 2012, pp. 3642
–
3649
, doi
:
10.1109/C
V
P
R
.2012.6248110.
[
8]
J
.
K
ha
ir
na
r
a
nd
M
.
K
in
ik
a
r
,
“
M
a
c
hi
n
e
le
a
r
ni
ng a
lg
or
it
hms
f
or
opi
ni
on
mi
ni
ng
a
nd s
e
nt
im
e
nt
c
la
s
s
if
ic
a
ti
on,”
I
nt
e
r
nat
io
nal
J
ou
r
nal
of
Sc
ie
nt
if
ic
and R
e
s
e
ar
c
h P
ubl
ic
at
io
ns
, vol
. 3, no. 6, 2013
.
[
9]
A
.
K
uma
r
a
nd
T
.
M
.
S
e
b
a
s
ti
a
n,
“
S
e
nt
im
e
nt
a
na
ly
s
is
on
T
w
it
te
r
,”
I
J
C
SI
I
nt
e
r
nat
io
nal
J
ou
r
nal
of
C
o
m
put
e
r
Sc
ie
nc
e
I
s
s
ue
s
,
vol
. 9, no. 4, pp. 372
–
378, 2012.
[
10]
P
.
E
kma
n
a
nd
W
.
V
.
F
r
ie
s
e
n,
“
F
a
c
ia
l
a
c
ti
on
c
odi
ng
s
ys
t
e
m,”
P
s
y
c
T
E
ST
S
D
at
as
e
t
.
C
on
s
ul
ti
ng
P
s
yc
hol
ogi
s
t
s
P
r
e
s
s
,
J
a
n.
14,
20
19
,
doi
:
10.1037/t
27734
-
000.
[
11]
S
.
P
or
ia
,
E
.
C
a
mbr
ia
,
N
.
H
ow
a
r
d,
G
.
-
B
.
H
ua
ng,
a
nd
A
.
H
u
s
s
a
in
,
“
F
us
in
g
a
udi
o,
vi
s
ua
l
a
nd
te
xt
u
a
l
c
lu
e
s
f
or
s
e
nt
im
e
nt
a
na
l
ys
i
s
f
r
om m
ul
ti
moda
l
c
ont
e
nt
,”
N
e
ur
oc
om
put
in
g
, vol
. 174, pp. 50
–
59, J
a
n. 2016, doi:
10.1016/j
.ne
uc
om.2015.01.095.
Evaluation Warning : The document was created with Spire.PDF for Python.