Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation

788 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 27, NO.
4, APRIL 2019
Improving Conditional Sequence Generative

Adversarial Networks by Stepwise Evaluation
Yi-Lin Tuan and Hung-Yi Lee
Abstract—Sequence generative adversarial networks (SeqGAN) (2) One directional KL divergence [4]. MLE only minimizes
have been used to improve conditional sequence generation tasks, forward KL divergence and ignores the backward, causing the
for example, chit-chat dialogue generation. To stabilize the train- predicted distribution unbounded. (3) General Responses. Pre-
ing of SeqGAN, Monte Carlo tree search (MCTS) or reward at
every generation step (REGS) is used to evaluate the goodness of vious work [5] empirically found that MLE tends to produce
a generated subsequence. MCTS is computationally intensive, but general responses (e.g., “i don’t know.”), leading to dialogues
the performance of REGS is worse than MCTS. In this paper, we with little information. GANs have potential to solve the above
propose stepwise GAN (StepGAN), in which the discriminator is three issues.
modified to automatically assign scores quantifying the goodness To tackle the problems of MLE, SeqGAN is recently proposed
of each subsequence at every generation step. StepGAN has sig-
nificantly less computational costs than MCTS. We demonstrate for chit-chat dialogue generation [2], [6]. By using a discrimina-
that StepGAN outperforms previous GAN-based methods on both tor to evaluate the difference between the model responses and
synthetic experiment and chit-chat dialogue generation. ground truths, the problem of only one direction KL divergence
Index Terms—Generative adversarial network, sequence
is solved. Additionally, exposure bias is eliminated using policy
generation. gradient [3] for optimization. By observation SeqGAN can gen-
erate more creative sentences [2], [6], however, its optimization
remains inefficient.
I. INTRODUCTION SeqGAN uses policy gradient to solve the intractable back-
ONDITIONAL sequence generation refers to tasks in propagation issue, but a common problem in reinforcement
C which a response is generated according to an input. Such
tasks include machine translation, summarization, question an-
learning appears: sparse reward, that the non-zero reward is
only observed at the last time step. The primary disadvantage
swering and dialogue generation. In these applications, the in- of sparse reward is making the training sample inefficient. The
put messages and output responses usually have a one-to-many inefficiency slows down training because the generator has very
property. For example, in dialogue generation, given a message little successful experience. Sparse reward causes another prob-
(“How was your day?”), there are many acceptable responses lem to chit-chat chatbot. In chatting, an incorrect response and a
(“It is good.”, “Very bad.”). This property, mostly appears in correct one can share the same prefix. For example, “I’m John.”
dialogue generation, attributes to a conditional model to learn and “I’m sorry.” have the same prefix “I’m”. But for the in-
a distribution instead of a specific answer, thus preventing the put “What ’s your name?”, the first response is reasonable and
model from generating high quality answers. the second one is weak. The same prefix then receives oppo-
Generally, conditional sequence generation is learned using site feedbacks. The phenomenon continuously happens during
a sequence-to-sequence model (seq2seq) trained by minimiz- training, making the training signals highly variant. The training
ing the cross-entropy loss [1], a method often called maximum is therefore unstable [2], [3], [6], [7].
likelihood estimation (MLE) [1], [2]. MLE achieved acceptable To deal with sparse reward, the original SeqGAN is trained
results in terms of coherence, but leads to three problems: (1) with a stepwise evaluation method – Monte Carlo tree search
Exposure bias [3]. Targets and generated responses are respec- (MCTS) [6]. MCTS stabilizes the training, but it is computa-
tively taken as the inputs during training stage and inference tionally intractable when dealing with large dataset. To meet the
stage, a bias that causes accumulated errors in inference stage. necessary of large dataset for chatbot, reward at every genera-
tion step (REGS) is proposed to replace MCTS but with worse
Manuscript received August 15, 2018; revised December 21, 2018; accepted performance [2]. According to previous attempts [2], [6], [8],
January 20, 2019. Date of publication January 30, 2019; date of current version [9], we know that stepwise evaluation can remarkably affect the
February 19, 2019. This work was supported by the Ministry of Science and
Technology of Taiwan under Contract 108-2636-E-002-001. The associate edi- results, but it has not been thoroughly explored.
tor coordinating the review of this manuscript and approving it for publication We propose an alternative stepwise evaluation method to re-
was Prof. Tatsuya Kawahara. (Corresponding author: Hung-Yi Lee.) place MCTS – StepGAN. The motivation is to based on theory
Y.-L. Tuan is with the Speech Processing and Machine Learning Lab-
oratory, National Taiwan University, Taipei City 10617, Taiwan (e-mail:, use a discriminator to estimate immediate rewards without com-
b02901048@ntu.edu.tw). puting tree search. StepGAN needs only little modification on
H.-Y. Lee is with the Department of Electrical Engineering and the De- the discriminator, but has significantly less computational costs
partment of Computer Science and Information Engineering, National Taiwan
University, Taipei City 10617, Taiwan (e-mail:,hungyilee@ntu.edu.tw). than MCTS. To evaluate the effects of StepGAN, we first quan-
Digital Object Identifier 10.1109/TASLP.2019.2896437 tify GANs ability in learning less general responses, different
2329-9290 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
TUAN AND LEE: IMPROVING CONDITIONAL SeqGAN BY STEPWISE EVALUATION 789
direction of KL divergence, and coverage of many acceptable have found that REINFORCE would cause high variance during
answers. Then we check which stepwise evaluation method bet- training, MCTS [2], [6], a stepwise evaluation method, is used to
ter raises the advantages of GAN by comparing different step- tabilize the training, however, it suffers from high computational
wise evaluation methods on a synthetic task and a real-world costs.
scenario – chit-chat dialogue generation. In synthetic task, we To optimize the computation, several alternatives have been
show that StepGAN outperforms previous work by learning proposed to evaluate every subsequences [2], [9], but their per-
more balanced KL divergence. In dialogue generation, we show formances are weaker than MCTS [2].
that StepGAN can produce more less general responses with GANs have been applied to several tasks that can be for-
higher sentence quality. Although GANs may not benefit a mulated as conditional sequence generation. For dialogue gen-
model in all aspects, we empirically demonstrate that StepGAN eration, clear performance improvements on multiple metrics
can utilize the advantages of adversarial learning in conditional have been observed with SeqGAN using MCTS and REGS [2].
sequence generation the most. Moreover, it is worth stating that GANs improve machine translation models with different net-
although we only apply StepGAN on dialogue generation in work architectures including RNNSearch [25], [26] and Trans-
this paper, the proposed approach can be applied to any task former [26]. For abstractive summarization, GANs are able to
that can be formulated as conditional sequence generation, for generate more abstractive, readable and diverse summaries than
example, machine translation, summarization and video caption conventional approaches [27]. Image captioning models trained
generation. with GANs produce captions which are diverse and match the
The contribution of this paper can be summarized to three statistics of human generated captions significantly better than
stages: the baseline models [28]–[30].
r The experiments demonstrate that GANs can improve We observe that there is still no research on comparing step-
some aspects of conditional sequence generation, but may wise evaluation methods, even though many previous work [2],
worsen some aspects such as coherence in dialogue gener- [6], [8], [9] has noted the importance. Therefore in this paper
ation. we study the effectiveness of stepwise evaluation methods on
r The proposed StepGAN benefits GANs’ advantages the conditional sequence generation, and propose StepGAN which
most but has trade-offs on GANs’ cons. preserves the property of MCTS but with computational effi-
r The proposed StepGAN is significantly faster than MCTS, cacy. This is the first paper comparing the stepwise evaluation
and shows comparable performance. methods, and a StepGAN is proposed to replace them as a better
training method for sequence GANs.
II. RELATED WORKS
Neural based conditional sequence generation is first suc- III. CONDITIONAL SEQUENCE GENERATION
cessfully trained using seq2seq model [1], [10]. The model is When a model aims to produce an output sequence y given
initially learned by minimizing the cross-entropy between the an input x, it is called conditional sequence generative model.
true data distribution and the model distribution. The method, For example, in chit-chat dialogue generation, both x and y are
often called MLE, suffers from exposure bias, so beam search, word sequences defined as x = {x1 , x2 , . . . , xt , . . . , xN } and
scheduled sampling [11] and REINFORCE [3], [12] are pro- y = {y1 , y2 , . . . , yt , . . . , yM }, where xt and yt are words in a
posed to solve the problem. On the other hand, MLE also leads vocabulary, and N and M are respectively the lengths of the
to general responses, so mutual information [5] and adversarial input and output sequences.
learning [2] are used to produce creative sentences.
To optimize a task-specific score (e.g., BLEU [13]), REIN-
FORCE [3], a reinforcement learning [14] based algorithm, is A. Maximum Likelihood Estimation
first presented to guide seq2seq models. Further, MIXER [3] The conditional sequence generator G used in this paper is
integrates MLE and REINFORCE to reduce the exploration a recurrent neural network (RNN) based seq2seq model, con-
space. Afterwards, other reinforcement learning methods are sisting of an encoder and a decoder with embedding layers.
also proposed to improve the performance, including actor- The encoder reads in an input sentence one token xt at a time.
critic [7], off-policy learning [15] and deep reinforcement learn- The decoder aims to generate an output sentence ŷ as close as
ing for multi-turns dialogue [16]. possible to the target y ∗ .
Recently, BLEU score is verified to be weakly correlated During inference, after the encoder reads the whole x, the
with human prior knowledge [17]. Therefore GANs, an al- decoder generates the first word distribution PG (y1 |x). The first
gorithm that automatically assigns scores to sentences, are word ŷ1 is obtained by sampling or argmax from PG (y1 |x).
applied to sequence modeling. The related work includes Se- The decoder then takes ŷ1 as the next input token, and gener-
qGAN [6], MaliGAN [8], REGS [2], WGAN-GP [18], [19], ates the next word distribution PG (y2 |x, ŷ1 ). When the token
TextGAN [20], RankGAN [21] and MaskGAN [9]. To conquer representing the end of the sentence is predicted, the gener-
the intractable backpropagation through discrete sequences, pol- ation process stops, and the output sequence ŷ is completely
icy gradient [6], Gumbel-softmax [22], soft-argmax [23] and generated. Because sampling is used in the generation pro-
Wasserstein distance [24] are applied. Among them, policy gra- cess, the generator can be considered as a distribution of y
dient is the most widely used. An early version of using policy given x: PG (y|x) = M t=1 PG (yt |x, y1:t−1 ), where y1:t−1 rep-
gradient on sequence modeling is REINFORCE. As researchers resents the first (t − 1)-th tokens in sequence y, or y1:t−1 =
790 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 27, NO. 4, APRIL 2019
Fig. 1. The illustration of StepGAN. The generator’s encoder (red) reads in sentence x, and then the generator’s decoder (orange) produces output sentence
ŷ. The discriminator (blue and green) either reads in the generated sentence pair (x, ŷ) or the target sentence pair (x, y ∗ ), and then outputs estimated Q-values
Q̂(st , ŷ t ) (or Q̂(st , y t∗ )) at each time step t. The average of estimated Q-values at every time step t is taken as the final discriminator score D(x, ŷ) (or D(x, y ∗ )).
{y1 , y2 , . . . , yt−1 }. For instance if argmax is used in the de- The gradient of this method can be approximated as follows.
coder, we can consider PG as a Kronecker delta function (or a M
very sharp distribution) that the probabilities are zeros every-
∇G = D(x, ŷ)∇ log PG (ŷt |x, ŷ1:t−1 ) (2)
where except for the chosen sample. t=1
The model is trained to minimize the cross-entropy of ŷ and
the ground-truth sequence y ∗ . This method is often called max- In Equation (2), the gradients of the log probabilities are
imum likelihood estimation (MLE) [1]. However, the inconsis- weighted by the same weight D(x, ŷ).
tency of decoder inputs between training stage and inference Compared to MLE, a generator can minimize any loss de-
stage leads to exposure bias [3], whereby the generator would fined by a discriminator instead of one direction KL divergence.
accumulate errors in inference stage. Moreover, policy gradient adopts the same ŷ1:t−1 in inference
and training stages, thus solving the problem of exposure bias.
B. Generative Adversarial Network (GAN) IV. STEPWISE EVALUATION APPROACH FOR

A GAN, consists of a generator G and a discriminator D, SEQUENTIAL GANS
for sequence learning is often called sequence GAN. Given To improve sequence GAN by tackling both sparse reward
an input sequence x, the discriminator learns to maximize score and training instability, we propose a new stepwise evaluation
D(x, y ∗ ) and minimize score D(x, ŷ), while the generator learns method called StepGAN. Borrowing the idea of Q-learning,
to produce response ŷ to maximize D(x, ŷ). StepGAN automatically estimates state-action values with very
light computational costs. The framework of StepGAN is de-
minmax E [log D(x, y ∗ )] picted in Figure 1.
G D (x,y ∗ )∼P R (x,y )
(1)
+ E [log(1 − D(x, ŷ))], A. State-Action Value
x∼P R (x), ŷ ∼P G (y |x)
In Figure 1, each Q(st , yt ) is a state-action value when taking

where PR (x) and PR (x, y) are the probability distribution of action yt at state st = {x, y1:t−1 }. This value, also a general
x and joint probability distribution of (x, y) from the training form of Q-value, is often used in reinforcement learning to
data. stabilize the training [14], [31]–[33]. The definition of state-
To overcome the intractable gradients due to the discrete na- action value in conditional sequence GAN is as below.
ture of language, previous work formulates sequence generation
as a Markov decision process and trains a GAN using policy gra- Q(st , ŷt ) = E [D(x, {ŷ1:t , z})]. (3)
z ∼P G (.|x, ŷ 1 : t )
dient [6]. The basic idea is to take the policy as the parameters
of a generator PG , the state at time step t as {x, ŷ1:t−1 }, the where z is a sequence of words generated by the current gen-
action taken at time step t as ŷt . Specifically, the rewards except erator given input x and generated prefix ŷ1:t . Accordingly, the
for the last time step are set to zero, and the final step reward state-action value Q(st , ŷt ) is the expected return of all the re-
is set to D(x, y). The intuition of training sequence GAN using sponses sharing the same prefix ŷ1:t . The input of the Q-function
policy gradient is that the generator maximizes the probabilities are st = {x, y1:t−1 } and yt , which are discrete tokens. Because
of words with higher expected D(x, y) and minimizes others. the Q-values are the outputs of the discriminator, which has
embedding layer, we can consider that the tokens are trans-

Algorithm 1: StepGAN.
formed into embedded forms within the Q-function.
To stabilize the REINFORCE algorithm for GAN, the term 1: Initialize generator G
D(x, ŷ) in Equation (2) is replaced with Q(st , ŷt ), and the 2: Pretrain generator G using MLE
equation is modified as: 3: Initialize discriminator D, value network V
4: for number of training iterations do
M
5: for i = 1 to D-iterations do
∇G = Q(st , ŷt )∇ log PG (ŷt |x, ŷ1:t−1 ) (4) 6: Sample (x, y ∗ ) from real data
t=1 7: Sample ŷ ∼ PG (y|x)
In Equation (4), each generation step is weighted by a step- 8: Update D using equation (5)
dependent value, Q(st , ŷt ). However, computing Q(st , ŷt ) by 9: end for
Equation (3) is intractable. We propose an efficient approach to 10: Update V by
estimate Q(st , ŷt ) for generator while training the discriminator
min V (st ) − Q(st , ŷt )2
in StepGAN. V
11: for i = 1 to G-iterations do

B. Discriminator for StepGAN 12: Sample x from real data
In this paper, the discriminator is a seq2seq model that takes 13: Sample ŷ ∼ PG (y|x)
x as encoder inputs, and y (either ŷ or y ∗ ) as decoder in- 14: Update G using equation (7)
puts. Although in Figure 1, the discriminator is only a native 15: end for
seq2seq model, any state-of-the-art seq2seq model, for exam- 16: end for
ple, attention-based model [34], CNN-based model [35], Trans-
former [36], etc., can be used here. As regular GANs, the
discriminator is trained to maximize D(x, y ∗ ) of ground-truth approximate the predicted Q̂(st , yt ) for every previous states st .
examples and minimize the D(x, ŷ) of generated response as That is, V (st ) = E[Q̂(st , yt )].
yt
below. With the value network, we train the generator as below.
max E [log D(x, y ∗ )] M

D (x,y ∗ )∼P R (x,y )
(5) ∇G = αt (Q̂(st , ŷt ) − V (st ))
+ E [log(1 − D(x, ŷ))] t=1 (7)
x∼P R (x), ŷ ∼P G (y |x)
∇ log PG (ŷt |x, ŷ1:t−1 ).
However, different from all the previous GAN-based se-
quence generation approaches, the definition of D(x, y) in Step- where αt is a weighting coefficient that is related to time t. In
GAN is designed to estimate the Q(st , yt ). Instead of generating uniform case, αt equals to 1 for all time t. We also test increasing
a scalar as the final discriminator score D(x, y) [6], we do some and decaying cases in the following experiments.
modification to the discriminator to automatically obtain the es- In Figure 1, the generator is a seq2seq model, but the pro-
timation of Q(st , yt ). As shown in the upper block of Figure 1, posed approach is independent to the network architecture of
after reading in the input sentence x and part of the response the generator. The complete training process of StepGAN is
sequence y1:t , discriminator generates a scalar Q̂(st , yt ). illustrated in Algorithm 1.
The discriminator score D(x, y) is the average of all the
scalars Q̂(st , yt ) throughout the generated sequence length M . D. Theory Behind StepGAN
1 M StepGAN is based on a direct linking between D(x, y) and
D(x, y) = Σ Q̂(st , yt ), (6)
M t=1 Q(st , yt ) as below.
M
The term Q̂(st , yt ) in Equation (6) directly matches the role of 1
Q(st , yt ). Therefore, we take the scalars Q̂(st , yt ) as the approx- E [D(x, ŷ)] = E Q(st , ŷt ) . (8)
ŷ ∼P G (y |x) M ŷ ∼P G (y |x) t=1
imation of state-action values Q(st , yt ) to train the generator in
Section IV-C. The theory behind the above approximation will The equation above suggests that we can decompose D(x, y)
be clear in Section IV-D. into M different scores for all the time steps t and take the
decomposed scores as Q(st , yt ). Because of Equation (8), we
C. Generator for StepGAN estimate state-action values Q(st , yt ) by adding a simple com-
The generator reads in the input sentence x and then predicts ponent to the architecture of discriminator as shown in (6).
one token ŷt at a time based on the generated word sequence We can derive Equation (8) by following:
ŷ1:t−1 . For stepwise evaluation, we update generator G by re- M

placing Q(st , ŷt ) with Q̂(st , ŷt ) in Equation (4). E Q(st , ŷt )
ŷ 1 : M ∼P G (y |x)
As previous works [2], [3], we train a value network V to t=1
generate the value baseline for stabilizing policy gradient. The M
(9)

value network has the same structure as discriminator. It pre- = E E [rter (x, ŷ1:t , z)] ,
dicts the expected value V (st ). The value network is trained to ŷ 1 : M ∼P G (y |x)
t=1
z ∼P G (.|x, ŷ 1 : t )
TABLE I
THE EQUATIONS FOR DIFFERENT STEPWISE EVALUATION METHODS. THE DETAILS ARE EXPLAINED IN SECTION V
where z is sampled from PG by given input x and partial prefix Monte-Carlo tree search (MCTS) [2], [6], [8]. MCTS com-
ŷ1:t , and rter is the reward of the whole sequence {x, ŷ1:t , z}. putes the estimated state-action value Q∗ (st , yt ). Given input
i
Based on GAN, the rter is estimated by discriminator score D, sentence and fixed prefix (x, y1:t ), we roll out I suffixes yt+1:M i
so the equation can be rewritten as: using the generator G, where i is the label of suffix from 1 to I,
M and each suffix has different lengths Mi . Here {y1:t , yt+1:Mi
}
i
E E [rter (x, ŷ1:t , z)] forms a full response whose first tokens are y1:t , and the rest
i
ŷ 1 : M ∼P G (y |x)
t=1
z ∼P G (.|x, ŷ 1 : t ) tokens are yt+1:M i
. We average D(x, y) of the I responses
M
obtained by roll-out, I1 ΣIi=1 D(x, {y1:t , yt+1:M
i
}), as the ap-
∗
i
proximated state-action value Q (st , yt ). Monte-Carlo search
= E E [D(x, {ŷ1:t , z})] (10)
ŷ 1 : M ∼P G (y |x)
t=1
z ∼P G (.|x, ŷ 1 : t ) would have high precision if I is very large but with very high
computational costs.
M
MaskGAN [9]. The discriminator optimizes all scores
= E E [D(x, {ŷ1:t , z})] .
ŷ 1 : M ∼P G (y |x) z ∼P G (.|x, ŷ 1 : t ) D(x, y1:t ) = Q̂(st , yt ) for every time step t; the generator takes
t=1
each score D(x, y1:t ) as received reward, and estimates state-
The two expectation E can be combined: action value by summing future received rewards: Q∗ (st , yt ) =
M
M τ =t [D(x, y1:τ )].
E E [D(x, {ŷ1:t , z})] The proposed approach differs from prior work in two as-
ŷ 1 : M ∼P G (y |x) z ∼P G (.|x, ŷ 1 : t )
t=1 pects. First, MCTS stabilizes the training of SeqGAN with lots
M of extra computation, but StepGAN stabilizes the training with
(11)
= E [D(x, ŷ1:M )] nearly zero extras. Second, while REGS and MaskGAN predict
ŷ 1 : M ∼P G (y |x) state-action values by heuristic, while StepGAN theoretically
t=1
approximates a discriminator score as the average of correspon-
=M· E [D(x, ŷ1:M )] .
ŷ 1 : M ∼P G (y |x) dent state-action values.
Then we divide the above equations by M and obtain the equa-
VI. EXPERIMENTS: SYNTHETIC EXPERIMENT
tion below:
M To compare their performances accurately, we conducted ex-
1
E [D(x, ŷ)] = E Q(st , ŷt ) . (12) periments on synthetic data. This also help us understand how
ŷ ∼P G (y |x) M ŷ ∼P G (y |x) t=1 the StepGAN works. The source code is available at https://
github.com/Pascalson/Conditional-Seq-GANs
The same equation can also be derived by substituting PG with
PR , and substituting ŷ with y ∗ . Equation (6) is therefore a sample
A. Task Description
estimation of Equation (12).
A counting task is designed to capture the one-to-many prop-
V. EXPERIMENTS: BASELINE APPROACHES erty of conditional sequence generation tasks. In a counting
task, given an input sequence x = {x1 , x2 , . . . , xt , . . . , xN }, a
The following stepwise evaluation methods are used as the
correct output sequence y = {y1 , y2 , y3 } obeys the following
baselines in the experiments. The equations are listed in Table I.
rules.
SeqGAN [6]. The discriminator predicts a final score D(x, y) ⎧
after reading the entire pair of sentences (x, y). The generator ⎪ k ∼ {1, 2, . . . , N }
⎪
⎪
adopts 1-sample estimation, using D(x, y) as the multipliers for ⎨y = k − 1
1
every time steps t, for policy gradient updates. (13)
⎪
⎪ y 2 = xk
Reward at Every Generation Step (REGS) [2]. The dis- ⎪
⎩
y3 = N − k
criminator optimizes a randomly selected score D(x, y) ∼
{Q̂(st , yt )}M M
t=1 . Every scores {Q̂(st , yt )}t=1 are taken as the where each token xt is a digit selected from 0 to 9, and number
multipliers for policy gradient. N is the length of input.
TABLE II
RESULTS OF SYNTHETIC EXPERIMENT - COUNTING
Based on the above setup, given the same input sequence, C. Results
several different output sequences are correct. For example,
The results are shown in Table II. We compared Step-
given an input sequence < 1, 8, 3 >, the possible answers are
GAN with other stepwise evaluation methods including REGS,
< 0, 1, 2 >, < 1, 8, 1 > and < 2, 3, 0 >. MCTS, MaskGAN. In addition, we performed weighted step-
We generated 100,000 training examples, 10,000 validation
GANs (stepGAN-W) that we gave decaying weights for the
examples and 10,000 testing examples according to the counting
policy gradient. The weights in Equation (7) were set as
rule. The maximum length of the input sequence was set to 10 αt = M − t, where M was the length of the generated out-
(N ≤ 10), and the average number of possible answers is 4.97.
puts. We had tested both increasing and decaying cases, and
We evaluated different sequence generation approaches based found that increasing weights did not improve the training, so
on GANs. All the GANs were trained upon a pretrained model the results of increasing weights were not shown here.
trained by MLE until converged. All the GANs were trained
Precision and Recall. Compared to MLE, all GANs improve
with 64 minibatch size for 5,000 iterations. the precision of argmax outputs by Prec in Table II. Specifically,
SeqGAN improves the model by 0.94; previous stepwise evalu-
B. Evaluation Metrics ation methods improve the model by 1.19 ∼ 3.68; our proposed
In the synthetic experiments, because we know all the possi- approaches StepGAN and StepGAN-W improve the model by
ble answers to an input sentence, we can compute the precision 3.87 and 5.97 respectively.
and recall rates. Given each input x, we used the generator G For sampled softmax outputs, in Table II the GANs have better
to generate one sample by argmax and 100 samples by sam- precision (the column labeled SampP) and weaker recall (the
pling from the probability distribution of softmax outputs. column labeled SampR). Especially, the MaskGAN increases
In Table II three metrics are evaluated: (1) Prec, the preci- the precision the most but significantly drops the recall. This
sion of argmax outputs. It is not possible to evaluate recall indicates that MaskGAN overfits to a smaller portion of possible
for argmax because there is only one sample. (2) SampP, the answers.
precision of softmax outputs. (3) SampR, the recall of softmax In Figure 2, we show the variance of the performance of the
outputs. They are intuitive indexes to test the performance of the GAN-based approaches with different random parameter initial-
models. ization. Each box represents the results of each model in Table II
To have more insights of the models, we analyzed genera- trained with different random initialization, and the green line
tor by evaluating both the forward KL divergence (FKLD) and represents the precision result of MLE. The results show that
inverse KL divergence (IKLD) between the conditional distribu- StepGAN-W not only achieves better performance than other
tion of the generator, PG (y|x), and the true distribution based on approaches, it is also more stable with different parameter ini-
the counting rule, PR (y|x). The definition of FKLD and IKLD tialization.
are as below. Forward and Inversed KL-Divergence. Comparing to
MLE, we observe that GANs increase FKLD (except SeqGAN)
PR (y|x)
F KLD = PR (y|x) log and decrease IKLD. This is reasonable because GANs do not
x y P G (y|x) minimize FKLD as MLE, but minimize other distance met-
(14)
rics. Because minimizing FKLD will cause the PG unbounded
PG (y|x)
IKLD = PG (y|x) log at where PR is almost zero (Equation (14)), and vise versa, we
x y PR (y|x)
consider that FKLD and IKLD are equally important to measure
All the above metrics are the benefits of using synthetic tasks. true distance between two distributions. Hence, we add them to-
In real applications like dialogue generation, it is not possible to gether and get a score FKLD+IKLD to check which model is
list all the correct responses, so precision and recall cannot be overall good. The SeqGAN and previous stepwise evaluation
accurately measured. Additionally, the computations of FKLD methods improve FKLD+IKLD score by 0.009 ∼ 0.071; our
and IKLD in most real applications are intractable; their compu- approaches StepGAN and StepGAN-W respectively improve
tations in the synthetic task is tractable due to the finite number this divergence score by 0.162 and 0.31, much better than pre-
of answers. vious methods.
64}. All the generators and discriminators are 1 GRU layer with
512 dimension, which is an acceptable number of parameters
on sentence generation [18], [19].
B. Evaluation Metrics
Currently, chit-chat dialogue generation still lack general
rules for evaluation [17], therefore we use several different met-
rics to evaluate the generated responses.
BLEU. BLEU score [13] is an automatic evaluation metric
that counts the appearance of n-grams. Although BLEU has
been reported not consistent with the human evaluation [17] on
dialogue, we reported it here as a reference.
Coherence Human Score (CoHS) and Semantics Human
Score (SHS). We invited 5 judges to evaluate 200 examples.
Because coherence is naturally binary (yes or no) and semantic
is continuous, we asked judges to measure them by 0-1 test and
ranking respectively. For CoHS, judges were asked to measure
if each response is coherent to the input sentence (0 or 1); for
Fig. 2. The box plot of results obtained by different random parameter ini- SHS, they were asked to rank the information abundance of
tialization. Each box represents the results of each model in Table II trained the given responses. The ranking is then normalized to a score
with different random initialization. Green line represents the precision result ranging from 0 to 1. Both CoHS and SHS are the higher the
of MLE.
better.
LEN and GErr. LEN stands for lenth and is the average
The results show that GANs fine-tune a pretrained model that of length of generated responses; GErr stands for grammar
has converged to reach a better performance and reduce IKLD, error, and we measured it by an open source software https:
and the precision scores are thus higher than the pretrained //github.com/languagetool-org/languagetool
model. Furthermore, the stepwise evaluation methods can im- and calculated the number of violations of grammar rules.
prove or maintain the model to generate as much as possible General. This metric counts the number of generated general
correct answers by showing comparable recall scores. Among responses. Here we define the top 10% responses generated from
them, StepGAN and StepGAN-W can reach better precision the model learned by MLE as general responses. The top two
with little penalty of recall, and have more balanced results of general responses are “I don’t know” and “I’m sorry”. However,
FKLD+IKLD scores. if a ground truth response is one of the top 10% responses, it
will not be counted as a general response.
VII. EXPERIMENTS: CHIT-CHAT DIALOGUE GENERATION Dist-N. This metric counts the number of distinct n-grams
We trained chit-chat dialogue generation on OpenSubtitles and divides it by the total number of words [5].
[39], a collection of movies subtitles. After pretraining the
generator by MLE, we fine-tuned the model for 1-epoch by C. Results
three different stepwise evaluation methods: vanilla SeqGAN,
The results of our proposed evaluation metrics are listed in
REGS, and StepGAN We choose to compare vanilla SeqGAN
Table III. Generally, we found that on CoHS, MLE gets the
and REGS because they have been reported helpful on dialogue
highest score; on BLEU, SeqGAN is the highest; on metrics
generation task [2]. We do not have the results of MCTS here
related to the quality of generated sentences (i.e., SHS, LEN,
because it costs too much computation which is not tractable on
GErr, and General), StepGAN consistently gets the best scores.
our machine.
We observed that all the three types of GANs can gener-
ate sentences with better semantic scores (SHS), longer lengths
A. Experimental Setup
(LEN) and less general responses (General), but they do not
We trained chit-chat dialogue generation on OpenSubtitles equally lower GErr. Table III shows that among them, Step-
[39] and used the top 4000 most frequent words as the vo- GAN is the most beneficial one, which yielded the highest se-
cabulary. After pretraining the generator by MLE, we trained mantic human score (SHS), longest generated sentences (LEN),
the model for 1-epoch by four different stepwise evaluation the least grammatical errors (GErr) and the least general re-
methods: vanilla SeqGAN, REGS, and StepGAN. All the dis- sponses (General). This shows that the stepwise evaluation
criminators were pre-trained on real data and generated data methods surely impact the results. The assumption that MLE
sampled from the pre-trained generator. To tune the parame- easily results in general responses [5] can also be verified by the
ters, grid search was used with optimization operation = {SGD, observation. Even MLE has the highest CoHS, it has relatively
Adam, RMSProp}, learning rate = {1e-1, 1e-2, 1e-3, 1e-4}, low SHS and LEN. This is probably because CoHS is more
discriminator iteration step = {1, 5}, and batch size = {32, related to conditioning rather than sequence modeling.
TABLE III
THE RESULTS OF CHIT-CHAT DIALOGUE GENERATION. COHS (%) IS COHERENCE HUMAN SCORE; SHS (%) IS SEMANTICS HUMAN SCORE. TO
MAKE SURE THE RELIABILITY OF HUMAN SCORES, WE HAVE MEASURED THEIR INTRACLASS CORRELATION COEFFICIENT (ICC) [37], [38],
AND THEY SHOW SUBSTANTIALLY CONSISTENT
TABLE IV
THE RESULTS OF HUMAN EVALUATION FOR CHIT-CHAT DIALOGUE GENERATION WITH DIFFERENT INFERENCE METHODS: ARGMAX, BEAM SEARCH(BS) AND
MAXIMUM MUTUAL INFORMATION (MMI)
SHS. In terms of training methods, this phenomenon is consis-

tent with the results in Table III. For inference methods, although
MMI has better SHSs and weaker CoHSs than beam search, both
beam search and MMI lead to higher CoHSs but lower SHSs.
This indicates that these inference methods tend to assure coher-
ence instead of informativeness, thus eliminating the benefits of
GANs. Therefore argmax is suggested to be used for sequence
GANs.
E. Analysis
1) Discriminator Outputs: To verify if the proposed step-
wise evaluation method can estimate the state-action values, we
Fig. 3. The results of dist-N for chit-chat dialogue generation. analyzed the outputs of discriminators. Empirically, given the
same response, the discriminator scores D change frequently
The Dist-N metrics are plotted in Figure 3. Compared to MLE, between the training iterations. This makes the values Q̂(st , yt )
GANs do not have better diversity when N is below 4-gram but from D unstable and thus difficult to be analyzed. We proposed
show diversity when N is above 5-gram. Specifically, when N is to measure the variance at each generation step throughout
larger, StepGAN shows much better performance than others. the training iterations. During training, Q̂(st , yt ) oscillates ac-
This demonstrates the ability of StepGAN to strengthen the cording to the performance of generator at that time. If at the
benefits of GANs. generation step t, Q̂(st , yt ) oscillates violently, the word gen-
In Table V, we present some generated examples of our erated at step t is crucial to determine if a response is fake or
trained dialogue generative models. real.
Four examples are shown in Figure 4. Figure 4a and 4b are
respectively true response and wrong response given input sen-
D. CoHS and SHS Results tence “how are you ?”; Figure 4c and 4d are true and wrong
To survey the effectiveness of different inference methods, we responses given input sentence “what ’s your name ?”. The
asked 5 judges to evaluate 200 random selected examples. Each depth of the color is proportional to the variance at each gener-
example consists of an input and 15 generated outputs by five ation step for MCTS,1 REGS and StepGAN.
different training algorithms and three inference methods. The We can observe that the variances of StepGAN and MCTS
training algorithms include MLE, SeqGAN, MaliGAN, REGS, are similar, and the generation steps with large variance, that
and StepGAN; the inference methods include argmax, beam is, the critical words in the response, are consistent with human
search, and MMI [5].
As shown in Table IV, MLE with beam search achieves the 1 In Figure 4, the MCTS results are calculated by using MCTS on SeqGAN
best CoHS, while StepGAN-W with argmax achieves the best model.
TABLE V
EXAMPLES OF GENERATED RESPONSES
Fig. 4. The variance of learned state-action value during training. The higher
variance is represented by darker color. (a)(b) are given “how are you ?” as
input, and (c)(d) are given “what ’s your name ?” as input.
Fig. 6. The proportion of generated responses that have appeared in the train-
ing data.
3) General Responses: To see how the GANs learn from

the training data, we first sorted the training data into input-
outputs pairs – each input was paired with all the appeared
outputs in training data. We then used the models trained by
MLE, SeqGAN, REGS and StepGAN to generate responses, and
counted how many responses are in the training data. The ratio
of the number of appeared responses to the number of inputs
is shown in Figure 6. The results verify that GANs learn more
creative responses [6] that are not in training data. StepGAN
especially generates the least non-creative ones.
Fig. 5. The average computation times for one training iteration (i.e., one VIII. CONCLUSION
mini-batch) of chit-chat dialogue generation (OpenSubtitles) and synthetic ex-
periment (Counting).
This paper verifies that stepwise evaluation methods have no-
table impact on conditional sequence generation, and proposes a
novel stepwise evaluation method – StepGAN that can directly
intuition. For example, when given “what ’s your name ?”, estimate state-action value by discriminator. In experiments, we
they focus on the bold parts of “i’m john.” in Figure 4c and show that StepGAN can help conditional sequence generation.
“i ’msorry.” in Figure 4d. People cannot identify if “i ’m” is Compared to MCTS, it is computational efficient; compared to
good or bad, but can identify if the sentence is good or bad vanilla SeqGAN and REGS, it is more accurate on synthetic
with more information. Besides, we found that the variances experiment and generates sentences with higher quality in dia-
of StepGAN are even more reasonable for human than MCTS logue generation.
(e.g., (a) i ’m fine, ..... (c) i ’m john). The most possible reason
is that the variances of StepGAN are directly an expectation of REFERENCES
all the training examples; the variances of MCTS are average
of I = 5 roll-out paths and highly depend on the number I. [1] O. Vinyals and Q. Le, “A neural conversational model,” ICML Deep Learn.
Workshop, Jun. 2015.
Figure 4 (a–d) also shows that REGS puts emphases on the last [2] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky, “Adversarial
few steps, and the critical words do not meet human’s intuition. learning for neural dialogue generation,” in Proc. Conf. Empirical Methods
2) Computation: As demonstrated by Figure 5, StepGAN Natural Lang. Process., 2017, pp. 2157–2169.
[3] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training
is much more computational efficiency than MCTS. The com- with recurrent neural networks,” Int. Conf. Learn. Represent., 2016.
putational consumption of MCTS depends on the number of [4] M. Arjovsky and L. Bottou, “Towards principled methods for training
roll-out paths which is set to 5 in Figure 5. In synthetic task, generative adversarial networks,” Int. Conf. Learn. Represent., 2017.
[5] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting
SeqGAN, REGS, and StepGAN are twice faster than MCTS; in objective function for neural conversation models,” in Proc. Conf. North
dialogue modeling, they are about five times faster than MCTS. Am. Chapter Assoc. Comput. Linguistics: Human Lang. Technologies,
For longer sequence length and larger number of roll-out paths, 2016, pp. 110–119.
[6] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative
the time consumption of MCTS grows. MCTS is therefore in- adversarial nets with policy gradient,” in Proc. Nat. Conf. Artif. Intell.,
tractable for large datasets. 2017, pp. 2852–2858.
[7] D. Bahdanau et al., “An actor-critic algorithm for sequence prediction,” [29] D. Li, Q. Huang, X. He, L. Zhang, and M.-T. Sun, “Generating diverse and
Int. Conf. Learn. Represent., 2017. accurate visual captions by comparative adversarial learning,” Workshop
[8] T. Che et al., “Maximum-likelihood augmented discrete generative ad- Visually Grounded Interaction Lang. NIPS, 2018.
versarial networks,” 2017, arXiv:1702.07983. [30] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-
[9] W. Fedus, I. Goodfellow, and A. M. Dai, “MaskGAN: Better text gen- transition GAN for visual paragraph generation,” in Proc. IEEE Int. Conf.
eration via filling in the _,” Int. Conf. Learn. Represent., 2018. [Online]. Comput. Vis., 2017, pp. 3382–3391.
Available: https://openreview.net/forum?id=ByOExmWAb [31] V. Mnih et al., “Playing atari with deep reinforcement learning,” NIPS
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning Deep Learn. Workshop, 2013.
with neural networks,” in Proc. Adv. Neural inf. Process. Syst., 2014, [32] V. Mnih et al., “Human-level control through deep reinforcement learn-
pp. 3104–3112. ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[11] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for [33] M. Hausknecht and P. Stone, “Deep recurrent Q-Learning for partially
sequence prediction with recurrent neural networks,” in Proc. Adv. Neural observable MDPs,” AAAI Fall Symp. Series, 2015, pp. 29–37.
Inf. Process. Syst., 2015, pp. 1171–1179. [34] K. Xu et al. “Show, attend and tell: Neural image caption generation with
[12] R. J. Williams, “Simple statistical gradient-following algorithms for visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 2048–
connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3–4, 2057.
pp. 229–256, 1992. [35] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Con-
[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for volutional sequence to sequence learning,” in Proc. 34th Int. Conf. Mach.
automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Learn., 2017, pp. 1243–1252.
Assoc. Comput. Linguistics, 2002, pp. 311–318. [36] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Process. Syst., 2017, pp. 6000–6010.
vol. 1. Cambridge, MA, USA: MIT Press, 1998. [37] J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the
[15] K. Kandasamy, Y. Bachrach, R. Tomioka, D. Tarlow, and D. Carter, intraclass correlation coefficient as measures of reliability,” Edu. Psycho-
“Batch policy gradient methods for improving neural conversation logical Meas., vol. 33, no. 3, pp. 613–619, 1973.
models,” Int. Conf. Learn. Represent., 2017. [Online]. Available: [38] J. J. Bartko, “The intraclass correlation coefficient as a measure of relia-
https://openreview.net/forum?id=rJfMusFll bility,” Psychological Rep., vol. 19, no. 1, pp. 3–11, 1966.
[16] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao, “Deep [39] J. Tiedemann, “News from OPUS - A collection of multilingual par-
reinforcement learning for dialogue generation,” in Proc. Conf. Empirical allel corpora with tools and interfaces,” in Recent Advances in Natu-
Methods Natural Lang. Process., 2016, pp. 1192–1202. ral Language Processing, vol. V, N. Nicolov, K. Bontcheva, G. An-
[17] C.-W. Liu et al., “How not to evaluate your dialogue system: An empirical gelova, and R. Mitkov, Eds. Borovets, Bulgaria: John Benjamins, Am-
study of unsupervised evaluation metrics for dialogue response genera- sterdam/Philadelphia, 2009, pp. 237–248.
tion,” in Proc. Conf. Empirical Methods Natural Language Process., 2016,
pp. 2122–2132.
[18] O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf, “Language generation
with recurrent generative adversarial networks without pre-training,” 1st
Workshop Learn. Generate Natural Lang. ICML, 2017.
[19] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville, “Adver-
sarial generation of natural language,” in Proc. 2nd Workshop Represent.
Learn. NLP, 2017, pp. 241–251. Yi-Lin Tuan received the B.S. degree in electrical
[20] Y. Zhang et al., “Adversarial feature matching for text generation,” in engineering from National Taiwan University (NTU),
Proc. Int. Conf. Mach. Learn., 2017, pp. 4006–4015. Taipei City, Taiwan, in 2017. He is currently a Re-
[21] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking search Assistant with the Speech Processing and Ma-
for language generation,” in Proc. Adv. Neural Inf. Process. Syst., 2017, chine Learning Laboratory, NTU.
pp. 3155–3165.
[22] M. J. Kusner and J. M. Hernández-Lobato, “GANS for sequences
of discrete elements with the gumbel-softmax distribution,” 2016,
arXiv:1611.04051.
[23] Y. Zhang, Z. Gan, and L. Carin, “Generating text via adversarial training,”
in NIPS Workshop Adversarial Training, vol. 21, 2016.
[24] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville,
“Improved training of wasserstein GANs,” Advances Neural Inf. Process. Hung-Yi Lee received the M.S. and Ph.D. degrees
Syst., 2017, pp. 5767–5777. from National Taiwan University (NTU), Taipei, Tai-
[25] L. Wu et al., “Adversarial neural machine translation,” in Proc. 10th Asian wan, in 2010 and 2012, respectively. From September
Conf. Mach. Learn., 2017, pp. 534–549. 2012 to August 2013, he was a Postdoctoral Fellow
[26] Z. Yang, W. Chen, F. Wang, and B. Xu, “Improving neural machine with the Research Center for Information Technology
translation with conditional sequence generative adversarial nets,” in Proc. Innovation, Academia Sinica. From September 2013
Conf. North Am. Chapter Assoc. Comput. Linguistics: Human Language to July 2014, he was a visiting Scientist with the
Technol., 2018, pp. 1346–1355. Spoken Language Systems Group, MIT Computer
[27] L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu, and H. Li, “Generative adversarial Science and Artificial Intelligence Laboratory. He is
network for abstractive text summarization,” in Proc. Nat. Conf. Artif. currently an Assistant Professor with the Department
Intell., 2018, pp. 8109–8110. of Electrical Engineering, NTU, with a joint appoint-
[28] R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele, “Speak- ment with the Department of Computer Science and Information Engineering
ing the same language: Matching machine to human captions by adversar- of the university. His research interests include spoken language understanding,
ial training,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4155–4164. speech recognition, and machine learning.

Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation

Enviado por

Direitos autorais:

Formatos disponíveis

788 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 27, NO.

Improving Conditional Sequence Generative

B. Generative Adversarial Network (GAN) IV. STEPWISE EVALUATION APPROACH FOR

In Figure 1, each Q(st , yt ) is a state-action value when taking

embedding layer, we can consider that the tokens are trans-

11: for i = 1 to G-iterations do

SHS. In terms of training methods, this phenomenon is consis-

3) General Responses: To see how the GANs learn from

Você também pode gostar