A Hierarchical Neural Autoencoder for Paragraphs and Documents

Jiwei Li; Thang Luong; Dan Jurafsky

doi:10.3115/v1/p15-1107

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DeepDyve requires Javascript to function. Please enable Javascript on your browser to continue.

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Li, Jiwei;Luong, Thang;Jurafsky, Dan; 2015-01-01 00:00:00 Jiwei Li, Minh-Thang Luong and Dan Jurafsky Computer Science Department, Stanford University, Stanford, CA 94305, USA jiweil, lmthang, jurafsky@stanford.edu Abstract 2004; Elsner and Charniak, 2008; Li and Hovy, 2014, inter alia). However, applying these to text Natural language generation of coherent generation remains difﬁcult. To understand how long texts like paragraphs or longer doc- discourse units are connected, one has to under- uments is a challenging problem for re- stand the communicative function of each unit, current networks models. In this paper, and the role it plays within the context that en- we explore an important step toward this capsulates it, recursively all the way up for the generation task: training an LSTM (Long- entire text. Identifying increasingly sophisticated short term memory) auto-encoder to pre- human-developed features may be insufﬁcient for serve and reconstruct multi-sentence para- capturing these patterns. But developing neural- graphs. We introduce an LSTM model that based alternatives has also been difﬁcult. Al- hierarchically builds an embedding for a though neural representations for sentences can paragraph from embeddings for sentences capture aspects of coherent sentence structure (Ji and words, then decodes this embedding and Eisenstein, 2014; Li et al., 2014; Li and Hovy, to reconstruct the original paragraph. We 2014), it’s not clear how they could help in gener- evaluate the reconstructed paragraph us- ating more broadly coherent text. ing standard metrics like ROUGE and En- Recent LSTM models (Hochreiter and Schmid- tity Grid, showing that neural models are huber, 1997) have shown powerful results on gen- able to encode texts in a way that preserve erating meaningful and grammatical sentences in syntactic, semantic, and discourse coher- sequence generation tasks like machine translation ence. While only a ﬁrst step toward gener- (Sutskever et al., 2014; Bahdanau et al., 2014; Lu- ating coherent text units from neural mod- ong et al., 2015) or parsing (Vinyals et al., 2014). els, our work has the potential to signiﬁ- This performance is at least partially attributable cantly impact natural language generation to the ability of these systems to capture local and summarization . compositionally: the way neighboring words are combined semantically and syntactically to form 1 Introduction meanings that they wish to express. Generating coherent text is a central task in natural Could these models be extended to deal with language processing. A wide variety of theories generation of larger structures like paragraphs or exist for representing relationships between text even entire documents? In standard sequence- units, such as Rhetorical Structure Theory (Mann to-sequence generation tasks, an input sequence and Thompson, 1988) or Discourse Representa- is mapped to a vector embedding that represents tion Theory (Lascarides and Asher, 1991), for ex- the sequence, and then to an output string of tracting these relations from text units (Marcu, words. Multi-text generation tasks like summa- 2000; LeThanh et al., 2004; Hernault et al., 2010; rization could work in a similar way: the sys- Feng and Hirst, 2012, inter alia), and for extract- tem reads a collection of input sentences, and ing other coherence properties characterizing the is then asked to generate meaningful texts with role each text unit plays with others in a discourse certain properties (such as—for summarization— (Barzilay and Lapata, 2008; Barzilay and Lee, being succinct and conclusive). Just as the local 1 semantic and syntactic compositionally of words Code for models described in this paper are available at www.stanford.edu/ jiweil/. can be captured by LSTM models, can the com- Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1106–1115, Beijing, China, July 26-31, 2015. 2015 Association for Computational Linguistics positionally of discourse releations of higher-level ing a softmax function: text units (e.g., clauses, sentences, paragraphs, and P (Y|X) documents) be captured in a similar way, with clues about how text units connect with each an- = p(y |x , x , ..., x , y , y , ..., y ) t 1 2 t 1 2 t−1 other stored in the neural compositional matrices? t∈[1,n ] In this paper we explore a ﬁrst step toward this Y exp(f(h , e )) t−1 y task of neural natural language generation. We fo- exp(f(h , e 0)) t−1 y t∈[1,n ] cus on the component task of training a paragraph (4) (document)-to-paragraph (document) autoencoder f(h , e ) denotes the activation function be- t−1 y to reconstruct the input text sequence from a com- tween e and e , where h is the representa- h−1 y t−1 pressed vector representation from a deep learn- tion outputted from the LSTM at time t− 1. Note ing model. We develop hierarchical LSTM mod- that each sentence ends up with a special end-of- els that arranges tokens, sentences and paragraphs sentence symbol <end>. Commonly, the input in a hierarchical structure, with different levels of and output use two different LSTMs with differ- LSTMs capturing compositionality at the token- ent sets of convolutional parameters for capturing token and sentence-to-sentence levels. different compositional patterns. We offer in the following section to a brief de- In the decoding procedure, the algorithm termi- scription of sequence-to-sequence LSTM models. nates when an <end> token is predicted. At each The proposed hierarchical LSTM models are then timestep, either a greedy approach or beam search described in Section 3, followed by experimental can be adopted for word prediction. Greedy search results in Section 4, and then a brief conclusion. selects the token with the largest conditional prob- ability, the embedding of which is then combined 2 Long-Short Term Memory (LSTM) with preceding output for next step token predic- tion. For beam search, (Sutskever et al., 2014) dis- In this section we give a quick overview of LSTM covered that a beam size of 2 sufﬁces to provide models. LSTM models (Hochreiter and Schmid- most of beneﬁts of beam search. huber, 1997) are deﬁned as follows: given a sequence of inputs X = {x , x , ..., x }, an 1 2 n 3 Paragraph Autoencoder LSTM associates each timestep with an input, memory and output gate, respectively denoted as In this section, we introduce our proposed hierar- chical LSTM model for the autoencoder. i , f and o . For notations, we disambiguate e and t t t h where e denote the vector for individual text 3.1 Notation unite (e.g., word or sentence) at time step t while Let D denote a paragraph or a document, which h denotes the vector computed by LSTM model is comprised of a sequence of N sentences, at time t by combining e and h . σ denotes the D t t−1 1 2 N D = {s , s , ..., s , end }. An additional sigmoid function. The vector representation h for D ” end ” token is appended to each document. each time-step t is given by: D Each sentence s is comprised of a sequence of 1 2 N tokens s = {w , w , ..., w } where N denotes i σ " t # " # " # the length of the sentence, each sentence end- f σ h t t−1 = W · (1) ing with an “ end ” token. The word w is as- o σ e t t sociated with a K -dimensional embedding e , l tanh 1 2 K e = {e , e , ..., e }. Let V denote vocabu- w w w lary size. Each sentence s is associated with a K- c = f · c + i · l (2) t t t−1 t t dimensional representation e . An autoencoder is a neural model where output h = o · c (3) t t units are directly connected with or identical to in- 4K×2K where W ∈ R In sequence-to-sequence put units. Typically, inputs are compressed into generation tasks, each input X is paired with a representation using neural models (encoding), a sequence of outputs to predict: Y = which is then used to reconstruct it back (decod- {y , y , ..., y }. An LSTM deﬁnes a distribution ing). For a paragraph autoencoder, both the input 1 2 n over outputs and sequentially predicts tokens us- X and output Y are the same document D. The 1107 autoencoder ﬁrst compresses D into a vector rep- To build representation e for the current doc- resentation e and then reconstructs D based on ument/paragraph D, another layer of LSTM (de- sentence e . noted as LSTM ) is placed on top of all sen- encode tences, computing representations sequentially for For simplicity, we deﬁne LSTM(h , e ) to t−1 t be the LSTM operation on vectors h and e to each timestep: t−1 t achieve h as in Equ.1 and 2. For clariﬁcation, s sentence s s we ﬁrst describe the following notations used in h (enc) = LSTM (e , h (enc)) (6) t encode t t−1 encoder and decoder: Representation e computed at the ﬁnal time end w s • h and h denote hidden vectors from LSTM t t step is used to represent the entire document: models, the subscripts of which indicate e = h . end timestep t, the superscripts of which indi- Thus one LSTM operates at the token level, cate operations at word level (w) or sequence leading to the acquisition of sentence-level rep- level (s). h (enc) speciﬁes encoding stage resentations that are then used as inputs into the and h (dec) speciﬁes decoding stage. second LSTM that acquires document-level repre- sentations, in a hierarchical structure. w s • e and e denotes word-level and sentence- t t level embedding for word and sentence at po- Decoder As with encoding, the decoding algo- sition t in terms of its residing sentence or rithm operates on a hierarchical structure with two document. layers of LSTMs. LSTM outputs at sentence level for time step t are obtained by: 3.2 Model 1: Standard LSTM s sentence s s The whole input and output are treated as one h (dec) = LSTM (e , h (dec)) (7) t decode t t−1 sequence of tokens. Following Sutskever et al. (2014) and Bahdanau et al. (2014), we trained s The initial time step h (d) = e , the end-to-end an autoencoder that ﬁrst maps input documents output from the encoding procedure. h (d) is used into vector representations from a LSTM word encode as the original input into LSTM for subse- decode and then reconstructs inputs by predicting to- quently predicting tokens within sentence t + 1. kens within the document sequentially from a word LSTM predicts tokens at each position se- decode LSTM . Two separate LSTMs are imple- decode quentially, the embedding of which is then com- mented for encoding and decoding with no sen- bined with earlier hidden vectors for the next time- tence structures considered. Illustration is shown step prediction until the end token is predicted. in Figure 1. The procedure can be summarized as follows: 3.3 Model 2: Hierarchical LSTM w sentence w w h (dec) = LSTM (e , h (dec)) (8) t decode t t−1 The hierarchical model draws on the intuition that just as the juxtaposition of words creates a joint p(w|·) = softmax(e , h (dec)) (9) t−1 meaning of a sentence, the juxtaposition of sen- tences also creates a joint meaning of a paragraph word During decoding, LSTM generates each decode or a document. word token w sequentially and combines it with earlier LSTM-outputted hidden vectors. The Encoder We ﬁrst obtain representation vectors LSTM hidden vector computed at the ﬁnal time at the sentence level by putting one layer of LSTM word step is used to represent the current sentence. (denoted as LSTM ) on top of its containing encode sentence This is passed to LSTM , combined words: decode with h for the acquisition of h , and outputted t+1 w word w w to the next time step in sentence decoding. h (enc) = LSTM (e , h (enc)) (5) t encode t t−1 sentence For each timestep t, LSTM has to ﬁrst decode The vector output at the ending time-step is used decide whether decoding should proceed or come to a full stop: we add an additional token end to to represent the entire sentence as the vocabulary. Decoding terminates when token e = h end is predicted. Details are shown in Figure 2. s D end 1108 Figure 1: Standard Sequence to Sequence Model. Figure 2: Hierarchical Sequence to Sequence Model. Figure 3: Hierarchical Sequence to Sequence Model with Attention. 3.4 Model 3: Hierarchical LSTM with input is most responsible for the current decoding Attention state. This attention version of hierarchical model is inspired by similar work in image caption gen- Attention models adopt a look-back strategy by eration and machine translation (Xu et al., 2015; Bahdanau et al., 2014). linking the current decoding stage with input sen- s s s tences in an attempt to consider which part of the Let H = {h (e), h (e), ..., h (e)} be the 1 2 1109 dataset S per D W per D W per S collection of sentence-level hidden vectors for Hotel-Review 8.8 124.8 14.1 each sentence from the inputs, outputted from Sentence LSTM . Each element in H contains in- Wikipedia 8.4 132.9 14.8 encode formation about input sequences with a strong fo- Table 1: Statistics for the Datasets. W, S and D re- cus on the parts surrounding each speciﬁc sentence s spectively represent number of words, number of (time-step). During decoding, suppose that e de- sentences, and number of documents/paragraphs. notes the sentence-level embedding at current step s For example, “S per D” denotes average number and that h (dec) denotes the hidden vector out- t−1 of sentences per document. sentence putted from LSTM at previous time step decode t−1. Attention models would ﬁrst link the current- step decoding information, i.e., h (dec) which t−1 For testing, we adopt a greedy strategy with sentence is outputted from LSTM with each of the dec no beam search. For a given document D, e input sentences i ∈ [1, N], characterized by a is ﬁrst obtained given already learned LSTM encode strength indicator v : parameters and word embeddings. Then in decod- sentence T s s ing, LSTM computes embeddings at each decode v = U f(W · h (dec) + W · h (enc)) (10) i 1 2 t−1 i sentence-level time-step, which is ﬁrst fed into the K×K K×1 binary classiﬁer to decide whether sentence de- W , W ∈ R , U ∈ R . v is then normal- 1 2 i word coding terminates and then into LSTM for ized: decode exp(v ) word decoding. a = (11) exp(v ) i i The attention vector is then created by averaging 4 Experiments weights over all input sentences: 4.1 Dataset m = a h (enc) (12) t i We implement the proposed autoencoder on two i∈[1,N ] datasets, a highly domain speciﬁc dataset consist- LSTM hidden vectors for current step is then ing of hotel reviews and a general dataset extracted s s achieved by combining c , e and h (dec): from Wkipedia. t t−1 i σ " t # " # " # h (dec) Hotel Reviews We use a subset of hotel reviews t−1 f σ = W · e (13) crawled from TripAdvisor. We consider only re- o σ m views consisting sentences ranging from 50 to 250 l tanh words; the model has problems dealing with ex- tremely long sentences, as we will discuss later. c = f · c + i · l (14) t t t−1 t t We keep a vocabulary set consisting of the 25,000 h = o · c (15) t t most frequent words. A special “ <unk>” token 4K×3K is used to denote all the remaining less frequent where W ∈ R . h is then used for word tokens. Reviews that consist of more than 2 per- predicting as in the vanilla version of the hierar- cent of unknown words are discarded. Our train- chical model. ing dataset is comprised of roughly 340,000 re- 3.5 Training and Testing views; the testing set is comprised of 40,000 re- views. Dataset details are shown in Table 1. Parameters are estimated by maximizing likeli- hood of outputs given inputs, similar to standard sequence-to-sequence models. A softmax func- Wikipedia We extracted paragraphs from tion is adopted for predicting each token within Wikipedia corpus that meet the aforementioned output documents, the error of which is ﬁrst back- length requirements. We keep a top frequent word propagated through LSTM to sentences, vocabulary list of 120,000 words. Paragraphs decode sentence then through LSTM to document repre- with larger than 4 percent of unknown words are decode sentence sentation e , and last through LSTM and discarded. The training dataset is comprised of encode word LSTM to inputs. Stochastic gradient de- roughly 500,000 paragraphs and testing contains encode scent with minibatches is adopted. roughly 50,000. 1110 4.2 Training Details and Implementation where count denotes the number of n-grams match co-occurring in the input and output. We report Previous research has shown that deep LSTMs ROUGE-1, 2 and W (based on weighted longest work better than shallow ones for sequence-to- common subsequence). sequence tasks (Vinyals et al., 2014; Sutskever et al., 2014). We adopt a LSTM structure with four BLEU Purely measuring recall will inappropri- layer for encoding and four layer for decoding, ately reward long outputs. BLEU is designed to each of which is comprised of a different set of pa- address such an issue by emphasizing precision. rameters. Each LSTM layer consists of 1,000 hid- n-gram precision scores for our situation are given den neurons and the dimensionality of word em- by: beddings is set to 1,000. Other training details are given below, some of which follow Sutskever et al. count (gram ) match gram ∈output n precision = P (2014). n count(gram ) gram ∈output n • Input documents are reversed. (17) • LSTM parameters and word embeddings are BLEU then combines the average logarithm of initialized from a uniform distribution be- precision scores with exceeded length penaliza- tween [-0.08, 0.08]. tion. For details, see Papineni et al. (2002). • Stochastic gradient decent is implemented without momentum using a ﬁxed learning Coherence Evaluation Neither BLEU nor rate of 0.1. We stated halving the learning ROUGE attempts to evaluate true coherence. rate every half epoch after 5 epochs. We There is no generally accepted and readily avail- trained our models for a total of 7 epochs. able coherence evaluation metric. Because of • Batch size is set to 32 (32 documents). the difﬁculty of developing a universal coherence • Decoding algorithm allows generating at evaluation metric, we proposed here only a most 1.5 times the number of words in inputs. tailored metric speciﬁc to our case. Based on the • 0.2 dropout rate. assumption that human-generated texts (i.e., input documents in our tasks) are coherent (Barzilay • Gradient clipping is adopted by scaling gra- dients when the norm exceeded a threshold and Lapata, 2008), we compare generated outputs with input documents in terms of how much of 5. original text order is preserved. Our implementation on a single GPU processes a speed of approximately 600-1,200 tokens per sec- We develop a grid evaluation metric similar to ond. We trained our models for a total of 7 itera- the entity transition algorithms in (Barzilay and tions. Lee, 2004; Lapata and Barzilay, 2005). The key idea of Barzilay and Lapata’s models is to ﬁrst 4.3 Evaluations identify grammatical roles (i.e., object and sub- ject) that entities play and then model the transi- We need to measure the closeness of the output tion probability over entities and roles across sen- (candidate) to the input (reference). We ﬁrst adopt tences. We represent each sentence as a feature- two standard evaluation metrics, ROUGE (Lin, vector consisting of verbs and nouns in the sen- 2004; Lin and Hovy, 2003) and BLEU (Papineni tence. Next we align sentences from output doc- et al., 2002). uments to input sentences based on sentence-to- ROUGE is a recall-oriented measure widely sentence F1 scores (precision and recall are com- used in the summarization literature. It measures puted similarly to ROUGE and BLEU but at sen- the n-gram recall between the candidate text and tence level) using feature vectors. Note that multi- the reference text(s). In this work, we only have ple output sentences can be matched to one input one reference document (the input document) and Wolf and Gibson (2005) and Lin et al. (2011) proposed ROUGE score is therefore given by: metrics based on discourse relations, but these are hard to ap- ply widely since identifying discourse relations is a difﬁcult count (gram ) match problem. Indeed sophisticated coherence evaluation metrics gram ∈input n ROUGE = P are seldom adopted in real-world applications, and summa- count(gram ) gram ∈input n n rization researchers tend to use simple approximations like (16) number of overlapped tokens or topic distribution similarity (e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and Tesla K40m, 1 Kepler GK110B, 2880 Cuda cores. Hakkani-Tur ¨ , 2011)). 1111 Input-Wiki washington was unanimously elected President by the electors in both the 1788 – 1789 and 1792 elections . he oversaw the creation of a strong, well-ﬁnanced national government that maintained neutrality in the french revolutionary wars , suppressed the whiskey rebellion , and won acceptance among Americans of all types . washington established many forms in govern- ment still used today , such as the cabinet system and inaugural address . his retirement after two terms and the peaceful transition from his presidency to that of john adams established a tradition that continued up until franklin d . roosevelt was elected to a third term . washington has been widely hailed as the ” father of his country ” even during his lifetime. Output-Wiki washington was elected as president in 1792 and voters <unk> of these two elections until 1789 . he continued suppression <unk> whiskey rebellion of the french revolution war gov- ernment , strong , national well are involved in the establishment of the ﬁn advanced operations , won acceptance . as in the government , such as the establishment of various forms of inau- guration speech washington , and are still in use . <unk> continued after the two terms of his quiet transition to retirement of <unk> <unk> of tradition to have been elected to the third paragraph . but , ” the united nations of the father ” and in washington in his life , has been widely praised . Input-Wiki apple inc . is an american multinational corporation headquartered in cupertino , california , that designs , develops , and sells consumer electronics , computer software , online services , and personal com - puters . its bestknown hardware products are the mac line of computers , the ipod media player , the iphone smartphone , and the ipad tablet computer . its online services include icloud , the itunes store , and the app store . apple’s consumer software includes the os x and ios operating systems , the itunes media browser , the safari web browser , and the ilife and iwork creativity and productivity suites . Output-Wiki apple is a us company in california , <unk> , to develop electronics , softwares , and pc , sells . hardware include the mac series of computers , ipod , iphone . its online services , including icloud , itunes store and in app store . softwares , including os x and ios operating system , itunes , web browser , < unk> , including a productivity suite . Input-Wiki paris is the capital and most populous city of france . situated on the seine river , in the north of the country , it is in the centre of the le-de-france region . the city of paris has a population of 2273305 inhabitants . this makes it the ﬁfth largest city in the european union measured by the population within the city limits . Output-Wiki paris is the capital and most populated city in france . located in the <unk> , in the north of the country , it is the center of <unk> . paris , the city has a population of <num> inhabitants . this makes the eu ’ s population within the city limits of the ﬁfth largest city in the measurement Input-Review on every visit to nyc , the hotel beacon is the place we love to stay . so conveniently located to central park , lincoln center and great local restaurants . the rooms are lovely . beds so comfortable , a great little kitchen and new wizz bang coffee maker . the staff are so accommo- dating and just love walking across the street to the fairway supermarket with every imaginable goodies to eat . Output-Review every time in new york , lighthouse hotel is our favorite place to stay . very convenient , central park , lincoln center , and great restaurants . the room is wonderful , very comfortable bed , a kitchenette and a large explosion of coffee maker . the staff is so inclusive , just across the street to walk to the supermarket channel love with all kinds of what to eat . Table 2: A few examples produced by the hierarchical LSTM alongside the inputs. sentence. Assume that sentence s is aligned degree of permutation in terms of text order, we output i 0 penalize the absolute difference between the two with sentence s , where i and i denote position input computed distances. This metric is also relevant index for a output sentence and its aligned input. to the overall performance of prediction and re- The penalization score L is then given by: call: an irrelevant output will be aligned to a ran- L = dom input, thus being heavily penalized. The de- N · (N − 1) output output ﬁciency of the proposed metric is that it concerns X X 0 0 × |(j − i)− (j − i )| itself only with a semantic perspective on coher- i∈[1,N −1] j∈[i+1,N ] output output ence, barely considering syntactical issues. (18) 4.4 Results Equ. 18 can be interpreted as follows: (j − i) denotes the distance in terms of position index be- A summary of our experimental results is given tween two outputted sentences indexed by j and i, in Table 3. We observe better performances for 0 0 and (j − i ) denotes the distance between their the hotel-review dataset than the open domain mirrors in inputs. As we wish to penalize the Wikipedia dataset, for the intuitive reason that 1112 Model Dataset BLEU ROUGE-1 ROUGE-2 Coherence(L) Standard Hotel Review 0.241 0.571 0.302 1.92 Hierarchical Hotel Review 0.267 0.590 0.330 1.71 Hierarchical+Attention Hotel Review 0.285 0.624 0.355 1.57 Standard Wikipedia 0.178 0.502 0.228 2.75 Hierarchical Wikipedia 0.202 0.529 0.250 2.30 Hierarchical+Attention Wikipedia 0.220 0.544 0.291 2.04 Table 3: Results for three models on two datasets. As with coherence score L, smaller values signiﬁes better performances. documents and sentences are written in a more original input sentence (similar to sequence-to- ﬁxed format and easy to predict for hotel reviews. sequence translation in (Bahdanau et al., 2014)), The hierarchical model that considers sentence- and later transform the original task to sentence- level structure outperforms standard sequence- to-sentence generation. However our long-term to-sequence models. Attention models at the goal here is not on perfecting this basic multi-text sentence level introduce performance boost over generation scenario of reconstructing input docu- ments, but rather on extending it to more important vanilla hierarchical models. With respect to the coherence evaluation, the applications. original sentence order is mostly preserved: the hi- That is, the autoencoder described in this work, erarchical model with attention achieves L = 1.57 where input sequence X is identical to output Y , is on the hotel-review dataset, equivalent to the fact only the most basic instance of the family of doc- that the relative position of two input sentences ument (paragraph)-to-document (paragraph) gen- are permuted by an average degree of 1.57. Even eration tasks. We hope the ideas proposed in this paper can play some role in enabling such for the Wikipedia dataset where more poor-quality sentences are observed, the original text order can more sophisticated generation tasks like summa- still be adequately maintained with L = 2.04. rization, where the inputs are original documents and outputs are summaries or question answering, 5 Discussion and Future Work where inputs are questions and outputs are the ac- tual wording of answers. Sophisticated genera- In this paper, we extended recent sequence-to- tion tasks like summarization or dialogue systems sequence LSTM models to the task of multi- could extend this paradigm, and could themselves sentence generation. We trained an autoencoder beneﬁt from task-speciﬁc adaptations. In sum- to see how well LSTM models can reconstruct in- marization, sentences to generate at each timestep put documents of many sentences. We ﬁnd that might be pre-pointed to or pre-aligned to speciﬁc the proposed hierarchical LSTM models can par- aspects, topics, or pieces of texts to be summa- tially preserve the semantic and syntactic integrity rized. Dialogue systems could incorporate infor- of multi-text units and generate meaningful and mation about the user or the time course of the grammatical sentences in coherent order. Our dialogue. In any case, we look forward to more model performs better than standard sequence-to- sophi4d applications of neural models to the im- sequence models which do not consider the intrin- portant task of natural language generation. sic hierarchical discourse structure of texts. While our work on auto-encoding for larger 6 Acknowledgement texts is only a preliminary effort toward allowing neural models to deal with discourse, it nonethe- The authors want to thank Gabor Angeli, Sam less suggests that neural models are capable of en- Bowman, Percy Liang and other members of the coding complex clues about how coherent texts are Stanford NLP group for insightful comments and connected . suggestion. We also thank the three anonymous The performance on this autoencoder task could ACL reviewers for helpful comments. This work certainly also beneﬁt from more sophisticated neu- is supported by Enlight Foundation Graduate Fel- ral models. For example one extension might align lowship, and a gift from Bloomberg L.P, which we the sentence currently being generated with the gratefully acknowledge. 1113 References written texts. In Proceedings of the 20th inter- national conference on Computational Linguistics, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- page 329. Association for Computational Linguis- gio. 2014. Neural machine translation by jointly tics. learning to align and translate. arXiv preprint arXiv:1409.0473. Jiwei Li and Eduard Hovy. 2014. A model of coher- ence based on distributed sentence representation. Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Compu- Jiwei Li, Rumeng Li, and Eduard Hovy. 2014. Recur- tational Linguistics, 34(1):1–34. sive deep models for discourse parsing. In Proceed- ings of the 2014 Conference on Empirical Methods Regina Barzilay and Lillian Lee. 2004. Catching the in Natural Language Processing (EMNLP), pages drift: Probabilistic content models, with applications 2061–2069. to generation and summarization. arXiv preprint cs/0405039. Chin-Yew Lin and Eduard Hovy. 2003. Auto- matic evaluation of summaries using n-gram co- Asli Celikyilmaz and Dilek Hakkani-Tur ¨ . 2011. Dis- occurrence statistics. In Proceedings of the 2003 covery of topically coherent sentences for extractive Conference of the North American Chapter of the summarization. In Proceedings of the 49th Annual Association for Computational Linguistics on Hu- Meeting of the Association for Computational Lin- man Language Technology-Volume 1, pages 71–78. guistics: Human Language Technologies-Volume 1, Association for Computational Linguistics. pages 491–499. Association for Computational Lin- guistics. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. Automatically evaluating text coherence using dis- Micha Elsner and Eugene Charniak. 2008. course relations. In Proceedings of the 49th An- Coreference-inspired coherence modeling. In nual Meeting of the Association for Computational Proceedings of the 46th Annual Meeting of the Linguistics: Human Language Technologies-Volume Association for Computational Linguistics on Hu- 1, pages 997–1006. Association for Computational man Language Technologies: Short Papers, pages Linguistics. 41–44. Association for Computational Linguistics. Chin-Yew Lin. 2004. Rouge: A package for automatic Vanessa Wei Feng and Graeme Hirst. 2012. Text- evaluation of summaries. In Text Summarization level discourse parsing with rich linguistic fea- Branches Out: Proceedings of the ACL-04 Work- tures. In Proceedings of the 50th Annual Meeting shop, pages 74–81. of the Association for Computational Linguistics: Long Papers-Volume 1, pages 60–68. Association Thang Luong, Ilya Sutskever, Quoc V Le, Oriol for Computational Linguistics. Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. Hugo Hernault, Helmut Prendinger, Mitsuru Ishizuka, ACL. et al. 2010. Hilda: a discourse parser using sup- port vector machine classiﬁcation. Dialogue & Dis- William C Mann and Sandra A Thompson. 1988. course, 1(3). Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–281. Sepp Hochreiter and Jur ¨ gen Schmidhuber. 1997. Long short-term memory. Neural computation, Daniel Marcu. 2000. The rhetorical parsing of unre- 9(8):1735–1780. stricted texts: A surface-based approach. Computa- tional linguistics, 26(3):395–448. Yangfeng Ji and Jacob Eisenstein. 2014. Represen- tation learning for text-level discourse parsing. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Proceedings of the 52nd Annual Meeting of the As- Jing Zhu. 2002. Bleu: a method for automatic sociation for Computational Linguistics, volume 1, evaluation of machine translation. In Proceedings of pages 13–24. the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Mirella Lapata and Regina Barzilay. 2005. Automatic Computational Linguistics. evaluation of text coherence: Models and represen- tations. In IJCAI, volume 5, pages 1085–1090. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural net- Alex Lascarides and Nicholas Asher. 1991. Discourse works. In Advances in Neural Information Process- relations and defeasible knowledge. In Proceedings ing Systems, pages 3104–3112. of the 29th annual meeting on Association for Com- putational Linguistics, pages 55–62. Association for Computational Linguistics. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Huong LeThanh, Geetha Abeysinghe, and Christian Grammar as a foreign language. arXiv preprint Huyck. 2004. Generating discourse structures for arXiv:1412.7449. 1114 Florian Wolf and Edward Gibson. 2005. Representing discourse coherence: A corpus-based study. Com- putational Linguistics, 31(2):249–287. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural im- age caption generation with visual attention. arXiv preprint arXiv:1502.03044. Rui Yan, Liang Kong, Congrui Huang, Xiaojun Wan, Xiaoming Li, and Yan Zhang. 2011a. Timeline gen- eration through evolutionary trans-temporal summa- rization. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, pages 433–443. Association for Computational Lin- guistics. Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011b. Evolution- ary timeline summarization: a balanced optimiza- tion framework via iterative substitution. In Pro- ceedings of the 34th international ACM SIGIR con- ference on Research and development in Information Retrieval, pages 745–754. ACM. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Unpaywall http://www.deepdyve.com/lp/unpaywall/a-hierarchical-neural-autoencoder-for-paragraphs-and-documents-YmCDRNJ3ZZ

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Li, Jiwei; Luong, Thang; Jurafsky, Dan

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) – Jan 1, 2015

Download PDF

Share Full Text for Free

10 pages

Loading...

Page 2

Loading...

Page 3

Loading...

Page 4

Loading...

Page 5

Loading...

Page 6

Loading...

Page 7

Loading...

Page 8

Loading...

Page 9

Loading...

Page 10

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher: Unpaywall
DOI: 10.3115/v1/p15-1107
Publisher site: See Article on Publisher Site

Abstract

Jiwei Li, Minh-Thang Luong and Dan Jurafsky Computer Science Department, Stanford University, Stanford, CA 94305, USA jiweil, lmthang, jurafsky@stanford.edu Abstract 2004; Elsner and Charniak, 2008; Li and Hovy, 2014, inter alia). However, applying these to text Natural language generation of coherent generation remains difﬁcult. To understand how long texts like paragraphs or longer doc- discourse units are connected, one has to under- uments is a challenging problem for re- stand the communicative function of each unit, current networks models. In this paper, and the role it plays within the context that en- we explore an important step toward this capsulates it, recursively all the way up for the generation task: training an LSTM (Long- entire text. Identifying increasingly sophisticated short term memory) auto-encoder to pre- human-developed features may be insufﬁcient for serve and reconstruct multi-sentence para- capturing these patterns. But developing neural- graphs. We introduce an LSTM model that based alternatives has also been difﬁcult. Al- hierarchically builds an embedding for a though neural representations for sentences can paragraph from embeddings for sentences capture aspects of coherent sentence structure (Ji and words, then decodes this embedding and Eisenstein, 2014; Li et al., 2014; Li and Hovy, to reconstruct the original paragraph. We 2014), it’s not clear how they could help in gener- evaluate the reconstructed paragraph us- ating more broadly coherent text. ing standard metrics like ROUGE and En- Recent LSTM models (Hochreiter and Schmid- tity Grid, showing that neural models are huber, 1997) have shown powerful results on gen- able to encode texts in a way that preserve erating meaningful and grammatical sentences in syntactic, semantic, and discourse coher- sequence generation tasks like machine translation ence. While only a ﬁrst step toward gener- (Sutskever et al., 2014; Bahdanau et al., 2014; Lu- ating coherent text units from neural mod- ong et al., 2015) or parsing (Vinyals et al., 2014). els, our work has the potential to signiﬁ- This performance is at least partially attributable cantly impact natural language generation to the ability of these systems to capture local and summarization . compositionally: the way neighboring words are combined semantically and syntactically to form 1 Introduction meanings that they wish to express. Generating coherent text is a central task in natural Could these models be extended to deal with language processing. A wide variety of theories generation of larger structures like paragraphs or exist for representing relationships between text even entire documents? In standard sequence- units, such as Rhetorical Structure Theory (Mann to-sequence generation tasks, an input sequence and Thompson, 1988) or Discourse Representa- is mapped to a vector embedding that represents tion Theory (Lascarides and Asher, 1991), for ex- the sequence, and then to an output string of tracting these relations from text units (Marcu, words. Multi-text generation tasks like summa- 2000; LeThanh et al., 2004; Hernault et al., 2010; rization could work in a similar way: the sys- Feng and Hirst, 2012, inter alia), and for extract- tem reads a collection of input sentences, and ing other coherence properties characterizing the is then asked to generate meaningful texts with role each text unit plays with others in a discourse certain properties (such as—for summarization— (Barzilay and Lapata, 2008; Barzilay and Lee, being succinct and conclusive). Just as the local 1 semantic and syntactic compositionally of words Code for models described in this paper are available at www.stanford.edu/ jiweil/. can be captured by LSTM models, can the com- Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1106–1115, Beijing, China, July 26-31, 2015. 2015 Association for Computational Linguistics positionally of discourse releations of higher-level ing a softmax function: text units (e.g., clauses, sentences, paragraphs, and P (Y|X) documents) be captured in a similar way, with clues about how text units connect with each an- = p(y |x , x , ..., x , y , y , ..., y ) t 1 2 t 1 2 t−1 other stored in the neural compositional matrices? t∈[1,n ] In this paper we explore a ﬁrst step toward this Y exp(f(h , e )) t−1 y task of neural natural language generation. We fo- exp(f(h , e 0)) t−1 y t∈[1,n ] cus on the component task of training a paragraph (4) (document)-to-paragraph (document) autoencoder f(h , e ) denotes the activation function be- t−1 y to reconstruct the input text sequence from a com- tween e and e , where h is the representa- h−1 y t−1 pressed vector representation from a deep learn- tion outputted from the LSTM at time t− 1. Note ing model. We develop hierarchical LSTM mod- that each sentence ends up with a special end-of- els that arranges tokens, sentences and paragraphs sentence symbol <end>. Commonly, the input in a hierarchical structure, with different levels of and output use two different LSTMs with differ- LSTMs capturing compositionality at the token- ent sets of convolutional parameters for capturing token and sentence-to-sentence levels. different compositional patterns. We offer in the following section to a brief de- In the decoding procedure, the algorithm termi- scription of sequence-to-sequence LSTM models. nates when an <end> token is predicted. At each The proposed hierarchical LSTM models are then timestep, either a greedy approach or beam search described in Section 3, followed by experimental can be adopted for word prediction. Greedy search results in Section 4, and then a brief conclusion. selects the token with the largest conditional prob- ability, the embedding of which is then combined 2 Long-Short Term Memory (LSTM) with preceding output for next step token predic- tion. For beam search, (Sutskever et al., 2014) dis- In this section we give a quick overview of LSTM covered that a beam size of 2 sufﬁces to provide models. LSTM models (Hochreiter and Schmid- most of beneﬁts of beam search. huber, 1997) are deﬁned as follows: given a sequence of inputs X = {x , x , ..., x }, an 1 2 n 3 Paragraph Autoencoder LSTM associates each timestep with an input, memory and output gate, respectively denoted as In this section, we introduce our proposed hierar- chical LSTM model for the autoencoder. i , f and o . For notations, we disambiguate e and t t t h where e denote the vector for individual text 3.1 Notation unite (e.g., word or sentence) at time step t while Let D denote a paragraph or a document, which h denotes the vector computed by LSTM model is comprised of a sequence of N sentences, at time t by combining e and h . σ denotes the D t t−1 1 2 N D = {s , s , ..., s , end }. An additional sigmoid function. The vector representation h for D ” end ” token is appended to each document. each time-step t is given by: D Each sentence s is comprised of a sequence of 1 2 N tokens s = {w , w , ..., w } where N denotes i σ " t # " # " # the length of the sentence, each sentence end- f σ h t t−1 = W · (1) ing with an “ end ” token. The word w is as- o σ e t t sociated with a K -dimensional embedding e , l tanh 1 2 K e = {e , e , ..., e }. Let V denote vocabu- w w w lary size. Each sentence s is associated with a K- c = f · c + i · l (2) t t t−1 t t dimensional representation e . An autoencoder is a neural model where output h = o · c (3) t t units are directly connected with or identical to in- 4K×2K where W ∈ R In sequence-to-sequence put units. Typically, inputs are compressed into generation tasks, each input X is paired with a representation using neural models (encoding), a sequence of outputs to predict: Y = which is then used to reconstruct it back (decod- {y , y , ..., y }. An LSTM deﬁnes a distribution ing). For a paragraph autoencoder, both the input 1 2 n over outputs and sequentially predicts tokens us- X and output Y are the same document D. The 1107 autoencoder ﬁrst compresses D into a vector rep- To build representation e for the current doc- resentation e and then reconstructs D based on ument/paragraph D, another layer of LSTM (de- sentence e . noted as LSTM ) is placed on top of all sen- encode tences, computing representations sequentially for For simplicity, we deﬁne LSTM(h , e ) to t−1 t be the LSTM operation on vectors h and e to each timestep: t−1 t achieve h as in Equ.1 and 2. For clariﬁcation, s sentence s s we ﬁrst describe the following notations used in h (enc) = LSTM (e , h (enc)) (6) t encode t t−1 encoder and decoder: Representation e computed at the ﬁnal time end w s • h and h denote hidden vectors from LSTM t t step is used to represent the entire document: models, the subscripts of which indicate e = h . end timestep t, the superscripts of which indi- Thus one LSTM operates at the token level, cate operations at word level (w) or sequence leading to the acquisition of sentence-level rep- level (s). h (enc) speciﬁes encoding stage resentations that are then used as inputs into the and h (dec) speciﬁes decoding stage. second LSTM that acquires document-level repre- sentations, in a hierarchical structure. w s • e and e denotes word-level and sentence- t t level embedding for word and sentence at po- Decoder As with encoding, the decoding algo- sition t in terms of its residing sentence or rithm operates on a hierarchical structure with two document. layers of LSTMs. LSTM outputs at sentence level for time step t are obtained by: 3.2 Model 1: Standard LSTM s sentence s s The whole input and output are treated as one h (dec) = LSTM (e , h (dec)) (7) t decode t t−1 sequence of tokens. Following Sutskever et al. (2014) and Bahdanau et al. (2014), we trained s The initial time step h (d) = e , the end-to-end an autoencoder that ﬁrst maps input documents output from the encoding procedure. h (d) is used into vector representations from a LSTM word encode as the original input into LSTM for subse- decode and then reconstructs inputs by predicting to- quently predicting tokens within sentence t + 1. kens within the document sequentially from a word LSTM predicts tokens at each position se- decode LSTM . Two separate LSTMs are imple- decode quentially, the embedding of which is then com- mented for encoding and decoding with no sen- bined with earlier hidden vectors for the next time- tence structures considered. Illustration is shown step prediction until the end token is predicted. in Figure 1. The procedure can be summarized as follows: 3.3 Model 2: Hierarchical LSTM w sentence w w h (dec) = LSTM (e , h (dec)) (8) t decode t t−1 The hierarchical model draws on the intuition that just as the juxtaposition of words creates a joint p(w|·) = softmax(e , h (dec)) (9) t−1 meaning of a sentence, the juxtaposition of sen- tences also creates a joint meaning of a paragraph word During decoding, LSTM generates each decode or a document. word token w sequentially and combines it with earlier LSTM-outputted hidden vectors. The Encoder We ﬁrst obtain representation vectors LSTM hidden vector computed at the ﬁnal time at the sentence level by putting one layer of LSTM word step is used to represent the current sentence. (denoted as LSTM ) on top of its containing encode sentence This is passed to LSTM , combined words: decode with h for the acquisition of h , and outputted t+1 w word w w to the next time step in sentence decoding. h (enc) = LSTM (e , h (enc)) (5) t encode t t−1 sentence For each timestep t, LSTM has to ﬁrst decode The vector output at the ending time-step is used decide whether decoding should proceed or come to a full stop: we add an additional token end to to represent the entire sentence as the vocabulary. Decoding terminates when token e = h end is predicted. Details are shown in Figure 2. s D end 1108 Figure 1: Standard Sequence to Sequence Model. Figure 2: Hierarchical Sequence to Sequence Model. Figure 3: Hierarchical Sequence to Sequence Model with Attention. 3.4 Model 3: Hierarchical LSTM with input is most responsible for the current decoding Attention state. This attention version of hierarchical model is inspired by similar work in image caption gen- Attention models adopt a look-back strategy by eration and machine translation (Xu et al., 2015; Bahdanau et al., 2014). linking the current decoding stage with input sen- s s s tences in an attempt to consider which part of the Let H = {h (e), h (e), ..., h (e)} be the 1 2 1109 dataset S per D W per D W per S collection of sentence-level hidden vectors for Hotel-Review 8.8 124.8 14.1 each sentence from the inputs, outputted from Sentence LSTM . Each element in H contains in- Wikipedia 8.4 132.9 14.8 encode formation about input sequences with a strong fo- Table 1: Statistics for the Datasets. W, S and D re- cus on the parts surrounding each speciﬁc sentence s spectively represent number of words, number of (time-step). During decoding, suppose that e de- sentences, and number of documents/paragraphs. notes the sentence-level embedding at current step s For example, “S per D” denotes average number and that h (dec) denotes the hidden vector out- t−1 of sentences per document. sentence putted from LSTM at previous time step decode t−1. Attention models would ﬁrst link the current- step decoding information, i.e., h (dec) which t−1 For testing, we adopt a greedy strategy with sentence is outputted from LSTM with each of the dec no beam search. For a given document D, e input sentences i ∈ [1, N], characterized by a is ﬁrst obtained given already learned LSTM encode strength indicator v : parameters and word embeddings. Then in decod- sentence T s s ing, LSTM computes embeddings at each decode v = U f(W · h (dec) + W · h (enc)) (10) i 1 2 t−1 i sentence-level time-step, which is ﬁrst fed into the K×K K×1 binary classiﬁer to decide whether sentence de- W , W ∈ R , U ∈ R . v is then normal- 1 2 i word coding terminates and then into LSTM for ized: decode exp(v ) word decoding. a = (11) exp(v ) i i The attention vector is then created by averaging 4 Experiments weights over all input sentences: 4.1 Dataset m = a h (enc) (12) t i We implement the proposed autoencoder on two i∈[1,N ] datasets, a highly domain speciﬁc dataset consist- LSTM hidden vectors for current step is then ing of hotel reviews and a general dataset extracted s s achieved by combining c , e and h (dec): from Wkipedia. t t−1 i σ " t # " # " # h (dec) Hotel Reviews We use a subset of hotel reviews t−1 f σ = W · e (13) crawled from TripAdvisor. We consider only re- o σ m views consisting sentences ranging from 50 to 250 l tanh words; the model has problems dealing with ex- tremely long sentences, as we will discuss later. c = f · c + i · l (14) t t t−1 t t We keep a vocabulary set consisting of the 25,000 h = o · c (15) t t most frequent words. A special “ <unk>” token 4K×3K is used to denote all the remaining less frequent where W ∈ R . h is then used for word tokens. Reviews that consist of more than 2 per- predicting as in the vanilla version of the hierar- cent of unknown words are discarded. Our train- chical model. ing dataset is comprised of roughly 340,000 re- 3.5 Training and Testing views; the testing set is comprised of 40,000 re- views. Dataset details are shown in Table 1. Parameters are estimated by maximizing likeli- hood of outputs given inputs, similar to standard sequence-to-sequence models. A softmax func- Wikipedia We extracted paragraphs from tion is adopted for predicting each token within Wikipedia corpus that meet the aforementioned output documents, the error of which is ﬁrst back- length requirements. We keep a top frequent word propagated through LSTM to sentences, vocabulary list of 120,000 words. Paragraphs decode sentence then through LSTM to document repre- with larger than 4 percent of unknown words are decode sentence sentation e , and last through LSTM and discarded. The training dataset is comprised of encode word LSTM to inputs. Stochastic gradient de- roughly 500,000 paragraphs and testing contains encode scent with minibatches is adopted. roughly 50,000. 1110 4.2 Training Details and Implementation where count denotes the number of n-grams match co-occurring in the input and output. We report Previous research has shown that deep LSTMs ROUGE-1, 2 and W (based on weighted longest work better than shallow ones for sequence-to- common subsequence). sequence tasks (Vinyals et al., 2014; Sutskever et al., 2014). We adopt a LSTM structure with four BLEU Purely measuring recall will inappropri- layer for encoding and four layer for decoding, ately reward long outputs. BLEU is designed to each of which is comprised of a different set of pa- address such an issue by emphasizing precision. rameters. Each LSTM layer consists of 1,000 hid- n-gram precision scores for our situation are given den neurons and the dimensionality of word em- by: beddings is set to 1,000. Other training details are given below, some of which follow Sutskever et al. count (gram ) match gram ∈output n precision = P (2014). n count(gram ) gram ∈output n • Input documents are reversed. (17) • LSTM parameters and word embeddings are BLEU then combines the average logarithm of initialized from a uniform distribution be- precision scores with exceeded length penaliza- tween [-0.08, 0.08]. tion. For details, see Papineni et al. (2002). • Stochastic gradient decent is implemented without momentum using a ﬁxed learning Coherence Evaluation Neither BLEU nor rate of 0.1. We stated halving the learning ROUGE attempts to evaluate true coherence. rate every half epoch after 5 epochs. We There is no generally accepted and readily avail- trained our models for a total of 7 epochs. able coherence evaluation metric. Because of • Batch size is set to 32 (32 documents). the difﬁculty of developing a universal coherence • Decoding algorithm allows generating at evaluation metric, we proposed here only a most 1.5 times the number of words in inputs. tailored metric speciﬁc to our case. Based on the • 0.2 dropout rate. assumption that human-generated texts (i.e., input documents in our tasks) are coherent (Barzilay • Gradient clipping is adopted by scaling gra- dients when the norm exceeded a threshold and Lapata, 2008), we compare generated outputs with input documents in terms of how much of 5. original text order is preserved. Our implementation on a single GPU processes a speed of approximately 600-1,200 tokens per sec- We develop a grid evaluation metric similar to ond. We trained our models for a total of 7 itera- the entity transition algorithms in (Barzilay and tions. Lee, 2004; Lapata and Barzilay, 2005). The key idea of Barzilay and Lapata’s models is to ﬁrst 4.3 Evaluations identify grammatical roles (i.e., object and sub- ject) that entities play and then model the transi- We need to measure the closeness of the output tion probability over entities and roles across sen- (candidate) to the input (reference). We ﬁrst adopt tences. We represent each sentence as a feature- two standard evaluation metrics, ROUGE (Lin, vector consisting of verbs and nouns in the sen- 2004; Lin and Hovy, 2003) and BLEU (Papineni tence. Next we align sentences from output doc- et al., 2002). uments to input sentences based on sentence-to- ROUGE is a recall-oriented measure widely sentence F1 scores (precision and recall are com- used in the summarization literature. It measures puted similarly to ROUGE and BLEU but at sen- the n-gram recall between the candidate text and tence level) using feature vectors. Note that multi- the reference text(s). In this work, we only have ple output sentences can be matched to one input one reference document (the input document) and Wolf and Gibson (2005) and Lin et al. (2011) proposed ROUGE score is therefore given by: metrics based on discourse relations, but these are hard to ap- ply widely since identifying discourse relations is a difﬁcult count (gram ) match problem. Indeed sophisticated coherence evaluation metrics gram ∈input n ROUGE = P are seldom adopted in real-world applications, and summa- count(gram ) gram ∈input n n rization researchers tend to use simple approximations like (16) number of overlapped tokens or topic distribution similarity (e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and Tesla K40m, 1 Kepler GK110B, 2880 Cuda cores. Hakkani-Tur ¨ , 2011)). 1111 Input-Wiki washington was unanimously elected President by the electors in both the 1788 – 1789 and 1792 elections . he oversaw the creation of a strong, well-ﬁnanced national government that maintained neutrality in the french revolutionary wars , suppressed the whiskey rebellion , and won acceptance among Americans of all types . washington established many forms in govern- ment still used today , such as the cabinet system and inaugural address . his retirement after two terms and the peaceful transition from his presidency to that of john adams established a tradition that continued up until franklin d . roosevelt was elected to a third term . washington has been widely hailed as the ” father of his country ” even during his lifetime. Output-Wiki washington was elected as president in 1792 and voters <unk> of these two elections until 1789 . he continued suppression <unk> whiskey rebellion of the french revolution war gov- ernment , strong , national well are involved in the establishment of the ﬁn advanced operations , won acceptance . as in the government , such as the establishment of various forms of inau- guration speech washington , and are still in use . <unk> continued after the two terms of his quiet transition to retirement of <unk> <unk> of tradition to have been elected to the third paragraph . but , ” the united nations of the father ” and in washington in his life , has been widely praised . Input-Wiki apple inc . is an american multinational corporation headquartered in cupertino , california , that designs , develops , and sells consumer electronics , computer software , online services , and personal com - puters . its bestknown hardware products are the mac line of computers , the ipod media player , the iphone smartphone , and the ipad tablet computer . its online services include icloud , the itunes store , and the app store . apple’s consumer software includes the os x and ios operating systems , the itunes media browser , the safari web browser , and the ilife and iwork creativity and productivity suites . Output-Wiki apple is a us company in california , <unk> , to develop electronics , softwares , and pc , sells . hardware include the mac series of computers , ipod , iphone . its online services , including icloud , itunes store and in app store . softwares , including os x and ios operating system , itunes , web browser , < unk> , including a productivity suite . Input-Wiki paris is the capital and most populous city of france . situated on the seine river , in the north of the country , it is in the centre of the le-de-france region . the city of paris has a population of 2273305 inhabitants . this makes it the ﬁfth largest city in the european union measured by the population within the city limits . Output-Wiki paris is the capital and most populated city in france . located in the <unk> , in the north of the country , it is the center of <unk> . paris , the city has a population of <num> inhabitants . this makes the eu ’ s population within the city limits of the ﬁfth largest city in the measurement Input-Review on every visit to nyc , the hotel beacon is the place we love to stay . so conveniently located to central park , lincoln center and great local restaurants . the rooms are lovely . beds so comfortable , a great little kitchen and new wizz bang coffee maker . the staff are so accommo- dating and just love walking across the street to the fairway supermarket with every imaginable goodies to eat . Output-Review every time in new york , lighthouse hotel is our favorite place to stay . very convenient , central park , lincoln center , and great restaurants . the room is wonderful , very comfortable bed , a kitchenette and a large explosion of coffee maker . the staff is so inclusive , just across the street to walk to the supermarket channel love with all kinds of what to eat . Table 2: A few examples produced by the hierarchical LSTM alongside the inputs. sentence. Assume that sentence s is aligned degree of permutation in terms of text order, we output i 0 penalize the absolute difference between the two with sentence s , where i and i denote position input computed distances. This metric is also relevant index for a output sentence and its aligned input. to the overall performance of prediction and re- The penalization score L is then given by: call: an irrelevant output will be aligned to a ran- L = dom input, thus being heavily penalized. The de- N · (N − 1) output output ﬁciency of the proposed metric is that it concerns X X 0 0 × |(j − i)− (j − i )| itself only with a semantic perspective on coher- i∈[1,N −1] j∈[i+1,N ] output output ence, barely considering syntactical issues. (18) 4.4 Results Equ. 18 can be interpreted as follows: (j − i) denotes the distance in terms of position index be- A summary of our experimental results is given tween two outputted sentences indexed by j and i, in Table 3. We observe better performances for 0 0 and (j − i ) denotes the distance between their the hotel-review dataset than the open domain mirrors in inputs. As we wish to penalize the Wikipedia dataset, for the intuitive reason that 1112 Model Dataset BLEU ROUGE-1 ROUGE-2 Coherence(L) Standard Hotel Review 0.241 0.571 0.302 1.92 Hierarchical Hotel Review 0.267 0.590 0.330 1.71 Hierarchical+Attention Hotel Review 0.285 0.624 0.355 1.57 Standard Wikipedia 0.178 0.502 0.228 2.75 Hierarchical Wikipedia 0.202 0.529 0.250 2.30 Hierarchical+Attention Wikipedia 0.220 0.544 0.291 2.04 Table 3: Results for three models on two datasets. As with coherence score L, smaller values signiﬁes better performances. documents and sentences are written in a more original input sentence (similar to sequence-to- ﬁxed format and easy to predict for hotel reviews. sequence translation in (Bahdanau et al., 2014)), The hierarchical model that considers sentence- and later transform the original task to sentence- level structure outperforms standard sequence- to-sentence generation. However our long-term to-sequence models. Attention models at the goal here is not on perfecting this basic multi-text sentence level introduce performance boost over generation scenario of reconstructing input docu- ments, but rather on extending it to more important vanilla hierarchical models. With respect to the coherence evaluation, the applications. original sentence order is mostly preserved: the hi- That is, the autoencoder described in this work, erarchical model with attention achieves L = 1.57 where input sequence X is identical to output Y , is on the hotel-review dataset, equivalent to the fact only the most basic instance of the family of doc- that the relative position of two input sentences ument (paragraph)-to-document (paragraph) gen- are permuted by an average degree of 1.57. Even eration tasks. We hope the ideas proposed in this paper can play some role in enabling such for the Wikipedia dataset where more poor-quality sentences are observed, the original text order can more sophisticated generation tasks like summa- still be adequately maintained with L = 2.04. rization, where the inputs are original documents and outputs are summaries or question answering, 5 Discussion and Future Work where inputs are questions and outputs are the ac- tual wording of answers. Sophisticated genera- In this paper, we extended recent sequence-to- tion tasks like summarization or dialogue systems sequence LSTM models to the task of multi- could extend this paradigm, and could themselves sentence generation. We trained an autoencoder beneﬁt from task-speciﬁc adaptations. In sum- to see how well LSTM models can reconstruct in- marization, sentences to generate at each timestep put documents of many sentences. We ﬁnd that might be pre-pointed to or pre-aligned to speciﬁc the proposed hierarchical LSTM models can par- aspects, topics, or pieces of texts to be summa- tially preserve the semantic and syntactic integrity rized. Dialogue systems could incorporate infor- of multi-text units and generate meaningful and mation about the user or the time course of the grammatical sentences in coherent order. Our dialogue. In any case, we look forward to more model performs better than standard sequence-to- sophi4d applications of neural models to the im- sequence models which do not consider the intrin- portant task of natural language generation. sic hierarchical discourse structure of texts. While our work on auto-encoding for larger 6 Acknowledgement texts is only a preliminary effort toward allowing neural models to deal with discourse, it nonethe- The authors want to thank Gabor Angeli, Sam less suggests that neural models are capable of en- Bowman, Percy Liang and other members of the coding complex clues about how coherent texts are Stanford NLP group for insightful comments and connected . suggestion. We also thank the three anonymous The performance on this autoencoder task could ACL reviewers for helpful comments. This work certainly also beneﬁt from more sophisticated neu- is supported by Enlight Foundation Graduate Fel- ral models. For example one extension might align lowship, and a gift from Bloomberg L.P, which we the sentence currently being generated with the gratefully acknowledge. 1113 References written texts. In Proceedings of the 20th inter- national conference on Computational Linguistics, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- page 329. Association for Computational Linguis- gio. 2014. Neural machine translation by jointly tics. learning to align and translate. arXiv preprint arXiv:1409.0473. Jiwei Li and Eduard Hovy. 2014. A model of coher- ence based on distributed sentence representation. Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Compu- Jiwei Li, Rumeng Li, and Eduard Hovy. 2014. Recur- tational Linguistics, 34(1):1–34. sive deep models for discourse parsing. In Proceed- ings of the 2014 Conference on Empirical Methods Regina Barzilay and Lillian Lee. 2004. Catching the in Natural Language Processing (EMNLP), pages drift: Probabilistic content models, with applications 2061–2069. to generation and summarization. arXiv preprint cs/0405039. Chin-Yew Lin and Eduard Hovy. 2003. Auto- matic evaluation of summaries using n-gram co- Asli Celikyilmaz and Dilek Hakkani-Tur ¨ . 2011. Dis- occurrence statistics. In Proceedings of the 2003 covery of topically coherent sentences for extractive Conference of the North American Chapter of the summarization. In Proceedings of the 49th Annual Association for Computational Linguistics on Hu- Meeting of the Association for Computational Lin- man Language Technology-Volume 1, pages 71–78. guistics: Human Language Technologies-Volume 1, Association for Computational Linguistics. pages 491–499. Association for Computational Lin- guistics. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. Automatically evaluating text coherence using dis- Micha Elsner and Eugene Charniak. 2008. course relations. In Proceedings of the 49th An- Coreference-inspired coherence modeling. In nual Meeting of the Association for Computational Proceedings of the 46th Annual Meeting of the Linguistics: Human Language Technologies-Volume Association for Computational Linguistics on Hu- 1, pages 997–1006. Association for Computational man Language Technologies: Short Papers, pages Linguistics. 41–44. Association for Computational Linguistics. Chin-Yew Lin. 2004. Rouge: A package for automatic Vanessa Wei Feng and Graeme Hirst. 2012. Text- evaluation of summaries. In Text Summarization level discourse parsing with rich linguistic fea- Branches Out: Proceedings of the ACL-04 Work- tures. In Proceedings of the 50th Annual Meeting shop, pages 74–81. of the Association for Computational Linguistics: Long Papers-Volume 1, pages 60–68. Association Thang Luong, Ilya Sutskever, Quoc V Le, Oriol for Computational Linguistics. Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. Hugo Hernault, Helmut Prendinger, Mitsuru Ishizuka, ACL. et al. 2010. Hilda: a discourse parser using sup- port vector machine classiﬁcation. Dialogue & Dis- William C Mann and Sandra A Thompson. 1988. course, 1(3). Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–281. Sepp Hochreiter and Jur ¨ gen Schmidhuber. 1997. Long short-term memory. Neural computation, Daniel Marcu. 2000. The rhetorical parsing of unre- 9(8):1735–1780. stricted texts: A surface-based approach. Computa- tional linguistics, 26(3):395–448. Yangfeng Ji and Jacob Eisenstein. 2014. Represen- tation learning for text-level discourse parsing. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Proceedings of the 52nd Annual Meeting of the As- Jing Zhu. 2002. Bleu: a method for automatic sociation for Computational Linguistics, volume 1, evaluation of machine translation. In Proceedings of pages 13–24. the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Mirella Lapata and Regina Barzilay. 2005. Automatic Computational Linguistics. evaluation of text coherence: Models and represen- tations. In IJCAI, volume 5, pages 1085–1090. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural net- Alex Lascarides and Nicholas Asher. 1991. Discourse works. In Advances in Neural Information Process- relations and defeasible knowledge. In Proceedings ing Systems, pages 3104–3112. of the 29th annual meeting on Association for Com- putational Linguistics, pages 55–62. Association for Computational Linguistics. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Huong LeThanh, Geetha Abeysinghe, and Christian Grammar as a foreign language. arXiv preprint Huyck. 2004. Generating discourse structures for arXiv:1412.7449. 1114 Florian Wolf and Edward Gibson. 2005. Representing discourse coherence: A corpus-based study. Com- putational Linguistics, 31(2):249–287. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural im- age caption generation with visual attention. arXiv preprint arXiv:1502.03044. Rui Yan, Liang Kong, Congrui Huang, Xiaojun Wan, Xiaoming Li, and Yan Zhang. 2011a. Timeline gen- eration through evolutionary trans-temporal summa- rization. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, pages 433–443. Association for Computational Lin- guistics. Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011b. Evolution- ary timeline summarization: a balanced optimiza- tion framework via iterative substitution. In Pro- ceedings of the 34th international ACM SIGIR con- ference on Research and development in Information Retrieval, pages 745–754. ACM.

Journal

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) – Unpaywall

Published: Jan 1, 2015

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A Hierarchical Neural Autoencoder for Paragraphs and Documents

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

References

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies