andrei popescu-belis abstract - arxiv · andrei popescu-belis heig-vd / hes-so route de cheseaux 1...

Multilingual Hierarchical Attention Networks for Document Classification

Nikolaos PappasIdiap Research Institute

Rue Marconi 19CH-1920 Martigny, Switzerland

[email protected]

Andrei Popescu-BelisHEIG-VD / HES-SORoute de Cheseaux 1

CH-1401 Yverdon, [email protected]

Abstract

Hierarchical attention networks have re-cently achieved remarkable performancefor document classification in a given lan-guage. However, when multilingual doc-ument collections are considered, train-ing such models separately for each lan-guage entails linear parameter growth andlack of cross-language transfer. Learn-ing a single multilingual model with fewerparameters is therefore a challenging butpotentially beneficial objective. To thisend, we propose multilingual hierarchicalattention networks for learning documentstructures, with shared encoders and/orshared attention mechanisms across lan-guages, using multi-task learning and analigned semantic space as input. We eval-uate the proposed models on multilingualdocument classification with disjoint la-bel sets, on a large dataset which we pro-vide, with 600k news documents in 8 lan-guages, and 5k labels. The multilingualmodels outperform monolingual ones inlow-resource as well as full-resource set-tings, and use fewer parameters, thus con-firming their computational efficiency andthe utility of cross-language transfer.

1 Introduction

Learning word sequence representations has be-come increasingly useful for a variety of NLPtasks such as document classification (Tang et al.,2015; Yang et al., 2016), neural machine trans-lation (NMT) (Cho et al., 2014; Luong et al.,2015), question answering (Chen et al., 2015; Ku-mar et al., 2015) and summarization (Rush et al.,2015). However, when data are available in mul-tiple languages, representation learning must ad-

Figure 1: Vectors of documents labeled with ‘Eu-rope’, ‘Culture’ and their Arabic counterparts.The multilingual hierarchical attention networkseparates topics better than monolingual ones.

dress two main challenges. Firstly, the compu-tational cost of training separate models for eachlanguage, which grows linearly with their number,or even quadratically in the case of multi-way mul-tilingual NMT (Firat et al., 2016a). Secondly, themodels should be capable of cross-language trans-fer, which is an important component in humanlanguage learning (Ringbom, 2007). For instance,Johnson et al. (2016) attempted to use a singlesequence-to-sequence neural network model forNMT across multiple language pairs.

Previous studies in document classification at-tempted to address these issues by employingmultilingual word embeddings, which allow di-rect comparisons and groupings across languages(Klementiev et al., 2012; Hermann and Blunsom,2014; Ferreira et al., 2016). However, they areonly applicable when common label sets are avail-able across languages which is often not the case(e.g. Wikipedia or news). Moreover, despite re-cent advances in monolingual document modeling(Tang et al., 2015; Yang et al., 2016), multilingualmodels are still based on shallow approaches.

In this paper, we propose Multilingual Hier-archical Attention Networks to learn shared doc-

arX

iv:1

707.

0089

6v4

[cs

.CL

] 1

5 Se

p 20

17

ument structures across languages for documentclassification with disjoint label sets, as opposedto training language-specific training of hierarchi-cal attention networks (HANs) (Yang et al., 2016).Our networks have a hierarchical structure withword and sentence encoders, along with atten-tion mechanisms. Each of these can either beshared across languages or kept language-specific.To enable cross-language transfer, the networksare trained with multi-task learning across lan-guages using an aligned semantic space as in-put. Fig. 1 displays document vectors, projectedwith t-SNE (van der Maaten, 2009), for two topicsand two languages, either learned by monolingualHANs (a) or by our multilingual HAN (b). Themultilingual HAN achieves better separation be-tween ‘Europe’ and ‘Culture’ topics in English asa result of the knowledge transfer from Arabic.

We evaluate our model against strong monolin-gual baselines, in low-resource and full-resourcescenarios, on a large multilingual document col-lection with 600k documents, labeled with general(1.2k) and specific topics (4.4k), in 8 languagesfrom Deutsche Welle’s news website.1 Our mul-tilingual models outperform monolingual ones inboth scenarios, thus confirming the utility of cross-language transfer and the computational efficiencyof the proposed architecture. To encourage fur-ther research in multilingual representation learn-ing our code and dataset are made available athttps://github.com/idiap/mhan.

2 Related Work

Research on learning multilingual word repre-sentations is based on early work on word em-beddings (Turian et al., 2010; Mikolov et al.,2013; Pennington et al., 2014). The goal is tolearn an aligned word embedding space for mul-tiple languages by leveraging bilingual dictionar-ies (Faruqui and Dyer, 2014; Ammar et al., 2016),parallel sentences (Gouws et al., 2015) or com-parable documents such as Wikipedia pages (Yihet al., 2011; Al-Rfou et al., 2013). Bilingual em-beddings were learned from word alignments us-ing neural language models (Klementiev et al.,2012; Zou et al., 2013), including auto-encoders(Chandar et al., 2014). Despite progress at theword level, the document level remains compar-atively less explored. The approaches proposedby Hermann and Blunsom (2014) or Ferreira et al.

1Germany’s news broadcaster: http://dw.com.

(2016) are based on shallow modeling and areapplicable only to classification tasks with labelsets shared across languages, which are costly toproduce and are often unavailable. Here, we re-move this constraint, and develop deeper multilin-gual document models with hierarchical structurebased on prior art at the word level.

Early work on neural document classificationwas based on shallow feed-forward networks,which required unsupervised pre-training (Le andMikolov, 2014). Later studies focused on neuralnetworks with hierarchical structure. Kim (2014)proposed a convolutional neural network (CNN)for sentence classification. Johnson and Zhang(2015) proposed a CNN for high-dimensional dataclassification, while Zhang et al. (2015) adopteda character-level CNN for text classification. Laiet al. (2015) proposed a recurrent CNN to capturesequential information, which outperformed sim-pler CNNs. Lin et al. (2015) and Tang et al. (2015)proposed hierarchical recurrent NNs and showedthat they were superior to CNN-based models. Re-cently, Yang et al. (2016) proposed a hierarchi-cal attention network (HAN) with bi-directionalgated encoders which outperforms traditional andneural baselines. Using such networks in multi-lingual settings has two drawbacks: the computa-tional complexity increases linearly with the num-ber of languages, and knowledge is acquired sepa-rately for each language. We address these issuesby proposing a new multilingual model based onHANs, which learns shared document structuresand to transfer knowledge across languages.

Early examples of attention mechanisms ap-peared in computer vision, e.g. for optical char-acter recognition (Larochelle and Hinton, 2010),image tracking (Denil et al., 2012), or image clas-sification (Mnih et al., 2014). For text classifica-tion, studies which aimed to learn the importanceof sentences included those by Yessenalina et al.(2010); Pappas and Popescu-Belis (2014); Yanget al. (2016) and more recently those by Pappasand Popescu-Belis (2017); Ji and Smith (2017).For NMT, Bahdanau et al. (2015) proposed anattention-based encoder-decoder network, whileLuong et al. (2015) proposed a local and ensem-ble attention model. Firat et al. (2016a) proposeda single encoder-decoder model with shared at-tention across language pairs for multi-way, mul-tilingual NMT. Hermann et al. (2015) developedattention-based document readers for question an-

https://github.com/idiap/mhanhttp://dw.com

swering. Chen et al. (2015) proposed a recurrentattention model over an external memory. Simi-larly, Kumar et al. (2015) introduced a dynamicmemory network for question answering and othertasks. We propose here to share attention acrosslanguages, at one or more levels of hierarchicaldocument models, which, to our knowledge, hasnot been attempted before.

3 Background: Hierarchical AttentionNetworks for Document Classification

We adopt a general hierarchical attention archi-tecture for document representation, displayed inFigure 2, which is derived from the one proposedby Yang et al. (2016). Our architecture is gen-eral in the sense that it defines only the hierar-chical structure, but accommodates different typesof individual components, i.e. encoders and at-tention models. We consider a dataset D ={(xi, yi), i = 1, . . . , N} made of N documentsxi with labels yi ∈ {0, 1}k. Each document isrepresented by the sequence of d-dimensional em-beddings of their words grouped into sentences,xi = {w11, w12, . . . , wKT }, T being the maxi-mum number of words in a sentence, and K themaximum number of sentences in a document.

The network takes as input a document xi andoutputs a document vector ui. In particular, it hastwo levels of abstraction, word vs. sentence. Theword level is made of an encoder gw with parame-ters Hw and an attention model aw with param-eters Aw, while the sentence level similarly in-cludes an encoder and an attention model (gs, Hsand as, As). The output ui is used by the classifi-cation layer to determine yi.

...

...s1 sK

w11 w12 w1T wK1... wK2 ... wKT

u

Hwαw

Hwαw

Hsαs

Wor

d-lev

el

Sent

ence

-leve

l

Figure 2: General architecture of hierarchical at-tention neural networks for modeling documents.

3.1 Encoder LayersAt the word level, the function gw encodes the se-quence of input words {wit | t = 1, . . . ,KT} for

each sentence i of the document, noted as:

h(it)w = {gw(wit)| t = 1, . . . ,K} (1)

At the sentence level, after combining the inter-mediate word vectors {h(it)w | t = 1, . . . , T} to asentence vector si (as explained in 3.2), the func-tion gs encodes the sequence of sentence vectors{si | i = 1, . . . ,K}, noted as h(i)s .

The gw and gs functions can be any feed-forward or recurrent networks with parametersHw and Hs respectively. We consider the fol-lowing networks: a fully-connected one, notedas DENSE, a Gated Recurrent Unit network (Choet al., 2014) noted as GRU2, and a bi-directionalGRU which captures temporal information for-ward or backward in time, noted as biGRU. Thelatter is defined as a concatenation of the hiddenstates for each input vector obtained from the for-ward GRU, ~gw, and the backward GRU, ~gw:

h(it)w =[~gw(h

(it)w ); ~gw(h

(it)w )]. (2)

The same concatenation is applied for the hidden-state representation of a sentence h(i)s .

3.2 Attention LayersA typical way to obtain a representation for a givenword sequence at each level is by taking the lasthidden-state vector that is output by the encoder.However, it is hard to encode all the relevant inputinformation needed in a fixed-length vector. Thisproblem is addressed by introducing an attentionmechanism at each level (noted αw and αs) thatestimates the importance of each hidden state vec-tor to the representation of the sentence or docu-ment meaning respectively. The sentence vectorsi ∈ Rdw , where dw is the dimension of the wordencoder, is thus obtained as follows:

1

T

T∑t=1

α(it)w h(it)w =

1

T

T∑t=1

exp(v>ituw)∑j exp(v

>ijuw)

h(it)w

(3)where vit = fw(h

(it)w ) is a fully-connected neural

network with Ww parameters. Similarly, the doc-ument vector u ∈ Rds , where ds is the dimensionof the sentence encoder, is obtained as follows:

1

K

K∑i=1

α(i)s h(i)s =

1

K

K∑i=1

exp(v>i us)∑j exp(v

>j us)

h(i)s

(4)2GRU is a simplified version of Long-Short Term Mem-

ory, LSTM (Hochreiter and Schmidhuber, 1997).

where vi = fs(h(i)s ) is a fully-connected neural

network with Ws parameters. The vectors uw andus are parameters which encode the word contextand sentence context respectively, and are learnedjointly with the rest of the parameters. The totalset of parameters for aw is Aw = {Ww, uw} andfor as is As = {Ws, us}.

3.3 Classification LayersThe output of such a network is typically fed to asoftmax layer for classification, with a loss basedon the cross-entropy between gold and predictedlabels (Tang et al., 2015) or on the negative log-likelihood of the correct labels (Yang et al., 2016).However, softmax overemphasizes the probabilityof the most likely label, which may not be ideal formulti-label classification because each documentshould have more than one likely labels indepen-dent of each other, as we verified empirically inour preliminary experiments. Hence, we replacethe softmax with a sigmoid function, so that foreach document i represented by the vector ui wemodel the probability of the k labels as follows:

ŷi = p(y|ui) =1

1 + e−(Wcui+bc)∈ [0, 1]k (5)

whereWc is a ds×k matrix and bc is the bias termfor the classification layer. The training loss basedon cross-entropy is computed as follows:

L(θ) = − 1N

N∑i=1

H(yi, ŷi) (6)

where θ is a notation for all the parameters of themodel (i.e. Hw, Aw, Hs, As,Wc), andH is the bi-nary cross-entropy of the gold labels yi and pre-dicted labels ŷi for a document i. The above ob-jective is differentiable and can be minimized withstochastic gradient descent (SGD) (Bottou, 1998)or variants such as Adam (Kingma and Ba, 2014),to maximize classification performance.

4 Multilingual Hierarchical AttentionNetworks: MHANs

When multilingual data is available, the abovenetwork can be trained on each language sepa-rately, but in this case the needed parameters growlinearly with the number of languages. More-over, this does not exploit common knowledgeacross languages or to transfer it from one toanother. We propose here a HAN with sharedcomponents across languages, which has slower

Hwαw

Hsαs

wKTw11 ...

HW1

wKTw11 ...

HWMαw

HS1 HSMαs

sKs1 ...

...

...

u1

WC1 WCM...

uM

sKs1 ...

u1

WC1 WCM...

uM

HsαS1

sKs1 ...

u1

WC1 WCM...

uMαSM

HwαW1 αWM

sKs1 ...

wKTw11 ...

...

...

(a) Sharing Encoders (b) Sharing Attentions (c) Sharing Both

Figure 3: Multilingual hierarchical attention net-works for modeling documents and classifyingthem over disjoint label sets.

parameter growth (hence sublinear) compared tomonolingual ones and enables knowledge trans-fer across languages. We now consider M lan-guages noted L = {Ll | l = 1, . . . ,M}, and amultilingual set of topic-labeled documents Dl ={(x(l)i , y

(l)i ) | i = 1, . . . , Nl, l = 1, ...,M} defined

as above (Section 3).

4.1 Sharing Components across Languages

To enable multilingual learning, we propose threedistinct ways for sharing components between net-works in a multi-task learning setting, depictedin Figure 3, namely: (a) sharing the parametersof word and sentence encoders, noted θenc ={Hw,W (l)w , Hs,W (l)s ,W (l)c }; (b) sharing the pa-rameters of word and sentence attention models,noted θatt = {H(l)w ,Ww, H(l)s ,Ws,W (l)c }; and(c) sharing both previous sets of parameters, notedθboth = {Hw,Ww, Hs,Ws,W

(l)c }. For instance,

the document representation of a text for languagel based on a shared sentence-level attention wouldbe computed based on Eq. 4 by using the same pa-rameters Ws and us across languages.

Let θmono = {H(l)w ,W (l)w , H(l)s ,W (l)s ,W (l)c } bethe parameters of multiple independent monolin-gual models with DENSE encoders, then we have:

|θmono| > |θenc| > |θatt| > |θboth| (7)

where | · | is the number of parameters in a set. ForGRU and biGRU encoders, the inequalities stillhold, but swapping |θenc| and |θatt|. Excluding theclassification layer which is necessarily language-specific, the (a) and (b) networks have sublinearnumbers of parameters and the (c) network has aconstant number of parameters with respect to thenumber of languages. The word embeddings arenot considered as parameters in our setup because

Languages Documents LabelsL |X| s̄ w̄ |Yg| |Ys|

English 112,816 17.9 516.2 327 1,058German 132,709 22.3 424.1 367 809Spanish 75,827 13.8 412.9 159 684

Portuguese 39,474 20.2 571.9 95 301Ukrainian 35,423 17.6 342.9 28 260Russian 108,076 16.4 330.1 102 814Arabic 57,697 13.3 357.7 91 344Persian 36,282 18.7 538.4 71 127

All 598,304 17.52 436.7 1,240 4,397

Table 1: Statistics of the Deutsche Welle corpus:s̄ and w̄ are the average numbers of sentences andwords per document.

they are fixed during training. For learned wordembeddings, the argument still holds if we con-sider their parameters as part of the word-level en-coder.

Depending on the label sets, several types ofdocument classification problems can be solvedwith such architectures. First, label sets can becommon or disjoint across languages. Second,considering labels as k-hot vectors, k = 1 cor-responds to a multi-class task, while k > 1 is amulti-label task. We focus here on the multi-labelproblem with disjoint label sets. Moreover, weassume an aligned input space i.e. with multilin-gual word embeddings that have aligned meaningsacross languages (Ammar et al., 2016). With non-aligned word embeddings, the multilingual trans-fer is harder due to the lack of parallel information,as we show in Section 6.2, Table 4.

4.2 Training over Disjoint Label SetsFor training, we replace the monolingual train-ing objective (Eq. 6) with a joint multilingualobjective that facilitates the sharing of compo-nents, i.e. a subset of parameters for each languageθ1, . . . , θM , across different language networks:

L(θ1, . . . , θM ) = −1

Z

Ne∑i

M∑l

H(y(l)i , ŷ(l)i ) (8)

where Z = M ×Ne and Ne is the epoch size.3The joint objective L can be minimized with re-

spect to the parameters θ1, . . . , θM using SGD asbefore. However, when training on examples fromdifferent languages consecutively it is difficult tolearn a shared space that works well across lan-guages. This is because updates for each language

3In the future, it may also be beneficial to add a γl termfor each language objective, which encodes prior knowledgeabout its importance.

apply only on a subset of parameters and may biasthe model away from other languages. To addressthis issue, we employ the training strategy pro-posed by (Firat et al., 2016a), who sampled par-allel sentences for multi-way machine translationfrom different language pairs in a cyclic fashion ateach iteration.4 Here, we sample a document-labelpair from each language at iteration. For mini-batch SGD, the number of samples per languageis equal to the batch size divided by M.

5 A New Corpus for MultilingualDocument Classification: DW

Multilingual document classification datasets areusually limited in size, have target categoriesaligned across languages, and assign documentsto only one category. However, classification isoften necessary in cases where the categories arenot strictly aligned, and multiple categories mayapply to each document. For instance, this is thecase for online multilingual news agencies, whichmust keep track of news topics across languages.

Two datasets for multilingual document classifi-cation have been used in previous studies: ReutersRCV1/RCV2 (6,000 documents, 2 languages and4 labels), introduced by (Klementiev et al., 2012),and TED talk transcripts (12,078 documents, 12languages and 15 labels), introduced by Hermannand Blunsom (2014). The former is tailored forevaluating word embeddings aligned across lan-guages, rather than complex multilingual docu-ment models. The latter is twice as large and cov-ers more languages, in a multi-label setting, butbiases evaluation by including translations of talksin all languages.

Here, we present and use a much larger datasetcollected from Deutsche Welle, Germany’s publicinternational broadcaster, shown in Table 1. TheDW dataset contains nearly 600,000 documents,in 8 languages, annotated by journalists with sev-eral topic labels. Documents are on average 2.6times longer than in Yang et al.’s (2016) monolin-gual dataset (436 vs. 163 words). There are twotypes of labels, namely general topics (Yg) andspecific ones (Ys) both described by one or morewords. We consider (and count in Table 1) onlythose specific labels that appear at least 100 times,to avoid sparsity issues.

4We verified this empirically in our preliminary experi-ments and found that mixing languages in a single batch per-formed better than keeping them in separate batches.

English + Auxiliary → English English + Auxiliary → AuxiliaryModels de es pt uk ru ar fa de es pt uk ru ar fa

Ygen

era

l

Mon

o NN (Avg) 50.7 53.1 70.0 57.2 80.9 59.3 64.4 66.6HNN (Avg) 70.0 67.9 82.5 70.5 86.8 77.4 79.0 76.6HAN (Att) 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6

Mul

ti MHAN-Enc 71.0 69.9 69.2 70.8 71.5 70.0 71.3 69.7 82.9 69.7 86.8 80.3 79.0 76.0MHAN-Att 74.0 74.2 74.1 72.9 73.9 73.8 73.3 72.5 82.5 70.8 87.7 80.5 82.1 76.3MHAN-Both 72.8 71.2 70.5 65.6 71.1 68.9 69.2 70.4 82.8 71.6 87.5 80.8 79.1 77.1

Ysp

ecifi

c

Mon

o NN (Avg) 24.4 21.8 22.1 24.3 33.0 26.0 24.1 32.1HNN (Avg) 39.3 39.6 37.9 33.6 42.2 39.3 34.6 43.1HAN (Att) 43.4 44.8 46.3 41.9 46.4 45.8 41.2 49.4

Mul

ti MHAN-Enc 45.4 45.9 44.3 41.1 42.1 44.9 41.0 43.9 46.2 39.3 47.4 45.0 37.9 48.6MHAN-Att 46.3 46.0 45.9 45.6 46.4 46.4 46.1 46.5 46.7 43.3 47.9 45.8 41.3 48.0MHAN-Both 45.7 45.6 41.5 41.2 45.6 44.6 43.0 45.9 46.4 40.3 46.3 46.1 40.7 50.3

Table 2: Full-resource classification performance (F1) on general (top) and specific (bottom) topic cate-gories using bilingual training with English as target (left) and the auxiliary language as target (right).

The number of labels varies greatly across the8 languages. Moreover, we found for instancethat only 25-30% of the labels could be manuallyaligned between English and German. The com-monalities are mainly concentrated on the mostfrequent labels, reflecting a shared top-level divi-sion of the news domain, but the long tail exhibitssignificant independence across languages.

6 Evaluation

6.1 Settings

We evaluate our multilingual models on full-resource and low-resource scenarios of multi-lingual document classification on the DeutscheWelle corpus. Following the typical evaluationprotocol in the field, the corpus is split per lan-guage into 80% for training, 10% for validationand 10% for testing. We evaluate both type of la-bels (Yg, Ys) on a full-resource scenario and onlythe general topic labels (Yg) on a low-resource sce-nario. We report the micro-averaged F1 scores foreach test set, as in previous work (e.g., Hermannand Blunsom, 2014).

Model configuration. For all models, weuse the aligned 40-dimensional multilingual em-beddings pre-trained on the Leipzig corpus us-ing multi-CCA from Ammar et al. (2016). Thenon-aligned embeddings used for comparison pur-poses are trained with the same method and data.We zero-pad documents up to a maximum of 30words per sentence and 30 sentences per docu-ment. The hyper-parameters were selected onthe validation sets. We made the following set-tings: 100-dimensional encoder and attention em-beddings (at every level), relu activation functionfor all intermediate layers, batch size of 16, epoch

size of 25k, and optimization using SGD withAdam until convergence.

All the hierarchical models have DENSE en-coders in both scenarios (Tables 2, 4, and 5), andGRU and biGRU in the full-resource scenario forEnglish+Arabic (Table 3). For the low-resourcescenario, we define three levels of data availabil-ity: tiny from 0.1% to 0.5%, small from 1% to5% and medium from 10% to 50% of the originaltraining set. We report the average F1 scores onthe test set for each level based on discrete incre-ments of 0.1, 1 and 10 percentage points respec-tively. The decision threshold for the value of p inEq. 5 for the full-resource scenario is set to 0.4 forlabels such that |Ys| < 400 and 0.2 for |Ys| ≥ 400,and for the low-resource scenario it is 0.3 for allsets. For the ensemble in the low-resource setting,we train the three proposed multilingual modelsand choose the optimal one based on the validationdata for each language respectively (see Fig. 4).

Baselines. We compare against the followingmonolingual neural networks, with shallow or hi-erarchical structures. These networks are based onthe state of the art in the field, reviewed in Sec-tion 2, and thus represent strong baselines.

• NN : A neural network which feeds the av-erage vector of the input words directly toa classification layer, as the common base-line for multilingual document classification(Klementiev et al., 2012).

• HNN : A hierarchical network with encodersand average pooling at every level, followedby a classification layer.

• HAN: A hierarchical network with encodersand attention, followed by a classificationlayer. This model is the one proposed by

Yang et al. (2016) adapted to our task.

Our multilingual models with the three sharingconfigurations from Section 4.1, are noted as Enc,Att and Both. Their implementation amounts to,first, creating a HAN model for each language,second, sharing components across multiple lan-guages as illustrated in Fig. 3, and, third, trainingthem with the objective of Eq. 8.

6.2 Results

Full-resource scenario. Table 2 displays the re-sults of full-resource document classification us-ing DENSE encoders for general and specific la-bels. On the left side, the performance on the En-glish sub-corpus is shown when English and anauxiliary sub-corpus are used for training, and onthe right side, the performance on the auxiliarysub-corpus is shown when that sub-corpus and theEnglish sub-corpus are used for training.

The multilingual model trained on pairs of lan-guages outperforms on average all the examinedmonolingual models, namely a bag-of-word neu-ral model and two hierarchical neural modelswhich use average pooling and attention respec-tively. The best-performing multilingual modelbilingually on average is the one with shared atten-tion across languages, especially when tested onEnglish. The consistent gain for English as targetcould be attributed to the alignment of the wordembeddings to English and to the many Englishlabels, which makes it easier to find multilinguallabels from which to transfer knowledge. Interest-ingly, this reveals that the transfer of knowledgeacross languages in a full-resource setting is max-imized with language-specific word and sentenceencoders, but language-independent (i.e. shared)attention for both words and sentences.

However, when transferring from English toPortuguese (en→pt), Russian (en→ru) and Per-sian (en→fa) on general categories, it is more ef-fective to have only language-independent compo-nents. We hypothesize that this is due to the under-lying commonness between the label sets ratherthan to a relationship between languages, whichis hard to identify on linguistic grounds.

We will now quantify the impact of three im-portant model choices on the performance: en-coder type, word embeddings, and number of lan-guages used for training. In Table 3, we observethat when we replace the DENSE encoder layerswith GRU or biGRU layers, the improvement from

Encoders Mono MultiYgeneral HAN Enc Att Both

ar→

en DENSE 71.2 70.0 73.8 68.9GRU 77.0 74.8 77.5 75.4biGRU 77.7 77.1 77.5 76.7

en→

ar DENSE 80.5 79.0 82.1 79.1GRU 81.5 81.2 83.4 83.1biGRU 82.2 82.7 84.0 83.0

Table 3: Full-resource classification performance(F1) for English-Arabic with various encoders.

Ygeneral Yspecific

Word embeddings |L| nl fl nl fl1 50K – 77.41 – 90K – 44.90 –

Aligned 2 40K ↓ 78.30 ↑ 80K ↓ 45.72 ↑8 32K ↓ 77.91 ↑ 72K↓ 45.82 ↑

Non-aligned 8 32K ↓ 71.23 ↓ 72K ↓ 33.41 ↓

Table 4: Average number of parameters per lan-guage (nl), average F1 per language (fl), and theirvariation (arrows) with the number of languages|L| and the word embeddings used for training.

the multilingual training is still present. In par-ticular, the multilingual models with shared atten-tion are superior to alternatives, regardless of theemployed encoders. For reference, using simplylogistic regression with bag-of-words (counts) forclassification leads to F1 scores of 75.8% in En-glish and 81.9% in Arabic, using many more pa-rameters than biGRU: 56.5M vs. 410k in Englishand 5.8M vs. 364k in Arabic.

In Table 4, when we train our multilingualmodel (MHAN-att) on eight languages at the sametime, the F1 score improves on average across lan-guages – for both types of labels, general or spe-cific – while the number of parameters per lan-guage decrease, by 36% for Ygeneral and 20% forYspecific . Lastly, when we train the same modelwith word embeddings that are not aligned acrosslanguages, the performance of the multilingualmodel drops significantly. An input space that isaligned across languages is thus crucial.

Low-resource scenario. We assess the abil-ity of the multilingual attention networks to trans-fer knowledge across languages in a low-resourcescenario, i.e. training on a fraction of the availabledata, as defined in 6.1 above. The results for sevenlanguages when trained jointly with English aredisplayed in detail in Table 5 and summarized inFigure 4. In all cases, at least one of the multi-lingual models outperforms the monolingual one,which demonstrates the usefulness of multilingualtraining for low-resource document classification.

Figure 4: Low-resource document classification performance (F1) of our multilingual attention networkensemble (blue lines) vs. a monolingual attention network (purple dashed lines) on the DW corpus.

Size Mono MultiYgeneral HAN Enc Att Both ∆%

en→

de 0.1-0.5% 29.9 41.0 37.0 39.4 +37.21-5% 51.3 51.7 49.7 52.6 +2.6

10-50% 63.5 63.0 63.8 63.8 +0.5

en→

es 0.1-0.5% 39.5 38.7 33.3 41.5 +4.91-5% 45.6 45.5 50.8 50.1 +11.6

10-50% 74.2 75.7 74.2 75.2 +2.0

en→

pt 0.1-0.5% 30.9 25.3 31.6 33.8 +9.61-5% 44.6 44.3 37.5 47.3 +6.0

10-50% 60.9 61.9 62.1 62.1 +1.9

en→

uk 0.1-0.5% 60.4 62.4 59.8 60.9 +3.11-5% 68.2 67.7 70.6 69.0 +3.4

10-50% 76.4 76.2 76.3 76.7 +0.3

en→

ru 0.1-0.5% 27.6 26.6 27.0 29.1 +5.41-5% 39.3 38.2 39.6 40.2 +2.2

10-50% 69.2 70.5 70.4 69.4 +1.9

en→

ar 0.1-0.5% 35.4 35.5 39.5 36.6 +11.71-5% 45.6 48.7 47.2 46.6 +6.9

10-50% 48.9 52.2 46.8 47.8 +6.8

en→

fa 0.1-0.5% 36.0 35.6 33.6 41.3 +14.61-5% 55.0 55.6 51.9 55.5 +1.0

10-50% 69.2 70.3 70.1 70.0 +1.5

Table 5: Low-resource classification performance(F1) with various sizes of training data.

Moreover, the improvements obtained from ourmultilingual models for lower levels of availabil-ity (tiny and small) are larger than in higher levels(medium). This is also clearly observed in Fig-ure 4 with our multilingual attention network en-semble, i.e. when we do model selection amongthe three multilingual variants on the developmentset. The best performing architecture in a major-ity of cases is the one which shares both the en-coders and the attention mechanisms across lan-guages. Moreover, this architecture also has thefewest number of parameters.

This promising finding for the low-resource sce-nario means that the classification performancecan greatly benefit from the multilingual training(sharing encoders and attention) without increas-ing the number of parameters beyond that of a sin-gle monolingual document model. Nevertheless,in a few cases, we observe that the other archi-tectures with increased complexity perform betterthan the “shared both” model. For instance, shar-ing encoders is superior to alternatives for Arabiclanguage, i.e. the knowledge transfer benefits from

shared word and sentence representations. Hence,to generalize to a large number of languages, wemay need to consider more dynamic models whichare able to choose for each language individuallywhich sharing scheme is the most appropriate fortransferring from another language. Lastly, we didnot generally observe a negative (or positive) cor-relation of the similarity between languages withthe performance in the low-resource scenario, al-though the largest improvements were observedon languages more related to English (German,Spanish, Portuguese) than others (Arabic).

Overall, the above experiments pinpointed themost suitable multilingual sharing scheme (Fig-ure 3) for each setting independently of the en-coder type, rather than the optimal combination ofsharing scheme and encoder. Therefore, as shownin Table 3, increasing the sophistication of the en-coders (from DENSE to GRU to biGRU) is ex-pected to further improve accuracy.

6.3 Qualitative Analysis

We analyze the performance of the multilingualmodel over the full range of labels, to observe onwhich type of labels it performs better than themonolingual model, and provide some qualitativeexamples. Figure 5 shows the cumulative true pos-itive (TP) difference between the monolingual andmultilingual models on the Arabic, German, Por-tuguese and Russian test sets, ordered by label fre-quency. We can observe that the cumulative TPdifference of the multilingual model consistentlyincreases as the frequencies of the labels decrease.This shows that labels across the entire range offrequencies benefit from joint training with En-glish and not only a subset, for example only thehighly frequent labels.

For example, the top 5 labels on which the mul-tilingual model performed better than the mono-lingual one for en→de were: russland (21), berlin(19), irak (14), wahlen (13) and nato (13), whilefor the opposite direction those were: germany(259), german (97), soccer (73), football (47) and

en → ru

en → ar

en → pt

en → de

Figure 5: Cumulative true positive (TP) differencebetween monolingual and multilingual (ensemble)models for topic classification with specific labels,in the full resource scenario.

merkel (25). These topics are likely better coveredin the respective auxiliary language which helpsthe multilingual model to better distinguish themin the target language as well. This is also ob-served in Figure 1 presented in the introduction,through an improved separation of topics usingmultilingual model vs. monolingual ones.

7 Conclusion

We proposed multilingual hierarchical attentionnetworks for document classification and showedthat they can benefit both full-resource and low-resource settings, while using fewer parametersthan monolingual networks. In the former set-ting, the best option was to share only the attentionmechanisms, while in the latter one, it was shar-ing the encoders along with the attention mech-anisms. These results confirm the merits of lan-guage transfer, which is also an important com-ponent of human language learning (Odlin, 1989;Ringbom, 2007). Moreover, our study broadensthe applicability of multilingual document classi-fication, since our framework is not restricted tocommon label sets.

There are several future directions for this study.In their current form, our models cannot gener-

alize to languages without any example, as at-tempted by Firat et al. (2016b) for neural machinetranslation. This could be achieved by a classi-fication layer independent of the size of the la-bel set as in zero-shot classification (Qiao et al.,2016; Nam et al., 2016). Moreover, although weexplored three distinct architectures, other con-figurations could be examined to improve docu-ment modeling, for example by sharing the atten-tion mechanism at the sentence-level only. Lastly,the learning objective could be further constrainedwith sentence-level parallel information, to embedmultilingual vectors of similar topics more closelytogether in the learned space.

Acknowledgments

We are grateful for the support from the Euro-pean Union through its Horizon 2020 programin the SUMMA project n. 688139, see http://www.summa-project.eu. We would alsolike to thank Sebastião Miranda at Priberam forgathering the news articles from Deutsche Welleand the anonymous reviewers for their helpful sug-gestions. The second author contributed to the pa-per while at the Idiap Research Institute.

ReferencesRami Al-Rfou, Bryan Perozzi, and Steven Skiena.

2013. Polyglot: Distributed word representationsfor multilingual NLP. In Proc. of the SeventeenthConference on Computational Natural LanguageLearning, Sofia, Bulgaria.

Waleed Ammar, George Mulcaire, Yulia Tsvetkov,Guillaume Lample, Chris Dyer, and Noah A. Smith.2016. Massively multilingual word embeddings.CoRR, abs/1602.01925.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proc. of the 5thInternational Conference on Learning Representa-tions, San Diego, CA, USA.

Léon Bottou. 1998. On-line learning and stochastic ap-proximations. In David Saad, editor, On-line Learn-ing in Neural Networks, pages 9–42. CambridgeUniversity Press.

Sarath Chandar, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas C.Raykar, and Amrita Saha. 2014. An autoencoderapproach to learning bilingual word representations.In Advances in Neural Information Processing Sys-tems 27, pages 1853–1861.

http://www.summa-project.euhttp://www.summa-project.eu

Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, XiaodongHe, Jianfeng Gao, Xinying Song, and Li Deng.2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture.In Advances in Neural Information Processing Sys-tems 28, pages 1765–1773, Montreal, Canada.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 1724–1734, Doha, Qatar.

Misha Denil, Loris Bazzani, Hugo Larochelle, andNando de Freitas. 2012. Learning where to attendwith deep architectures for image tracking. NeuralComputation, 24(8):2151–2184.

Manaal Faruqui and Chris Dyer. 2014. Improvingvector space word representations using multilin-gual correlation. In Proc. of the 14th Conference ofthe European Chapter of the Association for Com-putational Linguistics, pages 462–471, Gothenburg,Sweden.

Daniel C. Ferreira, André F. T. Martins, and MarianaS. C. Almeida. 2016. Jointly learning to embedand predict with multiple languages. In Proc. of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2019–2028, Berlin, Germany.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.2016a. Multi-way, multilingual neural machinetranslation with a shared attention mechanism. InProc. of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages866–875, San Diego, CA, USA.

Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b.Zero-resource translation with multi-lingual neuralmachine translation. In Proc. of the Conference onEmpirical Methods in Natural Language Process-ing, pages 268–277, Austin, Texas.

Stephan Gouws, Yoshua Bengio, and Gregory S. Cor-rado. 2015. BilBOWA: Fast bilingual distributedrepresentations without word alignments. In Proc.of the 32nd International Conference on MachineLearning, pages 748–756, Lille, France.

Karl Moritz Hermann and Phil Blunsom. 2014. Multi-lingual models for compositional distributed seman-tics. In Proc. of the 52nd Annual Meeting of the As-sociation for Computational Linguistics, pages 58–68, Baltimore, Maryland.

Karl Moritz Hermann, Tomáš Kočiský, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proc. of the

28th International Conference on Neural Informa-tion Processing Systems, NIPS’15, pages 1693–1701, Montreal, Canada.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. In Neural Computation, vol-ume 9 (8), pages 1735–1780. MIT Press.

Yangfeng Ji and Noah Smith. 2017. Neural dis-course structure for text categorization. CoRR,abs/1702.01829.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Viégas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean. 2016.Google’s multilingual neural machine translationsystem: Enabling zero-shot translation. CoRR,abs/1611.04558.

Rie Johnson and Tong Zhang. 2015. Effective useof word order for text categorization with convolu-tional neural networks. In Proc. of the 2015 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 103–112, Denver, Col-orado.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proc. of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 1746–1751, Doha, Qatar.

Diederik P. Kingma and Jimmy Lei Ba. 2014. Adam:A method for stochastic optimization. In Proc. ofthe International Conference on Learning Represen-tations, Banff, Canada.

Alexandre Klementiev, Ivan Titov, and Binod Bhat-tarai. 2012. Inducing crosslingual distributed rep-resentations of words. In Proc. of the InternationalConference on Computational Linguistics, Bombay,India.

Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad-bury, Robert English, Brian Pierce, Peter Ondruska,Ishaan Gulrajani, and Richard Socher. 2015. Askme anything: Dynamic memory networks for nat-ural language processing. In Proc. of the 33rd In-ternational Conference on Machine Learning, pages334–343, New York City, NY, USA.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification. In Proc. of the 29th AAAI Conferenceon Artificial Intelligence, pages 2267–2273, Austin,Texas.

Hugo Larochelle and Geoffrey Hinton. 2010. Learningto combine foveal glimpses with a third-order Boltz-mann machine. In Proc. of the 23rd InternationalConference on Neural Information Processing Sys-tems, pages 1243–1251, Vancouver, Canada.

Quoc V. Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Proc.of the 31st International Conference on MachineLearning, pages 1188–1196, Beijing, China.

Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,and Sheng Li. 2015. Hierarchical recurrent neuralnetwork for document modeling. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 899–907, Lisbon, Portugal.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1412–1421, Lisbon, Portugal.

Laurens van der Maaten. 2009. Learning a parametricembedding by preserving local structure. In Proc. ofthe 12th International Conference on Artificial In-telligence and Statistics, pages 384–391, ClearwaterBeach, FL, USA.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In Proc. of the Inter-national Conference on Learning Representations,Scottsdale, AZ, USA.

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko-ray Kavukcuoglu. 2014. Recurrent models of visualattention. CoRR, abs/1406.6247.

Jinseok Nam, Eneldo Loza Mencı́a, and JohannesFürnkranz. 2016. All-in text: Learning document,label, and word representations jointly. In Proc. ofthe 30th AAAI Conference on Artificial Intelligence,pages 1948–1954, Phoenix, AR, USA.

Terence Odlin. 1989. Language transfer: Cross-linguistic influence in language learning. In Cam-bridge Applied Linguistics. Cambridge UniversityPress.

Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex-plaining the stars: Weighted multiple-instance learn-ing for aspect-based sentiment analysis. In Proc.of the Conference on Empirical Methods in NaturalLanguage Processing, pages 455–466, Doha, Qatar.

Nikolaos Pappas and Andrei Popescu-Belis. 2017.Explicit document modeling through weightedmultiple-instance learning. Journal of Artificial In-telligence Research, pages 591–626.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proc. of the Conference on Em-pirical Methods in Natural Language Processing,pages 1532–1543, Doha, Qatar.

Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and An-ton van den Hengel. 2016. Less is more: Zero-shot learning from online textual documents withnoise suppression. In Proc. of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages2249–2257, Las Vegas, NV, USA.

Hakan Ringbom. 2007. Cross-linguistic Similarityin Foreign Language Learning. Second languageacquisition series, vol. 21. Multilingual Matters,Clevedon, UK.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proc. of the Conference onEmpirical Methods in Natural Language Process-ing, pages 379–389, Lisbon, Portugal.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Docu-ment modeling with gated recurrent neural networkfor sentiment classification. In Empirical Methodson Natural Language Processing, pages 1422–1432,Lisbon, Portugal.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: A simple and general methodfor semi-supervised learning. In Proc. of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 384–394, Uppsala, Swe-den.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProc. of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1480–1489, San Diego, CA, USA.

Ainur Yessenalina, Yisong Yue, and Claire Cardie.2010. Multi-level structured models for document-level sentiment classification. In Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1046–1056, Cambridge, MA.

Wen-tau Yih, Kristina Toutanova, John C. Platt, andChristopher Meek. 2011. Learning discriminativeprojections for text similarity measures. In Proc.of the 15th Conference on Computational NaturalLanguage Learning, pages 247–256, Portland, OR,USA.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in Neural InformationProcessing Systems 28, pages 649–657, Montreal,Canada.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual word embeddingsfor phrase-based machine translation. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 1393–1398, Seattle, WA,USA.