自然言語処理にとって文や文書間の類似度を計算するのは重要なタスクです。 類似文(書)の計算には、盗作の検知、関連記事の検索、質問応答における質問文の多様性の吸収といった様々な応用があります。

文書間の距離を計算する手法として Word Mover’s Distance があります。 Word Mover’s Distance は2015年に提案された手法です。Twitterのようなショートテキストに対して良い結果を示しているのが特徴です。具体的には Word2vec や GloVe 等で得られた単語の分散表現を使って文書間の距離を計算します。

本記事では、Word Mover’s Distance を試してみることを目的としています。 具体的には gensim という単語の分散表現や類似文書を計算できるPythonライブラリを用いて Word Mover’s Distance を計算します。

なお、Word Mover’s Distance の理論については以下の記事が非常にわかりやすく解説してくださっているのでそちらを参照してください。 yubessy.hatenablog.com

では早速実装してみましょう。

準備

実装をして行く前に必要なライブラリをインストールします。今回は Word Mover’s Distance を計算するために gensim と pyemd をインストールします。以下のようにしてインストールしてください:

$ pip install gensim
$ pip install pyemd

Word Mover’s Distance の実装

最初に実装の手順を説明します。 Word Mover’s Distance を計算するのには単語の分散表現が必要です。そこでまず事前学習済みの単語分散表現を用意します。その後、学習済みの単語分散表現を用いて Word Mover’s Distance を計算します。

単語分散表現の用意

まずは学習済みの分散表現をダウンロードします。こちらから GoogleNews-vectors-negative300.bin.gz をダウンロードしてください。こちらの単語ベクトルは300次元で、ボキャブラリ数が300万、学習にはGoogle News コーパス(1000億語)を用いています。

gensimを使って学習済みの分散表現を読み込むのは非常に簡単です。分散表現のファイルパスを以下のようにして渡すだけです:

>>> from gensim.models.keyedvectors import KeyedVectors

>>> # Load Google's pre-trained Word2Vec model.
>>> model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

単語の分散表現を読み込めたら関連語を表示して確かめてみましょう。関連語は most_similar メソッドを呼ぶことで求めることができます:

>>> model.most_similar(positive=['japanese'])
[('japan', 0.6607723236083984),
 ('chinese', 0.6502295732498169),
 ('Japanese', 0.6149079203605652),
 ('korean', 0.6051568984985352),
 ('german', 0.5999273061752319),
 ('american', 0.5906797647476196),
 ('asian', 0.5839767456054688),
 ('san', 0.5834757089614868),
 ('jap', 0.5764404535293579),
 ('swedish', 0.5720360279083252)]

Word Mover’s Distance の計算

分散表現を読み込み終わったら Word Mover’s Distance を計算します。 Word Mover’s Distance を計算するには model の wmdistance メソッドを用います。このメソッドに2つの文を与えることで2つの文間の距離を計算することができます:

>>> sent1 = 'But other sources close to the sale said Vivendi was keeping the door open to further bids and hoped to see bidders interested in individual assets team up.'.split()
>>> sent2 = 'But other sources close to the sale said Vivendi was keeping the door open for further bids in the next day or two.'.split()
>>> distance = model.wmdistance(sent1, sent2)
>>> print(distance)
0.8738126733213625

距離が 0 に近いほど似た文であるということを示しています。

文を与える際の注意点として、与える文は分かち書きされていなければならないということです。上の例では split を使うことで簡易的に分かち書きしています。

類似文を抽出してみる

SemEval2012の MSRvid.txt を元に文のペアを76万ペア生成し、それらに対して Word Mover’s Distance を計算してみました。 Word Mover’s Distance で距離が近い文のペアを見てみましょう。まずは上位10件です。

順位	文1	文2
1	A man is eating a food.	A man is eating food.
2	A woman is playing guitar.	A woman is playing a guitar.
3	A woman is slicing carrot.	A woman is slicing a carrot.
4	A man is slicing a potato.	A man is slicing potato.
5	A man is singing and playing a guitar.	A man is playing a guitar and singing.
6	A girl is playing a guitar.	A girl is playing guitar.
7	A woman is playing a flute.	A woman is playing flute.
8	A man is driving a car.	A man is driving a car.
9	A man is playing keyboard.	A man is playing a keyboard.
10	Someone is playing paino.	Someone is playing piano.

上位10件のペアはすべて距離が0になりました。ペアの差異を見てみると、ほとんど定冠詞の「a」がついているかついていないかみたいな違いでした。では次は100位から110位を見てみましょう。

順位	距離	文1	文2
101	0.28339873148707156	The man is singing and playing the guitar.	A man is playing the guitar and singing.
102	0.28431112099748035	A man is dancing in the rain.	A man is standing in the rain.
103	0.28610509370804266	A man is doing pull-ups.	A man is doing push ups.
104	0.287278075708792	A woman is cutting potatoes.	A person is cutting a potato.
105	0.2874588758559057	A boy is playing with a toy.	A dog is playing with a toy.
106	0.2885302530155865	A man is praying.	A man is crying.
107	0.2899102354752659	A person is slicing some onions.	A person is slicing onions.
108	0.29093771121638334	A woman is slicing lemons.	A woman is slicing some onions.
109	0.29123062405240735	A girl is jumping onto a car.	A girl is coming up on a car.
110	0.2913863757608983	A woman is cutting potatoes.	A woman is cutting a tomatoe.

この辺になると分散表現を使っているメリット/デメリットが見て取れます。

たとえばメリットとして、104位のペアでは potatoes と potato の差を吸収できているように見えます。これは分散表現を使うことで字面は異なるけどほぼ同じ意味であると考えてくれたのではないでしょうか。また、ここには挙げていませんがスペルミスにも強くなっていると感じました。 driving を drivong とスペルミスしている文に対しても距離の値は小さくなっていました。

一方でデメリトとして、103位のペアでは対義語である pull-ups と push ups という単語があるのに距離が近くなっています。これは分散表現が類似性と対義性を区別できないことに原因があると考えられます。この辺は分散表現を使った距離計算のデメリットなので気をつけなければいけませんね。