Elasticsearchのmatchとmatch_phraseの違い

Elasticsearchで全文検索する際にmatchとmatch_phraseの違いがはっきり身についていないのでまとめてみる。

version: Elasticsearch 7.5

matchクエリ

matchクエリを使うといわゆる曖昧検索ができる。matchクエリに渡した文字列はanalyzeされてから検索に使用される。

例えばtitleというフィールドに「2021年春に発売される新着本の特集!!NEW!!」という文字を保存していたとして、「2021春」で検索できるかどうかを考えてみる。

GET sample-index/_search
{
  "query": {
    "match": {
      "title": "2021春"
    }
  }
}

解析されたクエリはORで検索される

indexをkuromojiで形態素解析されるように設定したので、「2021春」は「2021」と「春」に形態素解析される。

GET sample-index/_analyze
{
  "field": "title",
  "text": "2021春"
}

// 結果
{
  "tokens": [
    { "token": "2021", 略},
    { "token": "春", 略}
  ]
}

クエリに渡すパラメータが複数の場合のデフォルトはORのため、「2021春」で検索する場合は「2021 OR 春」でマッチするかどうかを検索することになる。

つまり「2021年春に発売される新着本の特集!!NEW!!」は、2021と春の間に「年」が入っているものの検索されることになる。

ORで検索されるということは、「2021夏」でも検索に引っかかってくるということである。これが目的と異なる場合は、operatorをANDに変更してあげるとよい。

    "match": {
      "title": {
        "query": "2021春",
        "operator": "AND",
      }
    }

match_phraseクエリ

match同様、match_phraseクエリに渡した文字列はanalyzeされてから検索に使用される。

ただmatch_phraseはmatchよりもマッチする条件が厳しい。

全てのワードがクエリと同じ順序で、間に余計なものを含まず

全てのワードがクエリと同じ順序で含まれていなければいけない。

また各ワードの間に余計な文字が入ってもいけない。

「2021春」で検索しても、間に年が入っているためヒットしない。

GET sample-index/_search
{
  "query": {
    "match_phrase": {
      "title": "2021春"
    }
  }
}

全てのワードがクエリと同じ順序で、さらに間に余計なものを含まず、というルールだと厳しすぎて曖昧検索というよりSQLのLIKEに近しいイメージになっている。

ただSQLのLIKEとは異なり形態素解析を行ってくれるので、記号（・や!）を除いてマッチしてくれたり、大文字小文字を正規化してからマッチしてくれるので、多少は強力。

今回の例でいうと「特集new」が「特集!!NEW」にマッチしてくれる。

slop: 間に余計なものを含んでもいいように

とはいえ、実際にもう少し柔軟性を持たせたフレーズ検索をしたい場合にはslopというオプションが使える。

    "match_phrase": {
      "title": {
        "query": "2021春",
        "slop": 1,
      }
    }

slopは各検索クエリワードの間に何文字余計な単語の出現を許すか、というオプション。

今回は2021と春の間に「年」という余計な1単語（偶然1単語1文字になっているが、「発売」のように形態素解析後にこれ以上分割されないものを単位に考える）が入っているので、slopに1以上の数字を指定すれば、検索にヒットするようになる。

multi_matchクエリ

複数のフィールドを対象に検索したい場合は、matchもmatch_phraseも両方ともmulti_matchを使用する。

type

multi_matchをmatchの複数フィールド版として使うのか、match_phraseの複数フィールド版として使うのかは、typeで指定する。

公式ドキュメントには複数のtypeが解説されているが、最低限best_fieldsとphraseの二つをしっかり理解すれば、大抵の検索目的は果たせそう。

type 説明

best_fields (default) Finds documents which match any field, but uses the _score from the best field. See best_fields.

most_fields Finds documents which match any field and combines the _score from each field. See most_fields.

cross_fields Treats fields with the same analyzer as though they were one big field. Looks for each word in any field. See cross_fields.

phrase Runs a match_phrase query on each field and uses the _score from the best field. See phrase and phrase_prefix.

phrase_prefix Runs a match_phrase_prefix query on each field and uses the _score from the best field. See phrase and phrase_prefix.

bool_prefix Creates a match_bool_prefix query on each field and combines the _score from each field. See bool_prefix.

type	説明
best_fields	(default) Finds documents which match any field, but uses the _score from the best field. See best_fields.
most_fields	Finds documents which match any field and combines the _score from each field. See most_fields.
cross_fields	Treats fields with the same analyzer as though they were one big field. Looks for each word in any field. See cross_fields.
phrase	Runs a match_phrase query on each field and uses the _score from the best field. See phrase and phrase_prefix.
phrase_prefix	Runs a match_phrase_prefix query on each field and uses the _score from the best field. See phrase and phrase_prefix.
bool_prefix	Creates a match_bool_prefix query on each field and combines the _score from each field. See bool_prefix.

best_fields

multi_matchをmatchの複数フィールド版として使う場合は、typeをbest_fieldsにする。

best_fieldsの説明で公式ドキュメントに以下の通り記載されているが、スコアに関して注意が必要。

Normally the best_fields type uses the score of the single best matching field, but if tie_breaker is specified, then it calculates the score as follows:

the score from the best matching field plus tie_breaker * _score for all other matching fields

各フィールドのうち最もスコアの高いものが全体のスコアとして採用される。best_fieldsという名前通りの挙動になっている。

もし各フィールドで計算されたスコアの合計点を全体のスコアとして採用したければ、tie_breakerに1.0を指定すればよい。（tie_breakerのデフォルトは0.0）

以下を計算式にすると、

the score from the best matching field plus tie_breaker * _score for all other matching fields

スコア =
  最もマッチしたフィールドのスコア +
  1.0 * 他のマッチしたフィールドその1のスコア +
  1.0 * 他のマッチしたフィールドその2のスコア +
  ...

になるため、つまり全フィールドのスコアの合計になる。

phrase

公式ドキュメントに記載の通り、typeにphraseを設定すればmatch_phrase相当の動きになる。

The phrase and phrase_prefix types behave just like best_fields, but they use a match_phrase or match_phrase_prefix query instead of a match query.

phraseはbest_fieldsのように動くと書いてあるが、スコア計算の仕方もbest_fieldsと同様。そのため、目的次第ではtie_breakerを同様に1.0にする。

most_fields, cross_fields

best_fieldsとtie_breakerの組み合わせは、代わりにmost_fieldsやcross_fieldsを使用した方がいいのではないか、という点について記載する。

match_phraseの挙動をmulti_matchで実現するにはphraseをtypeに設定する方法以外なく、most_fieldsのmatch_phrase版などはなさそうだった。スコアを各フィールドの合計にするにはtie_breakerの設定が必須のよう。

matchとmatch_phraseの複数フィールド版でのスコア合計方法を統一するために、most_fieldsやcross_fieldsでないと実現できない要件でない限りbest_fieldsとtie_breakerの組み合わせを利用するで問題ないと考えた。