IR brief theory

Photo by Bram Naus on Unsplash

IR brief theory

Search engines are designed with two goals in mind:

  • to retrieve documents relevant to the user

  • and to reject the documents that are non-relevant to the users

As a criterion for the correct selection of relevant documents recall metrics could be used. This metric is defined as the minimum of the type II error - min false positive samples. In our case, it is an incorrect (false) classification of the relevant document as negative, i.e. not subject to choice.

$$Recall = \frac{TP}{TP+FN}$$

On the other hand, the precise selection of only relevant documents can be checked using another metric - precision. It is defined as a minimum of the type I error, i.e. min false negative samples. In our case, a false negative means that the document is selected when it shouldn't be.

$$Precision =\frac{TP}{TP+FP}$$

Ideally both precision and recall should be as high as possible.

The selection method itself may be one of the following types: lexical, semantic, and hybrid.

Lexical (full-text) selection is usually based on creating an index of bag-of-words (BoW) found in the sentence (text), thus creating the sparse vector where each dimension in this vector corresponds to the dictionary entry of the word and the value represents the amount of this word in the sentence (text). As an example of this approach, let's consider BM25 algorithm that ranges the documents in the corpus according to their relevance to the prompt (query).

There are lots of improvements to classic BM25 algorithms. Take a review of some here.

BM25

Given the query \(Q\), containing keywords \(q_1, ..., q_N\) the BM25 score of the document \(D\) is

$$score(D, Q) = \sum_{i=1}^{n} IDF(q_{i}) \frac{f(q_i, D)(k_1+1)}{f(q_i, D)+k_1(1-b+b*\frac{\left\lvert D \right\rvert}{avgdl})}=$$

$$(k_1+1)\sum_{i=1}^{n} IDF(q_{i}) \frac{f(q_i, D)}{f(q_i, D)+k_1(1-b+b\frac{\left\lvert D \right\rvert}{avgdl})}$$

where \(k_1 \in [1.2,2]\) is a saturation external parameter (\(k_1+1\) is called boost), \(f(q_1, D)\) is the number of times that the keyword \(q_i\) occurs if the document \(D\), \(\vert D \vert\) is the length of the document in words, and \(avgdl\) is the average document length (in words) in the corpus from where the documents are drawn. \(IDF(q_i)\) is the inverse document frequency of the term \(q_i\). It is usually calculated as

$$IDF(q_i)=ln(\frac{N-n(q_i)+0.5}{n(q_i)+0.5}+1)$$

where \(N\) is the total number of documents in the corpus, \(n(q_i)\) is the number of all documents containing \(q_i\).

Let's look at using BM25 within ElasticSearch as an example.

  1. Create an index and add several documents to it:

     POST _bulk
     { "index" : { "_index" : "kotlin_articles", "_id" : "1" } }
     { "text" : "Kotlin Programming Language" }
     { "index" : { "_index" : "kotlin_articles", "_id" : "2" } }
     { "text" : "Learn Kotlin - Kotlin Free Tutorial" }
     { "index" : { "_index" : "kotlin_articles", "_id" : "3" } }
     { "text" : "Java vs. Kotlin - Part1: Performance" }
     { "index" : { "_index" : "kotlin_articles", "_id" : "4" } }
     { "text" : "Java vs. Kotlin - Part2: Bytecode" }
     { "index" : { "_index" : "kotlin_articles", "_id" : "5" } }
     { "text" : "Anything Java can do Kotlin can do better" }
    

The simple search with "Kotlin" query:

GET /kotlin_articles/_search
{
  "query": {
    "match": {
      "text": "Kotlin"
    }
  }
}

gives the following scores:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.120948985,
    "hits": [
      {
        "_index": "kotlin_articles",
        "_id": "2",
        "_score": 0.120948985,
        "_source": {
          "text": "Learn Kotlin - Kotlin Free Tutorial"
        }
      },
      {
        "_index": "kotlin_articles",
        "_id": "1",
        "_score": 0.10522306,
        "_source": {
          "text": "Kotlin Programming Language"
        }
      },
      {
        "_index": "kotlin_articles",
        "_id": "3",
        "_score": 0.08840232,
        "_source": {
          "text": "Java vs. Kotlin - Part1: Performance"
        }
      },
      {
        "_index": "kotlin_articles",
        "_id": "4",
        "_score": 0.08840232,
        "_source": {
          "text": "Java vs. Kotlin - Part2: Bytecode"
        }
      },
      {
        "_index": "kotlin_articles",
        "_id": "5",
        "_score": 0.07130444,
        "_source": {
          "text": "Anything Java can do Kotlin can do better"
        }
      }
    ]
  }
}

Now we can check whether the issued metric value coincides with the theoretical one. To check, let's take the most relevant document and substitute its parameters into the ranking function.

$$IDF=ln(1+\frac{(docCount - f(q)+0.5)}{f(q)+0.5})=ln(1+\frac{(5-5+0.5)}{5+0.5})=0.087$$

$$score=(k_1+1)IDF\frac{f(q, D)}{f(q, D) + k_1(1-b+b*\frac{fieldLen}{avgFieldLen}}=$$

$$(1.2+1)0.087\frac{2}{2+1.2*(1-0.75+0.75*\frac{5}{5.2})}=$$

$$2.20.087\frac{2}{2+1.2(0.25+0.750.96)}=2.20.087*\frac{2}{2+1.2*(0.25 + 0.72)}=$$

$$2.20.087\frac{2}{2+1.20.97}=2.20.087\frac{2}{3.164}=0.087*1.39=0.1209$$

To cross-check this, let’s run the calculations using the Explain API:

GET /kotlin_articles/_explain/2
{
  "query": {
    "match": {
      "text": "Kotlin"
    }
  }
}
{
  "_index": "kotlin_articles",
  "_id": "2",
  "matched": true,
  "explanation": {
    "value": 0.120948985,
    "description": "weight(text:kotlin in 1) [PerFieldSimilarity], result of:",
    "details": [
      {
        "value": 0.120948985,
        "description": "score(freq=2.0), computed as boost * idf * tf from:",
        "details": [
          {
            "value": 2.2,
            "description": "boost",
            "details": []
          },
          {
            "value": 0.087011375,
            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
            "details": [
              {
                "value": 5,
                "description": "n, number of documents containing term",
                "details": []
              },
              {
                "value": 5,
                "description": "N, total number of documents with field",
                "details": []
              }
            ]
          },
          {
            "value": 0.63183475,
            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details": [
              {
                "value": 2,
                "description": "freq, occurrences of term within document",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 5,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 5.2,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

Note that BM25 as TF-IDF as well don't have sound theoretical properties (in contrast to Probabilistic Retrieval Model)

Read more

Understanding TF-IDF and BM-25 - KMW Technology (kmwllc.com)

Rank Product Statistic

$$RP(g) = \left(\prod_{i=1}^k r_{g,i}\right)^{\frac{1}{k}}$$

Help

Run ELK in docker:

docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker run --name elasticsearch --net elastic -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" -t docker.elastic.co/elasticsearch/elasticsearch:8.12.2

docker pull docker.elastic.co/kibana/kibana:8.12.2
docker run --name kibana --net elastic -p 5601:5601 docker.elastic.co/kibana/kibana:8.12.2

For a secure environment, re-create the enrollment token and password for Kibana:

docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token -s kibana

or use docker-compose, create a file named: elasticsearch.yml

version: "3"

services: 
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.9.2
    ports:
      - 9200:9200
      - 9300:9300
    environment: 
      discovery.type: 'single-node'
      xpack.security.enabled: 'false'
      NEWSFEED_ENABLED: 'false'
      TELEMETRY_OPTIN: 'false'
      TELEMETRY_ENABLED: 'false'
      SERVER_MAXPAYLOADBYTES: 4194304
      KIBANA_AUTOCOMPLETETIMEOUT: 3000
      KIBANA_AUTOCOMPLETETERMINATEAFTER: 2500000


  kibana:
    image: docker.elastic.co/kibana/kibana:8.9.2
    depends_on:
      - elasticsearch
    ports: 
      - 5601:5601

and

docker compose -p elasticsearch -f elasticsearch.yml up -d