Probabilistic Retrieval Models

For further discussion we'll make two important assumptions:

  • ranking the relevant documents depends on the number of documents the user has already seen: the more documents we see - the less useful they are.

  • relevance of \(D_i\) to \(Q\) is independent of other documents \(D_j\) from the collection. Therefore we can apply it to each document separately.

Notation

  • Assume \( R=\{r, \neg r\} \) a binary random variable that indicates relevance

  • let \(r\) represent the event that document \(D\) is relevant

  • \(\neg r\) represent the event that \(D\) is not relevant

We need to estimate the probability of relevance of a document \(D\) w.r.t. query \(Q\). In other words, we need to find:

  • \(P(R=r|D, Q) \) - the probability that \(D\) is relevant to \(Q\)

  • \(P(R=\neg r| D, Q)\) - the probability that \(D\) is not relevant to \(Q\)

Applying Bayes Theorem to infer the probabilities:

  • \(P(R=r|D,Q)=\frac{P(D, Q|R=r)P(R=r)}{P(D,Q)}\)

  • \(P(R=\neg r| D,Q)=\frac{P(D,Q|R=\neg r)P(R=\neg r)}{P(D,Q)}\)