Textual Data

Information retrieval refers to querying on unstructured data. Stuff like keyword queries, document relevance, and mental mapping come under this.

TF-IDF ranking

  • Term Frequency (TF) - ranks the number of times a term \(t\) appears in the document \(d\). It's typically defined as \(\log(TF(t, d)+1)\) as a damping factor.
  • Inverse Document Frequency (IDF) - IDF is defined as the ratio of the total number of documents in a collection that \(t\) appears in. It's typically defined as \(\log(N/N_d)\) as a damping factor.

Relevance Ranking

\(\(r(d, Q) = \sum_{t \in Q} TF(t, d) * IDF(t)\)\) Other definitions take proximity of words into account, and ignore stop words.