Textual Data¶
Information retrieval refers to querying on unstructured data. Stuff like keyword queries, document relevance, and mental mapping come under this.
TF-IDF ranking¶
- Term Frequency (TF) - ranks the number of times a term \(t\) appears in the document \(d\). It's typically defined as \(\log(TF(t, d)+1)\) as a damping factor.
- Inverse Document Frequency (IDF) - IDF is defined as the ratio of the total number of documents in a collection that \(t\) appears in. It's typically defined as \(\log(N/N_d)\) as a damping factor.
Relevance Ranking¶
\(\(r(d, Q) = \sum_{t \in Q} TF(t, d) * IDF(t)\)\) Other definitions take proximity of words into account, and ignore stop words.