Python

Mini TF-IDF search engine over documentation

Indexes the Markdown files in a docs/ folder, computes a homemade TF-IDF score (no dependencies) and ranks the most relevant documents for a free-text query.

Prerequisites

Python 3.9+ (bibliothèque standard)

Python
import math
import re
from collections import Counter
from pathlib import Path

docs = {p.name: re.findall(r"\w{3,}", p.read_text(encoding="utf-8").lower())
        for p in Path("docs").glob("*.md")}
N = len(docs)
df = Counter(mot for mots in docs.values() for mot in set(mots))

def score(requete, mots):
    tf = Counter(mots)
    return sum(tf[t] / len(mots) * math.log(N / df[t])
               for t in requete.lower().split() if df.get(t))

requete = "configuration proxy timeout"
classement = sorted(((score(requete, mots), nom)
                     for nom, mots in docs.items()), reverse=True)

print(f"Recherche « {requete} » sur {N} documents")
print(f"{'score':>8}  document")
for s, nom in classement[:4]:
    print(f"{s:>8.4f}  {nom}")

Result

Recherche « configuration proxy timeout » sur 187 documents
   score  document
  0.0412  reseau-proxy-entreprise.md
  0.0287  troubleshooting-api.md
  0.0151  deploiement-prod.md
  0.0093  faq-integration.md
TF-IDFRechercheNLPMarkdown

Related snippets

Back to the Data Lab