Documents were stored inconsistently — no standard naming, no folder structure that made sense, duplicates everywhere. Finding a specific document meant searching manually through a mess. For an organization dealing with a lot of paperwork, that is a real time sink.
The part I spent most time on was edge cases — not the happy path.
Documents processed on ingestion — filename, extension, content keywords, and metadata parsed using regex to determine category, then file moved to appropriate folder automatically.
class DocumentCategorizer:
RULES = [
(r'invoice|receipt|payment', 'finance'),
(r'contract|agreement|terms', 'legal'),
(r'report|analysis|summary', 'reports'),
(r'cv|resume|candidate', 'hr'),
]
def categorize(self, filepath: Path) -> str:
text = self.extract_text(filepath)
filename = filepath.stem.lower()
for pattern, category in self.RULES:
if re.search(pattern, filename + text):
return category
# Ambiguous — do not guess
return 'review_queue'Naming conflicts get timestamp suffix. Duplicates are flagged for review, not silently overwritten. Ambiguous files go to a review queue. The system fails visibly, not quietly.
def safe_move(src: Path, dest_dir: Path) -> Path:
dest = dest_dir / src.name
# Handle naming conflict
if dest.exists():
if files_identical(src, dest):
# Exact duplicate — flag, don't overwrite
log_duplicate(src, dest)
return dest
else:
# Different file, same name — add timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
dest = dest_dir / f"{src.stem}_{timestamp}{src.suffix}"
shutil.move(src, dest)
log_moved(src, dest)
return destDocument storage went from organized-by-whoever-saved-it to consistently structured. Organization is now consistent enough to build search and reporting on top of it.
"The best automation fails loudly. Silent failures are worse than no automation — you think the problem is solved when it is not."