Narrative

Document Management — Rule-Based Auto-Organization

manual, inconsistentautomatic, consistentorganization

Documents were stored inconsistently — no standard naming, no folder structure that made sense, duplicates everywhere. Finding a specific document meant searching manually through a mess. For an organization dealing with a lot of paperwork, that is a real time sink.

PythonAutomationRegexFile Systems

What Was Broken

How It Was Built

The part I spent most time on was edge cases — not the happy path.

Rule-based categorization engine
  • Documents processed on ingestion — filename, extension, content keywords, and metadata parsed using regex to determine category, then file moved to appropriate folder automatically.
  • 📄 categorizer.py
Explicit edge case handling
  • Naming conflicts get timestamp suffix.
  • 📄 file_handler.py

Rule-based categorization engine

Documents processed on ingestion — filename, extension, content keywords, and metadata parsed using regex to determine category, then file moved to appropriate folder automatically.

categorizer.py
python
class DocumentCategorizer:
    RULES = [
        (r'invoice|receipt|payment', 'finance'),
        (r'contract|agreement|terms', 'legal'),
        (r'report|analysis|summary', 'reports'),
        (r'cv|resume|candidate', 'hr'),
    ]
    
    def categorize(self, filepath: Path) -> str:
        text = self.extract_text(filepath)
        filename = filepath.stem.lower()
        
        for pattern, category in self.RULES:
            if re.search(pattern, filename + text):
                return category
        
        # Ambiguous — do not guess
        return 'review_queue'

Explicit edge case handling

Naming conflicts get timestamp suffix. Duplicates are flagged for review, not silently overwritten. Ambiguous files go to a review queue. The system fails visibly, not quietly.

file_handler.py
python
def safe_move(src: Path, dest_dir: Path) -> Path:
    dest = dest_dir / src.name
    
    # Handle naming conflict
    if dest.exists():
        if files_identical(src, dest):
            # Exact duplicate — flag, don't overwrite
            log_duplicate(src, dest)
            return dest
        else:
            # Different file, same name — add timestamp
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            dest = dest_dir / f"{src.stem}_{timestamp}{src.suffix}"
    
    shutil.move(src, dest)
    log_moved(src, dest)
    return dest

What Changed

Document storage went from organized-by-whoever-saved-it to consistently structured. Organization is now consistent enough to build search and reporting on top of it.

Organization
manual, inconsistent
0
100% structured
Silent Data Loss
possible
0
visible failures only
"The best automation fails loudly. Silent failures are worse than no automation — you think the problem is solved when it is not."

Common Questions

They go to a review queue with a log entry explaining why categorization failed — what patterns were checked, what was found. This gives someone the information they need to either handle it manually or add a new rule. Nothing gets silently dropped.
The core logic scales fine — it is stateless per document. The bottleneck at scale would be the rule engine's complexity as categories grow. At a certain point you would want to replace regex rules with an ML classifier trained on labeled documents. For the scale this was built for, regex is the right tool — simple, fast, explainable.