diff --git a/111_paperless_deployment.md b/111_paperless_deployment.md new file mode 100644 index 0000000..f5d985f --- /dev/null +++ b/111_paperless_deployment.md @@ -0,0 +1,230 @@ +# Paperless-ngx Deployment — CT 111 + +## Overview + +Self-hosted document management system with multi-language OCR. Deployed on CT 111 via Docker Compose, accessible at `paperless.spendlik.sk`. All documents, media, and data stored on NAS. + +| Property | Value | +|---|---| +| **Container** | CT 111 | +| **Hostname** | paperless | +| **IP** | 192.168.1.111 | +| **OS** | Debian 13 (privileged LXC, `nesting=1`) | +| **URL** | https://paperless.spendlik.sk | +| **Internal port** | 8000 | +| **Compose file** | `/opt/paperless/docker-compose.yml` | +| **NAS mount (host)** | `/mnt/pve/spendlik-nas/data/paperless` | +| **NAS mount (CT)** | `/mnt/paperless` | + +--- + +## LXC Configuration + +``` +# /etc/pve/lxc/111.conf +arch: amd64 +cores: 2 +features: nesting=1 +hostname: paperless +memory: 8192 +mp1: /mnt/pve/spendlik-nas/data/paperless,mp=/mnt/paperless,shared=1 +net0: name=eth0,bridge=vmbr0,gw=192.168.1.1,hwaddr=BC:24:11:A8:11:71,ip=192.168.1.111/24,type=veth +ostype: debian +rootfs: local-lvm:vm-111-disk-0,size=50G +startup: order=5,up=30 +swap: 1024 +``` + +> ⚠️ RAM is set to 8192MB — this was raised from 4GB to handle bulk OCR. Should be reduced to 2048MB once bulk imports are complete. + +--- + +## NAS Directory Structure + +The entire `/mnt/pve/spendlik-nas/data/paperless` is bind-mounted into CT 111 at `/mnt/paperless`. Subdirectories: + +| Path (inside CT) | Purpose | +|---|---| +| `/mnt/paperless/consume` | Drop files here for automatic ingestion | +| `/mnt/paperless/export` | Export destination | +| `/mnt/paperless/media` | Processed documents (originals, archive, thumbnails) | +| `/mnt/paperless/data` | Paperless application data (search index, classifier, etc.) | + +--- + +## Docker Compose + +Located at `/opt/paperless/docker-compose.yml`: + +```yaml +services: + broker: + image: redis:7 + restart: unless-stopped + volumes: + - redis_data:/data + + db: + image: postgres:16 + restart: unless-stopped + volumes: + - pg_data:/var/lib/postgresql/data + environment: + POSTGRES_DB: paperless + POSTGRES_USER: paperless + POSTGRES_PASSWORD: + + webserver: + image: ghcr.io/paperless-ngx/paperless-ngx:latest + restart: unless-stopped + user: root + depends_on: + - db + - broker + ports: + - "8000:8000" + volumes: + - /mnt/paperless/data:/usr/src/paperless/data + - /mnt/paperless/media:/usr/src/paperless/media + - /mnt/paperless/export:/usr/src/paperless/export + - /mnt/paperless/consume:/usr/src/paperless/consume + environment: + PAPERLESS_REDIS: redis://broker:6379 + PAPERLESS_DBHOST: db + PAPERLESS_DBNAME: paperless + PAPERLESS_DBUSER: paperless + PAPERLESS_DBPASS: + PAPERLESS_URL: https://paperless.spendlik.sk + PAPERLESS_SECRET_KEY: + PAPERLESS_TIME_ZONE: Europe/Bratislava + PAPERLESS_OCR_LANGUAGE: slk+ces+rus+hun+deu+eng + PAPERLESS_OCR_LANGUAGES: slk ces rus hun deu eng + +volumes: + redis_data: + pg_data: +``` + +> ℹ️ `media` and `data` were originally Docker named volumes. They were migrated to NAS bind mounts after the container disk filled up during bulk OCR. See migration notes below. + +--- + +## Docker Container Names + +| Name | Image | Purpose | +|---|---|---| +| `paperless-webserver-1` | `ghcr.io/paperless-ngx/paperless-ngx:latest` | Main app + Celery worker + consumer | +| `paperless-db-1` | `postgres:16` | Database | +| `paperless-broker-1` | `redis:7` | Task queue | + +--- + +## nginx Reverse Proxy (CT 101) + +Config at `/etc/nginx/sites-available/paperless.spendlik.sk`: + +```nginx +server { + server_name paperless.spendlik.sk; + + location / { + proxy_pass http://192.168.1.111:8000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } + + listen 443 ssl; # managed by Certbot + ssl_certificate /etc/letsencrypt/live/paperless.spendlik.sk/fullchain.pem; + ssl_certificate_key /etc/letsencrypt/live/paperless.spendlik.sk/privkey.pem; + include /etc/letsencrypt/options-ssl-nginx.conf; + ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; +} + +server { + if ($host = paperless.spendlik.sk) { + return 301 https://$host$request_uri; + } + listen 80; + server_name paperless.spendlik.sk; + return 404; +} +``` + +--- + +## Consuming Documents + +### Automatic (inotify watcher) +Drop files into `/mnt/paperless/consume` — the consumer detects new files automatically via inotify and queues them. The consumer runs inside the `paperless-webserver-1` container. + +### Manual trigger (for pre-existing files) +The inotify watcher only detects **new** file additions, not files already present when the container starts. To process existing files: + +```bash +cd /opt/paperless +docker compose exec webserver python3 manage.py document_consumer --oneshot +``` + +> ⚠️ Flag is `--oneshot` (one word), not `--one-shot`. + +### If consumer process is not running +Check with: +```bash +docker compose exec webserver ps aux | grep consumer +``` + +If missing, restart the webserver container: +```bash +docker compose restart webserver +``` + +Then watch logs to confirm consumer starts: +```bash +docker compose logs webserver --tail=50 -f +``` + +Look for: `Using inotify to watch directory for changes: /usr/src/paperless/consume` + +--- + +## Supported File Types + +Paperless-ngx supports PDF and common image formats (JPG, PNG, etc.). `.djvu` files are **not supported** and will be skipped with a warning. + +--- + +## OCR Notes + +- 6 languages configured: Slovak, Czech, Russian, Hungarian, German, English +- Tesseract warnings about "lots of diacritics" and "too few characters" are normal for old scanned magazines — not errors +- OCR is CPU-intensive; bulk imports require adequate RAM (8GB during bulk, can reduce to 2GB after) + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---|---|---| +| Files in consume folder not processing | Consumer process died (OOM kill) | `docker compose restart webserver` | +| HTTP 500 on web UI | Container disk full | Check disk: `df -h`; migrate volumes to NAS or resize disk | +| `chown: Invalid argument` in logs | NAS mount doesn't allow ownership changes | Harmless — files still process correctly | +| OOM kill of worker | Insufficient RAM during bulk OCR | `pct set 111 --memory 8192 --swap 1024` on Proxmox host | +| Tasks show as failed in UI | OOM kill mid-processing | Re-trigger with `--oneshot`; failed tasks can be cleared from UI | + +--- + +## Deployment History & Key Events + +1. **Initial deploy** — CT 111 created, Docker + Paperless stack deployed, NAS consume/export mounted via Proxmox bind mount +2. **Disk fill** — Container root disk (20GB) filled during bulk OCR of 500+ magazines; resized to 50GB (`pct resize 111 rootfs +30G`) +3. **OOM kills** — 2GB RAM insufficient for 6-language bulk OCR; raised to 4GB then 8GB +4. **NAS migration** — `media` and `data` Docker named volumes migrated to NAS bind mounts (`/mnt/paperless/media` and `/mnt/paperless/data`) to avoid future disk issues. Migration done via `docker cp` without reprocessing. +5. **Bulk import** — 500+ scanned Czech/Slovak modelling magazines (Letecký Modelář, 1950s–1960s) imported + +--- + +## Planned: Gemini Post-Processing + +Future project to run nightly Gemini API post-processing on documents to improve OCR text, suggest tags, and improve titles. See `obsidian-vault/02 Projects/Gemini Post-Processing for Paperless.md`.