Claude_Homelab/111_paperless_deployment.md

231 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Paperless-ngx Deployment — CT 111
## Overview
Self-hosted document management system with multi-language OCR. Deployed on CT 111 via Docker Compose, accessible at `paperless.spendlik.sk`. All documents, media, and data stored on NAS.
| Property | Value |
|---|---|
| **Container** | CT 111 |
| **Hostname** | paperless |
| **IP** | 192.168.1.111 |
| **OS** | Debian 13 (privileged LXC, `nesting=1`) |
| **URL** | https://paperless.spendlik.sk |
| **Internal port** | 8000 |
| **Compose file** | `/opt/paperless/docker-compose.yml` |
| **NAS mount (host)** | `/mnt/pve/spendlik-nas/data/paperless` |
| **NAS mount (CT)** | `/mnt/paperless` |
---
## LXC Configuration
```
# /etc/pve/lxc/111.conf
arch: amd64
cores: 2
features: nesting=1
hostname: paperless
memory: 8192
mp1: /mnt/pve/spendlik-nas/data/paperless,mp=/mnt/paperless,shared=1
net0: name=eth0,bridge=vmbr0,gw=192.168.1.1,hwaddr=BC:24:11:A8:11:71,ip=192.168.1.111/24,type=veth
ostype: debian
rootfs: local-lvm:vm-111-disk-0,size=50G
startup: order=5,up=30
swap: 1024
```
> ⚠️ RAM is set to 8192MB — this was raised from 4GB to handle bulk OCR. Should be reduced to 2048MB once bulk imports are complete.
---
## NAS Directory Structure
The entire `/mnt/pve/spendlik-nas/data/paperless` is bind-mounted into CT 111 at `/mnt/paperless`. Subdirectories:
| Path (inside CT) | Purpose |
|---|---|
| `/mnt/paperless/consume` | Drop files here for automatic ingestion |
| `/mnt/paperless/export` | Export destination |
| `/mnt/paperless/media` | Processed documents (originals, archive, thumbnails) |
| `/mnt/paperless/data` | Paperless application data (search index, classifier, etc.) |
---
## Docker Compose
Located at `/opt/paperless/docker-compose.yml`:
```yaml
services:
broker:
image: redis:7
restart: unless-stopped
volumes:
- redis_data:/data
db:
image: postgres:16
restart: unless-stopped
volumes:
- pg_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: <see Vaultwarden>
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
user: root
depends_on:
- db
- broker
ports:
- "8000:8000"
volumes:
- /mnt/paperless/data:/usr/src/paperless/data
- /mnt/paperless/media:/usr/src/paperless/media
- /mnt/paperless/export:/usr/src/paperless/export
- /mnt/paperless/consume:/usr/src/paperless/consume
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_DBNAME: paperless
PAPERLESS_DBUSER: paperless
PAPERLESS_DBPASS: <see Vaultwarden>
PAPERLESS_URL: https://paperless.spendlik.sk
PAPERLESS_SECRET_KEY: <see Vaultwarden>
PAPERLESS_TIME_ZONE: Europe/Bratislava
PAPERLESS_OCR_LANGUAGE: slk+ces+rus+hun+deu+eng
PAPERLESS_OCR_LANGUAGES: slk ces rus hun deu eng
volumes:
redis_data:
pg_data:
```
> `media` and `data` were originally Docker named volumes. They were migrated to NAS bind mounts after the container disk filled up during bulk OCR. See migration notes below.
---
## Docker Container Names
| Name | Image | Purpose |
|---|---|---|
| `paperless-webserver-1` | `ghcr.io/paperless-ngx/paperless-ngx:latest` | Main app + Celery worker + consumer |
| `paperless-db-1` | `postgres:16` | Database |
| `paperless-broker-1` | `redis:7` | Task queue |
---
## nginx Reverse Proxy (CT 101)
Config at `/etc/nginx/sites-available/paperless.spendlik.sk`:
```nginx
server {
server_name paperless.spendlik.sk;
location / {
proxy_pass http://192.168.1.111:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
listen 443 ssl; # managed by Certbot
ssl_certificate /etc/letsencrypt/live/paperless.spendlik.sk/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/paperless.spendlik.sk/privkey.pem;
include /etc/letsencrypt/options-ssl-nginx.conf;
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
}
server {
if ($host = paperless.spendlik.sk) {
return 301 https://$host$request_uri;
}
listen 80;
server_name paperless.spendlik.sk;
return 404;
}
```
---
## Consuming Documents
### Automatic (inotify watcher)
Drop files into `/mnt/paperless/consume` — the consumer detects new files automatically via inotify and queues them. The consumer runs inside the `paperless-webserver-1` container.
### Manual trigger (for pre-existing files)
The inotify watcher only detects **new** file additions, not files already present when the container starts. To process existing files:
```bash
cd /opt/paperless
docker compose exec webserver python3 manage.py document_consumer --oneshot
```
> ⚠️ Flag is `--oneshot` (one word), not `--one-shot`.
### If consumer process is not running
Check with:
```bash
docker compose exec webserver ps aux | grep consumer
```
If missing, restart the webserver container:
```bash
docker compose restart webserver
```
Then watch logs to confirm consumer starts:
```bash
docker compose logs webserver --tail=50 -f
```
Look for: `Using inotify to watch directory for changes: /usr/src/paperless/consume`
---
## Supported File Types
Paperless-ngx supports PDF and common image formats (JPG, PNG, etc.). `.djvu` files are **not supported** and will be skipped with a warning.
---
## OCR Notes
- 6 languages configured: Slovak, Czech, Russian, Hungarian, German, English
- Tesseract warnings about "lots of diacritics" and "too few characters" are normal for old scanned magazines — not errors
- OCR is CPU-intensive; bulk imports require adequate RAM (8GB during bulk, can reduce to 2GB after)
---
## Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Files in consume folder not processing | Consumer process died (OOM kill) | `docker compose restart webserver` |
| HTTP 500 on web UI | Container disk full | Check disk: `df -h`; migrate volumes to NAS or resize disk |
| `chown: Invalid argument` in logs | NAS mount doesn't allow ownership changes | Harmless — files still process correctly |
| OOM kill of worker | Insufficient RAM during bulk OCR | `pct set 111 --memory 8192 --swap 1024` on Proxmox host |
| Tasks show as failed in UI | OOM kill mid-processing | Re-trigger with `--oneshot`; failed tasks can be cleared from UI |
---
## Deployment History & Key Events
1. **Initial deploy** — CT 111 created, Docker + Paperless stack deployed, NAS consume/export mounted via Proxmox bind mount
2. **Disk fill** — Container root disk (20GB) filled during bulk OCR of 500+ magazines; resized to 50GB (`pct resize 111 rootfs +30G`)
3. **OOM kills** — 2GB RAM insufficient for 6-language bulk OCR; raised to 4GB then 8GB
4. **NAS migration**`media` and `data` Docker named volumes migrated to NAS bind mounts (`/mnt/paperless/media` and `/mnt/paperless/data`) to avoid future disk issues. Migration done via `docker cp` without reprocessing.
5. **Bulk import** — 500+ scanned Czech/Slovak modelling magazines (Letecký Modelář, 1950s1960s) imported
---
## Planned: Gemini Post-Processing
Future project to run nightly Gemini API post-processing on documents to improve OCR text, suggest tags, and improve titles. See `obsidian-vault/02 Projects/Gemini Post-Processing for Paperless.md`.