cg_api_secure-webshare/docs/ARCHITECTURE.md

# Architecture & Design Decisions

This document explains the deeper design choices behind cg.cx - the trade-offs, threat models, and engineering rationale that shaped the system.

---

## Why XChaCha20-Poly1305 over AES-GCM?

We chose **XChaCha20-Poly1305** (via libsodium's `crypto_secretstream_xchacha20poly1305`) as the bulk encryption primitive for several reasons:

1. **Nonce-misuse resistance**: AES-GCM's security collapses catastrophically if a nonce is ever reused. XChaCha20 uses a 192-bit nonce, making accidental collisions statistically impossible even with billions of files. This removes an entire class of operator error.
2. **No hardware dependency**: AES-GCM performance relies heavily on AES-NI. XChaCha20 performs well on all platforms - including older or virtualized CPUs where AES-NI may be unavailable or disabled.
3. **Streaming integrity**: libsodium's `secretstream` API provides built-in chunked authenticated encryption with `Message` and `Final` tags. This gives us streaming decryption with per-chunk integrity checks without inventing our own framing protocol.
4. **Simpler key management**: Because nonce collisions are not a practical concern, we can generate a fresh random key for every file without tracking nonce counters or key lifecycles.

AES is still present in the system - we use **AES-256-KW** (Key Wrap) to encrypt the per-file content keys (CEKs) with the master key. AES-KW was chosen because it is a standard, deterministic, and widely audited key-wrapping algorithm with built-in integrity.

---

## Why SQLite over PostgreSQL?

For a self-hosted, single-tenant service handling encrypted file metadata, **SQLite** is the correct default:

1. **Operational simplicity**: No separate database server to install, upgrade, or network-secure. A single `.sqlite` file is trivial to back up, replicate, or inspect.
2. **WAL mode performance**: With `PRAGMA journal_mode = WAL`, SQLite handles concurrent readers and a single writer efficiently - enough for a bot + web server pair.
3. **Schema simplicity**: The schema is small (10 tables, 7 migration files). The overhead of a client/server RDBMS is unjustified.
4. **Deployment footprint**: Ideal for running on a small VPS or even an embedded edge device without container orchestration.

If future requirements demand horizontal scaling or heavy analytics, the repository pattern in `cgcx-db` makes it straightforward to swap in PostgreSQL without touching the bot or server code.

---

## Why a Modular 10-Crate Workspace?

The crate graph was designed to enforce architectural boundaries at compile time:

```
cgcx-core
   ▲
   ├── cgcx-config
   ├── cgcx-crypto
   ├── cgcx-db
   ├── cgcx-storage
   ├── cgcx-content-typing
   │       ▲
   │       └── cgcx-file-pipeline
   ├── cgcx-moderation
   │
   └── binaries: cgcx-bot, cgcx-server
```

- **cgcx-core** sits at the root and contains only pure data types. It has no I/O dependencies, making it safe to import anywhere.
- **cgcx-crypto** depends only on `cgcx-core`. It is side-effect-free and easy to property-test.
- **cgcx-db** and **cgcx-storage** are I/O crates but know nothing about Telegram or HTTP.
- **cgcx-file-pipeline** composes crypto, storage, typing, and DB into the upload workflow.
- The **binaries** are thin shells that wire configuration to the library crates.

This structure makes it impossible for a database query to accidentally invoke Telegram API code, or for HTTP handlers to directly touch the filesystem without going through the storage abstraction.

---

## Streaming Design for Large Files

Uploads from Telegram are bounded by Telegram's own file size limits (currently 2 GB for bots), but we still treat streaming as a first-class concern:

### Upload Path

1. The bot downloads the file into a `Vec<u8>` in memory.
2. The file pipeline encrypts the data in 1 MiB chunks, writing ciphertext directly to a temp file on disk.
3. After the final chunk is written and flushed, the temp file is atomically renamed to its final destination.
4. Only metadata (original name, MIME type, wrapped key, BLAKE3 hash) hits the database.

This ensures that even a 1 GB upload does not require a 1 GB contiguous memory allocation for ciphertext.

### Download Path

1. The Axum handler spawns a Tokio task that opens the encrypted file.
2. It reads the 24-byte secretstream header, unwraps the CEK, and initializes a `DecryptStream`.
3. A bounded MPSC channel (`capacity = 4`) decouples disk I/O from the HTTP response stream.
4. Ciphertext is read from disk in ~1 MiB chunks, decrypted, and sent through the channel.
5. Axum's `Body::from_stream` forwards plaintext chunks to the client as they are produced.

If the client disconnects mid-stream, the sender half of the channel is dropped and the decryption task exits cleanly. No full-file buffering occurs on the server.

---

## Security Threat Model

### What We Protect Against

| Threat | Mitigation |
|--------|------------|
| **Server compromise (passive)** | All files are encrypted at rest with per-file keys. An attacker with disk access cannot read plaintext without the master key. |
| **Database leak** | The database contains only wrapped keys, ciphertext hashes, and metadata. It does not contain plaintext or unwrapped CEKs. |
| **Ciphertext tampering** | XChaCha20-Poly1305 authenticates every chunk. Tampered files fail decryption and the stream aborts. |
| **Brute-force password guessing** | Per-content passwords are hashed with bcrypt. Rate limiting on `/api/content/:cxid/verify-password` slows online attacks. |
| **Cookie forgery** | Password session cookies include a BLAKE3 MAC keyed by the master key. Forging a cookie requires knowledge of the master key. |
| **Replay / enumeration** | Content IDs are 12-character random strings with ~71 bits of entropy. They are not sequential. |
| **Malicious uploads** | Content typing flags executable, HTML, and script MIME types. The frontend refuses to inline dangerous files. |

### What We Do Not Protect Against

| Threat | Rationale |
|--------|-----------|
| **Active server compromise (key extraction)** | If an attacker gains code execution and reads the master key from memory or env, they can decrypt all content. This is an inherent limitation of server-side encryption. |
| **Telegram MitM** | We trust Telegram's bot API transport (HTTPS) and file CDN. |
| **Client-side malware** | The user's browser or device may be compromised; we cannot protect plaintext after decryption. |
| **Denial of Service** | Large uploads and high request volumes can exhaust disk or bandwidth. Rate limiting and upload size caps mitigate but do not eliminate this risk. |

### Trust Boundaries

```
[User Device] --HTTPS--> [Telegram Cloud] --HTTPS--> [cg.cx Bot]
                                              |
[Browser] <--HTTPS--> [cg.cx Server] <--------┘
       |
   Decrypted plaintext rendered in browser
```

The **cg.cx server** is a trusted party for decryption and delivery. It is not a true "end-to-end" system in the Signal sense, because the server must unwrap keys to stream content to browsers that do not possess the master key. The architecture prioritizes **usable sharing** (anyone with a link can view) over **true E2EE** (which would require client-side JavaScript crypto and key distribution).

---

## Hashing for Deduplication and Blacklist

`cgcx-crypto` computes a **BLAKE3 hash over the ciphertext stream** (including the secretstream header) for tamper detection. This hash is stored per-file in `content_files.encrypted_hash`.

In addition, the file pipeline now computes a **plaintext BLAKE3 hash** during ingestion:
1. A running hash of the plaintext chunks is computed alongside encryption.
2. The resulting `plaintext_hash` is stored in `content_files` and used for deduplication — when identical plaintext is uploaded, the existing encrypted file is reused and its `ref_count` is incremented.
3. A `hash_blacklist` table (migration `007_hash_blacklist.sql`) allows moderators to block re-uploads of known-banned content by its plaintext hash. The pipeline checks this blacklist before storing any new file and rejects blocked content with a `BlockedHash` error.

---

## Future Considerations

- **Client-side decryption**: A future iteration could deliver the wrapped CEK to the browser and decrypt via WebAssembly / libsodium-js. This would remove the server from the trust boundary for delivery.
- **S3-compatible backends**: `cgcx-storage` could be abstracted into a trait to support object storage.
- **PostgreSQL backend**: The repository trait pattern in `cgcx-db` is amenable to an async SQLx implementation.
- **Metrics and alerting**: Structured tracing is in place; a metrics exporter (Prometheus) could be added to `cgcx-server` without touching business logic.