Spec · v1.0 · open

Constitutional Data.

An open protocol for cryptographically verifiable consent and human-authorship attestation in AI training data.

The problem.

Every modern AI lab faces the same question, in court and in public: how do you prove your training data was acquired legally? Existing answers fall into three categories.

Scraped. Cheap, plentiful, but legally fraught. The New York Times, the Authors Guild, Sarah Silverman, Studio Ghibli — the list of plaintiffs grows monthly.
Synthetic. Risk-free legally, but the well-documented "model collapse" effect means training on synthetic outputs degrades the next generation of models.
Contracted. The Reddit, Stack Overflow, and Shutterstock deals brought billions to a few platforms, but skipped the individual humans whose words and images were sold.

None of these prove that every individual whose data was used actually agreed to it. Until now there has been no public standard for that proof.

The protocol.

Constitutional Data is two cryptographic signatures, one structured message, and one open verifier.

User attestation. Each contributor's Ed25519 keypair signs the canonical message datahedge.v1 | scope | sha256(data) | nonce | timestamp. The signature, public key, and message are persisted with the receipt.
Authorship verification. A classifier estimates the probability that the user-authored portion of the data was written by a human, not by another AI. The score is published in clear with the receipt; downstream labs filter on it.
Service counter-signature. DATA HEDGE signs the user signature only after the upload has passed PII redaction and admin review. The service public key is pinned at /.well-known/datahedge-pubkey.json.
Open verifier. Each receipt is fetchable as JSON at /api/provenance/<dataset_id>. The math to validate it is 50 lines of pure Ed25519 in any language; no trust in DATA HEDGE is required.

What a valid Constitutional Data receipt asserts.

A real human (proven by signature) gave explicit consent for this exact byte sequence to be used for the declared scope.
The user-authored portion was machine-classified as human authorship with a confidence published in the receipt.
The upload was reviewed by DATA HEDGE admins; the service signature attests it cleared review.
If the user later revokes consent, the receipt is marked revoked. Subsequent training runs must filter out revoked receipts.

What it does not assert.

We don't claim the assistant-side of a conversation is human (it isn't — it's the AI's output, by definition). We don't claim no PII existed before redaction (we claim none remains after). We don't gate AI labs from using non-Constitutional data; that's their call. We just guarantee that data which does carry a Constitutional receipt is verifiably consented and verifiably human-authored on the user side.

Why open.

A standard that one company controls is a moat, not a standard. We publish the canonical-message format, the verification logic, and a reference implementation under MIT. Other marketplaces are welcome to issue Constitutional Data receipts; we'll honour them if their service public keys are published at their own .well-known path.

Status.

Spec is at v1.0. Issued automatically for every dataset approved on DATA HEDGE from this commit forward. Existing archived datasets are being back-signed on a rolling basis. See the latest /api/provenance output for a live example, or the source code for the reference implementation.

Stop scraping. Start signing. — DATA HEDGE