We didn't invent values.
We extracted them.
Human values are behavioral patterns — observable in text, extractable with deterministic computation, verifiable through accumulation. No LLM in the extraction stack. No hallucination in the training data.
Built from the full human spectrum — saints, monsters, and the complex majority in between. That middle ground is where the real signal lives.
See the spectrum →Most ethics datasets train on the poles.
The far-positive corpus teaches models to recognize virtue performance, not virtue. The far-negative corpus teaches recognition of monsters, not moral drift. The most useful training signal lives in the complex middle.
"A value stated in comfort is weak signal. A value demonstrated at real cost — under threat, under pressure, against interest — is strong signal. The resistance score is the measurement of that cost."
No figure is pre-labeled positive or negative at ingestion time. Classification emerges entirely from the data. The same extraction code processes Lincoln and Nixon identically. The resistance scores and marker patterns determine the label.
Ingest. Extract. Score. Classify.
A four-stage deterministic pipeline. Same input, same thresholds, identical output every time. Reproducible means auditable. Auditable means trustworthy as training data.
UTF-8 source text is segmented into sentence-bounded passages (≤450 chars). Each passage is stored in documents.db with doc_type — the authenticity signal that flows through all downstream scoring.
Each passage is scanned against a 15-value keyword vocabulary. One observation per matched value per passage. Watermark-gated: only new passages since last run are processed. Never repeats.
Each observation is scored for the cost of holding that value in that passage. Formula: base + doc_type bonus + significance + text markers. Action-type passages score highest — documented deeds over words.
Each observation is classified P1 (held under resistance), P0 (failed/corrupted), APY (yielded under pressure), or AMBIGUOUS. Exported as JSONL: per-figure files + universal positive/negative training sets.
Measuring the cost of holding a value.
Resistance scores the authenticity of a value signal — how much it cost to hold that value in that documented moment. Range: 0.0 → 1.0. Additive formula with four independent signals.
base = 0.25
sig_bonus = min(significance × 0.40, 0.30)
doc_type = action:0.40 / journal:0.35 / letter:0.30 / speech:0.10 / unknown:0.20
text_bonus = 0.20 (if adversity phrase pattern matches)
Relational Integrity Coefficient
Every value observation is classified into one of four labels. Applied deterministically during export — the same observation, the same thresholds, the same label every time.
- despite · even though
- stood firm · refused to give
- nevertheless · persevered
- maintained · stayed true
- gave in · gave up · yielded
- i lied · i deceived · i caved
- backed down · compromised my
- i rationalized · i pretended
- under pressure · when pressed
- forced to · compelled to
- to avoid punishment
- or face consequences
1. APY pressure detected? YES + failure markers → APY (0.95) | YES + no failure → P1 (0.95, APY-resistance)
2. Failure markers present? → P0 (0.85)
3. resistance ≥ p1_threshold (0.55) AND hold markers? → P1 (0.90)
4. resistance ≥ p1_threshold alone? → P1 (0.75)
5. resistance < p0_threshold (0.35)? → P0 (0.55)
6. Otherwise → AMBIGUOUS (0.40)
Rules that cannot be violated.
These constraints are structural, not stylistic — they define what the pipeline is and what makes it trustworthy as a training data source.
value_observations is never updated or deleted. Only appended. The evidence record is immutable once written.Where we are. Where we're going.
Foundation — Complete (Mar 2026)
Standalone pipeline operational. DocumentStore, ValueStore, resistance formula, value extractor, CLI ingest + export. 15 values, P1/P0/APY classification, JSONL output. Zero external dependencies.
Semantic Extraction
Replace keyword vocabulary with embedding-based clustering. Passage embeddings via sentence-transformers (BGE-base). FAISS/Qdrant ANN search against value cluster prototypes. Backward-compatible with Phase 0 baseline.
API Layer
FastAPI service wrapping the pipeline. POST /figures/{name}/ingest, GET /figures/{name}/profile, GET /figures/universal, GET /export/ric. Batch processing support.
Web Dashboard
Figure browser with value radar charts, corpus upload UI, universal registry heatmap, training set builder with filters by figure/value/doc_type/label.
Corpus Scale + HuggingFace Export
Batch ingestion CLI, multi-file figure support, corpus statistics, dataset card generation, and datasets-library-compatible format for load_dataset() integration.
SRL Integration
Port AiMe's Self-Reflection Layer — claim_extractor, ric_gate, trait_compiler, inflection_engine. Enables AI model output evaluation against the Ethos value corpus.
We extracted them from people who lived them —
and people who didn't."