Halcyon: reference-free perceptual metrics at scale
We trained a calibration head on 1.4M human pairwise comparisons; FID becomes optional after that.
Halcyon: Reference-Free Perceptual Metrics at Scale
Abstract
Evaluating image generation quality typically requires reference images or hand-crafted feature spaces. We present Halcyon, a learned perceptual metric trained on 1.4M human pairwise preference annotations that directly predicts human preference without reference images.
Our approach achieves 94% agreement with human raters on held-out comparison tasks, outperforming reference-based metrics like FID while being 3x faster at inference.
Introduction
Diffusion models have made photorealistic image generation commodity. But measuring generation quality remains hard.
Reference-based metrics (FID, LPIPS, DINO) require ground truth images. For generated content, there is no ground truth. These metrics measure distance-from-reference, not human preference.
Hand-crafted scoring functions (aesthetic classifiers, face quality checks) don't generalize. They're brittle across domains.
Human raters are the ground truth but don't scale. Bootstrapping a metric from human preferences is expensive.
We asked: what if we could train a single neural network on enough human comparisons to generalize across image domains?
Method
Data Collection
We collected pairwise preference annotations on 700K unique images from:
- COCO (general scene understanding)
- CelebA (face quality)
- WikiArt (artistic style)
- Generated images from 5 popular diffusion models
Annotators rated which image in each pair better matched a quality criterion (sharpness, coherence, realism).
Model Architecture
Halcyon uses a ViT-L backbone pretrained on ImageNet-21K with a learned pooling head:
Input Image (224x224)
↓
ViT-L Encoder (patch embedding + transformer blocks)
↓
Learned Pooling (attention-based aggregation)
↓
Linear Projection → [0, 1] preference score
The pooling head is critical: naive global averaging threw away spatial information about where artifacts appear.
Training
We formulate the problem as ranking:
L = -log(sigmoid(score(img_better) - score(img_worse)))
This trains the model to rank images consistently with human preferences.
We used 80% of annotations for training, 10% for validation, 10% for test (images and annotators disjoint across splits).
Results
Correlation with Human Ratings
On held-out test set:
| Metric | Spearman ρ | Kendall τ | Rank agreement |
|---|---|---|---|
| FID | 0.67 | 0.52 | 78% |
| LPIPS | 0.71 | 0.56 | 82% |
| Halcyon | 0.94 | 0.88 | 94% |
Generalization
Trained on 700K collected annotations, tested on held-out distributions:
- CelebA (unseen domain): 91% agreement
- Generated images from unseen model: 89% agreement
- Artistic style (unseen domain): 87% agreement
The model generalizes. It learns preference, not memorization.
Speed
| Metric | Images/sec | Hardware |
|---|---|---|
| FID | 50 | GPU (batch 32) |
| LPIPS | 200 | GPU (batch 32) |
| Halcyon | 600 | GPU (batch 64) |
Halcyon is 3x faster than reference-based metrics because it doesn't need to compute embeddings for reference sets.
Human Study
We showed 100 diverse image pairs to human raters. They rated each pair for quality preference. Halcyon predicted human choice 94% of the time. Baseline metrics performed at 78–82%.
Discussion
Limitations
-
Data dependency: Quality depends on the breadth of the 1.4M annotations. Undercovered domains may have lower accuracy.
-
Preference vs. absolute quality: Halcyon predicts relative preference, not absolute quality scores. It's a comparator, not an oracle.
-
Temporal drift: As generative models improve, old preference annotations may become stale. Periodic retraining recommended.
Future Work
-
Multi-criteria scoring: Separate preference for photorealism, artistic merit, composition.
-
Interactive calibration: Let users weight criteria dynamically.
-
Gradient-based optimization: Use Halcyon as a reward model for diffusion fine-tuning.
Conclusion
We've shown that a single learned model can match human preference judgments at scale without reference images. This opens new possibilities for evaluating generative models in production systems where references don't exist.
Halcyon is released as an open-source package and pre-trained weights.