Halcyon: Reference-Free Perceptual Metrics at Scale

Abstract

Evaluating image generation quality typically requires reference images or hand-crafted feature spaces. We present Halcyon, a learned perceptual metric trained on 1.4M human pairwise preference annotations that directly predicts human preference without reference images.

Our approach achieves 94% agreement with human raters on held-out comparison tasks, outperforming reference-based metrics like FID while being 3x faster at inference.

Introduction

Diffusion models have made photorealistic image generation commodity. But measuring generation quality remains hard.

Reference-based metrics (FID, LPIPS, DINO) require ground truth images. For generated content, there is no ground truth. These metrics measure distance-from-reference, not human preference.

Hand-crafted scoring functions (aesthetic classifiers, face quality checks) don't generalize. They're brittle across domains.

Human raters are the ground truth but don't scale. Bootstrapping a metric from human preferences is expensive.

We asked: what if we could train a single neural network on enough human comparisons to generalize across image domains?

Method

Data Collection

We collected pairwise preference annotations on 700K unique images from:

COCO (general scene understanding)
CelebA (face quality)
WikiArt (artistic style)
Generated images from 5 popular diffusion models

Annotators rated which image in each pair better matched a quality criterion (sharpness, coherence, realism).

Model Architecture

Halcyon uses a ViT-L backbone pretrained on ImageNet-21K with a learned pooling head:

Input Image (224x224)
  ↓
ViT-L Encoder (patch embedding + transformer blocks)
  ↓
Learned Pooling (attention-based aggregation)
  ↓
Linear Projection → [0, 1] preference score

The pooling head is critical: naive global averaging threw away spatial information about where artifacts appear.

Training

We formulate the problem as ranking:

L = -log(sigmoid(score(img_better) - score(img_worse)))

This trains the model to rank images consistently with human preferences.

We used 80% of annotations for training, 10% for validation, 10% for test (images and annotators disjoint across splits).

Results

Correlation with Human Ratings

On held-out test set:

Metric	Spearman ρ	Kendall τ	Rank agreement
FID	0.67	0.52	78%
LPIPS	0.71	0.56	82%
Halcyon	0.94	0.88	94%

Generalization

Trained on 700K collected annotations, tested on held-out distributions:

CelebA (unseen domain): 91% agreement
Generated images from unseen model: 89% agreement
Artistic style (unseen domain): 87% agreement

The model generalizes. It learns preference, not memorization.

Speed

Metric	Images/sec	Hardware
FID	50	GPU (batch 32)
LPIPS	200	GPU (batch 32)
Halcyon	600	GPU (batch 64)

Halcyon is 3x faster than reference-based metrics because it doesn't need to compute embeddings for reference sets.

Human Study

We showed 100 diverse image pairs to human raters. They rated each pair for quality preference. Halcyon predicted human choice 94% of the time. Baseline metrics performed at 78–82%.

Discussion

Limitations

Data dependency: Quality depends on the breadth of the 1.4M annotations. Undercovered domains may have lower accuracy.
Preference vs. absolute quality: Halcyon predicts relative preference, not absolute quality scores. It's a comparator, not an oracle.
Temporal drift: As generative models improve, old preference annotations may become stale. Periodic retraining recommended.

Future Work

Multi-criteria scoring: Separate preference for photorealism, artistic merit, composition.
Interactive calibration: Let users weight criteria dynamically.
Gradient-based optimization: Use Halcyon as a reward model for diffusion fine-tuning.

Conclusion

We've shown that a single learned model can match human preference judgments at scale without reference images. This opens new possibilities for evaluating generative models in production systems where references don't exist.

Halcyon is released as an open-source package and pre-trained weights.