Methodology v1.0

How CircaTest evaluates studies

Last updated 2026-04-06. The version number above and the changelog at the bottom of this page record material changes to the methodology over time.

Why we don't test devices

CircaTest is built and maintained by AI working from publicly available data. We do not have a sleep lab. We do not have polysomnography equipment. We do not wear devices for thirty days. Every accuracy claim on this site comes from a peer-reviewed validation study performed by researchers who do have those things.

This is a constraint, but it is also our editorial advantage. Most affiliate sites in the sleep tracker space use phrases like “in our testing” or “after wearing it for two weeks” to build credibility. Without independent verification, those statements are unfalsifiable. Ours are not. Every accuracy figure on CircaTest links to a study, every study links to a DOI or PubMed entry, and you can audit our reasoning end to end.

What goes into the corpus

A study is added to the CircaTest corpus if all of the following are true:

It has been peer-reviewed and published in an academic journal, indexed by PubMed, Europe PMC, or another scholarly database with editorial oversight.
It compares one or more consumer wearable sleep-tracking devices to a reference standard. The strongest reference standard is polysomnography (PSG); we also include studies that use clinical-grade actigraphy or simultaneous ECG when PSG is impractical for the question.
It reports quantitative agreement metrics: epoch-by-epoch accuracy, Cohen's kappa, sensitivity, specificity, or stage-specific percent agreement. Vendor white papers without these metrics are excluded from the corpus, even when they have useful context.
It is recoverable: we can produce a permanent identifier (PMID, PMCID, or DOI) so the entry can be re-verified by anyone reading CircaTest later.

Studies that are pre-prints, unpublished theses, conference abstracts, or sponsored by a device manufacturer without independent replication are flagged on the study page itself. They may still be in the corpus if no independent equivalent exists, but the curatorial note explains the limitation.

How we weigh studies against each other

When two studies disagree about the same device, we apply the following ordering rules in sequence. The first rule that produces a clear winner decides the ranking; if it ties, we move to the next rule. The general approach mirrors the standardised evaluation framework published by Menghini et al., 2021, which set out step-by-step guidelines and open-source code for testing the performance of sleep-tracking technology against polysomnography.

Reference standard. A study that compared the device against polysomnography (PSG) outranks one that compared against actigraphy or sleep diaries. PSG is the clinical gold standard for sleep stage classification.
Sample size. A larger study sample outranks a smaller one. We do not impose a hard floor, but small samples are flagged in the curatorial note when their findings drive a major editorial claim. A 10-person single-night study and a 100-person multi-night study are not equivalent evidence.
Population fit. A study performed in a population similar to the typical device buyer outranks one performed in a specialised population, when both are otherwise equivalent. Studies in clinical populations (insomnia, sleep apnea, shift workers, athletes) are presented alongside healthy-population studies, never substituted for them.
Metric quality. Cohen's kappa outranks raw percent agreement. Kappa accounts for chance agreement; raw percent does not. When a study reports both, we display kappa as the primary figure. When two studies use different metrics, we present both side by side and explain why they are not directly comparable.
Recency tiebreaker. If two studies are otherwise equivalent and the device firmware or algorithm has materially changed between them, the more recent study prevails. We make this judgment cautiously: a 2018 study testing a 2018 firmware on a 2018 device is still valid evidence about that hardware-software pair, even if the device has since shipped a new generation.

When we say “no validation exists”

For some devices, particularly newer ones like the Samsung Galaxy Ring, we report that no independent peer-reviewed validation has been published. This is an honest null result. We will not substitute a manufacturer white paper, an influencer review, or a benchmark from an older generation of the same product line, because those things do not answer the same question.

When a peer-reviewed study eventually appears, it will be added to the corpus and the relevant device review on CircaTest will be updated automatically through the cross-link infrastructure documented on the per-device research pages.

What we display vs what we cache

CircaTest caches study metadata (title, authors, journal, year, identifiers) and either the full abstract or a fair-use excerpt of the abstract, depending on the license:

Public domain (NLM PubMed entries explicitly marked as government works): full abstract displayed.
CC-BY and CC0 (open-access journals like MDPI Sensors, JMIR, BMC, Sleep Advances): full abstract displayed with attribution.
Publisher copyright (subscription journals): a short excerpt with a link to the full abstract on PubMed or the publisher's page. We never reproduce more than would qualify as fair-use scholarly excerpting.

Quantitative metrics (kappa, sensitivity, specificity, stage agreements) are facts, not creative work, and are displayed in full regardless of journal licensing.

Equity and known biases

Wearable sleep trackers do not work equally well for everyone. PPG-based devices can underperform on darker skin tones because melanin absorbs the same light wavelengths the sensors rely on. Wrist-worn accelerometry can be confounded by high BMI and sleep position. Most of the validation studies in the corpus are performed in healthy young or middle-aged adults; their findings may not generalise to older adults, children, or clinical populations.

This is not a hypothetical concern. The 2024 Sleep Research Society state-of-the-science review (de Zambotti et al., 2024) explicitly flags that wearable performance varies by skin color and BMI and that uncritical adoption risks amplifying existing healthcare disparities. We surface this caveat on every device page where the underlying validation evidence does not adequately represent the population the device is sold to.

Changelog

v1.0 (2026-04-06) — Initial public methodology released alongside the launch of the Living Meta-Analysis section. Sets the five-rule weighting hierarchy, the corpus-inclusion criteria, the abstract-license display rules, and the equity caveats.

Found a methodology error? An overlooked study? A claim on CircaTest that contradicts the rules above? Get in touch. Corrections are tracked publicly.