The Clinical Gold Standard
Before understanding how consumer sleep trackers work, it helps to know what they are trying to replicate. The clinical gold standard for sleep measurement is polysomnography (PSG). A PSG test uses EEG (brain wave monitoring), EOG (eye movement), EMG (muscle tone), ECG (heart rhythm), and respiratory sensors to classify every 30-second epoch of a night into one of five stages: Wake, N1 (light), N2 (light), N3 (deep), and REM.
No consumer device measures brain waves. Instead, they use proxy signals to estimate what the brain is probably doing. The accuracy of that estimation is what separates good trackers from bad ones.
The Three Sensors Consumer Trackers Use
1. Accelerometry (Motion)
Every sleep tracker contains an accelerometer that detects movement. The underlying principle, established in actigraphy research dating back decades: deep sleep involves very little movement, REM sleep involves muscle atonia with occasional twitches, and wakefulness involves more frequent movement.
Published research (Ancoli-Israel et al., 2003Cited studyThe role of actigraphy in the study of sleep and circadian rhythmsAncoli-Israel S et al. · Sleep · 2003Foundational actigraphy review from the American Academy of Sleep Medicine. Establishes the ~90% accuracy figure for distinguishing sleep from wake using motion alone, which CircaTest cites in the 'how sleep trackers work' guide as the empirical floor that all consumer wearables build on.View full record →) shows that motion data alone can distinguish sleep from wake with approximately 90% accuracy. However, accelerometry cannot reliably distinguish between sleep stages. A person lying perfectly still while reading looks identical to deep sleep on an accelerometer.
2. Photoplethysmography (Heart Rate)
PPG sensors shine green LED light into the skin and measure reflected light. Blood absorbs green light, so changes in blood volume with each heartbeat create a pulsing signal. From this signal, trackers derive:
- Heart rate (beats per minute)
- Heart rate variability (HRV): the variation in time between consecutive heartbeats
- Respiratory rate (estimated from HRV patterns)
Heart rate and HRV change predictably across sleep stages. Published data shows deep sleep typically corresponds with the lowest heart rate and highest HRV, while REM sleep shows more variable heart rate, similar to wakefulness. By combining heart rate patterns with motion data, modern trackers achieve sleep staging agreement with PSG that varies widely by device, from around 64% (Whoop 4.0, Miller et al., 2020Cited studyA validation study of the WHOOP strap against polysomnography to assess sleepMiller DJ et al. · Journal of Sports Sciences · 2020The only published independent PSG validation of a WHOOP device. CircaTest cites this as the canonical Whoop accuracy reference. The 64% four-stage agreement and the 8.2 ± 32.9 min TST overestimation are both editorially load-bearing.View full record →) up to 79% (Oura Ring Gen 3, Altini & Kinnunen, 2021Cited studyThe Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura RingAltini M & Kinnunen H · Sensors · 2021The largest published Oura Ring sleep stage validation against polysomnography. The 79% four-stage agreement figure is the most-cited single accuracy number for any consumer sleep tracker and is the editorial baseline for CircaTest's Oura coverage. Authors are Oura Health employees, which is disclosed in the paper.View full record →). The general principle that fusing heart rate and motion data improves on motion alone has been examined directly against PSG and wrist actigraphy in multisensor consumer wearables (Roberts et al., 2020Cited studyDetecting sleep using heart rate and motion data from multisensor consumer-grade wearables, relative to wrist actigraphy and polysomnographyRoberts DM et al. · Sleep · 2020Important because it directly compares Apple Watch and Oura Ring against ECG and PSG using identical methodology and machine-learning-built classifiers. The published abstract reports aggregated ranges across the device set (sensitivity 0.883-0.977, specificity 0.407-0.821, d' 1.827-2.347) but does not break these down per device, so this CircaTest record stores them as range-only with null per-device values. Anyone needing per-device numbers should consult the full paper at the PMC link.View full record →).
3. Temperature Sensing
Some devices (Oura Ring, Samsung Galaxy Ring) also measure skin temperature. Body temperature follows a circadian rhythm, dropping during sleep onset and rising before waking. Research (Krauchi et al., 1999Cited studyWarm feet promote the rapid onset of sleepKräuchi K et al. · Nature · 1999Foundational chronobiology paper widely cited as evidence that distal skin temperature is involved in sleep onset. CircaTest cites it in the 'how sleep trackers work' guide as the editorial basis for why temperature-sensing wearables have a physiological grounding beyond accelerometry and PPG. CircaTest does NOT have direct access to the full paper; this entry exists primarily as a stable citation target so the inline reference in the guide resolves to a verifiable PubMed record. Specific claims about what the paper found should be verified against the linked Nature page.View full record →) has shown that distal skin temperature changes are closely linked to sleep propensity.
Temperature data can improve sleep onset detection accuracy and provides additional value for illness detection (baseline temperature shifts upward) and menstrual cycle tracking (temperature rises after ovulation, validated in Maijala et al., 2019Cited studyNocturnal finger skin temperature in menstrual cycle tracking: ambulatory pilot study using a wearable Oura ringMaijala A et al. · BMC Women's Health · 2019The foundational Oura menstrual cycle / temperature paper. Note: CircaTest article body currently cites this as 'Maijala et al., 2022' which is a year typo — the actual paper is 2019. The retrofit step will correct this.View full record →).
How Sleep Staging Algorithms Work
The simplified version of what a modern sleep tracker does every 30 seconds:
- Read motion data: is there movement?
- Read heart rate and HRV: what are the current values and trends?
- Read temperature (if available): what is the deviation from baseline?
- Feed all signals into a machine learning model trained on thousands of PSG-validated nights
- Output a classification: Wake, Light Sleep, Deep Sleep, or REM
The model is probabilistic. It outputs its best estimate based on indirect signals. When those signals are ambiguous (lying still but awake, or the transition between light sleep and REM early in the night), the classification is often wrong.
Where Published Research Shows Trackers Get It Wrong
Overestimating Total Sleep Time
The most consistently documented error across consumer trackers. A meta-analysis by Haghayegh et al. (2019)Cited studyAccuracy of Wristband Fitbit Models in Assessing Sleep: Systematic Review and Meta-AnalysisHaghayegh S et al. · Journal of Medical Internet Research · 2019The most-cited meta-analysis of Fitbit accuracy. CircaTest references the 81-91% sleep/wake accuracy figure as the editorial baseline for any Fitbit claim, particularly because no peer-reviewed Charge 6 specific validation has been published. The very low specificity range (10-52%) on early models is the source of the well-known 'Fitbits overestimate sleep' criticism.View full record →, JMIR found that consumer sleep trackers systematically overestimate total sleep time by 10-40 minutes compared to PSG. The primary cause: trackers log quiet wakefulness (lying still in bed before sleep onset) as light sleep.
Misclassifying Light Sleep
Light sleep stages (N1 and N2) are the hardest to detect via proxy signals. Heart rate and movement patterns during light sleep overlap significantly with both wakefulness and REM. Most consumer trackers combine N1 and N2 into a single "light sleep" category to reduce misclassification, but errors remain common.
First Sleep Cycle REM Detection
The first REM period typically occurs 70-90 minutes after sleep onset and lasts only 5-15 minutes. Published validation studies consistently show that consumer trackers often miss this first REM period or misclassify it as light sleep. Later, longer REM periods are detected more reliably.
Sensor Placement Matters
Research comparing wrist-based and finger-based sensors (Kinnunen et al., 2020Cited studyFeasible assessment of recovery and cardiovascular health: accuracy of nocturnal HR and HRV assessed via ring PPG in comparison to medical grade ECGKinnunen H et al. · Physiological Measurement · 2020Establishes Oura ring PPG validity for nocturnal heart rate and HRV against medical ECG: nightly average HR agreement r² = 0.996 (mean bias -0.63 bpm), nightly average HRV agreement r² = 0.980 (mean bias -1.2 ms), in 49 adults aged 15-72. CircaTest cites this study to support claims about ring-form-factor PPG signal quality during sleep. Important honesty caveat: this study does NOT directly compare wrist-vs-finger PPG placement; the CircaTest article body claim that this paper documented a placement advantage is overstated and was softened during the retrofit. The paper establishes ring PPG accuracy against ECG, which is the underlying point but not the placement comparison the article body originally implied.View full record →) shows that finger-based PPG produces a cleaner signal with fewer motion artifacts. This partially explains why the Oura Ring achieves 79% four-stage PSG agreement (Altini & Kinnunen, 2021Cited studyThe Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura RingAltini M & Kinnunen H · Sensors · 2021The largest published Oura Ring sleep stage validation against polysomnography. The 79% four-stage agreement figure is the most-cited single accuracy number for any consumer sleep tracker and is the editorial baseline for CircaTest's Oura coverage. Authors are Oura Health employees, which is disclosed in the paper.View full record →) compared to wrist-based devices such as Whoop at 64% (Miller et al., 2020Cited studyA validation study of the WHOOP strap against polysomnography to assess sleepMiller DJ et al. · Journal of Sports Sciences · 2020The only published independent PSG validation of a WHOOP device. CircaTest cites this as the canonical Whoop accuracy reference. The 64% four-stage agreement and the 8.2 ± 32.9 min TST overestimation are both editorially load-bearing.View full record →). The Samsung Galaxy Ring has no published validation data to compare.
Accuracy by Device Category
Published Accuracy (PSG Agreement)
Oura Ring Gen 3
Altini & Kinnunen, 2021Cited studyThe Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura RingAltini M & Kinnunen H · Sensors · 2021The largest published Oura Ring sleep stage validation against polysomnography. The 79% four-stage agreement figure is the most-cited single accuracy number for any consumer sleep tracker and is the editorial baseline for CircaTest's Oura coverage. Authors are Oura Health employees, which is disclosed in the paper.View full record → Samsung Galaxy Ring
No data
No published data
Apple Watch S10
No data
Kappa 0.53 (Schyvens 2025Cited studyA performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnographySchyvens AM et al. · Sleep Advances · 2025Single most editorially important study in the CircaTest corpus. Six commercial wearables tested against PSG in a uniform protocol means the kappa values are directly comparable in a way most validation studies are not. Drives the head-to-head accuracy figures across CircaTest's comparison content. Limitations: tested previous-generation models (Series 8 not 10, Charge 5 not 6, original ScanWatch not 2) so the results are upper bounds for current models, not direct evidence.View full record →); see article Whoop 4.0
Miller et al., 2020Cited studyA validation study of the WHOOP strap against polysomnography to assess sleepMiller DJ et al. · Journal of Sports Sciences · 2020The only published independent PSG validation of a WHOOP device. CircaTest cites this as the canonical Whoop accuracy reference. The 64% four-stage agreement and the 8.2 ± 32.9 min TST overestimation are both editorially load-bearing.View full record → Fitbit Charge 6
No data
Kappa 0.41 (Charge 5, Schyvens 2025Cited studyA performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnographySchyvens AM et al. · Sleep Advances · 2025Single most editorially important study in the CircaTest corpus. Six commercial wearables tested against PSG in a uniform protocol means the kappa values are directly comparable in a way most validation studies are not. Drives the head-to-head accuracy figures across CircaTest's comparison content. Limitations: tested previous-generation models (Series 8 not 10, Charge 5 not 6, original ScanWatch not 2) so the results are upper bounds for current models, not direct evidence.View full record →); see article Withings ScanWatch 2
No data
Kappa 0.22 (Schyvens 2025Cited studyA performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnographySchyvens AM et al. · Sleep Advances · 2025Single most editorially important study in the CircaTest corpus. Six commercial wearables tested against PSG in a uniform protocol means the kappa values are directly comparable in a way most validation studies are not. Drives the head-to-head accuracy figures across CircaTest's comparison content. Limitations: tested previous-generation models (Series 8 not 10, Charge 5 not 6, original ScanWatch not 2) so the results are upper bounds for current models, not direct evidence.View full record →); see article Epoch-by-epoch agreement with polysomnography. Higher is closer to clinical measurement.
Based on published validation studies and manufacturer data:
Smart Rings
Finger-based PPG with stronger vascular signal. Oura Ring Gen 3 reaches 79% four-stage PSG agreement (Altini & Kinnunen, 2021Cited studyThe Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura RingAltini M & Kinnunen H · Sensors · 2021The largest published Oura Ring sleep stage validation against polysomnography. The 79% four-stage agreement figure is the most-cited single accuracy number for any consumer sleep tracker and is the editorial baseline for CircaTest's Oura coverage. Authors are Oura Health employees, which is disclosed in the paper.View full record →, Sensors). Samsung Galaxy Ring has no published PSG validation data.
Smartwatches
Wrist-based PPG with larger sensor arrays. Apple Watch achieved the best Cohen's kappa (0.53) of six devices tested in Schyvens et al. (2025)Cited studyA performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnographySchyvens AM et al. · Sleep Advances · 2025Single most editorially important study in the CircaTest corpus. Six commercial wearables tested against PSG in a uniform protocol means the kappa values are directly comparable in a way most validation studies are not. Drives the head-to-head accuracy figures across CircaTest's comparison content. Limitations: tested previous-generation models (Series 8 not 10, Charge 5 not 6, original ScanWatch not 2) so the results are upper bounds for current models, not direct evidence.View full record →, with stage-specific accuracy of Wake 52%, Light 83%, Deep 51%, REM 69%. No single overall percentage exists.
Fitness Bands
Wrist-based PPG, typically smaller sensors. Whoop 4.0 reaches 64% four-stage agreement (Miller et al., 2020Cited studyA validation study of the WHOOP strap against polysomnography to assess sleepMiller DJ et al. · Journal of Sports Sciences · 2020The only published independent PSG validation of a WHOOP device. CircaTest cites this as the canonical Whoop accuracy reference. The 64% four-stage agreement and the 8.2 ± 32.9 min TST overestimation are both editorially load-bearing.View full record →). Fitbit Charge 5 scored kappa 0.41 in Schyvens et al. (2025)Cited studyA performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnographySchyvens AM et al. · Sleep Advances · 2025Single most editorially important study in the CircaTest corpus. Six commercial wearables tested against PSG in a uniform protocol means the kappa values are directly comparable in a way most validation studies are not. Drives the head-to-head accuracy figures across CircaTest's comparison content. Limitations: tested previous-generation models (Series 8 not 10, Charge 5 not 6, original ScanWatch not 2) so the results are upper bounds for current models, not direct evidence.View full record →; no Charge 6 specific validation study exists.
Under-Mattress Sensors
Ballistocardiography (detecting micro-movements from heartbeats through the mattress). Published accuracy: 60-70%. Devices: Withings Sleep Analyzer.
How to Interpret Sleep Tracker Data
Based on the published accuracy limitations:
Trust relative trends over absolute numbers. If a tracker shows deep sleep declining over two weeks, that trend is likely real even if the absolute minutes are slightly off. Relative changes over time are more reliable than single-night readings.
Do not make decisions based on one night. A single night showing 45 minutes of deep sleep versus 50 minutes is within the margin of error for every consumer device. Patterns over 7-14 nights are meaningful.
Use trackers as feedback tools, not diagnostic instruments. Consistent fragmented sleep patterns flagged by a tracker are worth discussing with a healthcare provider. But no consumer tracker can diagnose sleep apnea, insomnia, or any sleep disorder. The Withings ScanWatch 2 has FDA 510(k) clearance (K201456) for ECG and SpO2 measurement, which Withings markets as enabling breathing disturbance detection, but this is not a diagnosis.
Consistency matters more than precision. 90 nights of slightly less accurate data provides more actionable insight than 5 nights of perfect data. The best tracker is the one that gets worn every night.