Sleep Tracker Accuracy: What the Published Research Actually Shows

The Gold Standard: Polysomnography

Polysomnography (PSG) is clinical sleep measurement. It uses EEG electrodes to measure brain waves, EOG for eye movement, and EMG for muscle activity. Trained technicians score sleep stages in 30-second intervals called epochs. This is what every consumer sleep tracker is compared against.

No consumer wearable includes EEG sensors. Instead, they use proxy signals -- primarily heart rate (PPG), motion (accelerometry), and skin temperature -- to estimate what the brain is probably doing. This fundamental limitation is why no consumer device achieves perfect agreement with PSG.

What "79% Accuracy" Actually Means

When published research reports that a device achieves "79% agreement with polysomnography," it means that in a controlled study, the device classified the same sleep stage as the PSG lab in 79 out of every 100 thirty-second epochs.

That also means approximately 1 in 5 epochs was classified differently than the clinical measurement. For an 8-hour sleep period (960 epochs), roughly 200 of those epochs may be assigned to the wrong sleep stage.

This does not mean the device is wrong about your total sleep time by 21%. Total sleep time (TST) accuracy is a separate metric. Most devices estimate total sleep duration within 15-30 minutes of PSG, which is substantially more accurate than their stage-by-stage classification.

Published Accuracy Data Across Devices

Published Accuracy (PSG Agreement)

Oura Ring Gen 3

79%

Altini & Kinnunen, 2021

Samsung Galaxy Ring

No data

No published data

Apple Watch S10

No data

Kappa 0.53 (Schyvens 2025); see article

Whoop 4.0

64%

Miller et al., 2020

Fitbit Charge 6

No data

Kappa 0.41 (Charge 5, Schyvens 2025); see article

Withings ScanWatch 2

No data

Kappa 0.22 (Schyvens 2025); see article

Epoch-by-epoch agreement with polysomnography. Higher is closer to clinical measurement.

The following accuracy figures come from peer-reviewed validation studies. Reported values are not directly comparable across studies because methodologies differ. Where a device has a published four-stage PSG agreement figure, it is reported below. Where no such figure exists, Cohen's kappa (a more statistically rigorous agreement measure) is cited from Schyvens et al. (2025), which tested six wearables against PSG in the same protocol.

Oura Ring Gen 3: 79% four-stage PSG agreement (Altini & Kinnunen, 2021, Sensors; 440 nights, 106 individuals).
Whoop 4.0: 64% four-stage PSG agreement, kappa 0.47 (Miller et al., 2020, Journal of Sports Sciences).
Apple Watch: Cohen's kappa 0.53 (best of six devices), with stage-specific accuracy Wake 52%, Light 83%, Deep 51%, REM 69% (Schyvens et al., 2025, Sleep Advances). Apple's own white paper reports Deep 62%, REM 81%.
Fitbit Charge 5: Cohen's kappa 0.41 (Schyvens et al., 2025). Earlier meta-analysis (Haghayegh et al., 2019, JMIR) reported sleep/wake accuracy of 81-91% across older Fitbit models. No Charge 6 specific validation study exists.
Withings ScanWatch: Cohen's kappa 0.22, lowest of six devices tested (Schyvens et al., 2025). Uses 3-stage classification (combines deep + REM).
Samsung Galaxy Ring: No published PSG validation data exists.

Total sleep time deviation indicates how far, on average, the device's reported total sleep differs from PSG measurement.

Where Accuracy Drops

Published research consistently identifies several scenarios where consumer sleep trackers perform worse:

Light sleep vs. wake distinction. Most devices struggle to distinguish quiet wakefulness from light sleep when the user is lying still. Accelerometers detect motion, and still-but-awake periods produce no motion signal. Published data shows this is the most common error category across all consumer devices.

Sleep onset timing. Devices that rely heavily on motion data tend to mark sleep onset when the user stops moving, which may precede actual sleep onset by 10-20 minutes. Devices with temperature sensors can partially compensate by detecting the circadian temperature drop that accompanies sleep onset.

Fragmented sleep. Brief awakenings (under 5 minutes) are frequently missed by consumer trackers. PSG catches these because EEG directly measures brain state changes, while proxy signals may not shift enough during brief arousals.

REM vs. light sleep. Both stages involve relatively low movement and similar heart rate patterns. Stage classification algorithms must distinguish them primarily through heart rate variability patterns, which is inherently less precise than EEG-based classification.

Peer-Reviewed vs. Manufacturer Validation

Not all accuracy data carries equal weight. There is a meaningful distinction between:

Independent peer-reviewed studies are conducted by researchers unaffiliated with the device manufacturer, submitted to academic journals, and reviewed by other experts before publication. The methodology, sample size, and statistical analysis are scrutinized by the scientific community. Examples: Altini & Kinnunen, 2021 (Sensors) for Oura Ring, Miller et al., 2020 (Journal of Sports Sciences) for Whoop 4.0, Haghayegh et al., 2019 (JMIR) for Fitbit devices, Schyvens et al., 2025 (Sleep Advances) comparing six wearables against PSG in the same protocol, and Chinoy et al., 2021, which compared seven consumer sleep-tracking devices against PSG in 34 adults.

Manufacturer-internal validation is testing conducted by the company that makes the device. The data may be rigorous, but it has not undergone independent peer review. The company controls the study design, participant selection, and which results to publish. Examples: Samsung's internal validation (2024) for the Galaxy Ring, Apple's internal data for watchOS 11 improvements.

Both types of data are useful, but independent validation provides higher confidence because the testing is free from commercial incentives.

Why No Device Exceeds 80%

The accuracy ceiling for consumer sleep trackers appears to be around 79-80% PSG agreement. This is not a failure of engineering but a fundamental constraint of the measurement approach.

Consumer devices measure peripheral signals (heart rate from the finger or wrist, body movement, skin temperature). The brain states that define sleep stages (N1, N2, N3, REM) are defined by EEG patterns that occur in the brain, not in the periphery. The correlation between peripheral signals and brain states is strong but imperfect.

Published research shows that even the best algorithm applied to the best sensor data from the best placement cannot fully reconstruct brain state from proxy signals. The 79% figure represents the current frontier of what peripheral measurement can achieve. A 2025 meta-analysis pooling 24 validation studies of consumer wrist-worn trackers (n = 798 participants, 12 brands) found these devices underestimate total sleep time by 16.85 minutes on average (95% CI -26.33 to -7.38) and underestimate sleep efficiency by 4.69 percentage points on average (Lee et al., 2025, Journal of Clinical Sleep Medicine). A separate 2026 validation of the Muse-S EEG headband against level 1 PSG in 47 adults reported Cohen's kappa of 0.76 (Lanthier et al., 2026), substantially above the best wrist-worn device in the six-tracker comparison, suggesting that devices which measure brain activity directly (EEG) genuinely outperform peripheral-signal devices on sleep stage classification.

How to Interpret Your Data

Spec	Metric	Reliability
Total sleep time	High (15-30 min deviation)
Sleep/wake detection	Moderate (still-wake errors)
Deep sleep duration	Moderate (relative trends valid)
REM sleep duration	Moderate (relative trends valid)
Sleep onset time	Low-moderate (10-20 min error)
Brief awakenings	Low (frequently missed)

Trends over time are more reliable than any single night. Even at 79% accuracy, a single night's stage breakdown contains meaningful error. But when the same device measures you consistently over weeks and months, the trends in sleep duration, timing, and relative stage proportions are more informative.

Total sleep time is the most reliable metric. Devices estimate TST within 15-30 minutes of PSG on average. This is the metric where consumer trackers are closest to clinical measurement.

Relative comparisons within a device are valid. If your device consistently shows reduced deep sleep after late caffeine intake, that pattern is meaningful even if the absolute deep sleep minutes contain error.

Cross-device comparisons are unreliable. Different devices use different algorithms and different definitions of stage boundaries. A reported "45 minutes of deep sleep" on one device is not directly comparable to the same metric on another device.