Home

飞牛 NAS硬盘出现Reallocated Sector Ct备份扇区分配分析

今天早上 nas 出现 Reallocated Sector Ct 告警

备用扇区已经被用掉一个了。这四块盘的来源是从拼夕夕上买的,企业库存盘,通电 0 小时的。

所以用的时间并不长,现在出现坏道了,需要做一下分析再找卖家稳稳怎么个事。

分析

ssh 到 fnos 后台,用序列号找出对应在 Linux 中的磁盘

sudo smartctl --scan | awk '{print $1}' | while read dev; do
    echo -n "$dev: "
    sudo smartctl -i "$dev" | grep "Serial Number" | awk '{print $3}'
done

出现问题的盘是ZC15GDG1 对应/dev/sdc

接下来就是看看完整的 smart 信息了

smartctl -a /dev/sdc
root@NAS:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.12.18-trim] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST4000NM0265-2DC107
Serial Number:    ZC15GDG1
LU WWN Device Id: 5 000c50 0b0028f77
Add. Product Id:  DELL(tm)
Firmware Version: DB34
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5770
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul  4 08:54:48 2025 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249)	Self-test routine in progress...
					90% of test remaining.
Total time to complete Offline
data collection: 		(   90) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 403) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x50bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   080   064   044    Pre-fail  Always       -       90376105
  3 Spin_Up_Time            0x0103   088   088   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0133   100   100   010    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       176707062
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       791
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   057   047   040    Old_age   Always       -       43 (Min/Max 29/46)
191 G-Sense_Error_Rate      0x0032   099   099   000    Old_age   Always       -       3548
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       159
194 Temperature_Celsius     0x0022   043   053   000    Old_age   Always       -       43 (0 28 0 0 0)
195 Hardware_ECC_Recovered  0x001a   025   001   000    Old_age   Always       -       90376105
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       767h+41m+22.645s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       11700990479
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       12071513596

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%       791         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

根据提供的 smartctl -a /dev/sdc 完整报告,以下是针对 Seagate ST4000NM0265(序列号 ZC15GDG1) 的详细分析和操作建议:


1. 关键SMART指标分析

属性ID名称当前值阈值RAW值状态分析
5Reallocated_Sector_Ct100101已有 1个扇区被重映射,但数值较低(<10),暂不构成严重威胁。
197Current_Pending_Sector10000无待处理扇区,说明没有新增疑似坏道。
198Offline_Uncorrectable10000无无法修复的坏扇区,硬件状态良好。
199UDMA_CRC_Error_Count20000数据线/接口无错误,排除连接问题。
190Airflow_Temperature_Cel574043°C温度略高(43°C),建议改善散热(硬盘工作温度建议≤40°C)。
1Raw_Read_Error_Rate804490376105原始值高是希捷硬盘的正常现象,无需担心。

2. 当前硬盘状态总结

  • 健康状态PASSED(SMART自检通过)。
  • 坏扇区:仅有 1个重映射扇区,无新增待处理坏道。
  • 温度:43°C(需优化散热)。
  • 运行时间:791小时(约1个月),属于较新硬盘。
  • 正在进行的测试:报告显示 Extended offline(长测试) 已完成10%,建议等待测试完成。

3. 下一步操作建议

(1) 等待长测试完成

当前硬盘正在运行 Extended offline(长测试),剩余90%:

# 查看测试进度
sudo smartctl -l selftest /dev/sdc

# 测试完成后检查结果
sudo smartctl -H /dev/sdc
  • 如果测试结果仍为 PASSEDReallocated_Sector_Ct 未增加,硬盘可继续使用。
  • 如果测试失败或新增坏道,需备份数据并考虑更换。

(2) 监控温度

  • 检查硬盘散热环境(如风扇是否正常,硬盘间距是否足够)。
  • 可通过以下命令实时监控温度:
watch -n 5 'sudo smartctl -A /dev/sdc | grep Temperature_Celsius'

(3) 定期检查SMART状态

建议每周检查一次关键指标:

sudo smartctl -A /dev/sdc | grep -e "Reallocated_Sector_Ct" -e "Current_Pending_Sector" -e "Temperature_Celsius"

(4) 备份重要数据(预防性措施)

即使当前问题不严重,也建议备份数据(尤其是重映射扇区可能预示早期老化):

# 使用dd或rsync备份数据到其他硬盘
sudo rsync -avh /path/to/data /mnt/backup/

4. 是否需要立即更换硬盘?

  • 无需立即更换:当前仅1个重映射扇区且无其他错误,硬盘仍可安全使用。
  • 需警惕的情况:如果未来出现以下变化,需立即行动:
    • Reallocated_Sector_Ct 持续增加。
    • Current_Pending_Sector > 0。
    • 频繁I/O错误或性能下降。

5. 优化建议

  • 改善散热:确保硬盘通风良好,必要时增加风扇。
  • 避免频繁启停:减少 Start_Stop_Count(当前为11次,正常范围)。
  • 启用SMART自动监控(示例配置):
# 添加到cron(每周日检查)
echo "0 0 * * 0 sudo smartctl -H -A /dev/sdc | mail -s 'SMART Report sdc' your@email.com" | sudo tee -a /etc/crontab

总结

  • 现状:硬盘状态良好,仅1个历史坏道,温度略高。
  • 建议操作:等待长测试完成 → 优化散热 → 定期监控 → 备份数据。
  • 后续关注:重点关注 Reallocated_Sector_Ct 和温度变化。
Linux 网络 存储 AI