Subject: Re: [CentOS] Kernel:[Hardware Error]:



On Sat, Aug 12, 2017 at 05:51:33PM -0400, Steven Tardy wrote:
>
> > On Aug 12, 2017, at 3:50 PM, Fred Smith <[email protected]>
> > wrote:
> >
> > I had a series of kernel hardware error reports today while I was away
> > from my computer:
> >
> > Message from [email protected] at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
> >
> > Message from [email protected] at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: Error Status: Corrected error, no action required.
> >
> > Message from [email protected] at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: CPU:2 (15:2:0)
> > MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
> >
> > Message from [email protected] at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
> >
> > never saw anything like that before.
> >
> > cpu is:
> >
> > $ cat /proc/cpuinfo
> > processor : 0
> > vendor_id : AuthenticAMD
> > cpu family : 21
> > model : 2
> > model name : AMD FX(tm)-6300 Six-Core Processor
> > stepping : 0
> > microcode : 0x600084f
> > cpu MHz : 1400.000
> > cache size : 2048 KB
> > physical id : 0
> > siblings : 6
> > core id : 0
> > cpu cores : 3
> > apicid : 16
> > initial apicid : 0
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 13
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> > pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid
> > aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes
> > xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
> > misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr
> > tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock
> > nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
> > pfthreshold bmi1
> > bogomips : 7023.90
> > TLB size : 1536 4K pages
> > clflush size : 64
> > cache_alignment : 64
> > address sizes : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
> >
> >
> > six core AMD, above is one of the cores.
> >
> > Any clues to figure out the errors, and/or mitigate?
> >
> > thanks!
> >
> > Fred
>
> MC == Machine check exception.
> The important part of a MC is the "status" code.
> One can use the Intel doc "Architecture Software Developers Manual" to decode
> this (4000 page .pdf).
> Unsure but it looks like AMD does similar MC codes.
> Luckily Linux does some heavy lifting and decodes to "cache hierarchy error
> L2 data eviction".
> The next most important part is the "corrected" bit.
>
> Now what does that really mean?
> *shrug*, could be
> firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.

Well. overheating is possible... we don't live in the cleanest possible
house, AND we have cats. so, in general I open up this box twice a year
and vacuum out the house dirt and cat fuzzies. I'm probably overdue for
this task.

This is the first one of these I've had. Hope it's the last. but a
little PM is in order either way.

thanks for the reply.

Fred
>
> Hope that doesn't confuse too much. (:
> _______________________________________________
> CentOS mailing list
> [email protected]
> https://lists.centos.org/mailman/listinfo/centos

--
---- Fred Smith -- [email protected] -----------------------------
The Lord detests the way of the wicked
but he loves those who pursue righteousness.
----------------------------- Proverbs 15:9 (niv) -----------------------------
_______________________________________________
CentOS mailing list
[email protected]
https://lists.centos.org/mailman/listinfo/centos



Programming list archiving by: Enterprise Git Hosting