Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win_perf_counters '% Processor Time' sent invalid data periodically (every ~2 minutes) #4453

Open
nognomar opened this issue Jul 23, 2018 · 23 comments
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) platform/windows

Comments

@nognomar
Copy link

nognomar commented Jul 23, 2018

Relevant telegraf.conf:

https://gist.github.com/nognomar/95c8c0c086106e2fb66e8f9d8940638f

System info:

[Include Telegraf version, operating system name, and other relevant details]
telegraf 1.7.2
Windows server 2012R2
Processor Intel-Xeon-Processor-E5-2620 (2 processors)

Every 2 (sometime 4) minutes telegraf detect 100% CPU loading. But I can't detect that by other methods (task manager, sysinternals process explorer, performance monitor)

I create simple tool in C# to monitor cpu activity, but it detects nothing:
https://gist.github.com/nognomar/05cbcb0ca8842c537e163557a4f96517

@glinton glinton added platform/windows area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) labels Jul 23, 2018
@danielnelson
Copy link
Contributor

@vlastahajek Any idea what could cause this?

@vlastahajek
Copy link
Contributor

@nognomar, does this happen also on another machine, if you have one?

Could, please, also try the typeperf tool if it matches what Telegraf detects?
Simply run from following command line: typeperf "\Processor(_Total)\% Processor Time"

@vlastahajek
Copy link
Contributor

Also, it's worth to disable refreshing counters configuration by setting performance counters global parameter: CountersRefreshInterval="0s"

@vlastahajek
Copy link
Contributor

For easy comparison of the performance counters plugin output and the typeperf output do following two simple steps to get typeperf output to InfluxDb (and display in Chronograf or Grafana):

  1. redirect typeperf 's output to file: typeperf "\Processor(_Total)\% Processor Time" >c:\tmp\typeperf_cpu.log
  2. setup the Telegraf's tail plugin to read typeperf output:
[[inputs.tail]]
  files = ['c:\tmp\typeperf_cpu.log']
  from_beginning = true
  data_format = "grok"
  grok_patterns = ['"%{DATESTAMP:timestamp:ts-"01/02/2006 15:04:05.000"}","%{NUMBER:value:float}"']
  name_override = "typeperf_cpu"
  grok_timezone = "Local"

@danielnelson
Copy link
Contributor

Oh that's pretty handy, I'll just mention that using tail with grok will require a nightly builds, with the lastest release version it would need slight adjustmented to use the logparser input.

@vlastahajek
Copy link
Contributor

My bad, sorry. I was daunted by warning about deprecation of logparser. But no problem, I already have also logparser config ready:

[[inputs.logparser]]
  files = ['c:\tmp\typeperf_cpu.log']
  from_beginning = true
  [inputs.logparser.grok]
    patterns = ['"%{DATESTAMP:timestamp:ts-"01/02/2006 15:04:05.000"}","%{NUMBER:value:float}"']
    measurement = "typeperf_cpu"
    timeZone = "Local"

@nognomar
Copy link
Author

@vlastahajek, thanks for your helping.
I have six stands with the same configurations and telegraf detect 100% cpu spikes on all of them.
Telegraf/Typeperf Snapshot

I disable refreshing counters interval and it's helped me. Thanks!

@vlastahajek
Copy link
Contributor

@nognomar, thanks for validation. The typeperf waveform is too different from Telegraf. Isn't there different function applied in the query? E.g. mean vs raw value.

So, with the disabled refreshing, values detected by Telegraf now matches others (typeperf, perfmon)?

@nognomar
Copy link
Author

@vlastahajek, now the data is definitely similar =)
Snapshot

But look at the beginning of the graph. It's restart telegraf service. Maybe this "magic 100" something like initial value in plugin's counter's map?

@Daryes
Copy link

Daryes commented Jul 26, 2018

The initial 100% has always been here, afaik from start. I've had other monitoring utilities with the same behavior, related to the initial connection to the system performance counters and registering a ~100% system CPU spike.

On the main point, I'm having the same problem with telegraf v1.7.2 : constant CPU usage reported around 90% when the real system usage is more along 5%.
The config is the same as OP, on a 5 min interval.
System is Windows 2012 (not R2), 1 vCPU, virtualized under VMware.

Without a change to the config file, reverting telegraf.exe to the previously used version (v1.5.1) gives proper values.
I've just tried telegraf v16.4, with the same configuration, and the CPU usage values are correct. So a change between 1.6.x and 1.7.x is the source of the erroneous values.

@vlastahajek
Copy link
Contributor

@nognomar This should be spike caused by Telegraf initialization, but should be also detected by other measurement.

In general, you can validate results by showing also values for '% Idle Time`, and it should be counterpart for value of % Processor Time.

Interval 1s is very short interval for counter measurements. Many performance counters need two samples with at least 1s pause between samples. With refreshing on, you will see zero values at the time of refreshing, easily observable by Idle Time value going zero.

@vlastahajek
Copy link
Contributor

@Daryes Does OP mean Originaly Provided?
And 5mins is agent collection interval?

The main difference in the default configuration from version<1.7 is the default value for the CountersRefreshInterval configuration parameter, which causes counters to be periodically recreated. This is primarily useful when gathering values from counters with dynamic instances (process, disk) and UseWildcardsExpansion turned on.

On the other hand, there is is almost no change in values gathering.

If you have 5min interval in the default configuration, with refreshing turned on, it can happen that at time of measurement counters are refreshed and it is causing high CPU usage peak.

I'm doing thorough testing with various configurations on multiple hosts (Win Server 2008 R2, Windows Server 2012 R2, Windows 10) and except above described exception I don't see bad values so far.

@Daryes
Copy link

Daryes commented Jul 28, 2018

@vlastahajek
OP stands for Original Poster, sorry, forum habit. And yes, it's the agent collection interval.

In my case, as I said, telegraf 1.7 reports false high CPU usage values, as values reported by other monitoring tools, even procmon running at a very short collecting interval, are identical for all of them, and very different than telegraf.
For example, Telegraf will reports a 90% while all other will report a constant 5-10%, which is is correct for an idle system.

@Daryes
Copy link

Daryes commented Jul 28, 2018

Separated message, after digging into the plugin code.
I might have found an idea explaining the cause of the false values reported by telegraf v1.7 while juggling with the MSDN documentation.

The current GO implementation in telegraf call the Windows API functions from PDH.DLL. So I got a look on the MSDN docs about those, but there's no special information on PdhCollectQueryData and QueryDataWithTime.

On the other hand, on the .Net implementation of PerformanceCounter, I found this information in the NextValue function :

If the calculated value of a counter depends on two counter reads, the first read 
operation returns 0.0.
Resetting the performance counter properties to specify a different counter 
is equivalent to creating a new performance counter, and the first read operation 
using the new properties returns 0.0. 
The recommended delay time between calls to the NextValue method is one second, 
to allow the counter to perform the next incremental read.

That explains the comment in the plugins/inputs/win_perf_counters/pdh.go source file :

// The PdhCollectQueryData will return an error in the first call because it needs two values for
// displaying the correct data for the processor idle time. The second call will have a 0 return code.

And, unless I'm mistaken, the current implementation around this is in the win_perf_counters.go/Gather() function, doing a first call to CollectData and a sleep(1sec) in the lastRefreshed / CountersRefreshInterval test block before parsing with the collected data

Given those :

  • as the first initial value (set as 100%) reported by telegraf isn't in fact valid, it might come from a bogus data from the system API. The processor data being one of those requiring 2 values at initialization.
  • when the CountersRefreshInterval setting kicks in, it seems the API will end like recreating a new counter, thus filling the first collected sample with a bogus value, as expected.
  • one interesting test is the fact that when CountersRefreshInterval is set to a higher value than the agent interval, sample collected between each counter refresh are valid and reflect the real CPU usage.
  • the .Net implementation does additional work to handle the first call, but there's no explanation about how it's done.

So, what if the .Net implementation does a create+query counter at the first call, instead of a simple create like it could be expected ?
Doing the same might be a possible way to fix it in telegraf, by having 2 sample collected instead of one in the Gather() function.

It might also be the fact that the data is the result of a calculation and not a direct read by the performance counter : the first data being absent, and the second one being bogus because of the first one missing, with only starting at the third one being valid as calculated from an existing value (even if bogus).

An other possibility would be to catch the error code when a counter is created, and store it with the definition to flag those who require 2 data sample, and exclude them from the counter refresh block.


Also, a question :
CollectQueryData and CollectQueryDataWithTime are 2 separated system API function in PDH.
Yet, while the win_perf_counters.go/Gather() function correctly call one or the other when collecting the data, the lastRefreshed / CountersRefreshInterval test block always call CollectQueryData.
Dunno the way the system API works, but could a switch from one function to the other have any effect on the counter, and the value it'll return ?


This aside, given the way counters works, and that an automatic refresh isn't common in other monitoring utilities, it would be better to have the CountersRefreshInterval setting value set to off by default in the configuration file, even after a fix is found.
Up to the user to activate it.

@vlastahajek
Copy link
Contributor

Yes, the weird values reported after re-initialization are worth investigating. They are not invalid in total, cause in such case it would be zeroes, but they somehow distorted

If someone sets agent interval as a multiple of counter refresh interval she we get weird values. If counter refresh interval is off, values are correct:.
For illustation, check following snaphost, where the agent interval = 2m and the counter refresh interval = 1m in the time before ~17:10, ant counter refresh interval is off after ~17:10.
I've already created PR for disabling counter refresh interval.

General description of gathering data from performance counters is at Collecting Performance Data. Step 2 talks about necessity to sleep between collecting data.

In the Gather function on lines 248 - 268 there is (re)initialization high level code, where, in the end , there is first call to gather data , then sleep 1s and line 279 issues subsequent another call to collect data.

Version 1.7.2. does't call CollectQueryDataWithTime, but generally it is not needed to call this same function two times at the beginning.

@danielnelson
Copy link
Contributor

Perhaps we can use a similar pattern to that used in the cpu input. In this plugin the first Gather does not return any metrics, only collecting an initial values, the second Gather can then compute the deltas and begin producing data.

In win_perf_counters, we could only do the expand during the initial Gather, returning no metrics. This would hopefully allow more time for the counters to settle, and allow us to remove the 1s sleep, which I believe is causing confusion around metric timestamps #4487

Another idea we could try is to refresh the counters after we produce the metrics, instead of before. Maybe this would reduce any spiking due to refreshing counters?

@Daryes
Copy link

Daryes commented Aug 8, 2018

For what I've seen, there's already this mechanism in the telegraf win_per_counters plugin, when creating or refreshing a counter, a call to gather() is done, with the return value ignored (and followed by the dreaded sleep 1s you spoke about).
But there's a catch with Ms way : refreshing a counter is, in reality, creating a new one. So it'll requires at each refresh a blank call to gather() with a sleep right after.

Refreshing the counters after producing the metrics should work. I didn't initially proposed this because it'll require to be tested, and errors could still occur. That's why I suggested to keep CountersRefreshInterval OFF by default, even after a fix.
But given the situation and Ms (lack of) documentation, it might be the better way.

@vlastahajek
Copy link
Contributor

Both a) skipping sleep and letting gather finish without returning metrics and b) recreating counters after collecting metrics, will not work properly, if gather interval will be equal to or higher than counters refresh interval, which can be seen among users.

There must be a catch, i will look on it later.

@reimda
Copy link
Contributor

reimda commented May 26, 2022

Is anyone still experiencing this issue? There have been bugfixes to this plugin since this issue was filed so I wonder if it's still a problem with recent telegraf releases.

@reimda reimda added the waiting for response waiting for response from contributor label May 26, 2022
@Daryes
Copy link

Daryes commented May 30, 2022

I need to verify about this,
I'm working with CountersRefreshInterval=0s which in reality disable the counter refresh.
The only side effect is having telegraf blind to hardware change (like adding memory), thus requiring a restart of the service.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 30, 2022
@srebhan
Copy link
Contributor

srebhan commented Nov 1, 2023

@Daryes can you verify this issue?

@srebhan srebhan added the waiting for response waiting for response from contributor label Nov 1, 2023
@Daryes
Copy link

Daryes commented Nov 4, 2023

@srebhan
Just checked with the latest version (v1.28.3) on a Win Server 2019 , setting back CountersRefreshInterval from 0 to "5m" on an idling server give back again a cpu spike over 50% usage each 5 mins, while the real usage is less than 7%.

Might be because the code in gather() is not similar between the refresh counter block and the collecting block

Suggestion : dunno if the way the collecting works would support it, but instead of redefining another time the hostCounterSet.query code, have gather() call itself and dropping the metrics between L429 and L431,
Assuming m.lastRefreshed = time.Now() get a value precise enough to prevent an infinite loop (or add a +1s )

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 4, 2023
@srebhan
Copy link
Contributor

srebhan commented Nov 6, 2023

@Daryes, I will try to into this in the near future... Would you be able to test a potential PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) platform/windows
Projects
None yet
Development

No branches or pull requests

7 participants