New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win_perf_counters '% Processor Time' sent invalid data periodically (every ~2 minutes) #4453
Comments
@vlastahajek Any idea what could cause this? |
@nognomar, does this happen also on another machine, if you have one? Could, please, also try the |
Also, it's worth to disable refreshing counters configuration by setting performance counters global parameter: |
For easy comparison of the performance counters plugin output and the typeperf output do following two simple steps to get typeperf output to InfluxDb (and display in Chronograf or Grafana):
|
Oh that's pretty handy, I'll just mention that using tail with grok will require a nightly builds, with the lastest release version it would need slight adjustmented to use the logparser input. |
My bad, sorry. I was daunted by warning about deprecation of logparser. But no problem, I already have also logparser config ready:
|
@vlastahajek, thanks for your helping. I disable refreshing counters interval and it's helped me. Thanks! |
@nognomar, thanks for validation. The typeperf waveform is too different from Telegraf. Isn't there different function applied in the query? E.g. mean vs raw value. So, with the disabled refreshing, values detected by Telegraf now matches others (typeperf, perfmon)? |
@vlastahajek, now the data is definitely similar =) But look at the beginning of the graph. It's restart telegraf service. Maybe this "magic 100" something like initial value in plugin's counter's map? |
The initial 100% has always been here, afaik from start. I've had other monitoring utilities with the same behavior, related to the initial connection to the system performance counters and registering a ~100% system CPU spike. On the main point, I'm having the same problem with telegraf v1.7.2 : constant CPU usage reported around 90% when the real system usage is more along 5%. Without a change to the config file, reverting telegraf.exe to the previously used version (v1.5.1) gives proper values. |
@nognomar This should be spike caused by Telegraf initialization, but should be also detected by other measurement. In general, you can validate results by showing also values for '% Idle Time`, and it should be counterpart for value of % Processor Time. Interval 1s is very short interval for counter measurements. Many performance counters need two samples with at least 1s pause between samples. With refreshing on, you will see zero values at the time of refreshing, easily observable by Idle Time value going zero. |
@Daryes Does OP mean Originaly Provided? The main difference in the default configuration from version<1.7 is the default value for the On the other hand, there is is almost no change in values gathering. If you have 5min interval in the default configuration, with refreshing turned on, it can happen that at time of measurement counters are refreshed and it is causing high CPU usage peak. I'm doing thorough testing with various configurations on multiple hosts (Win Server 2008 R2, Windows Server 2012 R2, Windows 10) and except above described exception I don't see bad values so far. |
@vlastahajek In my case, as I said, telegraf 1.7 reports false high CPU usage values, as values reported by other monitoring tools, even procmon running at a very short collecting interval, are identical for all of them, and very different than telegraf. |
Separated message, after digging into the plugin code. The current GO implementation in telegraf call the Windows API functions from PDH.DLL. So I got a look on the MSDN docs about those, but there's no special information on PdhCollectQueryData and QueryDataWithTime. On the other hand, on the .Net implementation of PerformanceCounter, I found this information in the NextValue function :
That explains the comment in the
And, unless I'm mistaken, the current implementation around this is in the Given those :
So, what if the .Net implementation does a create+query counter at the first call, instead of a simple create like it could be expected ? It might also be the fact that the data is the result of a calculation and not a direct read by the performance counter : the first data being absent, and the second one being bogus because of the first one missing, with only starting at the third one being valid as calculated from an existing value (even if bogus). An other possibility would be to catch the error code when a counter is created, and store it with the definition to flag those who require 2 data sample, and exclude them from the counter refresh block. Also, a question : This aside, given the way counters works, and that an automatic refresh isn't common in other monitoring utilities, it would be better to have the |
Yes, the weird values reported after re-initialization are worth investigating. They are not invalid in total, cause in such case it would be zeroes, but they somehow distorted If someone sets agent interval as a multiple of counter refresh interval she we get weird values. If counter refresh interval is off, values are correct:. General description of gathering data from performance counters is at Collecting Performance Data. Step 2 talks about necessity to sleep between collecting data. In the Gather function on lines 248 - 268 there is (re)initialization high level code, where, in the end , there is first call to gather data , then sleep 1s and line 279 issues subsequent another call to collect data. Version 1.7.2. does't call CollectQueryDataWithTime, but generally it is not needed to call this same function two times at the beginning. |
Perhaps we can use a similar pattern to that used in the In Another idea we could try is to refresh the counters after we produce the metrics, instead of before. Maybe this would reduce any spiking due to refreshing counters? |
For what I've seen, there's already this mechanism in the telegraf win_per_counters plugin, when creating or refreshing a counter, a call to gather() is done, with the return value ignored (and followed by the dreaded sleep 1s you spoke about). Refreshing the counters after producing the metrics should work. I didn't initially proposed this because it'll require to be tested, and errors could still occur. That's why I suggested to keep CountersRefreshInterval OFF by default, even after a fix. |
Both a) skipping sleep and letting gather finish without returning metrics and b) recreating counters after collecting metrics, will not work properly, if gather interval will be equal to or higher than counters refresh interval, which can be seen among users. There must be a catch, i will look on it later. |
Is anyone still experiencing this issue? There have been bugfixes to this plugin since this issue was filed so I wonder if it's still a problem with recent telegraf releases. |
I need to verify about this, |
@Daryes can you verify this issue? |
@srebhan Might be because the code in gather() is not similar between the refresh counter block and the collecting block Suggestion : dunno if the way the collecting works would support it, but instead of redefining another time the hostCounterSet.query code, have gather() call itself and dropping the metrics between L429 and L431, |
@Daryes, I will try to into this in the near future... Would you be able to test a potential PR? |
Relevant telegraf.conf:
https://gist.github.com/nognomar/95c8c0c086106e2fb66e8f9d8940638f
System info:
[Include Telegraf version, operating system name, and other relevant details]
telegraf 1.7.2
Windows server 2012R2
Processor Intel-Xeon-Processor-E5-2620 (2 processors)
Every 2 (sometime 4) minutes telegraf detect 100% CPU loading. But I can't detect that by other methods (task manager, sysinternals process explorer, performance monitor)
I create simple tool in C# to monitor cpu activity, but it detects nothing:
https://gist.github.com/nognomar/05cbcb0ca8842c537e163557a4f96517
The text was updated successfully, but these errors were encountered: