Performance Megathread

alexti · Oct 8, 2019

CPR said:
Is this early game (ie lightly loaded) or late game (ie heavily loaded)?

Late game, but nothing crazy. ~2.5K pops, 2290, all AI combined probably has another 1K or so.

CPR said:
I'm familiar with CPU performance counters in general but not x86 so much or this particular app, so when it says 0.88 instructions per cycle and 1.610 CPUs used, does that mean a total of 0.88 IPC across all threads (ie average IPC of 0.546) or is it already an average across all threads? I'd guess the latter.

It would be average across all threads

CPR said:
Personally I would consider an IPC of 0.88 for generic "business logic" (ie complex code and data structures) to be quite fine.

In terms of CPU utilization it's quite bad, but you're right in a sense that generic "business logic" application aren't usually written very efficiently, so the problems we see in Stellaris aren't uncommon in software industry in general. And just to avoid possible misunderstanding, it's not 0.88 out of possible 1. It's 0.88 out of theoretically possible 4.

CPR said:
It doesn't suggest to me that the main limit is random access memory latency for example, which was suggested previously. Putting it another way, other things being equal, increasing the CPU clock rate should give a decent performance benefit.

Yes, memory access is clearly not a major factor and performance should scale more or less linearly with CPU clock (of course, there wasn't much progress in hardware in this area in recent years).

ChildServices · Oct 8, 2019

My question is why the hell doesn't Victoria 2 have the same performance issues when that games pop system is probably more intense

CPR · Oct 8, 2019

alexti said:
Late game, but nothing crazy. ~2.5K pops, 2290, all AI combined probably has another 1K or so.

Do you mean year 2490? Anyway, the data seems relevant.

alexti said:
In terms of CPU utilization it's quite bad, but you're right in a sense that generic "business logic" application aren't usually written very efficiently, so the problems we see in Stellaris aren't uncommon in software industry in general. And just to avoid possible misunderstanding, it's not 0.88 out of possible 1. It's 0.88 out of theoretically possible 4.

Instructions per cycle being very peaky being why we have SMT in the first place of course, particularly for complex real world applications.

Incidentally, I searched around and while it's hard to get full access to performance counters on the Mac you can get instruction counts and cycle counts with just a simple command line, eg:
top -d -i 5 -n 25 -s 5 -u -stats pid,command,cpu,time,csw,threads,mem,pstate,instrs,cycles -e

I did a brief test with a saved game (2310, not that many pops) and saw IPCs of over slightly 1.

alexti said:
Yes, memory access is clearly not a major factor and performance should scale more or less linearly with CPU clock (of course, there wasn't much progress in hardware in this area in recent years).

Yep. Putting it another way, if you want to get the max performance out of the current game, you'd probably be better off with few cores and crazy over-clocks. Of course, that would all change if/when the game is properly optimised for multi-threading.

Dëzaël · Oct 8, 2019

alexti said:
What do you mean by "CPU-readable state"?

Miswrote. I meant runable. Two things :

a) I just stated the devs claimed the interpretation part is done at startup so no performance hog from reading/parsing/building trees in game, which was not discarded by what was said before. Now it is. All there is left in game is running the result of the interpretation.

and b)

SeekingEtermity said:
Bringing this back to performance: it's possible that this language is currently parsed into an efficient internal representation of all the various conditions, actions, scopes, lookups, and so on.

The devs claimed it is the case.

From the video I can't find back, IIRC, which I'm fairly sure I do.

CPR said:
Of course, that would all change if/when the game is properly optimised for multi-threading.

About this, I posted results in the first thread AlphaAsh linked, with checking CPU usage from the game, first with normal usage, and then with ticks_per_turn 20 usage. The ticks_per_turn command increases simulation speed at the expense of graphical rendering, in case you're not familiar with it.

In the latter test, I get a CPU core nearing ~100% usage and the other 5 sitting to ~40%, while normal execution would yield a core at ~80%, two others at ~25%, the others below 20%. For total usage ~50% vs. ~25%. I don't know how it would translate in a profiling tool, I didn't monitor that far, but the game's logic seems to be quite able to multi-thread its way.

There is this thread about Stellaris potentially rendering in the main thread : https://forum.paradoxplaza.com/foru...ellaris-a-quick-performance-analysis.1138327/. It is quite interesting about how the game works the cores, and points out the way rendering is handled as a massive perf hog.

GnoSIS · Oct 8, 2019

ChildServices said:
My question is why the hell doesn't Victoria 2 have the same performance issues when that games pop system is probably more intense

It does, but with population groups, and there are "mechanics" in place to merge population groups like pop asimilation. The thing is that nowdays issues are not that evident because:
1. the game is old and hardware-wise we moved ahead a lot. If you played back in the day your pc would crawl!!
2. late game or post game, you don't get much more pop groups, instead their size increases which is just a number.

With stellaris, paradox did the 1 tile 1 pop approach and it was perfect for a start, but now with 2.2, they went quantum with pops instead or using a similar aproach like vicky 2. And pops in Stellaris have an additional level of properties, perks, triggers beyond just nationality and political affiliation.

And I'm sad that no vicky 3 is comming out....

CPR · Oct 8, 2019

Dëzaël said:
About this, I posted results in the first thread AlphaAsh linked, with checking CPU usage from the game, first with normal usage, and then with ticks_per_turn 20 usage. The ticks_per_turn command increases simulation speed at the expense of graphical rendering, in case you're not familiar with it.

In the latter test, I get a CPU core nearing ~100% usage and the other 5 sitting to ~40%, while normal execution would yield a core at ~80%, two others at ~25%, the others below 20%. For total usage ~50% vs. ~25%. I don't know how it would translate in a profiling tool, I didn't monitor that far, but the game's logic seems to be quite able to multi-thread its way.

There is this thread about Stellaris potentially rendering in the main thread : https://forum.paradoxplaza.com/foru...ellaris-a-quick-performance-analysis.1138327/. It is quite interesting about how the game works the cores, and points out the way rendering is handled as a massive perf hog.

I wonder if things have changed much since your tests.

Since I can't properly profile the application, I've tried to avoid getting into specifics of where the performance problems are, but when playing deeper into the game (when the slowdown is obvious) I have noticed that if I close the outliner and galaxy map then the speed increases notably. I've not verified it myself but people who have tested mods that simplify the job/pop code have reported significant performance improvements. Which suggests to me that there's more than one major performance bottleneck, unless for some very weird reason the outlier / galaxy view is triggering the job/pop evaluation code (certainly the graphics would need access to job/pop data but they shouldn't need to evaluate jobs weights etc simply to render data).

AlphaAsh · Oct 8, 2019

CPR said:
Which suggests to me that there's more than one major performance bottleneck, unless for some very weird reason the outlier / galaxy view is triggering the job/pop evaluation code (certainly the graphics would need access to job/pop data but they shouldn't need to evaluate jobs weights etc simply to render data).

I wouldn't be surprised if there are more internal calls derailing the evaluation process(es).

I keep looking at the need to de-select a planet, re-select a planet to refresh district-types displayed (AlphaMod changes these alot), because something in the grey matter keeps itching about whether there's a relation there.

(Maybe changes to job provision cause a massive cascade of immediate re-evaluations, and the outliner and galaxy view are somehow tied into that. **** knows why though.)

edit - Building progress bars and pop alerts may be why - the outliner might be polling planets continuously to keep that up-to-date, and closing it reduces the polling? *shrug*

(edit 2 - Hell, maybe that polling is causing yet more 'unscheduled' job re-evaluations.)

Dëzaël · Oct 8, 2019

CPR said:
I wonder if things have changed much since your tests.

I did test this by the end of July so it is up to date.

CPR said:
Since I can't properly profile the application, I've tried to avoid getting into specifics of where the performance problems are, but when playing deeper into the game (when the slowdown is obvious) I have noticed that if I close the outliner and galaxy map then the speed increases notably. I've not verified it myself but people who have tested mods that simplify the job/pop code have reported significant performance improvements. Which suggests to me that there's more than one major performance bottleneck, unless for some very weird reason the outlier / galaxy view is triggering the job/pop evaluation code (certainly the graphics would need access to job/pop data but they shouldn't need to evaluate jobs weights etc simply to render data).

Actually I'm one of those people having tested mods altering job/pop code in the context of other perf tests, including those core usage ones, and yeah, significant is about right.

I also noticed those outliner and galaxy view issues you point out, and all this also suggests to me we are facing a collection of problems. Though, as @AlphaAsh just said before me, there may very well be cascading issues making all that hard to track from outside. The problem might be less scattered than it seems then...

Anyway I feel job/pop code and rendering in main thread (if confirmed but seems likely to me) are the big offenders here. Another interesting thing I dug is that when modding game's graphics heavily, the game becomes considerably slower, but GPU usage don't climb above vanilla, about 20% max for me, being all over the place between 0% and 20%, and lowering on average as the game slows more and more. So CPU seems to be the one struggling with graphics, and that hints towards that main thread rendering theory.

permeakra · Oct 8, 2019

Dëzaël said:
So CPU seems to be the one struggling with graphics, and that hints towards that main thread rendering theory.

Outlier shows planet-level summaries and galaxy map needs to calculate what's visible and what's not, and they want it 60 times per second. If the values are not cached, it might cause the lag on its own, I think.

CPR · Oct 8, 2019

Dëzaël said:
I did test this by the end of July so it is up to date.

Oops. I misread your previous post. I couldn't find the post from July you're referencing though. When you did that test, was that on a lightly loaded system (ie new game) or a late game?

GnoSIS · Oct 8, 2019

Joke of the Day:
Releasing the "new" pop system and 2.3 for PS4...

I dare them to do it!!!

PhD_Fharon · Oct 8, 2019

I miss pre-apocalypsis stellaris so badly...

Dëzaël · Oct 8, 2019

CPR said:
Oops. I misread your previous post. I couldn't find the post from July you're referencing though. When you did that test, was that on a lightly loaded system (ie new game) or a late game?

ticks_per_turn 20 core usage :
Unplayable late game lag

normal core usage :
Unplayable late game lag

Same save, same game date for both. Numbers took with HWMonitor. Legend is truncated on the screenshots, the relevant column is the left one.
This and nearly all tests in this thread have been done on the same vanilla savegame at year 2450, 1000 stars default galaxy.

permeakra said:
Outlier shows planet-level summaries and galaxy map needs to calculate what's visible and what's not, and they want it 60 times per second. If the values are not cached, it might cause the lag on its own, I think.

Yeah, I'm assuming that's not the case.
I mean, the devs wrote in one of the patch notes that when they introduced the time before arrival indication for fleet travel, it was calculated daily to be displayed, and they got rid of that in the next hotfix. They claimed having optimized parts of the UI before that, if they let things like what you suggest, I don't know what the heck they optimized.

The UI certainly has its inefficiencies, but I don't think it goes this far.

STABBY5 · Oct 9, 2019

ChildServices said:
My question is why the hell doesn't Victoria 2 have the same performance issues when that games pop system is probably more intense

Unique pops in that game are much smaller early on.10k North German Farmers in Danzig is a single pop with a modifier of 10k if I remember correctly. Where as in stellaris 10 Human farmers are 10 different pops. Later on in Victoria 2 when every one migrates around it has a lot of unique pops and lags itself to death. Victoria 2 is also a much shorter game so its not nearly as noticeable.

CPR · Oct 9, 2019

Dëzaël said:
ticks_per_turn 20 core usage :
Unplayable late game lag

normal core usage :
Unplayable late game lag

Same save, same game date for both. Numbers took with HWMonitor. Legend is truncated on the screenshots, the relevant column is the left one.
This and nearly all tests in this thread have been done on the same vanilla savegame at year 2450, 1000 stars default galaxy.

Thanks! I get similar results on my MacBook Pro.

One interesting thing to do would be to do a similar test with one of those performance mods. Would we see a similar speed up? More parallelism? Less?

Trudel79 · Oct 9, 2019

Is it possible all of this could be fixed by just forcing us to put pops in their respective jobs manually when a new pop or job becomes available instead of automatically doing so, and only doing a pop/job check like once a month?

alexti · Oct 9, 2019

ChildServices said:
My question is why the hell doesn't Victoria 2 have the same performance issues when that games pop system is probably more intense

One of the factors is that Victoria 2 is less modable and some critical parts of code are probably compiled and optimized.

CPR said:
Do you mean year 2490? Anyway, the data seems relevant.

I have the late game start set at 2250. Though I don't think the year is much of a factor, it's more a question of number of pops. I have tried games in empty galaxy in the past where I would reach the same population count later and in those games it would still be running fairly well in 2290.

CPR said:
I did a brief test with a saved game (2310, not that many pops) and saw IPCs of over slightly 1.

I haven't tested it thoroughly, but few times I've gathered and looked at stats it appeared that the lag and IPC decline come together (presumably as a result of the same parts of the code being run more and more).

Dëzaël said:
Miswrote. I meant runable. Two things :

a) I just stated the devs claimed the interpretation part is done at startup so no performance hog from reading/parsing/building trees in game, which was not discarded by what was said before. Now it is. All there is left in game is running the result of the interpretation.

and b)

The devs claimed it is the case.

From the video I can't find back, IIRC, which I'm fairly sure I do.

You can't do the up-front interpretation. Interpretation is dependent on the data that are not available at the start up (the game state). You can only do the parsing. Both interpreter and compiler would do this step (it's pretty much the same step whether it's compiler or interpreter). So to avoid interpretation overhead you would have to compile (into a machine code).

As you said it was a video and it's very difficult to describe everything accurately in precise terms while in front of the camera. And then, assuming you have watched video more or less normally, there might be a discrepancy between what was said and your interpretation of that, particularly if it's not your area of expertise. Like in those tests when they show multiple people the same thing and there is a difference in people's description of what they've seen.

Dëzaël said:
There is this thread about Stellaris potentially rendering in the main thread : https://forum.paradoxplaza.com/foru...ellaris-a-quick-performance-analysis.1138327/. It is quite interesting about how the game works the cores, and points out the way rendering is handled as a massive perf hog.

Yes, rendering is quite costly, but TPT 10 largely eliminate the rendering overhead. The game runs few times faster, but the pop-dependent part of slow down is still there. The stats I have posted earlier are from TPT 10 run.

CPR said:
Since I can't properly profile the application, I've tried to avoid getting into specifics of where the performance problems are, but when playing deeper into the game (when the slowdown is obvious) I have noticed that if I close the outliner and galaxy map then the speed increases notably. I've not verified it myself but people who have tested mods that simplify the job/pop code have reported significant performance improvements. Which suggests to me that there's more than one major performance bottleneck, unless for some very weird reason the outlier / galaxy view is triggering the job/pop evaluation code (certainly the graphics would need access to job/pop data but they shouldn't need to evaluate jobs weights etc simply to render data).

Have you tried to run with TPT 1 (default) and TPT 10 and compare? It sounds like you were running with TPT 1 and observing deterioration due to rendering.

permeakra · Oct 9, 2019

alexti said:
You can't do the up-front interpretation.

Yes you can. This just means that you don't work with a programming language, but a sort of data definition/markup language.
Stellaris 'scripts' are not interpreted in the same vein as, say, python or lua scripts are. The game code has a lot of predefined 'slots', and the 'scripts' merely describe values to be put into those slots. With proper interpreters it is vice versa: the host program exposes its objects into a proper general-purpose interpreter. Granted, the Stellars 'slots' allow for some variance and this cannot be good for performance, but approaching this part of performance issues from the same side as optimizing interpreters clearly is counterproductive. Performance won't benefit from exposing more, it would benefit from exposing less and in a more rigid form.

AlphaAsh · Oct 9, 2019

Trudel79 said:
Is it possible all of this could be fixed by just forcing us to put pops in their respective jobs manually when a new pop or job becomes available instead of automatically doing so, and only doing a pop/job check like once a month?

No. The player doesn't assign the AI's pops. The AI does. Through evaluation.

Just letting a player do it would provide no great performance benefit. And would piss off the usual micro-management hate brigade, which would then mean having optional automation, which wouldn't work optimally for most anyway...

Insert WHAT YEAR IS IT? meme here.

GnoSIS · Oct 9, 2019

At least, let's hope they mention performance on the dev diary tomorrow....

Performance Megathread

Field Marshal

Private

Sergeant

Captain

Colonel

Sergeant

Miserable Git

Captain

Major

Sergeant

Colonel

Sergeant

Captain

Lt. General

Sergeant

Corporal

Field Marshal

Major

Miserable Git

Colonel