Clausewitz Engine : network model

ObscuraNox · Mar 30, 2018

Hello Paradoxteam,

I am a big fan of our games but one thing which bothers me from the beginning are the out of syncs which more or less happen(ed) in all of them. I think you did a good job to improve the situation (i.e. providing hot joins etc) but I was wondering what kind of network model you are using in your games.

Is it possible that you chose to use a "event/action" driven approach? So something like:
1) client/player generates an action
2) server checks/applies the action to its game state
3) server distributes the action to all (clients)

If this is NOT the case: I would be very interested in your approach.
If this is the case: Is there a good reason to chose this approach over a "state" driven approach? I know the action driven approach is very (very very) common but I my humbled opinion it is also the main reason why multiplayer experiences is very frequently suboptimal.

What is a state driven approach:
1) client/player generates an action
2) server checks/applies the action to its game state and marks the changed data as dirty
3) server distributes the new state of all updated data to all (clients)

Comparison (what is the difference):
One can argue that an OoS can happen in both approaches. This is totally true but lets assume the server distributes only actions like "move one up/down". In case of an action driven system the client state will stay out of sync if a desync happens because only CHANGES are sent. In case of an state driven system a desynced client state will be simply overwritten with the next update which fixes the desync.
Another argument is the latency/generated traffic. While this might holds true for fast paced games like shooters etc its general validity is arguable. I.e. sending an "move forward for (10,0,0) at heartbeat 1234" is not less payload then sending "position (43521,123,543) at heartbeat 1234".

I know questions about design choices are not very popular from a dev perspective (i.e. they are basically a pain in the ass with mostlikely no net gain at all) but sometimes sharing information helps to raise awares and increase the support of the community. So any information regarding this topic are very much appreciated.

If the information is already somewhere around please give me a short headsup. My current research did not unearth any details regarding the topic.

best regards

Johan · Mar 30, 2018

Distributing state is far less convenient with the smount of data changed and how network capacity works.

ObscuraNox · Mar 30, 2018

Thank you for this fast reply. The amount of generated traffic is a common argument as I already pointed out. Do you have any measurement which supports this statement for your application? Because I frequently encounter people who just claimed this without actually benchmarking it. I do not know how the networklayer in the clauswitz engine works so I can make assumptions here.
Perfect example is the UDP vs TCP debate. You will find plenty of sources which states that UDP is way superior for games. But if you check the argumentation you will find mostlikely only theorized assumptions. Due to this many start with UDP and end up with an own library which imitates TCP behaviour. Fact is that TCP can and will outperform reliable+order keeping UDP based solution in almost every case (for data see here: https://sourceforge.net/projects/networklibsbenc/ ). Dont get me wrong. UDP is a very good choice if one deals with small payloads and can life with packet losses. In every other case TCP might be the better choice according to these measurements.
Regarding the network "model" the action/event choice seems natural but in fact sending an "fire event ID" is not so different from "event ID=active". It of course depends strongly on the way of implementation but even if one has an edge case like "event: change all values +1" vs "obj 1:value=1,obj 2:value=1, ... obj n:value=1" were the payload is definitely higher a simple zlib:deflate will tremendously reduce the needed bandwidth (due to the low level of information - basically its only the ids which carries information making compression very easy and zlib deflate is really light weight). And depending on the size of the payload it might even be that there is no impact at all (networkwise there is no difference in traffic for all packets which fits in one frame - i.e. it does not matter if a packet is 10 or 100 as long as it "fits" into one frame.

EDIT: the "UDP vs TCP" was just meant as an example for "logic vs reality". I am aware that UDP has the very big advantage of "NAT punchthrough".

Chaingun · Apr 16, 2018

It's really enough to take a look in the code to see that synchronizing state explicitly is unfeasible. Not necessarily (although probably) about data volume too, but the amount of effort into writing synchronization code would be massive. The resulting requirements placed on the code would spill over into hitting performance as well, so it's just not about hiring a sufficiently large team of hamsters to spin the wheel.

I used to be an MMO engine programmer.

ObscuraNox · Apr 16, 2018

At the first glance you are completely right. The complexity for "just fire a event" is much lower than for synchronization code. On the other hand if one wrote a "save game" logic one already has most of the synchronization code needed. An event driven systems require zero deviation mechanics which is not easy to archive (and to sustain for every update). Question is how many working hours does one need in the end. So how many hours were spent in dealing with critical or high level level tickets due to OoS problems? I am not claiming that OoS never happens with an synchronization approach but they would intrinsically only medium level or even cosmetic. Would this not safe many costly hours of debugging and fixing?

Out of curiousity: What kind of MMO did you work on? How crictial were OoS scenarios there and how were they handled?

P.S: actually some of your games already feature a very simple sync mechanisms to some extends: hot joining and resync after OoS.

Chaingun · Apr 16, 2018

If we would be doing the equivalent of reading an entire save game of data each game tick it would kill performance.

OOS don't exist in the same way in the kinds of state synchronization models that are used in FPS games. It's not an irreversible condition there, and the game is "out of sync" by default, the question is just how closely you can synchronize state while reducing bandwidth.

I worked in a now defunct middleware company (Pikkotekk AB).

ObscuraNox · Apr 16, 2018

Well of course reading an entire save is not an option at all. It was more about the (additionally) required code. We both know that some additional glue logic is required which basically "marks" changed states and transmit them. This can be either done by proper use of dirty flags (information push) or use a "memory monitoring" system (information pull) like Zoidcom utilized.
A state synchronization model is not per se "out of sync" - using the very same "heartbeat" system required in an event driven system can lead to fully synced states. Good thing is that for a state synchronization approach it is helpful but not required to work perfectly.

i.e. borrowing the serialization code for saving the full game state and breaking it into smaller pieces to allow partial serialization on demand + adding an ID for every object (if not already present anyway) + extending all state changing method calls with a "submit changes for object x : subpart y" and preventing clients to create network revelant object on their own are basically most of the changes required.
No questions ask, this is quiet an investment - I totally agree about that. But in the long run for future products maybe an alternative to the currently used approach with way less support
hassle.

Nashetovich · Apr 24, 2018

Game's save can weight 20 mb easily. Plus serialization/deserialization. May be plus compression. Multiply it by several ticks per second. Looks bad for performance. And some people have bad unupgradable internet.

ObscuraNox said:
Well of course reading an entire save is not an option at all

How would you decide which part of save should be cut off? Battle history? Ok, that could be send 1 time. But still world is big. Maybe you can cut off information about enemy countries.. But what if you allied with entire world?

Chaingun · Apr 24, 2018

The thing is, game state structures are really complicated for this type of game. The code for maintaining them can already be nightmarish at times; don't want to mix in network synch issues in there.

Don't get me wrong, I do like challenging the way things are done. But regarding this problem I came to the conclusion long ago applying an unsynched network architecture would constrain far too much how the game is made. It simply wouldn't be PDS grand strategy anymore but necessarily something more simple. Being a good engineer is a lot about applying the right tools for the job rather than being fixated about the wrong ones.

As for present OOS issues a custom static checker (e.g. based on Clang) could probably prevent 99% of them. Then again, that's also an investment and not done because reasons.

ObscuraNox · Apr 25, 2018

@Nashetovich
Very good question!

short answer: the approaches I am aware of normally handle it the other way around - updates for changed data are added instead of cutting out the unchanged data. For example if an event triggers an unit to move the server sends the new unit (movement) state out.

long answer: to the best of my knowledge all approaches work on the basis that the server is aware of the client state. In some implementations the servers logs which information/object states were sent to the client. Other implementations just send the state of all objects once (like a complete save) and then only send out the changes.

This is similar to playing chess over phone. You tell your partner once in the beginning which figure is placed where. After that you normally only tell your partner "move queen to A3" (state sync approach) - so only the changed states. In case of an event/action approach you would tell your partner "move queen by 3 fields north" (action/event approach) - i.e. the ongoing action. Up to now both approaches work fine and need roughly the same amount of information. But now assume your partner missed one draw. In an action/event approach there is no way for him/her to tell there the queen should be by now. In case of the state sync he/she might not know there the queen actually was before this draw but at least they are now aware of the right positioning.

Does this answer your question?

Nice side effect of state driven approach is that you actually do not need to transmit the whole game state at once but are abled to transmit it "on-demand". So basically similar to streaming. One would be abled to join a session and all the informations would come in over time without the need to wait for several minutes to transmit everything at once. And as soon as a desync is detected the corresponding object states are simply sent again while the same situation normally means "game over" for an action/event approach.

Do not hesitat to ask further questions e.g. if something is not clear. If wished I can provide some short pseudo code example as well.

@Chaingun
I have the feeling that our understanding of the different approaches differs somehow. Maybe I missed something crucial but the term "unsynched network architecture" seems unfitting for the approach I have in mind/tried to describe. Could it be that you are talking about a quake networking model (i.e. fast paced network models)? There "desync" is an intrinsic part. But the way how information are encoded does not constrain you to "(de)sync" approach automatically. Of course it strongly depends on the scenario but basically simply add a timestamp indicating when to apply which update and make sure that all participants are on the same heartbeat and you are in sync. This is very similar to the requirements of an event/action driven approach. Main difference is the implication if an desync happens nontheless.

Can you please elaborate why "It simply wouldn't be PDS grand strategy"/ why you think that a sync approach would constrain the game archiecture/mechanics?

You mentioned "Being a good engineer is a lot about applying the right tools for the job rather than being fixated about the wrong ones" and there we are on the same page! Right now (to the best of my knowledge-please correct me if I am wrong) most multiplayers are designed like singleplayers there every input is shared i.e. basically one runs n singleplayer games, put in the same thing at the same time and keeps fingers crossed that everything will be just fine. Does that sound like the right way? For me it sounds more like "wood is a nice construction material for a one family house so one can build a skyscraper with it as well"- of course this might be an option but is it always the best option to simply "upscale"?

Chaingun · Apr 25, 2018

In my first post I misunderstood you. What I've been saying after that applies to delta synchronization. You suggest scanning the entire game state (pls no) or having fine grained retransmission of dirty objects dictated by application logic. While you might manage to hide this to some extent behind a nice code interface, I can't really see this happening without leaking over and constraining how to write new code significantly. Synchronization would have to be very fine grained indeed since daily tick updates can best be described as random access on the entire gamestate. There are a LOT of writes every tick. You will be constraining the game mechanics if you have to go "we can't do this, it will incur too many writes!"

Moving very slightly in your direction might be feasible - for instance, not using an implied random seed and instead supply it as command parameters would allow the game to survive OOS condition longer. There is still the downside of that any OOS in the current architecture is basically undefined behavior and may crash the game (game programmers are "lazy" and don't write appropriate checks in command preconditions).

The "commands" we have now are essentially very small delta updates, since user input happens to be a fairly minimal description of change in state. Supplant that by the actual changed values and now you're going to increase the delta size a lot. Commands do already put big constraints on how to write code, but are in the manageable zone given that gameplay code need not care much about their existence until it is time to do something in the user interface, with some exceptions wr.t. randomness and platform compatibility.

In short, gameplay programmers at Paradox can continue living happily in their singleplayer world for the most part, which is an objective of the current architecture.

Really, your idea would require a research project first to see if it would work (like e.g. retrofitting it on EU4), it would be too risky to plan for a new game of the bat. I will not be the one to convince someone to spend money on doing this, though, since I am dubious regarding its outcome.

Sorry for multiple edits, I usually end up posting like this, then afterwards deciding I want to say something more.

ObscuraNox · Apr 25, 2018

I think now we are on the same page

! As mostlikely everyone is aware the implementation strongly depends on the scenario. I think it is pretty safe to assume that a non-negligible part of the logic is writting in c-ish dialects in most cases. Depending on the code style either getter/setter or mutator or similar are used. So one way is to put there an "markdirty(object_data_subset_id)" for each and every write access. This can be indeed tedious depending on the code architecture to be honest. But it is an change with manageable invasiveness. The other change needed per object is to provide the serializer similar to this: http://codepad.org/aBLEFrYD . This is just one very basic example. Alternatively one can hook into the events if everything is done by them and simply add an listener which knows which event affects which parts and sets the dirty marks accordingly. And there are mostlikely many other ways to archive this. But yeah you are right. It needs some investment. Question is that is more worthwhile: taking first the "easy path" and try to fix all sources for OoS later or invest some LOC first knowing that OoS are part of the "game" but does not harm you that much anymore.

During my work as embedded systems engineer I have learned that robustness of systems (i.e. electronics incl computers) are not solely based on flawlessness but on tolerance against errors. Fun fact: computers on elevated ground tends to perform worse than on see level due to the higher error rate induced by cosmic radiation. So in some sense it is just a matter of time until states across multiple machines get inconsistent.

Are you aware of any other required major changes or drawbacks besides the mentioned ones? Maybe I missed something.

Nashetovich · Apr 27, 2018

ObscuraNox said:
Does this answer your question?

Yes, thank you, that was interesting.

Search

Clausewitz Engine : network model

ObscuraNox

Recruit

Johan

Studio Manager Paradox Tinto

ObscuraNox

Recruit

Chaingun

Field Marshal

ObscuraNox

Recruit

Chaingun

Field Marshal

ObscuraNox

Recruit

Nashetovich

Captain

Chaingun

Field Marshal

ObscuraNox

Recruit

Chaingun

Field Marshal

ObscuraNox

Recruit

Nashetovich

Captain