Not so long ago, we have started our adventure with Event Store in production. And I really mean it was an adventure.
Event Store was quite new tool for all of us and of course not everyone was convinced. We had a lot discussion about pros and cons using it but finally it seemed to fit perfect for our scenarios – and it did until we moved to production environment. After that we started to experience lot of disconnections from the server, which caused that our catch-up subscriptions were not able to update read model properly and our services stopped to process any command…
We spent a lot of time analyzing our logs, code etc. and finally we have found that the issue was in connection settings and it was related to heartbeat interval and timeout.
It turns out that although in the documentation you can find that recommended settings are: for LAN 750ms, 1050ms and for cloud 5000m, 1000ms respectively for us this was completely different. In our case stable values turned out to be rather in tens of seconds than milliseconds and finally we resolved the problem setting it to 20000ms for heartbeat timeout and 40000ms for interval.