The problem was that the publisher thread is stuck waiting for the window
to re-open on a connection that has been closed without notifying the publisher.
Several changes were done to avoid this :
- reading the monitoring information does not acquire the lock on the PendingChanges object anymore so that we can use it to debug such problems.
- When a connection to a server goes down, the operation now never
tries to re-open the connection, but wait for the receiver thread to do it.
The operation thread wait in the post-op until the reconnection is finished or until the receiver thread has found that there are no replication server available.
- tries to make the window mechanism more robustby introducing a loop around
the sendWindow.acquire() call so that the publisher thread is never
blocked indefinitely in this call in case of bugs or other problems
that could lead to this situation.
Also add a WindowProbe message that is sent to the replication server when the publisher notice that the window has been closed for a while to check if the window is really closed.
- notify the publisher thread when the connection has been shutdown.