As announced earlier in #community-librem-one-staging:talk.puri.sm matrix synapse upgrade is in progress. Since this needs a newer version of postgresql and the database is huge, the postgresql upgrade is taking some time to complete. Maintenance like this are posted on https://status.librem.one/ usually.
This update is now complete. matrix synapse was updated to 1.134.0 (from 1.118.0) and database server was updated from Debian Buster to Debian Bullseye (PostgreSQL 11 to PostgreSQL 13). Normal postgres upgrade (using dump/restore internally) was taking forever so I had to cancel and do an in place upgrade with -m link
option.
Awesome! That’s good news.
Unfortunately something broke during the upgrade and users are not able to login. So I’m going to restore the old version from backup.
I had tested the upgrade on two instances already (talk-staging.puri.sm and talk.puri.sm - both worked fine after the upgrade - though the database of chat.librem.one is much larger).
Restored matrix synapse 1.118.0 and postgres 11 on debian buster (same versions before the update).
I will now try to do just the postgresql/OS upgrade independently on 2nd August at 6am UTC.
PostgreSQL 11 to 13 update (Debian buster to bullseye) completed in 3 hours - most of the time was for taking snapshot of the vm.
Update: it ended up having the same issues after a while - clients remained offline. So PostgreSQL upgrade was reverted, but OS upgrade was kept. So running PostgreSQL 11 on Debian Bullseye now.
I have cloned the instance was able to reproduce the issue on the clone. This turned out to be a behavior change in how postgresql 13 handles statement timeouts[1]. The fix was to set 2m for statement timeout in postgresql configuration. I will do the production update on 5th Aug at 6am UTC.
[1] PostgreSQL: Documentation: 17: 19.11. Client Connection Defaults
Unfortunately, the fix that worked on the clone seems to be not enough on production. It is better than before - login works on element, but sync is still broken. I will try to troubleshoot a bit more, but plan to revert back to old version if I can’t figure this out today.
Sync as of right now is rather slow still. And seemingly unable to connect. There was an error regarding unable to connect to an IP (I think 169.x.x.x. But I only saw it for a second so I am unsure.
We have still not been able to resolve this issue - summary of what we tried so far is shared with synapse developers Upgrading PostgreSQL to 13 when PostgreSQL is run on a different server breaks synapse client sync · Issue #18785 · element-hq/synapse · GitHub Any help resolving this issue from others here who also run a synapse instance would be greatly appreciated.
Going back to PostgreSQL 11 is not really a solution as synapse 1.120+ needs at least postgresql 13, if we go back, we will be stuck on an old version and will miss the upcoming security fix on Aug 11. Matrix.org - Pre-disclosure: Upcoming coordinated security fix for all Matrix server implementations
We made a small progress - earlier even login was failing, but login works, but sync still fails. We are still trying to resolve this issue.
Thanks to @jonathon.hall - running vacuumdb --all --analyze-in-stages
with statement_timeout = 1200000 (20m) seems to have fixed the issue.