User Details
- User Since
- May 11 2015, 8:31 AM (489 w, 3 h)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Today
It appears it was a hw error on memory leading to an uncorrectable memory error, leading to killing mysql:
Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.
@ABran-WMF I see that T373579 in theory its productionization has finished and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?
@Scott_French wrote the patch:
Given the remaining time before switchover
Thu, Sep 19
Resumed ms backups on codfw.
I forgot to mention, I think orchestrator has a similar tool, but I found in the past a tool like db-replication-tree useful for this kind of work (preparation) and later tuning after switchover:
ms backups con codfw are stopped. As usual, not asking for priority over my workmates, but if you can not leave backup2007 for the end, I would appreciate it so I can restart them and finish my week soon (I won't be around tomorrow).
This should be now done.
Not super needed, but we can maybe add a note so we don't pool it accidentally or something? We don't use notes too often, and they were designed for things like this (awareness why it was depooled for an extended time).
Not resolved- this is a blocker for switchover, and we haven't yet fixed it for future runs. This is an outstanding issue and we need to do something about it, even if it is no longer happening.
Wed, Sep 18
Tue, Sep 17
I found an actual bug: this is failing:
Failed to run cookbooks.sre.switchdc.databases.finalize.FinalizeSection.clean_heartbeat: Failed to run 'DELETE FROM heartbeat WHERE server_id=180360463' on db1125.eqiad.wmnet
I found an actual bug: this is failing:
Failed to run cookbooks.sre.switchdc.databases.finalize.FinalizeSection.clean_heartbeat: Failed to run 'DELETE FROM heartbeat WHERE server_id=180360463' on db1125.eqiad.wmnet
one additional comment about the process, not necessarily the script, is that the post-maintenance script is confusing, as it will be ran post-maintenance, but the parameters will be in the direction of the maintenance (but replication will be flowing in the previous direction).
I was able to see it fail, so the check works as expected (that's good):
**MASTER_TO db2230.codfw.wmnet MASTER STATUS is not stable, see the extended logs** Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.wait_master_to_position: MASTER_TO db2230.codfw.wmnet MASTER STATUS is not stable, see the extended logs
Minor usability, given the 10 seconds of wait, I would add a print that that is happening, when there is 1 second of pause it is ok, but I would print explicitly Something informative such as "waiting 10 second to make sure all pending events/transactions/writes are caught up" so the operator feels ok. :-D
Removing tags to avoid IRC spam until tests complete.
The other thing I saw after T371351#10153483 is that on the next step, if I run twice the disabling of GTID, there is no error or warning.
I am going to create a dedicated task for production testing, to avoid also noise here and on IRC.
I executed:
However, it was removed before at c9fe19ccd39c89274f9f6f.
Thoughts?
I will take over the hosts for unrelated testing, will reload data anyway from backups.
Mon, Sep 16
Thu, Sep 12
After restart, the server looks way less io stressed.
I've stopped codfw media backups.
[11:44] <jynus> I will restart now db1171:s7, and s8 on the same host in ~1h, when the dumps there finish [11:44] <jynus> to apply the buffer pool change
Yeah, that would explain it. So root cause found. I still want to merge the patch to optimize memory assignment (future schema changes will happen there).
This method is used: https://wikitech.wikimedia.org/wiki/Primary_database_switchover but it only works when switching working replication and with direct parent-child relationships, so it cannot apply to wikireplicas -and it may not be, due to skipped/modified/additional transactions due to filtering (which is why it is so hard to handle them)
I wouldn't be responsible if I didn't tell you that GTID has been very error prone to us, and that is has been very unreliable, and why I believe it is not used in production at the moment. GTID works well when it works well, and terrible when it doesn't. The only reason GTID is enabled in production is the innodb safe replication tracking on crash.
Wed, Sep 11
I will want to stop ms backups at codfw for backup2007 before it happens. No big deal if I don't do it (just some backups will be marked as failed and probably retried later), but that way we avoid extra failures.
I will want to stop ms backups at codfw for backup2011 before it happens. No big deal if I don't do it (just some backups will be marked as failed and probably retried later), but that way we avoid extra failures.
Tue, Sep 10
Things are ok now, may tune more later. Will ask the deploy1002 issue separately.
Let's rename it to the cause, not the suggested solution.
deploy1002.eqiad.wmnet backups failed. I am unsure if your team handles that, but do you happen to know if that no longer exists, but the backups are still active? Can it be removed from puppet cache/config?
Aside from making sure config was loaded and distributed, I had to do some additional work:
This errored out as it was running while the config updated:
586570 Incr 2,922 28.05 G Error 10-Sep-24 02:38 arclamp2001.codfw.wmnet-Monthly-1st-Tue-productionEqiad-arclamp-application-data
CC @Dzahn
Mon, Sep 9
Thanks, everone. I think @MatthewVernon 's suggestion is fair, and something I should have done. I will update the code to do so bounces get sent to root@. While I know mail is not reliable, I just found weird that the same kind of message (as it is automated) got filtered only that one time.
Jul 9 2024
0 -> backup1004 0 1 -> backup1004 0 2 -> backup1004 0 3 -> backup1004 -> backup1005 1 (done) 4 -> backup1005 * 1 5 -> backup1005 * 1 6 -> backup1005 * backup1006 2 7 -> backup1005 * backup1006 2 8 -> backup1006 2 9 -> backup1006 -> backup1007 3 (done) a -> backup1006 -> backup1007 3 (done) b -> backup1006 -> backup1007 3 (done) c -> backup1007 -> backup1011 4 (done) d -> backup1007 -> backup1011 4 (done) e -> backup1007 -> backup1011 4 (done) f -> backup1007 -> backup1011 4 (done)
Jul 8 2024
Resharding completed, only pending 2 running purge screeen on ms-backup2001, 2002 for purging leftovers. backup1011 & backup2011 will have to be completented by backup1012 and backup2012 this Q.
Jul 4 2024
Running it manually it worked every time, so I am confused- it doesn't seem to be a script issue. Could it be a mailing subsystem issue?
Jul 3 2024
1 more week left to finish the resharding.
Backlog for when I come back.
5 million files left to recover!
It would be nice to productionize this, but didn't had the time so far.
This has been workarounded with the mini-loader method of restoring backups, so I would call it resolved.
I will skip the "Remove dump user", as I think that may be useful and we will decide how to leave it long term when the es1, es2 & es3 backups are generated (with or without the user).
Let's wait a little bit before deleting the files on the old dbprovs just in case (I will do it when I come back).
@Volans @ABran-WMF FYI
Jul 2 2024
es4 has already been archived on jobs 574899 and 574900, the two for es5 are running now. When finished, we will be able to close this ticket.
Jul 1 2024
No action will be needed for backup1010 in the end.
@Davenyi please note you missed the options asked on the form, as seen above.
If I may @fnegri, the issue is that those hosts are in a way special, because they are pieces (data) of production (meaning here mediawiki) on cloud realm, so it may not be easy to solve with the current architecture. If there was an implementation where absolutely all non-public data and configuration was deleted on production side (e.g. a message protocol that cleans up everything and reconstructs them again on cloud network), that would solve all concerns- but that would be way more complex and will require a lot of work. And only now there is the start of a proper inventory where each table and column will document its privacy and concerns for global usage and editing.