Webhosting2040 outage

Incident Report for InterServer

Postmortem

What happened: Specifically this is a standard cpanel webhosting server which does get backups outlined more at https://www.interserver.net/tips/kb/shared-hosting-and-backups/ - The basic backup set up includes weekly backups with a retention of 4 backups via jetbackup on most plans.

The system went read only Sunday the 9th around 9 PM EST requiring manual intervention. Sunday is also the weekly backup date, the system was in the process of backing up accounts which did not finish.

On inspection the system showed a raid error, showing data in cache not written and a drive failure. This system runs raid10, with a battery backup. The drive was swapped, the data written, and then the system attempted to boot showing an xfs error for the file system. This is where, a file system check was started. XFS repairs vary but with the amount of data can take a few hours to run. After about 2 hours, it was clear that it would be best to start a parallel restore from backups. A new server was built out, and prepped, to start the restore.

Backups for webhosting are designed to backup fast, being incremental. The first backup is slow, and then changes are backed up. This allows for low io intensive backups, however the downside is restores can take a while because all the files must copy over like that first slow backup, then restored. The restore process for all accounts was mostly complete after 24 hours.

During the process the priority is getting the majority of users back online as quickly as possible. I apologize about nodejs not initially being there. This was not enabled during the initial set up before the restore process was done, it is an extra in cloudlinux and was missed. Nodejs is minimally used and it was not until a request came for it, that it was added. Part of the issue is full server restores are so rare, it does not follow the normal set up of a server as we are focusing on data restores, and getting everything online first.

What may change in the future? This server, while not end of life, is no longer used in new builds. New builds are virtualized, have a backup of the zfs file system and allow snapshots and full server backups to remote systems. All new systems are set up this, way, but migrations take a significant amount of time. There are 8 of these systems left to restore still not virtualized, but not end of life. XFS was removed and the file system is ext4 on the virtualized system. Webhosting2040 now runs this set up. Ideally, in the future a system problem can: Roll back to a previous state with minimal loss of a few hours, and backup block by block to a remote file system. While block to block does not allow for easy single user restore, it does allow for a faster disaster recovery. Jetbackup remains for individual data.

Ideally I had hoped for a drive replacement, restart, rewrite the data from cache back, boot up and the system be back online. The second best option would be XFS repair finishing, with a few hours of downtime, with no corruption, a boot up and the system as is. Unfortunately it required a full server restore of available backups which was a fix of backups from the 9th, but only for some accounts that the backups had finished, those with daily backups beyond the standard plan, and lastly backups that had run but the week prior.

Posted Nov 12, 2025 - 14:11 EST

Resolved

Account recoveries have been completed. While there is some additional work to be done, and beginning new jetbackup backups the incident will be closed. Please contact support with any more specific questions or needed updates.
Posted Nov 12, 2025 - 14:10 EST

Monitoring

Most accounts have been restored. Staff are going through sites to ensure php compatability on their site as the system has been restored to the latest OS which has a newer default php version.
Posted Nov 11, 2025 - 14:51 EST

Update

The backup data has been fully copied over for all accounts allowing the remaining accounts to restore in parallel and faster. Exim will begin accepted the queued messages in 1 hour.

All accounts are expected to be restored over the next few hours.
Posted Nov 11, 2025 - 01:01 EST

Update

Restores continue to run. While many accounts are have restored there are still a number in progress. The restore includes

* copying the data over
* extracting some data like mysql
* restoring
* setting the original ip

However we expect the restore to pick up more speed in the coming hours.
Posted Nov 10, 2025 - 19:36 EST

Identified

The replacement server is online, with the ips moved and account restores have begun.

To prevent mail loss port 25 will be closed until most accounts are restored
Posted Nov 10, 2025 - 11:51 EST

Update

A parallel restore from the jetbackup backups is being prepped on a new server build as a restore from backups appears to be the quickest resolution.
Posted Nov 10, 2025 - 09:31 EST

Update

The webhosting2040 system raid lost a drive in a raid 10 system, and locked up and went read only when the drive failed. The reported drive has been replaced and a file system check is running on the system. This is a lengthy check and continues to run checking for any possible file corruption.
Posted Nov 10, 2025 - 05:17 EST

Investigating

We are investigating an outage on the webhosting2040 server, that appears hardware raid related.
Posted Nov 10, 2025 - 00:16 EST
This incident affected: Services (Webhosting).