Cannot SSH into GCP VM Instance Despite Resizing the (Full) Boot Disk
One of my Compute Engine VM instances on GCP with a boot disk of 10GB
suddenly stopped accepting SSH connections.
$ gcloud compute ssh my-instance
[email protected]: Permission denied (publickey).
Recommendation: To check for possible causes of SSH connectivity issues and get
recommendations, rerun the ssh command with the --troubleshoot option.
gcloud compute ssh my-instance --troubleshoot
Or, to investigate an IAP tunneling issue:
gcloud compute ssh my-instance --troubleshoot --tunnel-through-iap
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
This was really weird. So I tried troubleshooting (as seen in the output above):
$ gcloud compute ssh my-instance --troubleshoot
...
VM status: 1 issue(s) found.
The VM may need additional disk space. Resize and then restart the VM, or run a startup script to free up space.
Disk: https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-a/disks/my-instance
Help for resizing a boot disk: https://cloud.google.com/sdk/gcloud/reference/compute/disks/resize
Help for running a startup script: https://cloud.google.com/compute/docs/startupscript
Help for additional troubleshooting of full disks: https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-disk-full-resize#filesystem
...
So it turned out that I could not SSH because my boot disk was full. Entire 10GB
of disk space exhausted! What should I do?
Since the resizing link is right there in the console output above, I checked it out and ran the following command (you can also perform this operation from the GCP’s UI Console):
$ gcloud compute disks resize my-instance-disk --size=50GB
It worked just fine and the resizing happened. Yet I could not still login into the instance. Running the --troubleshoot
command again dumped the same error about the disk being full. I decided to check out the troubleshooting link this time. It says:
After you resize a VM boot disk, most VMs resize the root file system and restart the VM. However, for some VM images types, you might have to resize the file system manually. If your VM does not support automatic root file system resizing, or if you’ve resized a data (non-boot) persistent disk, you must manually resize the file system and partitions.
So at this point I realised that even though the boot disk had been resized (it had additional capacity now) its filesystem and partition hadn’t been resized. According to the documentation though, the filesystem and partition resizing should automatically happen. As it says – most VMs resize the root file systems. What are “most VMs” though? From this documentation page, I think this is what they mean:
When you create a custom Linux image or custom Windows image, you need to manually increase the size of the boot and non-boot disks. If you’re using a public image, Compute Engine automatically resizes the boot disks.
And I am using a public image only, so the resizing not happening was really perplexing.
The troubleshooting page also talks about inspecting the serial port output to search for any log lines that has resiz
(substring of resizing
, resized
, resize
, etc.). But I did go to the my-instance
page in GCP Console and clicked on Logs > Serial Port 1 (console)
to find no logs that talked about resizing. I re-started my VM just to be sure and nothing again. It did have a bunch of exceptions and errors regarding disk being full though. This basically meant that whatever part of GCP’s code is supposed to do auto-resizing, it wasn’t getting triggered once the disk had become full.
What Did Not Work
Apart from re-sizing the boot disk (above), there were a couple of things (on the easier side) that I tried which didn’t seem to work at all.
First, I added a startup script to delete the file that I knew would’ve caused the disk to become full. This basically involved:
- Opening the instance page on GCP Console.
- Clicking
Edit
. - Going to the
Startup script
section and adding a script torm
the files or folders causing disk bloat. - Restart the instance.
This did not work since I think the startup process never gets a chance to execute the script if the disk is full.
Second, it seemed like connecting to the serial console would help get SSH access through which I could delete the bloated files/folders. But in my case although it was enabled, my linux user did not have a password set. Serial console connections require traditional password logins for Linux VMs after the authentication is done with SSH keys, interesting! This might work for you if you have a password set for your linux user but it was not an option in my case.
Third, I decided to take a snapshot of the boot disk (of my-instance
) and swap the instance’s boot disk with a new one created out of this snapshot. I thought maybe the auto re-sizing of partitions and file system might kick in with this process. The process for this was:
- Go to the instance page in GCP Console.
- Turn off the instance.
- Click
Edit
and go to the Boot Disk section. HitDetach Boot Disk
. - Once detached, hit
Configure Boot Disk
and create and attach one from the snapshot taken. - Click
Save
to propagate the changes.
This did not work either. It seemed like the extension of partition or removal of bloated files would have to be done manually. So this is what I did to solve the problem.
What DID Work
The solution here is to:
- Take a snapshot of
my-instance
. - Create a disk out of it.
- Create a new instance.
- Attach the disk (2) to the new instance (3).
- Either resize the partitions and filesystems of the disk from the new instance (SSH) or delete files that were causing disk bloat. Either works.
- Detach the disk from the new instance.
- Swap the boot disk of
my-instance
with the new disk. - Re-start
my-instance
and enjoy!
Creating snapshot from the GCP Console involves going to Compute Engine Disks section, clicking on the relevant boot disk for the instance and then hitting Create Snapshot
button.
Then creating a new instance is straight-forward. While creating it, an additional new disk can be created off the snapshot taken in the previous step and attached.
Once the instance is created, we SSH into it and either remove the bloated files from the new disk or extend the partition. Let’s see how to do the former first.
# Without the additional disk attached (only boot disk)
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 50G 0 disk
|-sda1 8:1 0 49.9G 0 part /
|-sda14 8:14 0 3M 0 part
`-sda15 8:15 0 124M 0 part /boot/efi
# With the additional disk attached (boot disk + additional disk)
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 50G 0 disk
|-sda1 8:1 0 49.9G 0 part /
|-sda14 8:14 0 3M 0 part
`-sda15 8:15 0 124M 0 part /boot/efi
sdb 8:16 0 50G 0 disk
|-sdb1 8:17 0 9.9G 0 part
|-sdb14 8:30 0 4M 0 part
`-sdb15 8:31 0 106M 0 part
Using lsblk
we can see the different disks attached and their partitions. Now all I had to do was mount the /dev/sdb1
partition somewhere in the filesystem and then delete culprit files:
$ sudo mkdir -p /mnt/disks/my-instance
$ sudo mount /dev/sdb1 /mnt/disks/my-instance/
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 50G 0 disk
|-sda1 8:1 0 49.9G 0 part /
|-sda14 8:14 0 3M 0 part
`-sda15 8:15 0 124M 0 part /boot/efi
sdb 8:16 0 50G 0 disk
|-sdb1 8:17 0 9.9G 0 part /mnt/disks/my-instance
|-sdb14 8:30 0 4M 0 part
`-sdb15 8:31 0 106M 0 part
$ rm /mnt/disks/my-instance/home/ubuntu/big-file.log
Then I just detached the disk (from the instance’s edit page, you can simply go to additional disks section and remove them) and swapped it with the boot disk of my-instance
. In order to do the swapping, you will have to perform the same steps in the third option of What Did Not Work
section above. Basically you can do this in the instance’s edit page under the Boot disk section.
Once I re-started my instance (swapping the boot disks anyway requires the instance to be stopped), I could see the following lines in my serial port logs:
Jul 3 09:01:33 my-instance kernel: [ 8.625686] EXT4-fs (sda1): resizing filesystem from 2593019 to 13078779 blocks
Jul 3 09:01:33 my-instance kernel: [ 8.750863] EXT4-fs (sda1): resized filesystem to 13078779
And I could finally SSH into my instance!
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 49G 8.9G 40G 19% /
...
Lets also look at the other option of how the partition and file systems in the new disk could’ve been re-sized (grown) instead of us deleting existing files/folders. As a recap, after the disk was attached, this is what lsblk
showed us:
sdb 8:16 0 50G 0 disk
|-sdb1 8:17 0 9.9G 0 part
|-sdb14 8:30 0 4M 0 part
`-sdb15 8:31 0 106M 0 part
As we can see, the disk size is 50G
but sdb1
which is the root partition of the disk (and my-instance
) is 9.9G
only. So using the parted
command, we can first grow our partition:
$ sudo parted /dev/sdb
GNU Parted 3.4
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) resizepart
Warning: Not all of the space available to /dev/sdb appears to be used, you can fix the GPT to use all of the space (an extra
83886080 blocks) or continue with the current setting?
Fix/Ignore? Fix
Partition number? 1
End? [10.7GB]? 100%
(parted) quit
Information: You may need to update /etc/fstab.
Let’s take a look at lsblk
now:
sdb 8:16 0 50G 0 disk
├─sdb1 8:17 0 49.9G 0 part
├─sdb14 8:30 0 4M 0 part
└─sdb15 8:31 0 106M 0 part
As we can see sdb1
has grown from 9.9G
to 49.9G
. But this is not enough, we also have to resize the file system. If we quickly mount the partition and look at the file system size using df
, we will notice the difference:
$ sudo mount /dev/sdb1 /mnt/disks/my-instance/
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/sdb1 9.6G 9.5G 0 100% /mnt/disks/my-instance
As we can see the file system size is much less than 49.9G
, i.e., it is same as the old partition size. So let’s resize it also using resize2fs
:
$ sudo resize2fs /dev/sdb1
Now lets check the file system size:
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/sdb1 49G 9.5G 39G 20% /mnt/disks/my-instance
...
Again once the re-sizing is done, detach the disk and swap it with the boot disk of my-instance
(same as before).
Hope that helps!