One of my Compute Engine VM instances on GCP with a boot disk of
10GB suddenly stopped accepting SSH connections.
$ gcloud compute ssh my-instance [email protected]: Permission denied (publickey). Recommendation: To check for possible causes of SSH connectivity issues and get recommendations, rerun the ssh command with the --troubleshoot option. gcloud compute ssh my-instance --troubleshoot Or, to investigate an IAP tunneling issue: gcloud compute ssh my-instance --troubleshoot --tunnel-through-iap ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code .
This was really weird. So I tried troubleshooting (as seen in the output above):
$ gcloud compute ssh my-instance --troubleshoot ... VM status: 1 issue(s) found. The VM may need additional disk space. Resize and then restart the VM, or run a startup script to free up space. Disk: https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-a/disks/my-instance Help for resizing a boot disk: https://cloud.google.com/sdk/gcloud/reference/compute/disks/resize Help for running a startup script: https://cloud.google.com/compute/docs/startupscript Help for additional troubleshooting of full disks: https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-disk-full-resize#filesystem ...
So it turned out that I could not SSH because my boot disk was full. Entire
10GB of disk space exhausted! What should I do?
Since the resizing link is right there in the console output above, I checked it out and ran the following command (you can also perform this operation from the GCP’s UI Console):
$ gcloud compute disks resize my-instance-disk --size=50GB
It worked just fine and the resizing happened. Yet I could not still login into the instance. Running the
--troubleshoot command again dumped the same error about the disk being full. I decided to check out the troubleshooting link this time. It says:
After you resize a VM boot disk, most VMs resize the root file system and restart the VM. However, for some VM images types, you might have to resize the file system manually. If your VM does not support automatic root file system resizing, or if you’ve resized a data (non-boot) persistent disk, you must manually resize the file system and partitions.
So at this point I realised that even though the boot disk had been resized (it had additional capacity now) its filesystem and partition hadn’t been resized. According to the documentation though, the filesystem and partition resizing should automatically happen. As it says – most VMs resize the root file systems. What are “most VMs” though? From this documentation page, I think this is what they mean:
When you create a custom Linux image or custom Windows image, you need to manually increase the size of the boot and non-boot disks. If you’re using a public image, Compute Engine automatically resizes the boot disks.
And I am using a public image only, so the resizing not happening was really perplexing.
The troubleshooting page also talks about inspecting the serial port output to search for any log lines that has
resiz (substring of
resize, etc.). But I did go to the
my-instance page in GCP Console and clicked on
Logs > Serial Port 1 (console) to find no logs that talked about resizing. I re-started my VM just to be sure and nothing again. It did have a bunch of exceptions and errors regarding disk being full though. This basically meant that whatever part of GCP’s code is supposed to do auto-resizing, it wasn’t getting triggered once the disk had become full.
What Did Not Work
Apart from re-sizing the boot disk (above), there were a couple of things (on the easier side) that I tried which didn’t seem to work at all.
First, I added a startup script to delete the file that I knew would’ve caused the disk to become full. This basically involved:
- Opening the instance page on GCP Console.
- Going to the
Startup scriptsection and adding a script to
rmthe files or folders causing disk bloat.
- Restart the instance.
This did not work since I think the startup process never gets a chance to execute the script if the disk is full.
Second, it seemed like connecting to the serial console would help get SSH access through which I could delete the bloated files/folders. But in my case although it was enabled, my linux user did not have a password set. Serial console connections require traditional password logins for Linux VMs after the authentication is done with SSH keys, interesting! This might work for you if you have a password set for your linux user but it was not an option in my case.
Third, I decided to take a snapshot of the boot disk (of
my-instance) and swap the instance’s boot disk with a new one created out of this snapshot. I thought maybe the auto re-sizing of partitions and file system might kick in with this process. The process for this was:
- Go to the instance page in GCP Console.
- Turn off the instance.
Editand go to the Boot Disk section. Hit
Detach Boot Disk.
- Once detached, hit
Configure Boot Diskand create and attach one from the snapshot taken.
Saveto propagate the changes.
This did not work either. It seemed like the extension of partition or removal of bloated files would have to be done manually. So this is what I did to solve the problem.
What DID Work
The solution here is to:
- Take a snapshot of
- Create a disk out of it.
- Create a new instance.
- Attach the disk (2) to the new instance (3).
- Either resize the partitions and filesystems of the disk from the new instance (SSH) or delete files that were causing disk bloat. Either works.
- Detach the disk from the new instance.
- Swap the boot disk of
my-instancewith the new disk.
Then creating a new instance is straight-forward. While creating it, an additional new disk can be created off the snapshot taken in the previous step and attached.
Once the instance is created, we SSH into it and either remove the bloated files from the new disk or extend the partition. Let’s see how to do the former first.
# Without the additional disk attached (only boot disk) $ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 50G 0 disk |-sda1 8:1 0 49.9G 0 part / |-sda14 8:14 0 3M 0 part `-sda15 8:15 0 124M 0 part /boot/efi # With the additional disk attached (boot disk + additional disk) $ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 50G 0 disk |-sda1 8:1 0 49.9G 0 part / |-sda14 8:14 0 3M 0 part `-sda15 8:15 0 124M 0 part /boot/efi sdb 8:16 0 50G 0 disk |-sdb1 8:17 0 9.9G 0 part |-sdb14 8:30 0 4M 0 part `-sdb15 8:31 0 106M 0 part
lsblk we can see the different disks attached and their partitions. Now all I had to do was mount the
/dev/sdb1 partition somewhere in the filesystem and then delete culprit files:
$ sudo mkdir -p /mnt/disks/my-instance $ sudo mount /dev/sdb1 /mnt/disks/my-instance/ $ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 50G 0 disk |-sda1 8:1 0 49.9G 0 part / |-sda14 8:14 0 3M 0 part `-sda15 8:15 0 124M 0 part /boot/efi sdb 8:16 0 50G 0 disk |-sdb1 8:17 0 9.9G 0 part /mnt/disks/my-instance |-sdb14 8:30 0 4M 0 part `-sdb15 8:31 0 106M 0 part $ rm /mnt/disks/my-instance/home/ubuntu/big-file.log
Then I just detached the disk (from the instance’s edit page, you can simply go to additional disks section and remove them) and swapped it with the boot disk of
my-instance. In order to do the swapping, you will have to perform the same steps in the third option of
What Did Not Work section above. Basically you can do this in the instance’s edit page under the Boot disk section.
Once I re-started my instance (swapping the boot disks anyway requires the instance to be stopped), I could see the following lines in my serial port logs:
Jul 3 09:01:33 my-instance kernel: [ 8.625686] EXT4-fs (sda1): resizing filesystem from 2593019 to 13078779 blocks Jul 3 09:01:33 my-instance kernel: [ 8.750863] EXT4-fs (sda1): resized filesystem to 13078779
And I could finally SSH into my instance!
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/root 49G 8.9G 40G 19% / ...
Lets also look at the other option of how the partition and file systems in the new disk could’ve been re-sized (grown) instead of us deleting existing files/folders. As a recap, after the disk was attached, this is what
lsblk showed us:
sdb 8:16 0 50G 0 disk |-sdb1 8:17 0 9.9G 0 part |-sdb14 8:30 0 4M 0 part `-sdb15 8:31 0 106M 0 part
As we can see, the disk size is
sdb1 which is the root partition of the disk (and
9.9G only. So using the
parted command, we can first grow our partition:
$ sudo parted /dev/sdb GNU Parted 3.4 Using /dev/sdb Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) resizepart Warning: Not all of the space available to /dev/sdb appears to be used, you can fix the GPT to use all of the space (an extra 83886080 blocks) or continue with the current setting? Fix/Ignore? Fix Partition number? 1 End? [10.7GB]? 100% (parted) quit Information: You may need to update /etc/fstab.
Let’s take a look at
sdb 8:16 0 50G 0 disk ├─sdb1 8:17 0 49.9G 0 part ├─sdb14 8:30 0 4M 0 part └─sdb15 8:31 0 106M 0 part
As we can see
sdb1 has grown from
49.9G. But this is not enough, we also have to resize the file system. If we quickly mount the partition and look at the file system size using
df, we will notice the difference:
$ sudo mount /dev/sdb1 /mnt/disks/my-instance/ $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/sdb1 9.6G 9.5G 0 100% /mnt/disks/my-instance
As we can see the file system size is much less than
49.9G, i.e., it is same as the old partition size. So let’s resize it also using
$ sudo resize2fs /dev/sdb1
Now lets check the file system size:
$ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/sdb1 49G 9.5G 39G 20% /mnt/disks/my-instance ...
Again once the re-sizing is done, detach the disk and swap it with the boot disk of
my-instance (same as before).
Hope that helps!