EC2 Boot Time Benchmarking
lajosbacs 2021-08-17 08:22:54 +0000 UTC [ - ]
"launch GPU spot instance" to "I can actually use the GPU, at least by running nvidia-smi".
I find that this can take up to 10 minutes and for the more expensive instances, this can mean non-negligible amount of money.
Great article BTW
lucb1e 2021-08-17 09:42:33 +0000 UTC [ - ]
ardenpm 2021-08-17 08:59:15 +0000 UTC [ - ]
cinquemb 2021-08-17 09:14:16 +0000 UTC [ - ]
Perhaps I should have been using "Clear Linux 34640"
pojzon 2021-08-17 09:38:33 +0000 UTC [ - ]
TBH the bigger issue than boottime for AWS is lack of physical resources to fulfill the demand - this is what most big players are struggling with.
lajosbacs 2021-08-17 10:24:03 +0000 UTC [ - ]
hardwaresofton 2021-08-17 10:54:32 +0000 UTC [ - ]
It's not a big edge, but knowing about downtime before others could certainly be an edge
sixhobbits 2021-08-17 11:17:12 +0000 UTC [ - ]
Not sure if the data is freely available, but I used to be part of the team at AWS that calculated these. See [0].
[0] https://www.gartner.com/en/cloud-decisions/benchmark-library
hardwaresofton 2021-08-17 11:31:59 +0000 UTC [ - ]
It doesn't seem like they'll give you up to the hour information and you do have to pay for "Gartner for Professionals" which I'm not familiar with but it's definitely out there.
The key to this has to be supporting it cross cloud IMO -- people have the AWS vs GCP vs Azure conversation so often and I think most people know that GCP is best performance for the buck, but the other reasons to pick the other two are numerous. Might be nice to have some numbers on just how much performance differs against the others since that's one of more quantifiable things to weigh.
JoshTriplett 2021-08-17 16:31:27 +0000 UTC [ - ]
JoshTriplett 2021-08-17 16:59:26 +0000 UTC [ - ]
I've tried changing the various parameters under my control: assigning a specific network address rather than making AWS pick one from the VPC, disabling metadata, using a launch template, not using a launch template, using the smallest possible AMI (just a kernel and static binary), using Fast Snapshot Restore on the snapshot in the AMI, etc. None of those makes the slightest difference in instance start time; still ~8 seconds.
zmmmmm 2021-08-17 08:25:23 +0000 UTC [ - ]
lucb1e 2021-08-17 09:40:26 +0000 UTC [ - ]
nnx 2021-08-17 11:05:09 +0000 UTC [ - ]
aeyes 2021-08-17 13:30:58 +0000 UTC [ - ]
nonameiguess 2021-08-17 18:36:45 +0000 UTC [ - ]
What else they did would probably require booting into Clear Linux myself and checking it out, but disabling services you don't need so systemd isn't wasting time, baking all kernel modules you know you'll need directly into the kernel instead of storing them as modules, and not using initramfs if you don't need it is probably the "general" answer for how to get a faster boot.
If you really want ultra-fast, you can turn Linux into a unikernel by compiling the kernel with your application as init, not using any other services at all, and booting via the EFISTUB instead of using a bootloader at all. I don't know if the AWS EC2 services provides any way of doing that, though. On your own machine, you can either use efibootmgr from a different booted system on the same BIOS, or by setting boot order in the BIOS, or naming the EFISTUB whatever that magic name is that EFI automatically loads if a boot order isn't configured in NVMEM.
masklinn 2021-08-17 11:24:29 +0000 UTC [ - ]
watermelon0 2021-08-17 12:27:35 +0000 UTC [ - ]
nijave 2021-08-17 11:22:40 +0000 UTC [ - ]
masklinn 2021-08-17 11:26:00 +0000 UTC [ - ]
paulcarroty 2021-08-17 14:16:49 +0000 UTC [ - ]
nijave 2021-08-17 11:28:54 +0000 UTC [ - ]
spenczar5 2021-08-17 14:00:57 +0000 UTC [ - ]
Median is an odd metric to use. I think I might truly be more interested in the mean; maybe the geometric mean.
I am also very interested in worst-cases with this sort of thing, so the 80th and 95th and 100th percentiles would be helpful.
It would be interesting to see plots of distributions, as well: are boot times really unimodal? They seem like they could easily have multiple modes, which could really mess with all of these measures (but especially the median!).
Because the script is open source (thank you!) I may try this myself.
cperciva 2021-08-17 15:19:05 +0000 UTC [ - ]
That said, I did my final measurements from inside EC2 (rather than from my laptop over wifi) so it probably wasn't necessary.
spenczar5 2021-08-17 15:23:32 +0000 UTC [ - ]
Anyway - probably not a big deal, probably mostly academic, although modal behavior would be really interesting.
cperciva 2021-08-17 15:42:56 +0000 UTC [ - ]
kobalsky 2021-08-17 21:24:31 +0000 UTC [ - ]
C is an interesting choice of laguange for this task. They even implemented a rustic XML parser in there.
josnyder 2021-08-17 16:35:13 +0000 UTC [ - ]
https://gist.github.com/hashbrowncipher/17a92c6afb9642503876...
nijave 2021-08-17 11:26:41 +0000 UTC [ - ]
Also on boot, I /think/ EC2 pulls down the AMI to EBS from S3 so theoretically smaller AMIs might be faster but not sure
userbinator 2021-08-17 13:44:37 +0000 UTC [ - ]
nijave 2021-08-19 12:08:12 +0000 UTC [ - ]
JoshTriplett 2021-08-17 16:28:00 +0000 UTC [ - ]
And I've tested AMIs that are no bigger than a Linux kernel and a single static binary; times are still on par with what's listed here. Still takes 7-8 seconds before the first user-controlled instruction gets to run.
nijave 2021-08-19 12:10:17 +0000 UTC [ - ]
That's a bit disappointing but I guess AWS has spent a lot of time optimizing Lambda vs EC2
miyuru 2021-08-17 13:34:39 +0000 UTC [ - ]
I managed to create a 1GB AMI of debian. Will be interesting to test this.
glotzerhotze 2021-08-17 08:45:20 +0000 UTC [ - ]
pojzon 2021-08-17 09:41:07 +0000 UTC [ - ]
Or are you talking about cloud-init which is run by AWS themselves before running userdata scripts ?
staticassertion 2021-08-17 08:16:54 +0000 UTC [ - ]
That's actually pretty solid. If it were 5-10x faster than that you could probably fit that into a lot of interesting use-cases for on-demand workloads. The bottleneck is the EC2 allocation itself, and I'd be interested in seeing what warm EC2 instances can do for you there.
That said, I think for the majority of use cases boot performance is not particularly important. If you want really fast 'boot' you might just want VMs and containers - that would cut out the 3-8 seconds of instance allocation time, as well as most of the rest of the work.
Curious to see a follow up on what's going on with FreeBSD - seems like it takes ages to get the network up.
nijave 2021-08-17 11:16:33 +0000 UTC [ - ]
That's basically lambda although you lose control of the kernel and some of the userspace (although you can use Docker containers and the HTTP interface on Lambda to get some flexibility back)
Under the hood, Lambda uses optimized Firecracker VMs for < 1 sec boot
>I think for the majority of use cases boot performance is not particularly important
Anything with auto scaling. I think CI is probably a big use case (those get torn up and down pretty quickly) and you introduce hairy things like Docker in Docker unprivileged builds trying to run nested in a container
staticassertion 2021-08-18 01:13:43 +0000 UTC [ - ]
And yeah, Firecracker is pretty sick, but it's also something you can just use yourself on ec2 metal instances, and then you get full control over the kernel and networking too, which is neat.
JoshTriplett 2021-08-17 16:32:42 +0000 UTC [ - ]
nuclearnice1 2021-08-17 17:09:51 +0000 UTC [ - ]
JoshTriplett 2021-08-17 18:48:22 +0000 UTC [ - ]
staticassertion 2021-08-18 01:25:52 +0000 UTC [ - ]
nijave 2021-08-19 16:24:12 +0000 UTC [ - ]
The kernel surface is part of their security model. There's some details here https://www.bschaatsbergen.com/behind-the-scenes-lambda
E.g. exposing the kernel would undo some intentional isolation
JoshTriplett 2021-08-18 03:06:05 +0000 UTC [ - ]
I'd love to see Lambda support custom kernels. That would require a similar approach: document the actual VM interface provided to the kernel, including the virtual hardware and the mechanism used to expose the Lambda API. I'd guess that they haven't yet due to a combination of 1) not enough demand for Lambda with custom kernels and 2) the freedom to modify the kernel/hardware interface arbitrarily because they control both sides of it.
staticassertion 2021-08-18 16:31:54 +0000 UTC [ - ]
2021-08-17 23:10:10 +0000 UTC [ - ]
2021-08-17 23:06:09 +0000 UTC [ - ]
cperciva 2021-08-17 15:22:25 +0000 UTC [ - ]
Our BIOS boot loader is very slow. I'll be writing about FreeBSD boot performance in a later post.
alberth 2021-08-17 18:30:05 +0000 UTC [ - ]
Do you know if there is any plans on FreeBSD creating a super minimal server version that can be used as a docker host ... similar in size to Alpine Linux.
cperciva 2021-08-17 18:40:23 +0000 UTC [ - ]
sjnu 2021-08-17 10:19:50 +0000 UTC [ - ]
masklinn 2021-08-17 11:17:53 +0000 UTC [ - ]
> The first two values — the time taken for a RunInstances API call to successfully return, and the time taken after RunInstances returns before a DescribeInstances call says that the instance is "running" — are consistent across all the AMIs I tested, at roughly 1.5 and 6.9 seconds respectively
“Running to available” is what’s in the table, ranging from 1.23s to 70 or so.
rawoke083600 2021-08-17 16:04:26 +0000 UTC [ - ]
citrin_ru 2021-08-17 13:34:53 +0000 UTC [ - ]
FreeBSD rc executes all rc.d stripts sequentially, in one thread. OpenRC AFAIK can start daemons in parallel, but unfortunately switch to OpenRC was abandoned: https://reviews.freebsd.org/D18578
2021-08-17 09:43:59 +0000 UTC [ - ]