Building cBiox in the cloud, on a budget

2023-06-16

As the Lead engineer for cBiox with cost saving measures in mind, to be upfront, I believe working with one of the larger public cloud providers will make life easier and allow for faster result delivery, with less engineering expertise. AWS has services that cover everything we need to process bioinformatics data in the cloud, and the integration between these services is seamless and efficient. But we’re not going for easy here, right? We’re going for cheap. And cheap means cutting some corners in the name of saving our valuable dollars

So here’s my bioinformatics in the cloud on a budget cookbookk. Yeah, Let's cook!

cBiox in the Cloud, on a budget

We are building a biotech SAAS product that needs to perform bioinformatics and computational biology hopefully on a large scale, but we have a limited budget and do not want to spend time and resources building and maintaining our own hardware, We are interested in using the services of alternative cloud providers that offer competitive prices while still allowing us to use the computing power of AWS, GCP, and Azure.
...
While the larger public cloud providers such as AWS offer a wide range of features and the ability to scale infinitely, they can also come with hidden costs and billing practices that may not be transparent.
Using alternative cloud providers may require more effort and expertise, but it can also result in cost savings. It is important to carefully consider our options and weigh the benefits and drawbacks of each provider in order to find the best solution for our needs and budget.

The minimum viable bioinformatics cloud

With that being said, it’s time to design our cloud! The minimum capabilities of a system supporting a bioinformatics team include:

  1. Interactive compute for experimentation, prototyping workflows, programming in Jupyter and RStudio and generating figures.

  2. Cloud storage that’s accessible to all team members and other services. Ideally this system supports cheap cold storage for infrequently accessed and backup data.

  3. Container registries Batch workflows that need to access a high-bandwidth container registry for custom private and public containers.

  4. Scalable batch compute that can be managed by a workflow manager, to easily 10-1000X computes with a single command line argument or config change.

  5. GPUs, databases, and other add-ons, depending on how much work the team is doing.

So, why should we cut corners?

The why -

Some of the features offered by AWS matter less to a bioinformatics team. Therefore, for certain tasks, optimizing latency, uptime, and performance may not be worth the additional cost.

And then, we do not need:

  • optimization of latency, uptime and performance

In research, a small difference in completion time or a short period of downtime may not significantly impact us. The day isn’t ruined if a workflow completes in 22 versus 24 hours – it’s still an overnight task. Similarly, an hour of downtime on a cluster for maintenance isn’t the end of the world – We can always have papers we could be reading or news to watch. Beyond some limits, increasing these metrics isn’t worth the additional cost.

  • or multi-region and multi-availability

Additionally, We’re not building Netflix, or even a publicly available and multi accessed service. All the compute can be in one region.

  • or infinite hot storage?

In some cases, having unlimited hot storage or infinitely scalable compute may not lead to increased efficiency, and may actually result in increased overhead and diminishing returns.

Therefore, It is important to carefully consider the specific requirements and cost-saving measures we needed and to make informed decisions about which resources and features are necessary.

The How -

1. Interactive compute

Interactive compute can be handled in two ways: by providing a central compute server for all team members to use, or by allowing team members to provision their own compute servers.

On AWS:

It is done using EC2 instances, which can be always running or provisioned on demand. However, these instances can be expensive, with a $5/hour fee for dedicated instances.

How it can be done cheaply:

An alternative is to use the services of Hetzner, a German company that offers dedicated servers at a lower cost than AWS. These servers may not be as powerful as AWS EC2 instances and do not offer as many hardware configuration options, but they do offer a large amount of RAM and flash storage, as well as 20TB of data egress traffic.

Where we cut corners:

Hetzner servers are billed per month, while AWS EC2 instances are billed per second, providing more flexibility. However, Hetzner servers also have more scheduled maintenance downtime and fewer integrated services compared to AWS.

2. Cloud storage

On AWS:

Cloud storage can be provided using AWS S3 buckets or Elastic File System (EFS), which is AWS's implementation of the Network File System (NFS). AWS also offers storage tiers and the intelligent tiering service, which allow for cheap archival storage.

How it can be done cheaply:

However, there are also alternative providers that offer infinitely scalable cloud storage at a lower cost than S3, such as Hetzner, Backblaze B2 and Cloudflare R2. These providers can be accessed using the familiar S3 API and may also offer reduced or no data transfer fees.

Where we cut corners:

Hetzner offers Storage Boxes, which are available in predefined sizes and can offer low storage costs when fully utilized, but do not support APIs such as S3 and may not be as fast as using storage and compute within the AWS ecosystem.

So, for true backups and archival storage, AWS Glacier is a good option at $1/TB/month.

3. Container Registries

How it’s done on AWS:

AWS offers Elastic Container Registry (ECR) for storing and managing container images, with options for both public and private repositories. We will be charged for storage costs and data egress when pulling containers from a different AWS region.

How it can be done cheaply:

An alternative to ECR is DockerHub, which offers paid plans that include image builds and a certain number of daily container pulls.

Where we cut corners:

Using DockerHub or a similar service may require additional management and may not offer the same level of integration or speed as using ECR within the AWS ecosystem. Alternatively, We could host our own registry using a service like Harbor.

4: Batch workflows

How it’s done on AWS:

AWS offers two options for deploying workflows: Batch and Elastic Kubernetes Service (EKS). These workflows can be run on autoscaling EC2 or Fargate instances and can use S3 or EFS for data storage. The interoperability of AWS services makes it easy to create and manage complex workflows, but very costly.

How it can be done cheaply:

To save money on AWS, you can use spot instances as much as possible and design our workflows to be resilient to spot instance reclaims by creating small, composable steps, parallelizing as much as possible, and using larger instances for shorter periods of time.

Where we cut corners:

We set up a Kubernetes cluster on Hetzner Cloud and manage the infrastructure ourselves.

To set up this type of cluster, we use a lightweight distribution such as k3s and set up autoscaling with Hetzner. Modify our workflow requirements to fit the resources available on Hetzner Cloud instances. This may require more time and expertise, but it can take advantage of the cheapest autoscaling instances available, compared to using AWS managed services.

5: GPUs and accelerated computing

How it’s done on AWS:

On AWS, you can get an EC2 instance with a GPU and use it within a workflow

Where we cut corners:

Hetzner doesn’t offer cheap GPUs yet, but other cloud providers do, like Genesis Cloud, Vast, and RunPod. The obvious downside of this is splitting our workloads up between another cloud provider, but still not a big deal.

Generally

We can use spot instances whenever possible, which can offer discounts of up to 50%.

On AWS, we can as well set our maximum bid to the on-demand price to minimize interruptions.

It is also worth considering credit and grant opportunities offered by the major cloud providers, such as the $100k in credits offered by AWS to startups.

To further reduce costs, we should ensure that we turn off resources when they are not needed, use cost exploration tools to track our expenses, and test our workflows at a small scale before deploying them on a larger cluster.

Additionally, we can use free or low-cost accelerated compute options like Google Colab or Paperspace.

Conclusion

Cloud computing has made significant advances in recent years, but there is still room for improvement, especially for biotech and academic labs that do not have access to a university cluster or are seeking to scale beyond their current capabilities.

While cloud computing can be an attractive option, it can also be expensive and there have been reports of researchers incurring significant costs using AWS.

In this post, I have discussed alternative services that biotech and academic labs can use for storage and compute, including us, with the goal of helping reduce cloud costs by 75% or more.

While AWS offers excellent integration between its services, I hope to see more competition and innovation in this space in the future.

If you have experience with the services mentioned in this post, or have other ideas to share, please leave a text in my guestbook 💕