With services like Herkou, AWS, Azure and Google Cloud, creating and managing your own cloud architecture may be a dying art. Some might accuse me of showing my age, like professors of old who swore that their students must learn C and how to alloc & free.

Well I don’t think that those professors were entirely wrong. Of course you want to take advantage of all the abstractions available to you in the modern era but you’ll inevitbly have to face low level problems at some point in your career. Joel Spolsky has some great writing on this topic. While I think Joel goes a bit overboard (we’re surrounded by abstraction that rarely, if ever, leak) all of this reminds me of his article on leaky abstractions. Tldr: at some point something lower in the stack breaks (or bubbles up), requiring you to understand how it works so you can fix (or accomodate) it.

The most spectacular example I’ve seen of a leaky abstraction in production was a bug we had generating hashes of passwords. Developers in the PHP stack would call hash_password (when registering a user) or check_pasword($user_input, $stored_hash) when checking credentials. New accounts would get created, but the users could never log in. At the same time, the number of failed login attempts for previously registered users increased. Ultimately, the cause of the bug ended up being new servers we had installed which contained Intel chips that were generating incorrect SHA values. The abstraction leaked from the metal all the way up to application developers.

All that being said, you don’t know which part of the stack is going to end up breaking in your future but some part, deep down, will break. Given that, a good developer should have a passing familiarity with all levels of the stack.

Getting Started

In this series I’ll be setting up a full blown cloud, the likes of which you’d see deployed at Facebook, Google, Amazon or Microsoft. The cloud will be scaled down to tens of servers rather than tens of thousands of servers. The hardware we’re using will also be exclusively Raspberry Pis.

The princple we’ll be using while developing this cloud is that nothing is done until it is fully monitored. We’ll call this “monitoring driven development.”

Components

Before we can get started we need to understand what components we’ll need. When taken together, the set of components should be capable of supporting a diverse and rich array of applications that would be hosted in our cloud.

  1. Load balancer - so we can spread the incoming requests across multiple servers
  2. Application servers - handle requests from the outside world. Could call these “API Servers”
  3. Lambda tier - tier to run random jobs. Could call this “background” or “async” tier. Could be expanded to consume the responsibilities of the application servers if we’re going for a “serverless acrhitecture.”
  4. Database tier - persistence for online services
  5. Cache - fronts the DB
  6. Middleware / Message bus - facilitates one to many communication amongs components. The choices we make here can vastly simplify logging and monitoring.
  7. Blob storage - save and serve things like images and files
  8. Real time metrics database - so we can query exactly what is happening right now in our services
  9. Log collection - bring logs from all services on all servers to our real time db and data warehouse
  10. Data warehouse - for offline analysis of logs
  11. Service registry - allow services to discover one another as they come and go
  12. Chaos Monkey - simulate failures in order to test our fault tolerance (e.g., failover)
  13. Network separation - keep our internal/home/whatever network safe if the pi-cloud gets popped
  14. Real time abuse protection - prevent spam, child porn, terrorist propaganda, etc. from being uploaded by users of our applications
  15. ACL management - control what services and users have access to what
  16. Automated Deployment - committed some new code? It should get deployed without us having to think about it or even press a single button
  17. Misc machine pool - this is for services to be deployed to that can’t run in the application or lambda tier. I debated including this as the entire cloud of Pis could be seen as a “misc machine pool” that we just so happened to deploy a specific confiugration to. I think it warrants inclusion as most organizations have some random long tail of services that don’t fit into the broad buckets of: web/database/lambda. These are generally stateful services.
  18. Work queues - Application servers will likely need to enqueue work to be done at a later date (e.g., sending emails, scheduling account deletion jobs). This “work” will go to the lambda tier but we’ll need a queue to front it in many cases.

For our DB tier, we may get into shard maps. TBD.

We may also get into data governance as using user data appropriately (based on whatever consent we’ve collected from the user), and properly anonymizing and deleting data is a big deal in this era.

Relationships

To understand how all of the components connect, let’s draw them out.

Below is a high level diagram that shows how requests:

  1. Come in from the internet
  2. Hit our load balancer
  3. Get sent to an app server
  4. App servers talk with our database cache which, on miss, hits the DBs
  5. App servers may enqueue work to our work queues
  6. Lambda tier pulls work from the queues (or is pushed work from the queue)
  7. Application and lambda tiers talk to a service map to lookup and talk to specific long-tail/misc services they may need
  8. The DB holds a link (like a file handle) to content in the blob storage
  9. Blob storage content is served to users via CDN
High level diagram of the relationships between the cloud components
High level diagram of the relationships between the cloud components

We’ll somehow have to manage deployment of all of these services to the various Raspberry Pis. Services will describe their resource constraints and a machine will be picked for them to run on.

Deployment is handled by services declaring their resource needs, stacking constraints and number of instances to run. The deployment service will find machines to run the services on that match the constraints.
Deployment is handled by services declaring their resource needs, stacking constraints and number of instances to run. The deployment service will find machines to run the services on that match the constraints.

And lastly we have log collection. This may seem unimportant to some but having a central and streamlined way of processing log data will vastly simplify monitoring, alerting, application experimentation, abuse protection, online machine learning and more within our cloud.

All logs will be collected, or sent, to a central message bus. The bus will distribute the logs to any services that would like to subscribe to them.
All logs will be collected, or sent, to a central message bus. The bus will distribute the logs to any services that would like to subscribe to them.

Series Format

We’ve roughly laid out the vision for this series. Future posts will do a deep dive into each component, what software we pick (or write) to run that component, how we integrate it into the system and then finally how we monitor and test that component.

We’ll end up covering topics like:

  • error rates
  • bandwidth usage
  • cpu & memory usage
  • service failover
  • database replication
  • log collection
  • alerting
  • hardware failure remediation
  • when to shard
  • finding bottlenecks
  • when to horizontally vs vertically scale
  • service discovery
  • spam fighting
  • and more!