This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the AWS category.

Last Updated: 2025-06-17

Note that this article contains many ideas from the Manning book on AWS in Action. I highly recommend you buy it.

General points

be aware that AWS is constantly changing. Before implementing any features, check if it is supported.
create a billing alert to know if you are over budget
there are online calculators that can estimate cost of services upfront
the amazon user id is not the same as the canonical id. For example, the user id is a set of digits likes arn:aws:iam::888888888888 but the canonical ID is like "688888888888888888kwwsgjl67887asdf89bd41a49d22ef0c40ef7a4f03622d"
vertical (each cluster is more powerful - e.g. has more CPU or RAM) vs. horizontal scaling (more machines/nodes in a cluster). Scaling a database vertically is the easier option in terms of technical complexity, but it gets expensive. High-end hard-ware is more expensive than commodity hardware. Besides that, at some point, you can no longer add faster hardware because nothing faster is available.

Resilience

Highly availability vs. fault tolerant: A highly available system can recover from a failure automatically with a short down-time. A fault-tolerant system, in contrast, requires the system to provide its services without interruption in case of a component failure.
CloudWatch can trigger recovery automatically.
A typical recovery when a server fails is moving to another virtual machine with the same ID, private IP address and data available on a network-attached EBS volume.
Some AWS services are highly available or even fault-tolerant by default. E.g. CloudFront (CDN) and Route 53 (DNS)
Some services use multiple availability zones within a region so they can recover from even an availability zone outage: S3 (object store) and DynamoDB (NoSQL database).
EC2 has autoscaling - min/max/desired number of virtual machines (can also be done on the basis of load). Re: load scaling - your policy might be "if cpu usage less than 25%, remove 1 server. If greater than 75% add a server". Or you might auto-scale along a schedule (e.g. a specific upcoming datetime when you have a TV ad running; or cylical times of day or dates of year - e.g. every December)
By default metrics from EC2 to CloudWatch are updated every 5 minutes, but you can make it more frequenty (e.g. every minute) if it matters more to your use-case.
availability vs. durability. EBS is guaranteed to be available 99.999% of the time. This has nothing to do with losing data. EBS will not lose data 99.99% of the time (i.e. it is an order or magnitude more likely to lose data than be un-available)
Snapshots of EBS taken regularly are key. Consider creating an AMI.
Metric 1: Recovery time objective (RTO) is the time it takes for a system to recover from a failure; it’s the length of time until the system reaches a working state again, defined as the system service level, after an outage.
Metric 2: Recovery point objective (RPO) is the acceptable data-loss time caused by a failure. The amount of data loss is measured in time. If an outage happens at 10:00 a.m. and the system recovers with a data snapshot from 09:00 a.m., the time span of the data loss is one hour.
The most convenient way to make your system fault-tolerant is to build the architecture using fault-tolerant blocks. If all blocks are fault-tolerant, the whole system will be fault-tolerant as well. Fault tolerant systems include ELB, CloudWatch, S3, SQS, autoscaling groups, DynamoDB Unfortunately, one important service isn’t fault-tolerant by default: EC2 instances. Virtual machines aren’t fault-tolerant. This means an architecture that uses EC2 isn’t fault-tolerant by default. But you can get around this. If one of the EC2 instances crashes, the Elastic Load Balancer (ELB) stops routing requests to the crashed instances and sends them to instances that are still up. The auto-scaling group replaces the crashed EC2 instance within minutes, and the ELB begins to route requests to the new instance.

Choosing a region

Factors to consider - latency - how far to customers - availability of desired AWS services - not everything is available in every region. Use online tools to determine this. - legal issues - e.g. are you allowed to store data in country Y? - where your other AWS infrastructure is based. E.g. With DynamoDB, no additional traffic charges apply if you use access DynamoDB from ECs instances in the same region.

Each region consists of multiple availability zones (AZs). You can think of an AZ as an isolated group of data centers, and a region as an area where multiple availability zones are located at a sufficient distance. The region us-east-1 consists of six availability zones (us-east-1a to us-east-1f), for example. The availability zone us-east-1a could be one data center, or many (this is not public info)

The AZs are connected through low-latency links, so requests between different availability zones aren’t as expensive as requests across the internet in terms of latency. The latency within an availability zone (such as from an EC2 instance to another EC2 instance in the same subnet) is lower compared to latency across AZs.

Cloud Formation

cloud formation creates an entire architecture (e.g. Wordpress on two EC2 instances with an RDS MySQL, elastic load balancer) from a template (find them online/create your own)
a Cloud Formation template contains parameters (e.g. to customize aspects like the DB name when the template is run), resources (AWS services), and outputs.
before starting, import your private key in EC2 > Key Pairs. That means any EC2 instances you create will be accessible via SSH.
There is an outputs section that shows stuff like URLs for visiting your newly deployed site

S3

Object store (i.e. manages data as objects with a GUID and metadata (size, owner, content type), as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.)
Glacier - similar but cheaper. Downside is it takes up to 5 hours to access data (instead of near immediate with standard AWS). You might initially process raw data into a relational form on s3 then move to Glacier for archiving purposes since it is unlikely (but not impossible) you will need it again. It also has expedited transfer (more expensive but only 5 minutes)
movement to glacier can be controlled as a rule at the bucket level (e.g. after N days)
to give read only access to every item in a bucket you must use the wildcard - e.g. "Resource":["arn:aws:s3:::$BucketName/*"] in the policy
you can host a static website there by running aws s3 website s3://$BucketName --index-document helloworld.html and setting a CNAME record on your domain pointing to bucket's endpoint
s3 is eventually consistent, which means you might read stale data after changing an object for a short period of time. If you don’t consider this, you may be surprised if you try to read objects immediately after changing them.

For maximum performance, choose keys with even distribution of characters in their prefixes. In S3, keys are stored in alphabetical order in an index. The key name determines which partition the key is stored in. If your keys all begin with the same characters, this will limit the I/O performance of your S3 bucket. Thus names like "image1.png", "image2.png" will underperform names like "ffsa-image1.png", "abaw-image2.png" that have an MD5 hash of the original key as a prefix.

Using a slash (/) in the key name acts like creating a folder for your object. If you cre-ate an object with the key folder/object.png, the folder will become visible as a folder if you’re browsing your bucket with a GUI like the Management Console, for example. But technically, the key of the object still is prefix-folder-name/object.png.

EC2

EC2 gives you virtual machines, i.e. multiple instances get run as guests on a host machine and are separated by software by a hypervisor (e.g. Xen). Increasingly the separation is assisted with hardware.
We use virtual appliances to speed up creation of preconfigured OS's. These are images of a virtual machine containing an OS and preconfigured software. They eliminate the cost of installing afresh every time.
An AMI is a special type of virtual appliance for use with the EC2 service. An AMI technically consists of a read-only filesystem including the OS, additional software, and configuration; it doesn’t include the kernel of the OS. The kernel is loaded from an Amazon Kernel Image (AKI).
When chosing between EC2 instance hardwares, consider whether the program you are running will be single or multi-threaded. E.g. Redis is single-threaded and will not use all cores in a many core instance, so don't waste your money!
Which OS system image should you chose? Amazon Linux is a fine bet as it is maintained by and optimized for AWS.
How to read "t2.micro"? Class = "t" (small cheap virtual machines with low baseline performance but ability to burst), generation 2, size of micro.
People tend to overestimate the instance size they'll need so start small and work up.
System log of boot available through AWS console if issues with start up.
It is possible to assign extra network interfaces. There could be many reasons for this. One is allowing access via multiple different public IP addresses, thereby letting you host multiple different websites. You’ll need to configure the web server to deliver different websites depending on the IP address. Your virtual machine doesn’t know anything about its public IP address, but you can distinguish the requests based on the private IP address. These start with either 10, 172 or 192 and can be determiend with $ ifconfig on your EC2 instance.
Reserved instances: you have to pay whether it is running or not. May be much cheaper because you can do long-term contracts.
Spot instances: basically un-used capacity. Great for running long batch jobs (e.g. AI, encoding media files). Can as low as 10% of normal price! If the current spot price exceeds your bid, your VM will be terminated (not stopped) by AWS after two minutes.

EC2 Instance Store

An instance store provides block-level storage directly attached to the machine hosting your VM. The instance store is part of an EC2 instance and available only if your instance is running; it won’t persist your data if you stop or terminate the instance.
You don’t pay separately for an instance store; instance store charges are included in the EC2 instance price.
Use it for caching, temporary processing, or applications that replicate data to several servers, as some databases do.

Elastic IPs

When you modify an EC2 instance (e.g. increase its size), it gets a random, differing IP address. To avoid this you can allocate a fixed IP with Elastic IPs service.
This is found in sub-menu sidebar of EC2.
Essentially you create one or more fixed IP addresses and associate them with particular instances of EC2.
Another advantage of elastic IPs is that you can provision another machine to replace an old one and once it's ready do a quick switcheroo.

Elastic Block Store

A block is a sequence of bytes, and the smallest addressable unit. The OS is the intermediary between the application that wants to access files and the underlying file system and block-level storage. The OS provides access to block-level storage via open, write, and read system calls.
If you are migrating older systems (e.g. MYSQL) to AWS, these expect a classical block file system instead of an object store, so s3 is not possible.
EBS is a persistent (as opposed to temporary) block store and it has built-in replication
An EBS volume is separate from an EC2 instance and connected over the network. If you terminate your EC2 instance, the EBS volumes therefore remain.
WARNING: You can’t attach the same EBS volume to multiple virtual machines! This use-case requires a network filesystem.
Usually called "Volumes" in Cloud Formation nomenclature.
EBS volumes are charged based on the size of the volume, no matter how much data you store in the volume.
EBS offers an optimized, easy-to-use way to back up EBS volumes with EBS snapshots. A snapshot is a block-level incremental backup that is stored in S3.
Creating a snapshot of an attached, mounted volume is possible, but can cause problems with writes that aren’t flushed to disk. You should either detach the volume from your instance or stop the instance before creating the snapshot.

Manual setup of EBS on an EC2 instance

On an EC2 you can see the attached EBS volumes using sudo fdisk -l. Usually, EBS volumes can be found somewhere in the range of /dev/xvdf to /dev/xvdp. The root volume (/dev/xvda) is an exception—it's based on the AMI you choose when you launch the EC2 instance, and contains everything needed to boot the instance (your OS files):

The first time you use a newly created EBS volume, you must create a filesystem from that device volume sudo mkfs -t ext4 /dev/xvdf
After the filesystem has been created, you can mount the device:
```
$ sudo mkdir /mnt/volume/
$ sudo mount /dev/xvdf /mnt/volume/
```
To see mounted volumes, use df -h:
To save a shared file, put it in the volume you mounted - e.g. sudo touch /mnt/volume/testfile

Elastic File System

network file system (NFS), thus making uploading of files (say from Wordpress) available on many EC2 instances at once (as opposed to on just one instance - as per EBS limitations)
in a typical simple web app set-up, this would include PHP, HTML, CSS, PNG etc. files
the data on the EFS filesystem is replicated among multiple data centers and remains available even if a whole data center suffers from an outage, which is not true for EBS and instance stores. This means hardware issues are unlikely, but human error (e.g. rm -rf / is still possible) so you should backup to s3 to sync a snapshot to EBS from time to time anyway.
Mount Targets are used to mount the EFS on your virtual machines. You should have at least two for redundancy. These will have different IP addresses.
EFS mount targets provide an endpoint for EC2 instances to mount an EFS in a subnet (VPC)
Charged in GB per month
The EC2 instance communicates with the mount target via a TCP/IP network connection via the NFSv4.1 protocol. A security group is used to control/allow traffic (often on port 2049)

Use-cases for EFS

you can apply the same mechanism to share files between a fleet of web servers (for example, the /var/www/html folder)
a highly available Jenkins server (such as /var/lib/jenkins).
Making sure certain admin users' home directories that contain tooling (/home/USERNAME) is available on every instance so these admins can work effectively. To solve this problem, create a filesystem and mount EFS on each EC2 instance under /home. The home directories are then shared across all your EC2 instances, and users will feel at home no matter which VM they log in to.

Elastic Load Balancer (ELB)

typical use case is to forward requests to one of your virtual machines. This is an example of synchronous decoupling (Simple Queue Service is an example of asynchronous load balancing)
Instead of exposing your EC2 instances (running web servers) to the outside world, you only expose the load balancer to the outside world. This is very helpful since often your clients get an IP address in their system that they cannot easily changed. If you directly routing external traffic to an EC2 instance you'd have a problem. And you can't really rely on DNS either, since the TTL is not always obeyed by caches.
Performs health checks to ensure requests forwarded to healthy machines only
if scheme is "internet facing", it will be accessible via the internet thanks to config in a DNS record (possibly created with Cloud Formation)
connects to a "target group" that includes the various resources to be load balanced (e.g. two EC2 instances)
can have "listener rules" - choose a different target group based on the HTTP path or host (e.g. "if path starts with /api/* send to target group 2"). Otherwise requests are forwarded to the default target group defined in the listener.
AWS offers different types of load balancers through the Elastic Load Balancing (ELB) service. All load balancer types are fault-tolerant and scalable. They differ mainly in the protocols they support:
- Application Load Balancer (ALB)—HTTP, HTTPS
- Network Load Balancer (NLB)—TCP

Simple Queue Service (SQS)

Enables asynchronous decoupling: you can communicate without both sides being available at the same time, as is required by synchronous systems.
Serves as a buffer helping when the rates of production and consumption of requests is not equal
SQS offers message queues that guarantee the delivery of messages at least once. The problem of repeated delivery of a message can be solved by making the message processing idempotent. Idempotent means that no matter how often the message is processed, the result stays the same. In the example of a service that converts a webpage to PNG, this is true by design: If you process the message multiple times, the same image will be uploaded to S3 multiple times. If the image is already available on S3, it’s replaced. How to do idempotent when working with third party services (e.g. posting onto Twitter feed)? One way would be to query Twitter within the same job before posting the Tweet. Issue: Twitter is eventually consistent. Therefore a very recent matching tweet might be missed and you'll end up posting the same thing twice or more. Ultimately you need to make a business choice: tolerate a missing status update, or tolerate multiple status updates...or tolerate slowness (since you might give the 3rd service enough time to become consistent before taking action)
SQS doesn’t guarantee the order of messages, so you may read messages in a different order than they were produced. If you need a stable message order, you’ll have difficulty finding a solution that scales like SQS. Our advice is to change the design of your system so you no longer need the stable order, or put the messages in order on the client side. Or look at SQS FIFO queues which guarantee order of messages and detect duplicates.
Typically the user request will be handled by a fast part of your web-app (e.g. a nodejs server) and queue something up (e.g. generation of a screenshot from a URL) and return something to the client early (e.g. URL to check periodically for the final PNG screenshot)
The consumer of the messages just polls the queue
If a message removed from the queue is not marked as processed (e.g. through explicit deletion in SQS) before the VisibilityTimeout, the message will be delivered back to the queue. This architecture prevents broken parts of the system from losing messages.
Another advantage: you can add as many workers as you like independent of producers.
SQS does not replace a message broker like ActiveMQ. It has no message priorities or message routing.
When designing an async process, it is important to keep track of it so you'll need some kind of identifier. The client can do a look-up at that ID. Before the work is done it will give either a status report (say in JSON) or some fallback (e.g. the unprocessed image within an async flow to turn an image to a sepia colored variant)

Elastic Beanstalk

Features:

provides runtime for environments (e.g. Python, node, Ruby, docker)
updates OS etc. so you don't have to think about it
scales web application
monitors web application

Nevertheless, it still gives you virtual machine you can log in to for debugging.

Relational Database Service: RDS

supports all the big db names (e.g. postgres, MYSQL). AWS offers its own engine called Amazon Aurora, which is MySQL- and PostgreSQL-compatible. If your application supports MySQL or PostgreSQL, the migration to Amazon Aurora is easy.
Aurora is special in that it does not store data on an single EBS so much as a cluster volume (i.e. it stores data on multiple disks so has no single point of failure)
RDS has backups (configure retention period up to 35 days), patch management, and high availability SQL databases
Performance impact of snapshot backup: Creating a snapshot requires all disk activity to be briefly frozen. Requests to the database may be delayed or even fail because of a time out, so we recommend that you choose a time frame for the snapshot that has the least impact on applications and users (for example, late at night). You’d need considerable time and know-how to build a comparable relational database environment based on virtual machines, so we recommend using Amazon RDS for relational databases whenever possible to decrease operational costs
Typically you would create a security group for your RDS instance (e.g. with MYSQL allow in/out traffic on port 3306) and give this only to EC2 machines that strictly need to talk to the DB.
RDS has an option of highly available (HA). This means there is a master and a standby instance that replicates data. If master fails, the standby takes over using DNS resolution and without human intervention. You pay for both. The authors strongly recommend using high-availability deployment for all databases that handle production workloads. The master and slave are in different data centers ("availability zones"), therefore this feature is called Multi-AZ
I/O performance is important in certain DB loads. If you need to guarantee a high level of read or write throughput, you should use provisioned IOPS (SSD)
Option of read-only replication: A database suffering from too many read requests can be scaled horizontally by adding additional database instances for read replication. Changes to the database are asynchronously replicated to an additional read-only database instance. The read requests can be distributed between the master database and its read-replication databases to increase read throughput. These read-only dbs can be promoted to primary DB if needed.
Even though RDS is managed, you still need to monitor storage space, RAM, and CPU Utilizaiton to figure out how to scale.

DynamoDB

Scaling a traditional, relational database horizontally is difficult because transactional guarantees (atomicity, consistency, isolation, and durability, also known as ACID) require communication among all nodes of the database during a two-phase commit. A simplified two-phase commit with two nodes works like this:
1. A query is sent to the database cluster that wants to change data (INSERT, UPDATE, DELETE).
2. The database transaction coordinator sends a commit request to the two nodes.
3. Node 1 checks if the query could be executed. The decision is sent back to the coordinator. If the nodes decides yes, it must fulfill this promise. There is no way back.
4. Node 2 checks if the query could be executed. The decision is sent back to the coordinator.
5. The coordinator receives all decisions. If all nodes decide that the query could be executed, the coordinator instructs the nodes to finally commit.
6. Nodes 1 and 2 finally change the data. At this point, the nodes must fulfill the request. This step must not fail. The problem is that the more nodes you add, the slower your database becomes, because more nodes must coordinate transactions between each other. The way to tackle this has been to use databases that don’t adhere to these guarantees. They’re called NoSQL databases.
Therefore one use-case is where horizontal scaling with relational become a pain in the ass or too slow.
There are four types of NoSQL databases—document, graph, columnar, and key-value store—each with its own uses and applications. Dynamo is a document store.
Big advantage is that no action needed for provisioning more storage (i.e. just like s3 keeps growing vs. standard mysql-type use-case where you have to say "I want 200 GB extra now")
each table has a name and organizes a collection of items. Each item is a collection of attributes, where is a key-value pair, where the value may be scalar, multivalued (e.g. string or binary set) or a JSON document (object, array). It has no enforced schema.
best practice is to prefix your table names with the names of your application.
here the UID is used as the partition key and TID (task id) is used as the sort key

["michael", 1] => {
    "uid": "michael",
    "tid": 1,
    "description": "prepare lunch"
  }
["michael", 2] => {
  "uid": "michael",
  "tid": 3,
  "description": "prepare talk for conference"
}

note that while there is order in the sort key (the second key0, there is no order in the first key
DynamoDB lets you retrieve changes to a table as soon as they’re made. A stream provides all write (create, update, delete) operations to your table items. The order is consistent within a partition key. Streams are used in place of polling the DB for changes or populating caches with changes made to a table.
Global secondary index. Imagine a table of users where each user has a country attribute. You then create a global secondary index where the country is the new partition key. Imagine this as a read-only DynamoDB table (a "projection") that is automatically maintained. Be careful: this is only eventually consistent. A global secondary index comes at a price: the index requires storage (the same cost as for the original table). You must provision additional write-capacity units for the index as well, because a write to your table will cause a write to the global secondary index as well.

Security Group

Networking revision: 0.0.0.0 means any IP address, which you allow for for SSH access to your single server web app if you administer it from home
an example security group rule would be SSH only (port 22, any IP address, TCP protocol)
Another typical one, if you run a web server, is that the only other ports you need to open to the outside world are port 80 for HTTP traffic and 443 for HTTPS traffic. Close down all the other ports!

It is possible to control network traffic based on whether the source or destination belongs to a specific security group. For example, you can say that a MySQL database can only be accessed if the traffic comes from your web servers, or that only your proxy servers are allowed to access the web servers. Because of the elastic nature of the cloud, you’ll likely deal with a dynamic number of virtual machines, so rules based on security groups scale better than those based on IP addresses etc.

This wasn't mentioned in the book, but the examples gave me the impression that it is more important to limit inbound ports and IPs. Many examples did nothing with outbound.

Jump Box concept

only one virtual machine, the "bastion host", can be accessed via SSH from the internet
all others must be accessed via SSH from this host
means that if a smaller componennt, like a mail server, is compromised, then it does not have access to your entire system

To implement the concept of a bastion host, you must follow these two rules:

Allow SSH access to the bastion host from 0.0.0.0/0 or a specific source address.
Allow SSH access to all other virtual machines only if the traffic source is the bastion host.

It’s important that the bastion host does nothing but SSH, to reduce the chance of it becoming a security problem.

Use ssh -A to enable agent forwarding when you SSH into your jump box

Cloudtrail

Generates an event for every AWS API call (e.g., launch EC2 instance)

Cloudwatch

Consists of the following: - metrics - (watches various metrics - network usage, disk usage, number of function invocations) - alarms - creates alarms when metrics over certain thresholds - logs - events - Whenever something changes in your infrastructure, an event is generated in near real-time. For example, CloudTrail emits an event for every call to the AWS API. AWS emits an event to notify you of service degradations or downtimes.

Typical alarm: You might set up an alarm to trigger if the 10-minute average of the CPUUtilization metric is higher than 80% for 1 out of 1 data points, and if the 10-minute average of the SwapUsage metric is higher than 67108864 (64 MB) for 1 out of 1 datapoints.

From queueing theory, utilization over about 80% if problematic since wait time is exponential to the utilization of a resource. Applies to CPU, Hard Disks, cashiers at a help desk. This occurs basically because not all requests for the resource happen at convenient, even times - i.e. they are bursty. In other words, when you go from 0% utilization to 60%, wait time doubles. When you go to 80%, wait time has tripled. When you to 90%, wait time is six times higher. And so on. So if your wait time is 100 ms during 0% utilization, you already have 300 ms wait time during 80% utilization, which is already slow for a e-commerce web site.

Amazon API Gateway

Offers a scalable and secure REST API that accepts HTTPS requests from your web application’s front-end or your mobile application.

Lambda

unlike EC2 it scales automatically and is highly available and fault tolerant by default
you can measure failures of a function call in Cloudwatch.
if you need to access other resources, assign your lambda a role (e.g. that allows it to putLogEvents on CloudWatch). Temporary credentials are generated based on the IAM role and injected into each invocation via environment variables (such as AWSACCESSKEYID, AWSSECRET ACCESSKEY, AWSACCESSKEY_ID). Those environment variables are used by the AWS SDK to sign requests automatically.
you can pass ENV variables into the lambda and access them from the programming language environment
in lambda > monitoring you can see how often a function was executed
by default the lambda will log to CloudWatch stream /aws/lambda/{NAME_OF_LAMBDA} (after a few mins delay)
you can have them as scheduled events (e.g. every 5 minutes)
Be careful of how placing a lambda in a VPC can limit invocations. Each parallel invocation requires an IP address and you may run out.
To deploy a Lambda function, you need to upload the deployment package to S3. The deployment package is a zip file including your code as well as additional modules
Warning: Each invocation of your Lambda function needs to complete within a maximum of 300 seconds.
Starting a new execution context requires AWS Lambda to download your code, initialize a runtime environment, and load your code. This process is called a cold-start. Depending on the size of your deployment package, the runtime environment, and your configuration, a cold-start could take from a few milliseconds to a few seconds. Therefore, applications with very strict requirements concerning response times are not good candidates for AWS Lambda.
Another limitation is the maximum amount of memory you can provision for a Lambda function: 3008 MB. If your Lambda function uses more memory, its execu- tion will be terminated.
Common use-case is to combine with API Gateway and form the backend for a web application.
Another use-case is connected to a message broker and some rules that are input data from a "thing" (internet of things) and call the lambda if certain conditions are met.

Elastic Cache

offers managed in-memory database systems like Redis or Memcached.
Memcached offers simple data types and transactions and authentification, whereas memcached does not (but is multithreaded). Both support sharding.
Example uses-cases:
- Free up relational DB from heavy read load by caching some frequently accessed data (e.g. about a level in a multiplayer game). Since relational DBs tend to be pricey, adding this caching layer can be cheaper.
- provided sorted lists that change often (e.g. rank of a player in multiplayer games)
Downside: you can lose the cached data due to restart or hardware failure (though Redis has optional failover support)
many people compress their data before placing it in the cache (usually zlib). This can cut memory and network usage costs by 25%.
Cache population methods:
- Cron job that updates it every minute.
- On-demand (when a relevant request is made, say for the leaderboard, and there is no cache entry present, e.g. because the TTL has made it expire)
How to choose TTL? Consider the effects on ALL PARTIES INVOLVED (e.g. content producer, content consumer, your legal team) when decided. Consider too whether you have any other options for cache indvalidation.
How might you decide the cache key for an SQL query? Take the md5 of the whole thing: md5(SELECT id, nick FROM player ORDER BY score DESC LIMIT 10)
What is sharding? If the data does not fit on a single node, can you add nodes to increase capacity? Typically utilizing a consistent hashing algorithm which arranges keys into partitions in a ring distributed across the nodes.
There is a cluster mode option. With cluster mode enabled, failover speed is much faster, as no DNS is involved. With cluster mode disabled, AWS provides a single primary endpoint and in the event of a failover, AWS does a DNS swap on that endpoint to one of the available replicas. It may take ~1–1.5min before the application is able to reach the cluster after a failure, whereas with cluster mode enabled, the election takes less than 30s.
Cloudmatch can be used to montior the usual suspects as well as evictions and replication lag (where applicable, it describes how many seconds behind the replication is)

IAM

a "user" is used to authenticate people accessing your AWS account
a "group" is many users
a "role" is used to authenticate AWS resources, for example an EC2 instance.
a "policy" is used to define the permissions for a user, group, or role.

Typical policy:

{
"Version": "2012-10-17",
"Statement": [{
  "Sid": "1",
  "Effect": "Allow",
  "Action": "ec2:*",
  "Resource": "*"
}]

This allows every action for the EC2 service, for all EC2 resources you have.

If you have multiple statements that apply to the same action, Deny overrides Allow. The following policy allows all EC2 actions except terminating EC2 instances:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "1",
    "Effect": "Allow",
    "Action": "ec2:*",
    "Resource": "*"
  }, 
  {
    "Sid": "2",
    "Effect": "Deny",
    "Action": "ec2:TerminateInstances", 
    "Resource": "*"
  }]
}

So far "*" has meant every resource. But we can be more specific with an Amazon Resource Number (ARN):

arn:aws:ec2:us-east-1:878533e33333:instance/i-3dd4f812

How to red this:

arn: "amazon resource number"
ec2: service
us-east-1 - region
875... - account number
instance - resource type (if applicable)
i-3dd - resource id

WARNING: You should never copy a user's access keys to an EC2 instance; use IAM roles instead.

There are various use cases where an EC2 (or lambda etc.) instance needs to access or manage other AWS resources.

For example, an EC2 instance might need to:

Back up data to the object store S3.
Change the configuration of the private network environment in the cloud.

To be able to access the AWS API, an EC2 instance needs to authenticate itself. You could create an IAM user with access keys and store the access keys on an EC2 instance for authentication. But doing so is a hassle, especially if you want to rotate the access keys regularly. Instead of using an IAM user for authentication, you should use an IAM role whenever you need to authenticate AWS resources like EC2 instances. When using an IAM role, your access keys are injected into your EC2 instance automatically.

If an IAM role is attached to an EC2 instance, all policies attached to those roles are evaluated to determine whether the request is allowed.

Security generally

Make sure software up to date. One way, at least for OS-level packages, is to install security updates at the end of the boot process only by including yum -y --security update in your user-data script. But this has the downside of making your system unpredicatable compared to using fixed versions.
Give IAM users minimum privileges
Control traffic to and from resources (e.g. ec2 instances). Closing ports has other advantages. E.g. You can prevent yourself from human failure, for example you prevent accidentally sending email to customers from a test system by not opening outgoing SMTP connections for test systems.
Use private networks

VPC

They will be created within an address range. E.g. 172.31.0.0/16 means 16 bits fixed and (32 -16= 16) bits of space to play with (but 172.31 prefix is fixed). 172.31.38.0/24 means 24 bits fixed (and 8 to play with - we get this number because IPv4 has 32 bits in total)
You will need to attach an internet gateway to the VPC if you plan to connect to it via the internet
[Double check?] Within a VPC a security group rule that allows traffic from any address ('0.0.0.0/0') is OK since the machines within only have private IP addresses and access is only possible inside the VPC.
A VPC is always bound to a region. A subnet within a VPC is linked to an availability zone and a virtual machine is launched into a single subnet.

How to debug networking issues due to security with VPC Flow Logs

Say your EC2 instance does not accept SSH traffic as you want it to, but you can’t spot any misconfiguration in your firewall rules. In this case, you should enable VPC Flow Logs to get access to aggregated log messages containing rejected connections.

Options for deploying

Old school: SSH in and DIY. Does not scale. Near impossible to replicate.
Create a virtual machine and run a deployment script on startup with CloudFormation (i.e. "Cloud Formation with custom scripts")
Elastic Beanstalk for deploying common web applications (e.g. Ruby, Python, Docker) from zip archives on s3
OpWorks for complex layered applications (parts depend on each other).
Chef uses the likes of this. Deploy with git usually.

CLI

you should probably give it its own IAM profile (e.g. cli) with programmatic access and AdministrativeAccess permissions.

Workflow (potentially break these out into sub-tips later)

tag everything. It allows you to later group and analyze by resource groups, separate by billing, do access control, or delete everything belonging to one project.
and when you are tagging, tag things consistently e.g. with "project:X", e.g. "project:oxbridgenotes", "project:semicolonandsons"
use as few regions as possible (otherwise massively difficult to organize and delete stuff)
use instances that can hibernate (otherwise ethereal state of a running server is lost)
when chosing AMI, confirm that any functionality you need is available (e.g. snd-aloop could not be installed on Amazon's ubuntu.. I should have Googled first)

Resources

Manning book on AWS in Action: https://www.manning.com/books/amazon-web-services-in-action-second-edition

Semicolon & Sons