Last Updated: 2024-09-19
Note that this article contains many ideas from the Manning book on AWS in
Action.
I highly recommend you buy it.
General points
- be aware that AWS is constantly changing. Before implementing any features, check if it is supported.
- create a billing alert to know if you are over budget
- there are online calculators that can estimate cost of services upfront
- the amazon user id is not the same as the canonical id. For example, the user id is a set
of digits likes arn:aws:iam::888888888888 but the canonical ID is like
"688888888888888888kwwsgjl67887asdf89bd41a49d22ef0c40ef7a4f03622d"
- vertical (each cluster is more powerful - e.g. has more CPU or RAM) vs.
horizontal scaling (more machines/nodes in a cluster). Scaling a database
vertically is the easier option in terms of technical complexity, but it gets
expensive. High-end hard-ware is more expensive than commodity hardware.
Besides that, at some point, you can no longer add faster hardware because
nothing faster is available.
Resilience
- Highly availability vs. fault tolerant: A highly available system can recover
from a failure automatically with a short down-time. A fault-tolerant system,
in contrast, requires the system to provide its services without interruption
in case of a component failure.
- CloudWatch can trigger recovery automatically.
- A typical recovery when a server fails is moving to another virtual machine with the same ID,
private IP address and data available on a network-attached EBS volume.
- Some AWS services are highly available or even fault-tolerant by default. E.g.
CloudFront (CDN) and Route 53 (DNS)
- Some services use multiple availability zones within a region so they can
recover from even an availability zone outage: S3 (object store) and DynamoDB
(NoSQL database).
- EC2 has autoscaling - min/max/desired number of virtual machines (can also be
done on the basis of load). Re: load scaling - your policy might be "if cpu
usage less than 25%, remove 1 server. If greater than 75% add a server". Or
you might auto-scale along a schedule (e.g. a specific upcoming datetime when you
have a TV ad running; or cylical times of day or dates of year - e.g. every December)
- By default metrics from EC2 to CloudWatch are updated every 5 minutes, but you
can make it more frequenty (e.g. every minute) if it matters more to your
use-case.
- availability vs. durability. EBS is guaranteed to be available 99.999% of the
time. This has nothing to do with losing data. EBS will not lose data 99.99%
of the time (i.e. it is an order or magnitude more likely to lose data than be
un-available)
- Snapshots of EBS taken regularly are key. Consider creating an AMI.
- Metric 1: Recovery time objective (RTO) is the time it takes for a system to recover
from a failure; it’s the length of time until the system reaches a working
state again, defined as the system service level, after an outage.
- Metric 2: Recovery point objective (RPO) is the acceptable data-loss time
caused by a failure. The amount of data loss is measured in time. If an outage
happens at 10:00 a.m. and the system recovers with a data snapshot from 09:00
a.m., the time span of the data loss is one hour.
- The most convenient way to make your system fault-tolerant is to build the
architecture using fault-tolerant blocks. If all blocks are fault-tolerant, the
whole system will be fault-tolerant as well. Fault tolerant systems include
ELB, CloudWatch, S3, SQS, autoscaling groups, DynamoDB
Unfortunately, one important service isn’t fault-tolerant by default: EC2
instances. Virtual machines aren’t fault-tolerant. This means an architecture
that uses EC2 isn’t fault-tolerant by default. But you can get around this. If one of the EC2 instances
crashes, the Elastic Load Balancer (ELB) stops routing requests to the crashed
instances and sends them to instances that are still up. The auto-scaling group
replaces the crashed EC2 instance within minutes, and the ELB begins to route
requests to the new instance.
Choosing a region
Factors to consider
- latency - how far to customers
- availability of desired AWS services - not everything is available in every region. Use online tools to determine this.
- legal issues - e.g. are you allowed to store data in country Y?
- where your other AWS infrastructure is based. E.g. With DynamoDB, no additional traffic charges apply if you use access DynamoDB from ECs instances in the same region.
Each region consists of multiple availability zones (AZs). You can think of an
AZ as an isolated group of data centers, and a region as an area where multiple
availability zones are located at a sufficient distance. The region us-east-1
consists of six availability zones (us-east-1a to us-east-1f), for example. The
availability zone us-east-1a could be one data center, or many (this is not
public info)
The AZs are connected through low-latency links, so requests between different
availability zones aren’t as expensive as requests across the internet in terms
of latency. The latency within an availability zone (such as from an EC2
instance to another EC2 instance in the same subnet) is lower compared to
latency across AZs.
Cloud Formation
- cloud formation creates an entire architecture (e.g. Wordpress on two EC2 instances with an RDS MySQL, elastic load balancer) from a template (find them online/create your own)
- a Cloud Formation template contains parameters (e.g. to customize aspects like the DB name when the template is run), resources (AWS services), and outputs.
- before starting, import your private key in EC2 > Key Pairs. That means any EC2 instances you create will be accessible via SSH.
- There is an outputs section that shows stuff like URLs for visiting your newly deployed site
S3
- Object store (i.e. manages data as objects with a GUID and metadata (size, owner, content type), as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.)
- Glacier - similar but cheaper. Downside is it takes up to 5 hours to access data (instead of near immediate with standard AWS). You might initially process raw data into a relational form on s3 then move to Glacier for archiving purposes since it is unlikely (but not impossible) you will need it again. It also has expedited transfer (more expensive but only 5 minutes)
- movement to glacier can be controlled as a rule at the bucket level (e.g. after N days)
- to give read only access to every item in a bucket you must use the wildcard - e.g.
"Resource":["arn:aws:s3:::$BucketName/*"]
in the policy
- you can host a static website there by running
aws s3 website s3://$BucketName --index-document helloworld.html
and setting a CNAME record on your domain pointing to bucket's endpoint
- s3 is eventually consistent, which means you might read stale data after changing an object for a short period of time. If you don’t consider this, you may be surprised if you try to read objects immediately after changing them.
For maximum performance, choose keys with even distribution of characters in their prefixes. In S3, keys are stored in alphabetical order in an index. The key name determines which partition the key is stored in. If your keys all begin with the same characters, this will limit the I/O performance of your S3 bucket. Thus names like "image1.png", "image2.png" will underperform names like "ffsa-image1.png", "abaw-image2.png" that have an MD5 hash of the original key as a prefix.
Using a slash (/) in the key name acts like creating a folder for your object. If you cre-ate an object with the key folder/object.png, the folder will become visible as a folder if you’re browsing your bucket with a GUI like the Management Console, for example. But technically, the key of the object still is prefix-folder-name/object.png.
EC2
- EC2 gives you virtual machines, i.e. multiple instances get run as guests on a host machine and are separated by software by a hypervisor (e.g. Xen). Increasingly the separation is assisted with hardware.
- We use virtual appliances to speed up creation of preconfigured OS's. These are images of a virtual machine containing an OS and preconfigured software. They eliminate the cost of installing afresh every time.
- An AMI is a special type of virtual appliance for use with the EC2 service. An AMI technically consists of a read-only filesystem including the OS, additional software, and configuration; it doesn’t include the kernel of the OS. The kernel is loaded from an Amazon Kernel Image (AKI).
- When chosing between EC2 instance hardwares, consider whether the program you are running will be single or multi-threaded. E.g. Redis is single-threaded and will not use all cores in a many core instance, so don't waste your money!
- Which OS system image should you chose? Amazon Linux is a fine bet as it is maintained by and optimized for AWS.
- How to read "t2.micro"? Class = "t" (small cheap virtual machines with low baseline performance but ability to burst), generation 2, size of micro.
- People tend to overestimate the instance size they'll need so start small and work up.
- System log of boot available through AWS console if issues with start up.
- It is possible to assign extra network interfaces. There could be many reasons for this. One is allowing access via multiple different public IP addresses, thereby letting you host multiple different websites. You’ll need to configure the web server to deliver different websites depending on the IP address. Your virtual machine doesn’t know anything about its public IP address, but you can distinguish the requests based on the private IP address. These start with either 10, 172 or 192 and can be determiend with
$ ifconfig
on your EC2 instance.
- Reserved instances: you have to pay whether it is running or not. May be much cheaper because you can do long-term contracts.
- Spot instances: basically un-used capacity. Great for running long batch jobs (e.g. AI, encoding media files). Can as low as 10% of normal price! If the current spot price exceeds your bid, your VM will be terminated (not stopped) by AWS after two minutes.
EC2 Instance Store
- An instance store provides block-level storage directly attached to the machine hosting your VM. The instance store is part of an EC2 instance and available only if your instance is running; it won’t persist your data if you stop or terminate the instance.
- You don’t pay separately for an instance store; instance store charges are included in the EC2 instance price.
- Use it for caching, temporary processing, or applications that replicate data to several servers, as some databases do.
Elastic IPs
- When you modify an EC2 instance (e.g. increase its size), it gets a random, differing IP address. To avoid this you can allocate a fixed IP with Elastic IPs service.
- This is found in sub-menu sidebar of EC2.
- Essentially you create one or more fixed IP addresses and associate them with particular instances of EC2.
- Another advantage of elastic IPs is that you can provision another machine to replace an old one and once it's ready do a quick switcheroo.
Elastic Block Store
- A block is a sequence of bytes, and the smallest addressable unit. The OS is the intermediary between the application that wants to access files and the underlying file system and block-level storage. The OS provides access to block-level storage via open, write, and read system calls.
- If you are migrating older systems (e.g. MYSQL) to AWS, these expect a classical block file system instead of an object store, so s3 is not possible.
- EBS is a persistent (as opposed to temporary) block store and it has built-in replication
- An EBS volume is separate from an EC2 instance and connected over the network. If you terminate your EC2 instance, the EBS volumes therefore remain.
- WARNING: You can’t attach the same EBS volume to multiple virtual machines! This use-case requires a network filesystem.
- Usually called "Volumes" in Cloud Formation nomenclature.
- EBS volumes are charged based on the size of the volume, no matter how much data you store in the volume.
- EBS offers an optimized, easy-to-use way to back up EBS volumes with EBS snapshots. A snapshot is a block-level incremental backup that is stored in S3.
- Creating a snapshot of an attached, mounted volume is possible, but can cause problems with writes that aren’t flushed to disk. You should either detach the volume from your instance or stop the instance before creating the snapshot.
Manual setup of EBS on an EC2 instance
On an EC2 you can see the attached EBS volumes using sudo fdisk -l
. Usually,
EBS volumes can be found somewhere in the range of /dev/xvdf
to /dev/xvdp
.
The root volume (/dev/xvda
) is an exception—it's based on the AMI you choose
when you launch the EC2 instance, and contains everything needed to boot the
instance (your OS files):
The first time you use a newly created EBS volume, you must create a filesystem from that device volume sudo mkfs -t ext4 /dev/xvdf
After the filesystem has been created, you can mount the device:
$ sudo mkdir /mnt/volume/
$ sudo mount /dev/xvdf /mnt/volume/
To see mounted volumes, use df -h
:
To save a shared file, put it in the volume you mounted - e.g. sudo touch /mnt/volume/testfile
Elastic File System
- network file system (NFS), thus making uploading of files (say from Wordpress) available on many EC2 instances at once (as opposed to on just one instance - as per EBS limitations)
- in a typical simple web app set-up, this would include PHP, HTML, CSS, PNG etc. files
- the data on the EFS filesystem is replicated among multiple data centers and
remains available even if a whole data center suffers from an outage, which is
not true for EBS and instance stores. This means hardware issues are unlikely,
but human error (e.g.
rm -rf /
is still possible) so you should backup to s3
to sync a snapshot to EBS from time to time anyway.
- Mount Targets are used to mount the EFS on your virtual machines. You should
have at least two for redundancy. These will have different IP addresses.
- EFS mount targets provide an endpoint for EC2 instances to mount an EFS in a
subnet (VPC)
- Charged in GB per month
- The EC2 instance communicates with the mount target via a TCP/IP network
connection via the NFSv4.1 protocol. A security group is used to control/allow
traffic (often on port 2049)
Use-cases for EFS
- you can apply the same mechanism to share files between a fleet of web servers (for example, the /var/www/html folder)
- a highly available Jenkins server (such as /var/lib/jenkins).
- Making sure certain admin users' home directories that contain tooling
(
/home/USERNAME
) is available on every instance so these admins can work
effectively. To solve this problem, create a filesystem and mount EFS on each
EC2 instance under /home. The home directories are then shared across all your
EC2 instances, and users will feel at home no matter which VM they log in to.
Elastic Load Balancer (ELB)
- typical use case is to forward requests to one of your virtual machines. This is an example of synchronous decoupling (Simple Queue Service is an example of asynchronous load balancing)
- Instead of exposing your EC2 instances (running web servers) to the outside
world, you only expose the load balancer to the outside world. This is very
helpful since often your clients get an IP address in their system that they
cannot easily changed. If you directly routing external traffic to an EC2
instance you'd have a problem. And you can't really rely on DNS either, since
the TTL is not always obeyed by caches.
- Performs health checks to ensure requests forwarded to healthy machines only
- if scheme is "internet facing", it will be accessible via the internet thanks
to config in a DNS record (possibly created with Cloud Formation)
- connects to a "target group" that includes the various resources to be load balanced (e.g. two EC2 instances)
- can have "listener rules" - choose a different target group based on the HTTP path or host (e.g. "if path starts with /api/* send to target group 2"). Otherwise requests are forwarded to the default target group defined in the listener.
- AWS offers different types of load balancers through the Elastic Load Balancing (ELB) service. All load balancer types are fault-tolerant and scalable. They differ mainly in the protocols they support:
- Application Load Balancer (ALB)—HTTP, HTTPS
- Network Load Balancer (NLB)—TCP
Simple Queue Service (SQS)
- Enables asynchronous decoupling: you can communicate without both sides being available at the same time, as is required by synchronous systems.
- Serves as a buffer helping when the rates of production and consumption of requests is not equal
- SQS offers message queues that guarantee the delivery of messages at least
once. The problem of repeated delivery of a message can be solved by making
the message processing idempotent. Idempotent means that no matter how often
the message is processed, the result stays the same. In the example of a
service that converts a webpage to PNG, this is true by design: If you process the message multiple times, the same
image will be uploaded to S3 multiple times. If the image is already
available on S3, it’s replaced. How to do idempotent when working with third
party services (e.g. posting onto Twitter feed)? One way would be to query Twitter
within the same job before posting the Tweet. Issue: Twitter is eventually
consistent. Therefore a very recent matching tweet might be missed and you'll
end up posting the same thing twice or more. Ultimately you need to make a business choice:
tolerate a missing status update, or tolerate multiple status updates...or
tolerate slowness (since you might give the 3rd service enough time to become
consistent before taking action)
- SQS doesn’t guarantee the order of messages, so you may read messages in a
different order than they were produced. If you need a stable message order,
you’ll have difficulty finding a solution that scales like SQS. Our advice is to
change the design of your system so you no longer need the stable order, or put
the messages in order on the client side. Or look at SQS FIFO queues which
guarantee order of messages and detect duplicates.
- Typically the user request will be handled by a fast part of your web-app
(e.g. a nodejs server) and queue something up (e.g. generation of a screenshot
from a URL) and return something to the client early (e.g. URL to check
periodically for the final PNG screenshot)
- The consumer of the messages just polls the queue
- If a message removed from the queue is not marked as processed (e.g. through
explicit deletion in SQS) before the
VisibilityTimeout, the message will be delivered back to the queue. This
architecture prevents broken parts of the system from losing messages.
- Another advantage: you can add as many workers as you like independent of
producers.
- SQS does not replace a message broker like ActiveMQ. It has no message
priorities or message routing.
- When designing an async process, it is important to keep track of it so you'll
need some kind of identifier. The client can do a look-up at that ID. Before
the work is done it will give either a status report (say in JSON) or some
fallback (e.g. the unprocessed image within an async flow to turn an image to
a sepia colored variant)
Elastic Beanstalk
Features:
- provides runtime for environments (e.g. Python, node, Ruby, docker)
- updates OS etc. so you don't have to think about it
- scales web application
- monitors web application
Nevertheless, it still gives you virtual machine you can log in to for
debugging.
Relational Database Service: RDS
- supports all the big db names (e.g. postgres, MYSQL). AWS offers its own engine called Amazon Aurora, which is MySQL- and PostgreSQL-compatible. If your application supports MySQL or PostgreSQL, the migration to Amazon Aurora is easy.
- Aurora is special in that it does not store data on an single EBS so much as
a cluster volume (i.e. it stores data on multiple disks so has no single point of failure)
- RDS has backups (configure retention period up to 35 days), patch management, and high availability SQL databases
- Performance impact of snapshot backup: Creating a snapshot requires all disk activity to be briefly frozen. Requests to the database may be delayed or even fail because of a time out, so we recommend that you choose a time frame for the snapshot that has the least impact on applications and users (for example, late at night). You’d need considerable time and know-how to build a comparable relational database environment based on virtual machines, so we recommend using Amazon RDS for relational databases whenever possible to decrease operational costs
- Typically you would create a security group for your RDS instance (e.g. with
MYSQL allow in/out traffic on port 3306) and give this only to EC2 machines
that strictly need to talk to the DB.
- RDS has an option of highly available (HA). This means there is a master and a
standby instance that replicates data. If master fails, the standby takes over using DNS resolution and without human intervention. You pay for both. The authors strongly recommend
using high-availability deployment for all databases that handle production
workloads. The master and slave are in different data centers ("availability
zones"), therefore this feature is called Multi-AZ
- I/O performance is important in certain DB loads. If you need to guarantee a
high level of read or write throughput, you should use provisioned IOPS (SSD)
- Option of read-only replication: A database suffering from too many read
requests can be scaled horizontally by adding additional database instances
for read replication. Changes to the database are asynchronously replicated to an additional read-only database instance. The
read requests can be distributed between the master database and its
read-replication databases to increase read throughput. These read-only dbs
can be promoted to primary DB if needed.
- Even though RDS is managed, you still need to monitor storage space, RAM, and CPU Utilizaiton to figure out how to scale.
DynamoDB
- Scaling a traditional, relational database horizontally is difficult because
transactional guarantees (atomicity, consistency, isolation, and durability,
also known as ACID) require communication among all nodes of the database
during a two-phase commit. A simplified two-phase commit with two nodes works like this:
- A query is sent to the database cluster that wants to change data (INSERT, UPDATE, DELETE).
- The database transaction coordinator sends a commit request to the two nodes.
- Node 1 checks if the query could be executed. The decision is sent back to the coordinator. If the nodes decides yes, it must fulfill this promise. There is no way back.
- Node 2 checks if the query could be executed. The decision is sent back to the coordinator.
- The coordinator receives all decisions. If all nodes decide that the query could be executed, the coordinator instructs the nodes to finally commit.
- Nodes 1 and 2 finally change the data. At this point, the nodes must fulfill the request. This step must not fail. The problem is that the more nodes you add, the slower your database becomes, because more nodes must coordinate transactions between each other. The way to tackle this has been to use databases that don’t adhere to these guarantees. They’re called NoSQL databases.
- Therefore one use-case is where horizontal scaling with relational become a pain in the ass or too slow.
- There are four types of NoSQL databases—document, graph, columnar, and key-value store—each with its own uses and applications. Dynamo is a document store.
- Big advantage is that no action needed for provisioning more storage (i.e. just like s3 keeps growing vs. standard mysql-type use-case where you have to say "I want 200 GB extra now")
- each table has a name and organizes a collection of items. Each item is a
collection of attributes, where is a key-value pair, where the value may be
scalar, multivalued (e.g. string or binary set) or a JSON document (object,
array). It has no enforced schema.
- best practice is to prefix your table names with the names of your application.
- here the UID is used as the partition key and TID (task id) is used as the sort key
["michael", 1] => {
"uid": "michael",
"tid": 1,
"description": "prepare lunch"
}
["michael", 2] => {
"uid": "michael",
"tid": 3,
"description": "prepare talk for conference"
}
- note that while there is order in the sort key (the second key0, there is no
order in the first key
- DynamoDB lets you retrieve changes to a table as soon as they’re made. A
stream provides all write (create, update, delete) operations to your table
items. The order is consistent within a partition key. Streams are used in
place of polling the DB for changes or populating caches with changes made to
a table.
- Global secondary index. Imagine a table of users where each user has a country attribute. You then create a global secondary index where the country is the new partition key. Imagine this as a read-only DynamoDB table (a "projection") that is automatically maintained. Be careful: this is only eventually consistent. A global secondary index comes at a price: the index requires storage (the same cost as for the original table). You must provision additional write-capacity units for the index as well, because a write to your table will cause a write to the global secondary index as well.
Security Group
- Networking revision: 0.0.0.0 means any IP address, which you allow for for SSH access to your single server web app if you administer it from home
- an example security group rule would be SSH only (port 22, any IP address, TCP protocol)
- Another typical one, if you run a web server, is that the only other ports you need to open to the outside world are port 80 for HTTP traffic and 443 for HTTPS traffic. Close down all the other ports!
It is possible to control network traffic based on whether the source or
destination belongs to a specific security group. For example, you can say that
a MySQL database can only be accessed if the traffic comes from your web
servers, or that only your proxy servers are allowed to access the web servers.
Because of the elastic nature of the cloud, you’ll likely deal with a dynamic
number of virtual machines, so rules based on security groups scale better than
those based on IP addresses etc.
This wasn't mentioned in the book, but the examples gave me the impression that
it is more important to limit inbound ports and IPs. Many examples did nothing with outbound.
Jump Box concept
- only one virtual machine, the "bastion host", can be accessed via SSH from the internet
- all others must be accessed via SSH from this host
- means that if a smaller componennt, like a mail server, is compromised, then it does not have access to your entire system
To implement the concept of a bastion host, you must follow these two rules:
- Allow SSH access to the bastion host from 0.0.0.0/0 or a specific source address.
- Allow SSH access to all other virtual machines only if the traffic source is the bastion host.
It’s important that the bastion host does nothing but SSH, to reduce the chance of it becoming a security problem.
Use ssh -A
to enable agent forwarding when you SSH into your jump box
Cloudtrail
- Generates an event for every AWS API call (e.g., launch EC2 instance)
Cloudwatch
Consists of the following:
- metrics - (watches various metrics - network usage, disk usage, number of function invocations)
- alarms - creates alarms when metrics over certain thresholds
- logs
- events - Whenever something changes in your infrastructure, an event is
generated in near real-time. For example, CloudTrail emits an event for every
call to the AWS API. AWS emits an event to notify you of service degradations
or downtimes.
Typical alarm: You might set up an alarm to trigger if the 10-minute average of
the CPUUtilization metric is higher than 80% for 1 out of 1 data points, and if
the 10-minute average of the SwapUsage metric is higher than 67108864 (64 MB)
for 1 out of 1 datapoints.
From queueing theory, utilization over about 80% if problematic since wait time
is exponential to the utilization of a resource. Applies to CPU, Hard Disks,
cashiers at a help desk. This occurs basically because not all requests for the
resource happen at convenient, even times - i.e. they are bursty. In other
words, when you go from 0% utilization to 60%, wait time doubles. When you go to
80%, wait time has tripled. When you to 90%, wait time is six times higher. And
so on. So if your wait time is 100 ms during 0% utilization, you already have
300 ms wait time during 80% utilization, which is already slow for a e-commerce
web site.
Amazon API Gateway
- Offers a scalable and secure REST API that accepts HTTPS requests from your
web application’s front-end or your mobile application.
Lambda
- unlike EC2 it scales automatically and is highly available and fault tolerant by default
- you can measure failures of a function call in Cloudwatch.
- if you need to access other resources, assign your lambda a role (e.g. that
allows it to
putLogEvents
on CloudWatch). Temporary credentials are generated
based on the IAM role and injected into each invocation via environment
variables (such as AWSACCESSKEYID, AWSSECRET ACCESSKEY,
AWSACCESSKEY_ID). Those environment variables are used by the AWS SDK to
sign requests automatically.
- you can pass ENV variables into the lambda and access them from the
programming language environment
- in lambda > monitoring you can see how often a function was executed
- by default the lambda will log to CloudWatch stream
/aws/lambda/{NAME_OF_LAMBDA}
(after a few mins delay)
- you can have them as scheduled events (e.g. every 5 minutes)
- Be careful of how placing a lambda in a VPC can limit invocations. Each parallel invocation requires an IP address and you may run out.
- To deploy a Lambda function, you need to upload the deployment package to S3. The deployment package is a zip file including your code as well as additional modules
- Warning: Each invocation of your Lambda function needs to complete within a maximum of 300 seconds.
- Starting a new execution context requires AWS Lambda to download your code,
initialize a runtime environment, and load your code. This process is called
a cold-start. Depending on the size of your deployment package, the runtime
environment, and your configuration, a cold-start could take from a few
milliseconds to a few seconds. Therefore, applications with very strict
requirements concerning response times are not good candidates for AWS
Lambda.
- Another limitation is the maximum amount of memory you can provision for a
Lambda function: 3008 MB. If your Lambda function uses more memory, its execu-
tion will be terminated.
- Common use-case is to combine with API Gateway and form the backend for a web application.
- Another use-case is connected to a message broker and some rules that are input data from a "thing" (internet of things) and call the lambda if certain conditions are met.
Elastic Cache
- offers managed in-memory database systems like Redis or Memcached.
- Memcached offers simple data types and transactions and authentification, whereas memcached does not (but is multithreaded). Both support sharding.
Example uses-cases:
- Free up relational DB from heavy read load by caching some frequently accessed data (e.g. about a level in a multiplayer game). Since relational DBs tend to be pricey, adding this caching layer can be cheaper.
- provided sorted lists that change often (e.g. rank of a player in multiplayer games)
Downside: you can lose the cached data due to restart or hardware failure (though Redis has optional failover support)
many people compress their data before placing it in the cache (usually zlib). This can cut memory and network usage costs by 25%.
Cache population methods:
- Cron job that updates it every minute.
- On-demand (when a relevant request is made, say for the leaderboard, and there is no cache entry present, e.g. because the TTL has made it expire)
How to choose TTL? Consider the effects on ALL PARTIES INVOLVED (e.g. content producer, content consumer, your legal team) when decided. Consider too whether you have any other options for cache indvalidation.
How might you decide the cache key for an SQL query? Take the md5 of the whole thing: md5(SELECT id, nick FROM player ORDER BY score DESC LIMIT 10)
What is sharding? If the data does not fit on a single node, can you add nodes to increase capacity? Typically utilizing a consistent hashing algorithm which arranges keys into partitions in a ring distributed across the nodes.
There is a cluster mode option. With cluster mode enabled, failover speed is much faster, as no DNS is involved. With cluster mode disabled, AWS provides a single primary endpoint and in the event of a failover, AWS does a DNS swap on that endpoint to one of the available replicas. It may take ~1–1.5min before the application is able to reach the cluster after a failure, whereas with cluster mode enabled, the election takes less than 30s.
Cloudmatch can be used to montior the usual suspects as well as evictions and replication lag (where applicable, it describes how many seconds behind the replication is)
IAM
- a "user" is used to authenticate people accessing your AWS account
- a "group" is many users
- a "role" is used to authenticate AWS resources, for example an EC2 instance.
- a "policy" is used to define the permissions for a user, group, or role.
Typical policy:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "1",
"Effect": "Allow",
"Action": "ec2:*",
"Resource": "*"
}]
This allows every action for the EC2 service, for all EC2 resources you have.
If you have multiple statements that apply to the same action, Deny overrides Allow. The following policy allows all EC2 actions except terminating EC2 instances:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "1",
"Effect": "Allow",
"Action": "ec2:*",
"Resource": "*"
},
{
"Sid": "2",
"Effect": "Deny",
"Action": "ec2:TerminateInstances",
"Resource": "*"
}]
}
So far "*" has meant every resource. But we can be more specific with an Amazon
Resource Number (ARN):
arn:aws:ec2:us-east-1:878533e33333:instance/i-3dd4f812
How to red this:
- arn: "amazon resource number"
- ec2: service
- us-east-1 - region
- 875... - account number
- instance - resource type (if applicable)
- i-3dd - resource id
WARNING: You should never copy a user's access keys to an EC2 instance; use IAM roles instead.
There are various use cases where an EC2 (or lambda etc.) instance needs to access or manage other AWS resources.
For example, an EC2 instance might need to:
- Back up data to the object store S3.
- Change the configuration of the private network environment in the cloud.
To be able to access the AWS API, an EC2 instance needs to authenticate itself. You could create an IAM user with access keys and store the access keys on an EC2 instance for authentication. But doing so is a hassle, especially if you want to rotate the access keys regularly. Instead of using an IAM user for authentication, you should use an IAM role whenever you need to authenticate AWS resources like EC2 instances. When using an IAM role, your access keys are injected into your EC2 instance automatically.
If an IAM role is attached to an EC2 instance, all policies attached to those roles are evaluated to determine whether the request is allowed.
Security generally
- Make sure software up to date. One way, at least for OS-level packages, is to
install security updates at the end of the boot process only by including
yum
-y --security update
in your user-data script. But this has the downside of
making your system unpredicatable compared to using fixed versions.
- Give IAM users minimum privileges
- Control traffic to and from resources (e.g. ec2 instances). Closing ports has
other advantages. E.g. You can prevent yourself from human failure, for
example you prevent accidentally sending email to customers from a test system
by not opening outgoing SMTP connections for test systems.
- Use private networks
VPC
- They will be created within an address range. E.g. 172.31.0.0/16 means 16 bits
fixed and (32 -16= 16) bits of space to play with (but 172.31 prefix is fixed). 172.31.38.0/24 means 24
bits fixed (and 8 to play with - we get this number because IPv4 has 32 bits
in total)
- You will need to attach an internet gateway to the VPC if you plan to connect to it via the internet
- [Double check?] Within a VPC a security group rule that allows traffic from any address ('0.0.0.0/0') is OK since the machines within only have private IP addresses and access is only possible inside the VPC.
- A VPC is always bound to a region. A subnet within a VPC is linked to an availability zone and a virtual machine is launched into a single subnet.
How to debug networking issues due to security with VPC Flow Logs
Say your EC2 instance does not accept SSH traffic as you want it to, but you can’t spot any misconfiguration in your firewall rules. In this case, you should enable VPC Flow Logs to get access to aggregated log messages containing
rejected connections.
Options for deploying
- Old school: SSH in and DIY. Does not scale. Near impossible to replicate.
- Create a virtual machine and run a deployment script on startup with CloudFormation (i.e. "Cloud Formation with custom scripts")
- Elastic Beanstalk for deploying common web applications (e.g. Ruby, Python, Docker) from zip archives on s3
- OpWorks for complex layered applications (parts depend on each other).
- Chef uses the likes of this. Deploy with git usually.
CLI
- you should probably give it its own IAM profile (e.g.
cli
) with programmatic access and AdministrativeAccess permissions.
Workflow (potentially break these out into sub-tips later)
- tag everything. It allows you to later group and analyze by resource groups, separate by billing, do access control, or delete everything belonging to one project.
- and when you are tagging, tag things consistently e.g. with "project:X", e.g. "project:oxbridgenotes", "project:semicolonandsons"
- use as few regions as possible (otherwise massively difficult to organize and delete stuff)
- use instances that can hibernate (otherwise ethereal state of a running server is lost)
- when chosing AMI, confirm that any functionality you need is available (e.g. snd-aloop could not be installed on Amazon's ubuntu.. I should have Googled first)
Resources