1. Kubernetes implementation on cloud and on-premise are very different.
2. Enough linux internals for a solid understanding of how to operate kubernetes in production environment.
3. Install and operate kubernetes using only community tools.
4. Deploy community kubernetes cluster on manually VMs from scratch.
5. Design and implement CI/CD piepelines for independent deployments.
6. Figure out governance strategies to independently develop, configure and operate each microservice in a kubernetes cluster.
7. Configure istio in a flexible manner to govern east-west traffic.
8. Run all K8s processes as Docker containers rather than binaries.
9. In the absence of open internet, start with docker registry first and populate all necessary images.
10. Use kubespray to setup RHEL VMs
--> Use Ansible playbooks for opinionated provisioning
--> Sets up Calcio overlay networking
11. Admission control policies to apply resource quotas(especially in lower environments)
12. Dont run single master clusters(even in lower environments)
13. Helm charts are your friend
14. Each microservice can be independently deployed, scaled and monitored for KPIs across multiple environments.
15. Custom Autoscaling tied to Istio metrics for I/O intensive workloads e.g. zuul gateways acting as BFFS.
16. Timeout and retries, Rate limiting, Circuit Breaker, bulkheading - Istio configuration in Helm Charts.
17. Storage orchestration on K8s without mature storage provisioning
--> Tried going with GlusterFS however infra team wa not well versed with dynamic provisioning , used to mounting disks on nodes directly.
--> No standardized storeage orchestration solution.
--> Issues converying dynamic storage needs to IT team.
--> Needed additional hardware to run Gluster solution and was still backed by NFS
--> Ended up with NFS as the only standard solution available(Though not recommended)
--> Both Gluster and NFS need special OS drivers on the nodes
--> Now exploring CAS, Rook, OpenEBS
18. Istio connection pooling Issues
--> Istio internally runs on HTTP/2 and GRPC
--> TLS is handled by Istio - our services are on HTTP/1.1
--> Istio automatically upgrades browser to HTTP/2 during content negotiation
--> Upstream services calls were converted to HTTP/1.1
--> Conversion by Istio between HTTP/1.1 and HTTP/2 at gateway started showing TCP packets loss
--> Too many browser requests went into pending state
--> Application started glitching
--> Happened early on as soon as we enabled TLS in lower environments.
19. Automatic TLS origination with Istio was also a problem
20. Certificate provisioning is manual and does not follow ACME standards
21. Multi-Domain certificates also cause a problem.
22. K8s adoption running in front of Traditional IT Teams, IT/Ops teams not upgraded at the same pace
--> Cluster state corruption due to unscheduled master node restarts
--> Etcd runs as Docker container which doesn't automatically restart which damaged the etcd cluster state
--> Solution:- Set policy to auto restart Docker container
23. Disk Issues - separate partitions images, logs, containers.
24. Be aware of K8s garbage collection runs
25. Registry stability issues bring down cluster services
26. Average latency of services < 200 ms
27. uptime of entire platform 99.9%
28. Expected peak load ~5000 tps
29. Will grow to ~15000 tps over next 3 years
2. Enough linux internals for a solid understanding of how to operate kubernetes in production environment.
3. Install and operate kubernetes using only community tools.
4. Deploy community kubernetes cluster on manually VMs from scratch.
5. Design and implement CI/CD piepelines for independent deployments.
6. Figure out governance strategies to independently develop, configure and operate each microservice in a kubernetes cluster.
7. Configure istio in a flexible manner to govern east-west traffic.
8. Run all K8s processes as Docker containers rather than binaries.
9. In the absence of open internet, start with docker registry first and populate all necessary images.
10. Use kubespray to setup RHEL VMs
--> Use Ansible playbooks for opinionated provisioning
--> Sets up Calcio overlay networking
11. Admission control policies to apply resource quotas(especially in lower environments)
12. Dont run single master clusters(even in lower environments)
13. Helm charts are your friend
14. Each microservice can be independently deployed, scaled and monitored for KPIs across multiple environments.
15. Custom Autoscaling tied to Istio metrics for I/O intensive workloads e.g. zuul gateways acting as BFFS.
16. Timeout and retries, Rate limiting, Circuit Breaker, bulkheading - Istio configuration in Helm Charts.
17. Storage orchestration on K8s without mature storage provisioning
--> Tried going with GlusterFS however infra team wa not well versed with dynamic provisioning , used to mounting disks on nodes directly.
--> No standardized storeage orchestration solution.
--> Issues converying dynamic storage needs to IT team.
--> Needed additional hardware to run Gluster solution and was still backed by NFS
--> Ended up with NFS as the only standard solution available(Though not recommended)
--> Both Gluster and NFS need special OS drivers on the nodes
--> Now exploring CAS, Rook, OpenEBS
18. Istio connection pooling Issues
--> Istio internally runs on HTTP/2 and GRPC
--> TLS is handled by Istio - our services are on HTTP/1.1
--> Istio automatically upgrades browser to HTTP/2 during content negotiation
--> Upstream services calls were converted to HTTP/1.1
--> Conversion by Istio between HTTP/1.1 and HTTP/2 at gateway started showing TCP packets loss
--> Too many browser requests went into pending state
--> Application started glitching
--> Happened early on as soon as we enabled TLS in lower environments.
19. Automatic TLS origination with Istio was also a problem
20. Certificate provisioning is manual and does not follow ACME standards
21. Multi-Domain certificates also cause a problem.
22. K8s adoption running in front of Traditional IT Teams, IT/Ops teams not upgraded at the same pace
--> Cluster state corruption due to unscheduled master node restarts
--> Etcd runs as Docker container which doesn't automatically restart which damaged the etcd cluster state
--> Solution:- Set policy to auto restart Docker container
23. Disk Issues - separate partitions images, logs, containers.
24. Be aware of K8s garbage collection runs
25. Registry stability issues bring down cluster services
26. Average latency of services < 200 ms
27. uptime of entire platform 99.9%
28. Expected peak load ~5000 tps
29. Will grow to ~15000 tps over next 3 years
0 comments:
Post a Comment