Concepts

Controller Manager: host of controller instances, each controller manager runs as a separate process
Controller: each controller instance is processing the workload request and maintain system status (such as replicaset controller in k8s). A controller instance can have one to many workers running concurrently.
Controller worker: each controller worker get workload request from the queue, fulfill client request and maintain system status.

Project Scope

In Scope

Design & implement a general controller framework that support workload request for a large scale kubernetes cluster (up to 100K nodes) - solve the scalability issue of current Kubernetes system (v1.15)
Design & implement controller manager and controller instances deployment and scale out mechanism - facilitate controller system deployment and maintenance
Controller system implementation rolling out plan

Out of Scope

There is no plan to fix various K8s bugs and performance issues in most of the cases except scalability issue described above
In order to solve the controller scalability issue, this project might add some features into api server, but the scalability of api server and backend storage is not in the scope of this project. It will be handled in a separate project.

Design Principles

Allow multiple instances of controllers run in parallel and split workloads
Minimal or no increment to the api server informer watch network traffic
Balance of workloads as much as possible but it is not the highest priority unless reached processing limit
All controller managers are running on the same code base and can host different types of controllers allow default configure and support live configuration changes
Each controller managers can host many types of controllers
Controller managers are not responsible for controller workload distribution

Controller Manager

Proposal A - fully configurable, most flexible

Configuration

Each controller manager instance (process) has the following fields:
- Type: specifies the default launch configuration, including which controller it will start and how many workers the controllers will have
- Id: instance type
Implementation location: new data type in storage (config map is watched by all nodes. This design has health probe data and will be updated frequently. Not suitable for config map.)

Design Details

See ExploredSolutionsDetails-NotUsed.md

Proposal B - All controller runs in each controller manager, with fixed # of workers configured

Configuration

All controller runs in each controller manager, with fixed # of workers configured
When there is missing configuration for a controller, default to worker 1
Change of configration only affects newly started controller manager, existing controller managers won't be affected
Implementation location: config map

Design Details

See ExploredSolutionsDetails-NotUsed.md

Proposal C - All controller runs in each controller manager, with fixed # of workers in location file configuration (or hard code)

Configuration

All controller runs in each controller manager, with fixed # of workers in location file configuration
Implementation location: local configuration file

Proposal Comparison & Decision

Solution	Pros	Cons
Proposal A	Fully configurable, most flexible	Introduce new data type/key into storage. Implementation complexity is high.
Proposal B	Simple logic One less concept - controller manager type Some flexiblility in worker # configuration	It is confusing that how many # of controller workers are running in different controller instances. Some memory wasting in minimal used workload requests (queue is blocked and hence workers so not too much memory waste).
Proposal C	Simple logic One less concept - controller manager type	Some memory waste but not too much Cannot have more specific workers for increasing traffic unless start new controller manager instances

We will have use proposal C at phase I. Afterward, analyze realtime traffic and decide whether proposal A can make significant workload balance contribution.

Controller

Overview

Solution 1: supervised workload distribution
Solution 2: unsupervised (self organized) workload distribution
- Solution 2.a. Balanced workload distribution with increased overall network traffic
- Solution 2.b. Unbalanced workload distribution with similar network traffic
- Solution 2.c. Semi-balanced workload distribution with similar network traffic
Implementation Plan
Original decision is from supervised to unsupervised
Decide to first spend 2 to 3 months for POC unsupervised solution. If it is not feasible, with fall back to supervised solution

Unsupervised Workload Distribution

Solution A - Balanced workload distribution

Ideas

Any consistent hash function
Upon controller start, get full set of workloads that are not completed. (Need to check workload state support)
Internal records of all none completed workloads with minimized attributes: workload id, namespace/name (cache key), created timestamp, status, current assigned controller instance id
1. Map of workload id -> workload record
2. Map of controller instance id -> set of workloads
Upon update of instances, recaculate workload ownership; for affected workloads, reassign
1. Need algorithm to find delta

Solution Overview

Using consistent hash to distribute workload to multiple controller instances
- Consistent hash is used to make sure workloads are evenly distributed, also minimize the impact of controller instance updates on workload redistribution
Internal memory to cache all minimalized workloads
- Facilitate fast workload redistribution
Controllers are centrally registered

Design Details

See ExploredSolutionsDetails-NotUsed.md

Solution B - Unbalanced workload distribution

Ideas

Self maintained hash function that maps key to a limited problem space (int64)
Add filter by range feature into api server informer
Using tenant id (or tenant hash introduced in api server project) as the hash key for workload distribution
1. Workload for a tenant can only be handled by one controller instance
2. Natural division of workloads and related workloads, there will be no broadcasting of workloads to multiple controller instances
Add more controller instances for specifically for big tenants that needs half of the processing power of a controller instance
Upon controller instance changes, only pause and reset filter for affected controller instance
1. No need for central lock

Problems and Analysis

Distribute workload based on tenant is very unbalanced, also need more controller instances if there are small intervals (# of workloads between big tenants is small)
- Adjust hash function might be able to solve the problem but needs design for transition
Cannot handle big tenant once its workload reached the processing power of a single controller - scalability bottleneck
- Need follow up design to further break down big tenant to multiple controllers
With the scale out of workload, filter update might have a lot of already processed workloads need to be rechecked and take a long time to start processing new workloads
- Consider introduce additional field into workload for faster status check

Solution C - Semi-balanced workload distribution

Ideas

Self maintained hash function that maps key to a limited problem space (int64)
Add filter by range feature into api server informer
Using workload id as the hash key for workload distribution
Workload id is UUID, it shall have very good balanced distribution
Add hash value (name as hashKey) to objectMeta, this allows hash value to be generated together with workload id
Hash workload id to int64 and saved to hashKey. This will allow workload informer to filter by integer range with minimal computation
Filter dependent workloads by owner reference
Add owner hash key into OwnerReference
Upon controller instance changes, only pause and reset filter for affected controller instance
No need for central lock

Problems and Analysis

How to monitor workload balance?
- Controller instances report # of workload periodically
How to adjust workload balance automatically?
- Phase I: new controller instance always pick up range that has highest workload number (only split)
- Phase II: consolidate range that is too small and redistribute workload
With the scale out of workload, filter update might have a lot of already processed workloads need to be rechecked and take a long time to start processing new workloads
- In phase I, use the original controller restart mechanism, handle this problem later when it becomes an issue
- Consider introduce additional field into workload for faster status check

Solution Overview

Add filter by range feature into api server informer
Self defined hash function to map keys to uint64
As k8s framework does not handle uint64 well, use 0 - max(int64) as the range
Each controller instance will handle a range of keys
For better balance, the range can be adjusted in later iteration
Controller instances are centrally registered as a new data type
Config map is not a good candidate as it will be listened by all nodes
When a new controller instance join, it will take partial workloads from the controller behind it (evaluated by controller key), the controller behind needs to reset its filter
When a controller leave, the controller behind it will took over by resetting the filter

Controller Instance Registry in Storage

Controllers:
- VirtualMachine
  - instance-id: uuidv1
    hash-key: <keyv1>
    workload-num: <numberv1>
    is-locked: boolean
  - instance-id: uuidv2
    hash-key: <keyv2>
    workload-num: <numberv2>
    is-locked: boolean
  ...
  - instance-id: uuidvm
    hash-key: <keyvm>
    workload-num: <numbervm>
    is-locked: boolean
- BareMetal
  - instance-id: uuidb1
    hash-key: <keyb1>
    workload-num: <numberb1>
    is-locked: boolean
  - instance-id: uuidb2
    hash-key: <keyb2>
    workload-num: <numberb2>
    is-locked: boolean
  ...
  - instance-id: uuidbn
    hash-key: <keybn>
    workload-num: <numberbn>
    is-locked: boolean

Use etcd lease and timeout mechanism
When instance join, it will create a instance node with its key and attach a timeout; it also watch the Controllers/ node
All healthy instances will periodically renew the instance node with a timeout, also report # of workloads it listed
If a instance failed to renew, the instance node will be automatically deleted from the tree and other instances will be notified
Hash key is used to balance the workload distribution
Upon initialization, controller instance will generate its hash key
Questions
Why not use config map? - config map is watched by all nodes. Depend on the size of controller family, the data volume can be significant. (However, we can use config map as alternative at the beginning if cannot add new type into etcd quickly)

Change to existing api server

Add a new data type and informer: controller instance
Add filter by range feature into api server informer
Add a hash value field (int64) to workload data structure - ObjectMeta (same level as workload id)
Hash value is calculated from workload id (or other workload field that is likely to evenly distribute).
Hash value will be calculated when workload is created, no update. (It is not an identifier. Can share the same value with other workloads.)
Add an hash value field (int64) into workload owner reference data structure
Set owner reference hash value in controller_ref.go:NewControllerRef()

Controller Internal Workflow

Controller Instance States

Init - initialization
Locked - locked, waiting to be unlocked by controller registry
Wait - waiting for workload processing to complete
Active - processing all workloads

alt text

Controller Instance Initialization & Update

Lock means pause processing workload
When new controller instance join, there will be workload redistribution. It is ok for a controller instance to exam a processed workload. However, it may not be ok for a controller to process a workload that is still under processing. Therefore, we need a mechanism to make sure such thing won't happen.
No need to lock all controller instances, only need to lock the one that will get instances from others (because those workloads might be in processing)

alt text

Phase I controller hash key generation algorithm:
- When the first controller instance start, it will take all the workloads (hash key = max int64, lower bound = 0)
- When a new controller instance starts, it will find the first longest interval (by workload number or just key range before workload number is implemented), split it to half and take the first part:
  - Assume the longest interval is (a, b] (b is the hash key for controller instance i), the interval for the new instance will be (a, (a+b)/2], the new interval for instance i will be ((a+b)/2, b]

Controller Workload Periodical Adjustment

alt text

An internal list of all controller instances (with the same type such as replicaset)
A group of functions that watch controller instances and decide filter range for workload informers
- Register current controller
- Read and update controller instance records
- Adjust filter based on new controller instance records
- Rejoin controller list if found it was out of loop from last update
A status attribute for workload processing control (stop processing during filter adjustment)

Hash Key Management

A hash algorithm to map string to int64
- Use golang internal hash function for now (hash/fnv)
A script that can update hash key of controller instances - for workload processing rebalance
- Phase I need it for rebalance manually
- In phase II, there will be auto rebalance

Implementation Steps

Add filter by range feature into api server informer (830)
Add hash value field to workload data structure (830)
Add hash function that map workload uuid to 0 - int64
Add hash value field to workload data structure
Add owner reference hash value field to workload metadata OwnerReference struct
Controller framework
Add controller instance list (830)
Add controller status (830)
Interface definition for controller instance configuration: create, read, update - mock controller instance value (830)
Controller instances management
1. Add controller instance type into storage (830)
2. Add controller instance informer (830)
3. Add watch logic into controller framework (830)
4. Implement new instance registration (830)
  1. Pick the range of highest workload number and split in half (830)
5. Implement controller instance update (930)
  1. Workload distribution rebalance
ReplicaSet controller use controller framework
Apply controller framework to replica set controller (830)
Set owner reference hash value in controller_ref.go:NewControllerRef() (830)
Filter replicaset and pod by range (830)
Existing unit test pass (830)
Use controller manager to start replicaset controller instance (830)
Manual deployment (830)
New unit tests (930)
New end to end test (930)
Volume controller use controller framework (TBD)
Apply controller framework to volume controller
Filter volume workload by range
Query other related workload in realtime (instead of from cache)
Existing unit test pass
Use controller manager to start volume controller instance
Manual deployment
New unit tests
New end to end test
VM controller use controller framework (TBD)
Network controller use controller framework (TBD)
Service controller use controller framework (TBD)

Proposal Comparison & Decision

Solution	Pros	Cons
Balanced workload distribution	Workload distribution is mostly even, can fully ultilized each controller instance processing power	Network traffic will be increased, it will be a linear function of the # of controller instances, increased burden to core/api server Extra internal memory requirement for saving all minimized workloads. Each workload will need to provide a way to calculate completed standard - more code changes. Most complicated implementation logic.
Unbalanced workload distribution	No increment on traffic/api server burden No extra memory requirement	Workload distribution can be very uneven. Current design has scalability limit to biggest tenant workload size. Hash function maintenance. Need extra maintenance manual/script when a tenant grows too big to share controller with others.
Semi-balanced workload distribution	No increment on traffic/api server burden No scalability bottleneck Workload distribution is relatively even. No extra memory requirement Require least implementation.	Not fully balanced. Some code changes in invididual controller to minimize network traffic

We choese solution 3 - semi-balanced workload distribution for phase I. It can be converted to solution 2 if there is issue with realtime querying other data.