一轉眼五年過去,除了見證團隊的成長,也看過各種人來來去去。在 Amazon 待超過五年的員工不正確估算在全球大概佔 PR 90,在同一個團隊待超過五年的人更是稀缺。一滿五年,你就能夠晉身橘色識別證的一員1。作為團隊少數幾個拿橘牌的員工,不免有一些勵志或是辛苦的工作歷程可以分享,可惜忙碌跟快速的工作步調讓很多想法稍縱即逝。尤其是回顧自己過去寫下的內容 (我在 Amazon 學到的 10 件事),驚覺這些改變其實無時無刻都在發生,我確實明白自己有些模樣在這五年前已經有所改變,慶幸的是我沒有因為環境而變得事故。
我想或多或少是因為能夠透過回顧這些內容,甚至是因為協助面試的緣故,很多來面試的候選人甚至主動提及曾經讀過我的內容並且因此受到一些啟發 (不論是否最終有進入我們團隊工作)。這些文字跟反饋都讓我時刻刻提醒自己要永保初心,也要求自己用更成熟的方法處理事情。有些文章甚至獲得很多各方高手的迴響,但一直遲遲沒有安排時間再繼續寫更多。因此寫下這篇內容,也算是我個人對自己在用五年後的視角,來總結我對這個團隊喜歡和認為有待加強的地方。
在這裡先提個免責申明,AWS 雖然作為 Amazon 整個集團重要的一個業務,但有許多管理的方式可能也因不同組織和角色而有所差異。畢竟 Amazon 在全球有數十萬名員工分佈在世界各地,工作時遇到一堆奇奇怪怪的部門早已屢見不鮮,用 AWS Support 不能以偏概全所有團隊的樣貌。
此外,這篇內容僅為個人的觀察,不代表任何官方立場,本篇內容更不期待帶有任何批判色彩。在團隊裡身為一個獨立貢獻者 (Individual Contributor),大部分時間我也是站在第一線解決各式各樣的工程問題,而非討論管理維運議題。因為不是站在管理層的角度用宏觀的面向關注團隊維運,工程師關注的議題跟管理者的議題多少都有些不同,我僅分享我對於身為工程師及親身經歷的相關感受,不意味著任何一方不好。
畢竟我的角度為「工程師」而非「管理者」,因此可能部分觀察的角度也會有所侷限,但會盡可能用客觀的角度寫下我個人的觀察,還請閱讀者自行評判。
我在中文技術團隊觀察一直沒變的事情是,團隊中充滿了各式各樣優秀的同事,許多不乏都是在業界工作數年的工程師並且擁有豐富的經歷。有趣的是,時下的招聘政策也鼓勵許多剛畢業的學生加入 AWS Support 技術團隊,不時為團隊注入心血,使得整個團隊的組成十分多元。常常也有許多不同的新進同事從其他公司帶來不一樣的經驗和想法,甚至從原本使用 AWS 產品的客戶端變成解決問題的角色,都促使團隊中因為不同背景的工作者加入彼此交流而建立強大的人脈網路。
Cloud Support Engineer 是一個全球化的團隊,工程師遍佈世界各地。當你身肩的責任和任務越多,自然越有更多機會與全世界的同事交流和執行專案。我個人曾經主持過跨三個時區的技術講座和訓練,不僅讓我實際應用專案管理的思維、訓練教學技能,更讓我進一步增加自己的能見度,並且實質提升團隊的技術職能,為團隊作出貢獻。
能進一步打開自己的視野並與全世界交流,是我個人最喜歡的部分。尤其身處 Dublin,多得是不同語言和背景的工程師,充滿了不同國家、語言、背景交流的機會,除了專注在技術上的交流,工作之餘也可以聊聊不同背景之間的特色和學習彼此的文化。
但如果人為在台灣的技術團隊,實際上可能因為時區的緣故,部分交流多少會有一些受限。但與世界各地的同事交流仍只是使用通訊軟體這種觸手可及的事情,壞處就是你可能要額外安排其他時間 (比如你必須要早起、晚上回訊息或參加會議)。
我認為這份工作與一般軟體開發最大不同的是,由於面對的是客戶的生產環境,Cloud Support Engineer 不時會有面對高壓的情境,例如功能故障、應用程式不工作的事件、系統崩潰等等。有時候客戶甚至會不斷的催促你盡快給予資訊,在這種情況下,你需要學會保持冷靜和集中注意力,同時,不能因客戶的情緒引領你往不正確的調查方向。
我還記得第一次面對客戶環境故障時,我還得請求資深工程師一同指導解決客戶所遭遇的技術問題。但在從中觀摩學習後,並且檢討學習排查問題的方法不斷練習,慢慢將這種壓力視為對客戶的理解,學會用冷靜的態度處理棘手的狀況,正確的釐清和排除。可能有些人認為是缺點 (例如:客戶很難搞),但實際在走過一遭後,一旦你跨過那個不舒服的過程,將成為無懈可擊且帶得走的軟實力,就看你是否願意視這些挑戰為成長的機會。
這樣的經驗和訓練不自覺在生活中產生一些幫助,尤其是面對系統故障或是一些非預期的狀況發生時,第一直覺反應是釐清問題並且如何找出解決問題的方式,而不是受情緒或環境影響不斷焦慮。
Cloud Support Engineer 與軟體開發不同的是,這是一個大量需要溝通的工作角色,並且將技術問題用淺顯易懂的方式與客戶分享相關的調查結果、提及可執行且理解的步驟供客戶採納 (不論是書信、Chat 或是電話形式)。由於面向的客戶端存在各式各樣的角色,你除了要學會用開發者能理解的方式解釋問題和解決方案,不免遇到客戶端主管級別的角色想釐清問題的相應狀況;同時,AWS 也會有不同面向的客戶端角色一同協助客戶的問題,身為技術工程師也讓我多了很多必須要學會與這些不同角色溝通的技能。
某種程度上學會站在他人角度、同理他人。在產品開發團隊前面,我會需要了解客戶遭遇的問題是什麼、如何複製、點出產品當前的問題、建議如何修正。除了要讓產品團隊理解當前問題要排查的方向,更需要對產品整體的核心運作有一定的認識,才能使用開發團隊所能理解的語言有效的將問題修正;在客戶前面,我需要了解客戶所遭遇的問題、客戶的痛點,並且提出可參考的實務建議,以幫助他們在業務上透過產品的功能和方法,滿足他們業務上所期待的目標,甚至有時候需要引導客戶改正問題以爲他們帶來長遠的效益 (因為有時候客戶有自己的想法並且多急於解決當下的短期問題)。
除了 Amazon 組織文化本身就具備寫文件的精神外,技術支持工程師會有大量的時間會投入在將複雜的問題調查報告轉譯成客戶或是產品團隊能理解的語言,並且寫出能夠讓客戶理解的書信內容。
AWS Support 除了是一個跨國組成的團隊外,AWS 本身就提供了一個能夠讓你成長和曝光的平台。
身處中文 DevOps/Container 領域的技術團隊,我特別喜歡的一點是週遭的同事都非常支持且互相幫忙,並且在自己的職涯規劃上都很積極,不會只侷限在日常協助客戶解決單一 Support Case 的問題上。
即使每天日常解決多少個 Support Case (Ticket) 很重要,但更多得是其他面向的工作幫助你成長不同面向的技能。由於 AWS Support 密切的與不同產業的客戶合作,一個顯著的例子是透過客戶端面向的教育訓練機會幫助你成長,為不同規模的企業客戶分享有關 AWS 產品的使用建議和最佳實踐。
此外,為了協助更多客戶解決技術問題,內部不時充斥各種專案和計畫。不論是藉由影片、技術文章或是教育訓練,團隊成員們都會透過不同的方式提升自己的技能。除了使中文的客戶受益,更多時候貢獻己力讓全球的客戶受益,並且在全球打開能見度。例如,以下都是我或是同事們貢獻的各種內容:
AWS Knowledge Center
更甚者,我厲害的同事們多是路見不平拿 Pull Request 來填,甚至會提交對應的修補程式,或開發對應的工具讓更多客戶從中受益,例如:
作為開啟 AWS 職涯的敲門磚,Cloud Support Engineer 著實是一個充滿學習廣度機會的工作,這也是我個人覺得這份工作十分有趣的地方,例如:
雖然每個人對於 Work-life balance 的定義不同 (就我的觀點,這是一種相對感受),但比起很多公司的 IT Support 或是工程師職位,「相對來說」,AWS Cloud Support Engineer 可能不會是一個非常輕鬆的工作。
以目前團隊的工作型態,為了提供客戶 24 x 7 x 365 天不間斷的支持,除了意味著國定假日或是週末會是你的工作時間外,團隊成員彼此之間通常工作時間會有部分不重疊的輪班制度。
但之所以我認為這還算可接受,主要有以下幾點觀察:
(1) 在過去,在台灣的中文團隊值班時間需覆蓋整整 16 小時,直到晚上 11 點左右才轉移至北美時區 (這意味著當時有部分同事需要工作到晚上 11 點)。由於值晚班這件事情對很多人來說並不是一個很健康的工作型態,管理層也在台灣團隊成立不久後,不斷地尋求可能的解決方案。隨著歐洲團隊的建立,這樣的現象也趨於改善,使得台灣能從值晚班的噩夢中解放,將工作時間往前推移 (能下班的時間越來越早)。
(2) 即使 Cloud Support Engineer 同樣會有 Oncall 的機制,團隊的 Oncall 會以工作時間為主。由於是全球化的團隊,工作時間結束後的 Oncall 班次將會由其他時區輪值。
大部分情況下,新進人員在 Work-life balance 這件事情上面通常能有很好的控制。但隨著想做的事情越多,可能在你身上肩負的責任也越多,使得工作與生活上不見得能夠充分平衡。例如:我觀察到資深的工程師,有時候也得必須配合美洲時區在晚上時間 (21:00) 之後開會。但「相對來說」,比起傳聞有些 Amazon 的開發團隊需要半夜 Oncall 起床處理問題,確實種程度上是還可接受的 Work-life balance。
雖然難以置信,但在提供客戶 24 小時不間斷服務的背後,仍有 AWS 開發團隊需要負擔全天候的 Oncall 工作,不得不得在半夜時間起床。尤其你身為 AWS 技術支持工程師並具備一定資歷後,多少會接觸到客戶生產環境故障的問題,多的是把北美時區的開發者叫起床的機會。
作為曾經在半夜被叫醒的工作者,我個人十分認同半夜起床值 Oncall 是一個很不健康的工作型態。雖然團隊對於工作型態的設計不是我很喜歡的一點,但在方面團隊整體確實有在以緩慢的節奏進步。隨著加入的人才越多,整個團隊的工作安排會趨於理想值。
日常工作時間的碎片化是我個人很不喜歡的一點。要討論這點之前,需要先理解整體環境的影響佔這個問題的關鍵性因素。取決於市場型態不同,客戶的行為也會有所不同;對於中文的客戶,由於客戶習慣即時通訊方法和立刻響應速度,這往往造就了客戶使用技術支持服務時,產生很多非預期的行為。
即使在文件和產品頁面中明確定義了嚴重性的不同 2 3,客戶習慣仍傾向選擇最少的時間來開啟案例 (能選多短就選多短),而非真正的問題嚴重性影響 (例如:客戶有一個專案趕著下週上線所以選擇最短的響應時間 15 分鐘,而非系統正在當機):
即使明白客戶常常不正確的選擇案例嚴重性,AWS Support 仍提供客戶最大的決定權。然而,這樣的現象某種程度上確實也導致工程資源被濫用。這就好比家喻戶曉的伊索寓言「狼來了」中描述的故事,當假警報一多,除了使得團隊無法正確區分真正受到生產環境影響的故障,更嚴重的是由於工程師都被一堆非故障影響的問題佔用,使得團隊工程師無法很好平衡不同問題之間的嚴重性,即時協助真正有環境受損影響的故障。
這個現象所帶來的影響更使得中文技術團隊必須大量的應付客戶這種短而快的回覆,而往往喪失能夠專注在技術問題上的時間。我看過許多新進人員因此無所適從,迫使被拉去處理大量需要短時間回應的案例,而無法真正的在單一案例中投入太多充分的時間進行調查:一下忙 A 案例,一下被抽去做 B 案例,或是手邊正在忙碌的事情、正在開的會不得不中斷去協助客戶,導致工作時間的碎片化。如果間接犧牲的是客戶長遠的服務品質,我相信整個市場型態有很多有待改進的空間。
相對來說,日文客戶就慣用「一般指導」、「系統效能不佳/系統受損」等這類真正反應其事實狀況的案例嚴重性,進行問題指導和事件後的相關問題調查。這樣的市場型態讓我從日文團隊同事中間接受益,從他們每個人身上提供給客戶的回覆我學習到非常多。我常常可以觀察到他們提供給客戶的訊息都十分詳細且完整,在回覆前不僅做了非常多詳細的測試,更提出各種可參考的方案或是 PoC,甚至在內部已經討論過一圈 (但缺點是客戶需要有足夠的耐心)。
當然這並不是要比較哪種客戶比較好,畢竟,從親身體驗過印度客戶的型態,也覺得支持上充滿挑戰,就明白這並不是單一語系客戶的問題,每種客戶都有各自的特性。而是從實際經驗中,讓我對文化和市場差異有更深的體悟。
同時,團隊也在針對這項問題做了很多不同面向的嘗試,以試圖優化工程師在工作上的痛點,讓客戶學習如何更好且正確的使用 AWS 技術資源,以幫助他們解決真正重要的問題。
由於團隊快速成長且存在多時區的問題,為了適應不同問題的情境,流程的修訂和快速迭代有時往往讓人跟不上。例如:有時候今天建議的工作流程 A,可能明天會變成 A-1 版本,再過幾週可能變成 A-10。由於流程常因客戶需求和問題不斷迭代,有時候不見得能夠即時的在各區域中套用,或是特定區域根本是獨立的系統,有自己的一套系統運作。
除了「朝令夕改」大概是最適合用來形容團隊一些流程上的現況,很多時候團隊會不斷嘗試導入一些新的方法或是流程。我自己有時候都會覺得無所適從。必須要不時回去翻翻內部的文件,或是回到一些原則性的討論,以了解是否有任何理解錯誤。
雖然有時候感到混亂,但這種迭代的過程在團隊中會一直不斷的進行,你得學會適應快節奏和不斷變化的文化。
最後,「學會如何成為一名更好的客戶」絕對是我任職客戶面向技術支持工程師角色這幾年收穫最大的體悟,絕對可以列為最重要的一點。由於日常工作中實際就是在做類似客戶服務性質的工作,工作中不免看盡客戶百態,了解不同客戶的型態,並且了解到什麼是好的、什麼是不好的。
身為一名技術支持工程師,隨著協助客戶經驗的累積,漸漸地學會如何站在客戶服務人員的角度思考,並且同理身為客戶服務角色的辛苦。
當你了解到客戶並未尊重專業時,那份挫折感絕對是深深擊潰你對於技術的自信。甚至很多時候即使你的建議正確,並且成功解決客戶問題後,客戶可能就只把你當做一名 AWS 的後台人員,也不見得會獲得客戶的肯定。
在開始工作的頭幾年我很不能理解,往往覺得在工作上付出了努力但仍得不到任何客戶的反饋;但轉頭一看,其實還是有非常多的客戶展現充分的專業並且在工作上有很好的合作體驗,更加學會心平氣和的理解不同性質的客戶。也因此學習到如何成為一名「好客戶」,更懂得如何在日常的生活中,向不同產業、工作的客戶服務人員展示尊重,可以是餐廳服務生、是銀行電話客戶服務人員,或是任何一種客戶面向的工作者。在我所處的團隊中,團隊成員也都不吝分享自己的經驗,並且分享如何針對不同客戶提供適當的處理方式,藉由這些經驗成長。
自從成為客戶服務角色後,能更同理不同行業類別的客服工作,更加重視第一線的服務人員背後所付出的辛勞。並且在未獲得預期的服務水平時,能提出實質的建議、想法而不是純抱怨。有趣的是,這樣做往往讓我獲得更為滿意的結果 (不是更快的退費效率或是獲得更多的補償),一同促使產品和服務進步。
在這篇內容中,深入探討了在 Amazon Web Services (AWS) 擔任 Cloud Support Engineer 的工作體驗,並分享了個人在過去五年中學到的寶貴技能和體驗。以上幾點完全屬於個人的觀察,不代表任何官方立場。
如果你正對 AWS Cloud Support Engineer 職位感興趣,希望以上的內容對您有所幫助,也可以透過參考其他系列文章以幫助你了解更多資訊。
詳細內容可以參考以下我的團隊針對 AWS Cloud Support Engineer (雲端工程師) 職涯分享會中所提到的具體細節,裡面也包含了部分面試流程和一些小秘訣幫助你可以更好地掌握我們團隊所看重的能力:
請注意本篇內容純屬個人觀點,會分享這些內容的目的是希望有興趣投遞 AWS Cloud Support Engineer 能夠了解具體應該提升哪些必備技能,也同時分享一些日常為客戶解決技術問題時非常重要的能力。寫這篇內容的目的不是要幫助大家 Crash 考題,也不代表任何官方的參考指南。
即使你面試時展現把考題背熟的能力,進來團隊終究會在實際面對客戶問題時怕得要死,因為客戶問題往往都是沒有被定義清楚但又希望你能給予解答。如果只會背誦這些內容,仍然無法實際為客戶解決任何問題。
另外團隊召聘所看重的核心技能也可能隨時間變化,面試也不是一般考試,重要的是了解你能為團隊貢獻什麼樣的技能,看的是不同維度的全面評估。因此以下內容屬於我個人在團隊的經驗和處理客戶案例認為非常重要的必備技能,僅供參考。
如果你還不是很清楚 AWS Cloud Support Engineer 在做什麼,我十分推薦你可以參考我在 AWS 職涯系列的相關文章,以幫助你逐步建立對於這個職位的認識:
很多應徵者把這份工作當成一般設定環境的 IT Helpdesk 或是只是單純的客戶服務職位,以為遵循 Runbook 就能解決大部分工作上的問題。但實際 AWS Support 做的工作與一般公司的 IT Support 會與想像中有蠻大的差異,即使工作上以 Ticket 形式與客戶互動,但角色仍偏向顧問服務性質,直接被拉進客戶會議直接一個人打十個討論問題更是你都可能會遇到的情境,我會建議在應徵這份工作前可以有個心理準備。
我常觀察到很多候選人即使在 IT 界從業多年,對於很多基本的知識都有很大的落差 (例如:我聽過有人說用 ping
可以測網站的 Port 80 看網站是不是掛了)。這種現象在只專注做開發相關工作的工程師身上尤為明顯 (嚴格上來說很多軟體工程師職位都是在「實作」,面對的很多產品規格都已經在現有的封裝函式庫或是公開解決方案的 API 上定義能夠直接套用,所以可能也沒太多機會思考這種這麼核心底層的問題)。
不論是開發人員或是系統維運人員,可能在高階應用和實作上能夠滿足「使用者」身份角度的需求,所以也不需要對於基礎知識有太深入的了解,使得當東西壞掉或問題情境複雜時便束手無策 (也是感謝這些人才讓我有飯吃)。
但 AWS Cloud Support Engineer 就像是醫生這項職業,醫生必須根據病人描述的徵狀跟現有資訊提出正確的診斷步驟,用正確的診斷工具 (例如:聽診器、X 光),最後開出正確的藥來緩解病人的症狀;工程師在協助 Troubleshooting 問題時也需要依照自己對於客戶提出問題的背景,知道要收集什麼資訊進行分析、用什麼樣正確的工具。
甚至有時候客戶給的資訊還是錯的,這就十分仰賴對於基礎知識至上層的全面了解,提供正確的排查方向,否則就只是在亂查一通。
網路概論基本包含:
DNS 算是基本中再基本到不行的必備知識,我自己個人倒是遇過很多連基本 DNS 協議都不太熟的候選人 (當然也有些從事 IT 工作的客戶也不是那麼熟),最常見的問題就是 DNS 查詢的具體流程、DNS 協議的組成和問題除錯。
遇到問題除錯的場景或是系統故障就將矛頭直接指向應用程式或是服務端,壓根沒有想到實際造成問題的其實是 DNS 不正確設定或是一些非預期行為導致。
關於網路概論,有太多的免費資源可以參考,甚至有一些很不錯的參考資源可以具體幫助你了解。可能光以下的連結要全盤了解就讀不完了,我這裡就不一一列舉:
請注意問題的深度仍取決招聘團隊所看重的技術能力,如果是專注網路相關 AWS 服務的團隊,則可能在網路的部分就會更加深入;但對於其他專業團隊來說,由於更多的重心在於協助特定領域的 AWS 產品,則具備基本網路問題排查能力即可滿足協助客戶的情境。
由於我個人不太熟 Windows,為避免誤人子弟,這邊就僅列舉我認為非常實用的 Linux 資源,以及基礎到不能再基礎的檔案系統章節 (其實把鳥哥所有章節認真讀完並且實作,可能就足以面對 60-80% 有關 Linux 維運的情境):
Linux (file system/operation/administration knowledges)
如果你完全並未具備這方面的經驗,搭建一個 Web Service (HTTP) 涵蓋網路概論至作業系統基本維運操作過程中所必備的知識都是必須的。
以下列舉幾個我認為應徵 AWS Cloud Support Engineer 所需具備的幾項重要特質:
AWS Cloud Support Engineer 的主要工作是為客戶提供協助,並且常常會需要將複雜的技術問題拆解成能理解的步驟。讓客戶甚至是其他團隊 (例如開發團隊) 能夠清楚地知道如何排查問題、修正哪些錯誤。需要能夠清晰、明確地傳達信息,解釋問題和解決方案。
與一般軟體開發工作不同的是,AWS Cloud Support Engineer 由於需要經手系統故障排查的情境,不免會接受客戶環境上的壓力,例如有時候客戶在系統故障影響到營收的同時,充滿緊張和壓力的情境下只希望能盡快把問題解決 (急急急),原本這些專業的 IT 人員也會瞬間變得很不理智。
試想下你剛進問題現場才短短的 5 分鐘,進行 Live troubleshooting 的同時試著釐清問題檢查每一個項目,並且引導客戶執行正確的步驟確認 (因為有時候客戶給的資訊是錯的)。但有時候客戶就是會覺得你在浪費他的時間,具備保持冷靜和耐心特質的重要性在這種情境下就特別顯著。我個人自己聽過的就有:
客戶有情緒但你不能帶著情緒協助,因為這樣就是大家都一起 Panic (大家一起急急急)。我在這份工作確實也學習到很多溝通軟技巧,如果再拉一堆不相干的人進來,我的經驗是這通常只會把問題攪得更亂,對問題調查沒有太大的幫助。
你可以想想自己過去是否有類似的經驗,談論如何與其他人有效地溝通,如何解決複雜的技術問題以及如何處理緊急情況。
AWS Cloud Support Engineer 需要有效、精確地識別和解決問題。基本的邏輯和良好的分析能力是必須的,並能夠迅速掌握問題的本質。同時,你需要能夠綜合多個方面的信息,從而定位出問題的癥結點。
簡單來說就是排查問題的過程邏輯要對、能夠正確分析問題、使用正確的方法和工具,知道當問題發生時要如何排查、為什麼用這些工具、為何查 A 不是查 B。比如網站連不上為什麼是用 ping 而不是其他工具、用 ping 返回的結果代表什麼、正確獲取結果後調查的方向是什麼?。
而不是收集到一堆無用的資訊胡亂瞎猜,將問題弄得更發散 (反面案例即是前面提到使用 ping
檢查 Port 80 能不能正常連通)。
AWS 產品不斷推陳出新,基本上已經學不完,這項工作不得不跟隨客戶的快速腳步不斷地持續的學習和自我提升。
由於這份工作的角色也從原本用戶端 (使用者) 變成解決問題的角色,在你的專業領域中也必須擁有深入研究問題的能力,當客戶拋出未知且未定義清楚的問題,你通常才能具體的給予明確的排查方向。
由於各個專業領域都有各自側重的項目,例如,專注 Database 專業的工程師跟專注 Linux 領域專業的工程師對於 Linux 知識的要求定義可能有所不同。可能 Database 專業的工程師具體了解 Linux 的基本原理、知道一些基本的指令和明白檔案系統、檔案權限管理、基本問題排查即可;但 Linux 專業的工程師可能就要非常了解 Linux process 運作、知道如何使用 Linux 的工具更加了解系統效能、知道 kernel dump 怎麼解讀、troubleshooting 等等知識。1
每個專業領域會有基礎需要知道的基本知識,但團隊的技能樹也都是隨著客戶需求在變化,解決的問題也是日益更新。以下分享一些我認為可能對於所有專業團隊來說都十分有幫助的學習資源:
AWS Cloud Support Engineer 的工作是支援 AWS 的客戶,你需要熟悉 AWS 的服務和產品,並能夠協助客戶解決他們遇到的問題。同時,你需要知道如何設定和管理 AWS 環境,以及如何進行故障排除。因此如果你對 AWS 技術有著深刻的了解,這會對於你在入職之後非常有幫助。
其他專業技術領域則參考招聘簡介中所提到的對應專業技能,不同技術團隊所重視的技能樹基於專注的產品多少都有些不同,可以透過產品頁面大致了解相關的細節:
對於整個 AWS Support,我可能還稱不上非常了解,但如果是 DevOps 和容器技術相關領域的團隊,個人對於該領域小有心得還能分享點東西。
我所在的團隊大部分涵蓋 AWS 服務包含以下:
目前我的團隊一個人可能可以支持到快 40 個不同的 AWS Service,上述服務基本上都是客戶如果拋出問題來我都會有機會協助。
有鑒於我們團隊也在積極尋求合適的人才,以下是我們團隊十分看重的技術經驗和能力,部分附上一些你可以參考的學習資源:
Linux
Kubernetes / Docker
CI/CD
在這篇內容中,簡述了有關 AWS Cloud Support Engineer 必備的技術能力、特質和相關可以參考的學習材料。如果你是正在考慮加入 AWS Cloud Support Engineer 團隊,希望這些內容能夠更加幫助你建立更多認識。
這篇內容也更像是我自己對於 AWS Cloud Support Engineer 技術職位所具備的長遠學習路徑有個基本的指南,並且幫助你思考如何在你的問題中使用具體的案例正確的展現這些能力。
在 AWS Load Balancer Controller 文件中提及了具體的幾個注意事項2,然而,有些情境文件上不見得會全部捕捉 (或是不巧剛好被我遇到),以下列舉在 AWS Load Balancer Controller 遷移到 v2 版本之前,我個人在協助用戶執行遷移關注到的幾個有趣的問題。
也許有的人發現 v1 版本的控制器升級到 v2 版本控制器的過程,會由 v2 控制器產生一個新的 ELB 資源,並且將相關資源部署到新產生的資源中完成遷移,舊有 v1 建立的 ELB 資源就不會繼續使用。使得完成遷移的過程如果有依賴舊有 ELB 資源的位置 (e.g. fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
) 或是有依賴服務關聯,就要記得去改對應的 DNS 紀錄。在文件上也確實具體提到了這項行為:
The AWS LoadBalancer resource created for your Ingress will be preserved. If migrating from<v1.1.3, a new AWS LoadBalancer resource will be created and the old AWS LoadBalancer will remain in the account. However, the old AWS LoadBalancer will not be used for the ingress resource.
對於很多用戶來說,這可能不是什麼大問題;但對於已經有許多系統依賴單一 ELB 資源的用戶來說,前期規劃安裝也沒想太多就直接給他上了。導致這樣的升級過程對這部分用戶來說,就像跑了廁所蹲了馬桶,卻仍然還便秘一樣令人不愉快。
於是就有仍然在運行 v1.0.1 版本的客戶們很天才地提了我甚至都沒思考過的升級路徑:
既然文件上是說
<v1.1.3
,那是不是我先升級到大於這個版本之後 (例如:v1.1.9
),再升級到 v2 版本就可以保留原有的 ELB 資源了呢?升級路徑為 v1.0.1 -> v1.1.9 -> v2
邏輯上好像沒有什麼謬誤,但在看過 AWS Load Balancer Controlelr 的原始代碼後,很遺憾,只能說這種想法真的是太美好了。
根據 AWS Load Balancer Controller v2 版本對應產生的 ELB 資源名稱都存在 k8s
保留字串的不正確觀察,我大可以預料到只要是舊版本的 v1 控制器遷移很可能都無法複用舊有的 ELB 資源。
但針對上述的行為我們可以大膽假設,仍需要仔細驗證一下相關的行為。為求真相,實際在我的環境簡單複製測試後,直接地瓦解了上述的論證。
首先,我在我的環境運行了 v1.0.1
的控制器,歷經一番考古和 kubectl convert 轉換 (舊有的 API 宣告),將舊版 v1.0.1 控制器成功安裝到 Kubernetes 1.20 版本中運行,並且部署一個簡單的範例應用:
# Running controller v1.0.1
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.0.1/docs/examples/rbac-role.yaml
$ wget https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.0.1/docs/examples/alb-ingress-controller.yaml
$ kubectl apply -f alb-ingress-controller.yaml
$ kubectl logs -n kube-system $(kubectl get po -n kube-system | egrep -o alb-ingress[a-zA-Z0-9-]+)
W0130 12:45:29.518393 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
-------------------------------------------------------------------------------
AWS ALB Ingress controller
Release: v1.0.1
Build: git-ebac62dd
Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------
# Running a sample application
$ kubectl describe ing -n echoserver echoserver
Address: fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
...
並且直接更新到 v1.1.9
版本,即使 ALB Ingress Controller 存在刷新操作,仍然針對原有的 Ingress 物件保留了原本的部署關聯 (fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
):
# Upgrade and deploy to v1.1.9
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.9/docs/examples/rbac-role.yaml
# Use kubectl and update the image to "docker.io/amazon/aws-alb-ingress-controller:v1.1.9"
$ kubectl logs -n kube-system $(kubectl get po -n kube-system | egrep -o "alb-ingress[a-zA-Z0-9-]+")
W0130 13:05:04.770613 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
-------------------------------------------------------------------------------
AWS ALB Ingress controller
Release: v1.1.9
Build: 6c19d2fb
Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------
# ELB name doesn't change
$ kubectl describe ing -n echoserver echoserver
Address: fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
在預備升級 v2 的過程 ELB 資源都持續存在,且 Ingress 也同樣關聯舊有的 ELB 資源;然而,一旦 v2 版本一部署下去,新的 ELB 資源立即被建立,並且使用不同的命名 (k8s-echoserv-echoserv-XXXXXXXX-XXXXXXX
) 運作,並且關聯的 Ingress 和 Kubernetes Service 均遷移使用新的 ELB 資源:
# Update the controller to v2
# The old ALB Ingress controller has been uninstalled at this moment, and can see the ingress object is still preserved
$ kubectl describe ing -n echoserver echoserver
Address: fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
....
$ helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=eks \
--set serviceAccount.create=false \
--set serviceAccount.name =aws-load-balancer-controller
# Once v2 controller has been installed, the controller will update the ELB name
$ kubectl describe ing -n echoserver echoserver
Address: k8s-echoserv-echoserv-XXXXXXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfullyReconciled 11s ingress Successfully reconciled
此時,舊有的 ELB 資源 (fe584233-echoserver-echose-XXXX-XXXXXXX.ap-northeast-1.elb.amazonaws.com
) 仍然存在,只是 AWS Load Balancer Controller 並未直接管理及操作該資源,並且不再執行註冊 Kubernetes Service 至相關 Target Group 資源。
從上述部署來看,我們可以觀察到 v1 版本的控制器在古時候,使用 namespace + ingress 名稱的方式進行組合命名,這項行為也確實在 v1.0.1
如此呼叫 (Source: L299-L317),在 v1.1.9
(Source: L285-L304) 亦同。一個 NameLB
短短幾行副程式道盡前人的字串處理之美 (Source: v1.0.1, v1.1.9):
func (gen *NameGenerator) NameLB(namespace string, ingressName string) string {
hasher := md5.New()
_, _ = hasher.Write([]byte(namespace + ingressName))
hash := hex.EncodeToString(hasher.Sum(nil))[:4]
r, _ := regexp.Compile("[[:^alnum:]]")
name := fmt.Sprintf("%s-%s-%s",
r.ReplaceAllString(gen.ALBNamePrefix, "-"),
r.ReplaceAllString(namespace, ""),
r.ReplaceAllString(ingressName, ""),
)
if len(name) > 26 {
name = name[:26]
}
name = name + "-" + hash
return name
}
然而,在 v2 版本除了功能性的改進,針對 ALB Ingress Controller 也確實做了多項程式上的重構。最顯著的就是上述命名的變更,在 v2 版本的命名中存在了更多元的規範 (Source: v2.4.4, L90-L124):
func (t *defaultModelBuildTask) buildLoadBalancerName(_ context.Context, scheme elbv2model.LoadBalancerScheme) (string, error) {
...
if len(explicitNames) == 1 {
name, _ := explicitNames.PopAny()
// The name of the loadbalancer can only have up to 32 characters
if len(name) > 32 {
return "", errors.New("load balancer name cannot be longer than 32 characters")
}
return name, nil
}
if len(explicitNames) > 1 {
return "", errors.Errorf("conflicting load balancer name: %v", explicitNames)
}
uuidHash := sha256.New()
_, _ = uuidHash.Write([]byte(t.clusterName))
_, _ = uuidHash.Write([]byte(t.ingGroup.ID.String()))
_, _ = uuidHash.Write([]byte(scheme))
uuid := hex.EncodeToString(uuidHash.Sum(nil))
if t.ingGroup.ID.IsExplicit() {
payload := invalidLoadBalancerNamePattern.ReplaceAllString(t.ingGroup.ID.Name, "")
return fmt.Sprintf("k8s-%.17s-%.10s", payload, uuid), nil
}
sanitizedNamespace := invalidLoadBalancerNamePattern.ReplaceAllString(t.ingGroup.ID.Namespace, "")
sanitizedName := invalidLoadBalancerNamePattern.ReplaceAllString(t.ingGroup.ID.Name, "")
return fmt.Sprintf("k8s-%.8s-%.8s-%.10s", sanitizedNamespace, sanitizedName, uuid), nil
}
除了對於 ELB 命名的檢查更為嚴謹了,也多了以 Cluster 名稱、Ingress Group 和多個關聯的識別資料進行雜湊產生 UUID,最終透過 k8s-
前綴組織成了人見 - 人們不見得愛的正則化命名。
本章用了數百字的篇幅傳遞 v2 版本真的改很多東西,作為總結:
若從 v1 版本遷移至 v2 控制器涉及你目前所遭遇或未來將面臨的情境,則生成新的 ELB 資源都是很有可能且可以預期的行為。
在規劃遷移的同時,若尚未對 ELB 資源存取設計另一層存取介面的情境,考慮透過 DNS 紀錄 (CNAME) 方法管理應對 ELB 資源更新變更的存取位置是一種常見的做法,以降低用戶端因上述升級行為所產生的改動;也建議透過規劃相應的停機時間和更新 DNS 對應紀錄的變更紀錄納入考量。
畢竟,有很多事情只是人生暫時過不去的坎,在這樣的機制下,沒有什麼事情是清一下 DNS 快取以及「請稍候重試」提示訊息不能解決的。
Kubernetes doesn’t involve the Application Load Balancer (ALB) deployment in the native implementation for using Kubernetes service object with type=LoadBalancer
. Therefore, if you would like to expose your container service with Application Load Balancer (ALB) on EKS, it is recommended to integrate with AWS Load Balancer Controller (In the past, it was ALB Ingress Controller when it firstly initiated by CoreOS and Ticketmaster). This controller make it possible to manage have load balancers with Kubernetes deployment.
Below is showing an overview diagram that describing the controller workflow:
Note: AWS ALB Ingress Controller is replaced, while rename it to be “AWS Load Balancer Controller” with several new features coming out. For more detail, please refer the GitHub project - kubernetes-sigs/aws-alb-ingress-controller
Using Application Load Balancer as example, when running the controller, AWS Load Balancer Controller will be deployed as a Pod running on your worker node while continously monitor/watch your cluster state. Once there have any request for Ingress
object creation, AWS Load Balancer Controller will help you to manage and create Application Load Balancer resource. Here is a part of example for v1.1.8
deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: alb-ingress-controller
name: alb-ingress-controller
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: alb-ingress-controller
template:
metadata:
labels:
app.kubernetes.io/name: alb-ingress-controller
spec:
containers:
- name: alb-ingress-controller
args:
# Setting the ingress-class flag below ensures that only ingress resources with the
# annotation kubernetes.io/ingress.class: "alb" are respected by the controller. You may
# choose any class you'd like for this controller to respect.
- --ingress-class=alb
# REQUIRED
# Name of your cluster. Used when naming resources created
# by the ALB Ingress Controller, providing distinction between
# clusters.
# - --cluster-name=devCluster
# AWS VPC ID this ingress controller will use to create AWS resources.
# If unspecified, it will be discovered from ec2metadata.
# - --aws-vpc-id=vpc-xxxxxx
# AWS region this ingress controller will operate in.
# If unspecified, it will be discovered from ec2metadata.
# List of regions: http://docs.aws.amazon.com/general/latest/gr/rande.html#vpc_region
# - --aws-region=us-west-1
image: docker.io/amazon/aws-alb-ingress-controller:v1.1.8
serviceAccountName: alb-ingress-controller
The deployment basically will run a copy of ALB Ingress Controller (pod/alb-ingress-controller-xxxxxxxx-xxxxx) in kube-system
:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/alb-ingress-controller-5fd8d5d894-8kf7z 1/1 Running 0 28s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/alb-ingress-controller 1/1 1 1 3m48s
Since v2, the controller added lots of different custom resources and enhancements. But the core deployment still preserve many thing that mentioned in this post. Depending on your environment, the default and suggested installation steps may also involve the configuration of IRSA (IAM Role for Service Account) to grant permission for the AWS Load Balancer Controller Pods in order to operate AWS resources (e.g. ELB), so it is recommended to take a look official documentation to help you quickly understand how to install the controller:
In addition, the service can be deployed as Ingress
Object. For example, if you tried to deploy the simple 2048 application:
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.8/docs/examples/2048/2048-namespace.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.8/docs/examples/2048/2048-deployment.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.8/docs/examples/2048/2048-service.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.8/docs/examples/2048/2048-ingress.yaml
The file 2048-ingress.yaml
is mentioning the annotations
, spec
in format that supported by ALB Ingress Controller can recognize (Before Kubernetes 1.18):
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "2048-ingress"
namespace: "2048-game"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
labels:
app: 2048-ingress
spec:
rules:
- http:
paths:
- path: /*
backend:
serviceName: "service-2048"
servicePort: 80
Before the IngressClass resource and ingressClassName field were added in Kubernetes 1.18, Ingress classes were specified with a kubernetes.io/ingress.class
annotation on the Ingress. So right now, you should see the ingress specification will be defined as below if you are using controller version v2.x
:
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.4.1/docs/examples/2048/2048_full.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: game-2048
name: ingress-2048
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
ingressClassName: alb
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service-2048
port:
number: 80
The ingress object will construct ELB Listeners according rules and forward the connection to the corresponding backend(serviceName
), which match the group of service service-2048
, any traffic match the rule /*
will be routed to the group of selected Pods. In this case, Pods are exposed on the worker node based on type=NodePort
:
Here is the definition of this Kubernetes service:
apiVersion: v1
kind: Service
metadata:
name: "service-2048"
namespace: "2048-game"
spec:
ports:
- port: 80
targetPort: 80
protocol: TCP
type: NodePort
selector:
app: "2048"
Zero downtime deployment is always a big challenge for DevOps/Operation team when running any kind of business. When you try to apply the controller as a solution to expose your service, it has a couple of things need to take care due to the behavior of Kubernetes, ALB and AWS Load Balancer Controller. To achieve zero downtime, you need to consider many perspectives, some new challenges will also popup when you would like to roll out the new deployment for your Pods with AWS Load Balancer Controller.
Let’s use the 2048 game as example to describe the scenario when you are trying to roll out a new version of your container application. In my environment, I have:
service/service-2048
using NodePort
to expose the serviceNAMESPACE NAME READY STATUS RESTARTS AGE
2048-game pod/2048-deployment-58fb66554b-2f748 1/1 Running 0 53s
2048-game pod/2048-deployment-58fb66554b-4hz5q 1/1 Running 0 53s
2048-game pod/2048-deployment-58fb66554b-jdfps 1/1 Running 0 53s
2048-game pod/2048-deployment-58fb66554b-rlpqm 1/1 Running 0 53s
2048-game pod/2048-deployment-58fb66554b-s492n 1/1 Running 0 53s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
2048-game service/service-2048 NodePort 10.100.53.119 <none> 80:30337/TCP 52s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
2048-game deployment.apps/2048-deployment 5/5 5 5 53s
And for sure, once the controller correctly set up and provision the ELB resource, the full domain of ELB also will be recorded to the Ingress
object:
$ kubectl get ingress -n 2048-game
NAME HOSTS ADDRESS PORTS AGE
2048-ingress * xxxxxxxx-2048game-xxxxxxxx-xxxx-xxxxxxxxx.ap-northeast-1.elb.amazonaws.com 80 11m
I can use the DNS name as endpoint to visit my container service:
$ curl -s xxxxxxxx-2048game-xxxxxxxx-xxxx-xxxxxxxxx.ap-northeast-1.elb.amazonaws.com | head
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>2048</title>
<link href="style/main.css" rel="stylesheet" type="text/css">
<link rel="shortcut icon" href="favicon.ico">
...
This application can be any kind of critical service you are running. As a administrator, SRE (Site Reliability Engineer), member of operation team or a DevOps engineer, the goal and your duty is: we always try to ensure the service can run properly without any issue and no interruption (Sometimes it means good sleep). That’s why people really gets hand dirty and maintain the regular operation usually don’t like to adopt service change, because it generally means unstable.
No matter you don’t want to change, with any new business requests, you still can face the challenges like: your developers are saying that “Oh! we need to upgrade the application”, “we are going to roll out a bug fix”, “the new feature is going to be online”, no one can one hundred percent guarantees the service can run properly if any changes applied, because system usually has its limitation and trade-off. Any service downtime can lead anyone of stakeholders(users, operation team or leadership) unhappy.
However, the question is that can we better to address these problem once we know the limitation and its behavior? Some people in Taiwan will also consider to put Kuai Kuai on the workstation because they believe it can make service happy, but I am not very obsessed with this method, so in the following section I will try to walk through more realistic logic and phenomena by using the 2048 game as my sample service.
I am going to use a simple loop trick to continously access my service via the endpoint xxxxxxxx-2048game-xxxxxxxx-xxxx-xxxxxxxxx.ap-northeast-1.elb.amazonaws.com
to demonstrate a scenario: This is a popular web service and we always have customer need to access it. (e.g. social media platform, bitcoin trading platform or any else, we basically have zero tolerance for any service downtime as it can impact our revenue.), as below:
$ while true;do ./request-my-service.sh; sleep 0.1; done
HTTPCode=200_TotalTime=0.010038
HTTPCode=200_TotalTime=0.012131
HTTPCode=200_TotalTime=0.005366
HTTPCode=200_TotalTime=0.010119
HTTPCode=200_TotalTime=0.012066
HTTPCode=200_TotalTime=0.005451
HTTPCode=200_TotalTime=0.010006
HTTPCode=200_TotalTime=0.012084
HTTPCode=200_TotalTime=0.005598
HTTPCode=200_TotalTime=0.010086
HTTPCode=200_TotalTime=0.012162
HTTPCode=200_TotalTime=0.005278
HTTPCode=200_TotalTime=0.010326
HTTPCode=200_TotalTime=0.012193
HTTPCode=200_TotalTime=0.005347
...
Meanwhile, I am using RollingUpdate
strategy in my Kubernetes deployment strategy with maxUnavailable=25%
, which means, when Kubernetes need to update or patch(Like update the image or environment variables), the maximum number of unavailable Pods cannot exceed over 25%
as well as it ensures that at least 75% of the desired number of Pods are up (only replace 1-2 Pods if I have 5 copies at the same time):
apiVersion: apps/v1
kind: Deployment
metadata:
name: 2048-deployment
namespace: 2048-game
spec:
...
selector:
matchLabels:
app: "2048"
...
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
When rolling the new version of my container application (for example, I update my deployment by replacing the container image with the new image nginx
), it potentially can have a period of time that can return HTTP Status Code 502
error in my few hits:
If you are specifying the controller to use instance
mode to register targets(Pods) to your ELB Target Group, it will use worker nodes’ instance ID and expose your service in that ELB target group with Kubernetes NodePort
. In this case, the traffic will follow the Kubernetes networking design to do second tier of transmission according to externalTrafficPolicy
defined in the Kubernetes Service
object (No matter using externalTrafficPolicy=Cluster
or externalTrafficPolicy=Local
).
Due to the controller only care about to register Worker Node to the ELB target group, so if the scenario doesn’t involve the worker node replacement, the case basically have miniumun even no downtime(expect that it is rare to have downtime if the Kubernetes can perfectly handle the traffic forwarding); however, this is not how real world operate, few seconds downtime still can happen potentially due to the workflow below:
This is the general workflow when the client reach out to the service endpoint (ELB) and how was traffic goes
Client ----> ELB ----> Worker Node (iptables) / In this step it might be forwarded to other Worker Node ----> Pod
So, in these cases, you can see the downtime:
Terminating
state. It haven’t response back yet, caused the ELB doesn’t get the response from Pod.)Let’s say if you try to remove the encapsulation layer of the Kubernetes networking design and make thing more easier based on the AWS supported CNI Plugin (Only rely on the ELB to forward the traffic to the Pod directly by using IP mode
with annotation setting alb.ingress.kubernetes.io/target-type: ip
in my Ingress object), you can see the downtime more obvious when Pod doing RollingUpdate. That’s because not only the problem we mentioned the issues in case (1)/(2)/(3), but also there has different topic on the behavior of the controller need to be covered if the question comes to zero downtime deployment:
Here is an example by using IP mode (alb.ingress.kubernetes.io/target-type: ip
) as resgistration type to route traffic directly to the Pod IP
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: game-2048
name: ingress-2048
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
ingressClassName: alb
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service-2048
port:
number: 80
Again follow the issue we mentioned (1) (2) (3), when doing the rolling update (I was replacing the image again in IP mode
), similar problem can be observed. Potentially, you can have 10-15 seconds even longer downtime can be noticed if you are doing the same lab:
When Kubernetes is rolling the deployment, in the target group, you will see AWS Load Balancer Controller was issuing old targets draining process(Old Pods) in the meantime
However, you still can see HTTP 502/504 errors exceed 3-10 seconds for a single requset
HTTPCode=200_TotalTime=0.005413
2048
HTTPCode=200_TotalTime=0.009980
502 Bad Gateway
HTTPCode=502_TotalTime=3.076954
2048
HTTPCode=200_TotalTime=0.005700
2048
HTTPCode=200_TotalTime=0.010019
502 Bad Gateway
HTTPCode=502_TotalTime=3.081601
2048
HTTPCode=200_TotalTime=0.005527
502 Bad Gateway
HTTPCode=502_TotalTime=3.070947
502 Bad Gateway
HTTPCode=502_TotalTime=3.187812
504 Gateway Time-out
HTTPCode=504_TotalTime=10.006324
Welcome to nginx!
HTTPCode=200_TotalTime=0.011838
Welcome to nginx!
Let’s use this scenario as it is a edge problem we need to consider for most use case. The issue generally is bringing out the core topic we want to address and giving a good entry point to dive deep into the workflow between the Kubernetes, AWS Load Balancer Controller and the ELB, which can lead HTTP status code 502/503(5xx) erros during deployment when having Pod termination.
Before diving into it, we need to know when a pod is being replaced, AWS Load Balancer Controller will register the new pod in the target group and removes the old Pods. However, at the same time:
initial
state, until it pass the defined health check threshold (ALB health check)draining
state, until it completes draining action for the in-flight connection, or reaching out the Deregistration delay
defined in the target group.Which result in the service to be unavailable and return HTTP 502.
To better understand that, I made the following diagrams. It might be helpful to you understanding the workflow:
1) In the diagram, I used the following IP addresses to remark and help you recognize new/old Pods. Here is the initial deployment.
2) At this stage, I was doing container image update and start rolling out the new copies of Pods. In the meantime, the controller will make RegisterTarget
API call to ELB on behalf of the Kubernetes.
3) Meanwhile, the DeregisterTarget
API will be called by AWS Load Balancer Controller and new targets are in initial
state.
4) At this stage, anything could happen to cause service outage. Because the DeregisterTarget
API call might take some time to process, but, Kubernetes doesn’t have any design to monitor the current state of the ELB Target Group, it only care about rolling the new version of Pods and terminate old one.
In this case, if the Pod got terminated by Kubernetes but Target-1
or Target-2
are still leaving in the ELB Target Group as Active
/Healthy
state (It need to wait few seconds to be Unhealthy
once it reach out to the threshold of ELB HTTP health check), result in the ELB cannot forward the front-end request to the backend correctly.
5) ELB received the DeregisterTarget
request. So the ELB Target Group will start to perform connection draining(set old targets as draining
), and mark the Target-1
/Target-2
as draining
state, any new connection won’t be routed to these old targets.
6) However, here brings another issue: if the new targets (Target-3
and Target-4
) are still working on passing the health check of ELB(Currently those are in Initial
state), there has no backend can provide service at this moment, which can cause the ELB only can return HTTTP 5XX status code
7) Until the new Pods is in Running
state as well as can react the health check reqeust from ELB through HTTP/HTTPS protocol, the ELB end up mark the targets as Active/Healthy
and the service become available
Since version v1.1.6, AWS Load Balancer Controller (ALB Ingress Controller) introduced Pod readiness gates. This feature can monitor the rolling deployment state and trigger the deployment pause due to any unexpected issue(such as: getting timeout error for AWS APIs), which guarantees you always have Pods in the Target Group even having issue on calling ELB APIs when doing rolling update.
As mentioned in the previous workflow, obviously, if you would like to prevent the downtime, it is required to use several workarounds to ensure the Pod state consistency between ALB, ALB Ingress Controller and Kubernetes.
In the past, the readiness gate can be configured with legacy (version 1) by using the following pod spec. Here is an example to add a readiness gate with conditionType: target-health.alb.ingress.k8s.aws/<ingress name>_<service name>_<service port>
(As it might be changed afterward, for more detail, please refer to the documentation as mentioned in the AWS Load Balancer Controller project on GitHub):
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
clusterIP: None
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nginx-ingress
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/scheme: internal
spec:
rules:
- http:
paths:
- backend:
serviceName: nginx-service
servicePort: 80
path: /*
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
readinessGates:
- conditionType: target-health.alb.ingress.k8s.aws/nginx-ingress_nginx-service_80
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
For now, if you are using controller later than v2, the readiness gate configuration can be automatically injected to the pod spec by defining the label elbv2.k8s.aws/pod-readiness-gate-inject: enabled
to your Kubernetes namespace.
$ kubectl create namespace readiness
namespace/readiness created
$ kubectl label namespace readiness elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/readiness labeled
$ kubectl describe namespace readiness
Name: readiness
Labels: elbv2.k8s.aws/pod-readiness-gate-inject=enabled
Annotations: <none>
Status: Active
So defining legacy fields readinessGates
and conditionType
are not required if you are using controller later than v2.0
. If you have a pod spec with legacy readiness gate configuration, ensure you label the namespace and create the Service/Ingress objects before applying the pod/deployment manifest. The controller will remove all legacy readiness-gate configuration and add new ones during pod creation.
For existing connections(As mentioned in the workflow-4), the case is involving the gracefully shutdown/termination handling in Kubernetes. Therefore, it is requires to use the method provided by Kubernetes.
You can use Pod Lifecycle with preStop
hook and make some pause(like using sleep
command) for Pod termination. This trick ensures ALB can have some time to completely remove old targets on Target Group (It is recommended to adjust longer based on your Deregistration delay
):
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 40"]
terminationGracePeriodSeconds: 70
Note: If a container has a preStop hook configured, that runs before the container enters the Terminated state. Also, if the preStop hook needs longer to complete than the default grace period allows, you must modify
terminationGracePeriodSeconds
to suit this.
First apply the label to the namespace so the controller can automatically inject the readiness gate:
apiVersion: v1
kind: Namespace
metadata:
name: 2048-game
labels:
elbv2.k8s.aws/pod-readiness-gate-inject: enabled
apiVersion: apps/v1
kind: Deployment
metadata:
name: "2048-deployment"
namespace: "2048-game"
spec:
selector:
matchLabels:
app: "2048"
replicas: 5
template:
metadata:
labels:
app: "2048"
spec:
# This would be optional if you are using controller after v2.x
readinessGates:
- conditionType: target-health.alb.ingress.k8s.aws/2048-ingress_service-2048_80
terminationGracePeriodSeconds: 70
containers:
- image: alexwhen/docker-2048
imagePullPolicy: Always
name: "2048"
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 40"]
Here is an example after following the practice I was getting a try. The deployment will apply the feature and can see the status of the readiness gates:
$ kubectl get pods -n 2048-game -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
2048-deployment-99b6fb474-c97ht 1/1 Running 0 78s 192.168.14.209 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.compute.internal <none> 1/1
2048-deployment-99b6fb474-dcxfs 1/1 Running 0 78s 192.168.31.47 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.compute.internal <none> 1/1
2048-deployment-99b6fb474-kvhhh 1/1 Running 0 54s 192.168.29.6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.compute.internal <none> 1/1
2048-deployment-99b6fb474-vhjbg 1/1 Running 0 54s 192.168.18.161 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.compute.internal <none> 1/1
2048-deployment-99b6fb474-xfd5q 1/1 Running 0 78s 192.168.16.183 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.compute.internal <none> 1/1
Once rolling the new version of the container image, the deployment goes smoothly and prevent the downtime issue as mentioned in previous paragraphs:
In my scenario, the Kubernetes need to take at least 40 seconds termination period for single Pod, so the old targets are gradually moved out instead of remove all of them at once within few seconds, until entire target group only exists new targets.
Therefore, you probably also need to notice the Deregistration delay
defined in your ELB Target Group, which can be updated through the annotation:
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
In this case, it is recommended to be less than 40 seconds so ELB can drain your old targets before the Pod completely shutdown.
With the configuration, client can get normal responses from old Pods/existing connection during the deployment:
HTTPCode=200_TotalTime=0.012028
2048
HTTPCode=200_TotalTime=0.005383
2048
HTTPCode=200_TotalTime=0.010174
2048
HTTPCode=200_TotalTime=0.012233
Welcome to nginx!
HTTPCode=200_TotalTime=0.007116
2048
HTTPCode=200_TotalTime=0.010090
2048
HTTPCode=200_TotalTime=0.012201
2048
HTTPCode=200_TotalTime=0.005532
2048
HTTPCode=200_TotalTime=0.010107
2048
HTTPCode=200_TotalTime=0.012163
Welcome to nginx!
HTTPCode=200_TotalTime=0.005452
Welcome to nginx!
HTTPCode=200_TotalTime=0.009950
2048
HTTPCode=200_TotalTime=0.012082
Welcome to nginx!
HTTPCode=200_TotalTime=0.005349
2048
HTTPCode=200_TotalTime=0.010142
2048
HTTPCode=200_TotalTime=0.012143
2048
HTTPCode=200_TotalTime=0.005507
...
HTTPCode=200_TotalTime=0.012149
Welcome to nginx!
HTTPCode=200_TotalTime=0.005364
Welcome to nginx!
HTTPCode=200_TotalTime=0.010021
Welcome to nginx!
HTTPCode=200_TotalTime=0.012092
Welcome to nginx!
HTTPCode=200_TotalTime=0.005463
Welcome to nginx!
HTTPCode=200_TotalTime=0.010136
Welcome to nginx!
This is the practice in case having AWS Load Balancer Controller for doing graceful deployment with RollingUpdate
. However, it is another big topic need to be discussed regarding what type of the application when rolling the update. Because other type of applications need to establish long connection with the ELB or have requirement for considering persistence data need to be stored on the backend. All these things can bring out other issues we need to talk about.
But in summarize, with the deployment strategy above, it is also recommended to design the client/backend application as stateless, implement retry and fault-tolerance. These mothod usually help to reduce the customer complain and provide better user experience for most common use case.
Due to the current design of Kubernetes, it is involving the state inconsistent issue when you are exposing the service with Application Load Balancer. Therefore, in this article, I mentioned the potential issue when doing rolling update in the scenario having container service integrating with the AWS Load Balancer Controller (ALB Ingress Controller).
Even the technology is always in revolution, I am still willing to help people better handle the deployment strategy. I used a couple of hours to draft this content and tried to cover several major issues, metioned things you might need to aware, break down the entire workflow and shared few practical suggestions that can be achieved by using AWS Load Balancer Controller in order to meet the goal when doing zero downtime deployment.
The article was written based on my own experience (Of course many communications back and forth with different customers using AWS), it might not be perfect but I hope it is helpful to you. For sure, if you find any typo or have any suggestions, please feel free and leave comment below.
即使是設計自己的金流系統串接,想要將這些金流服務串接應用在雲端服務上,對於不熟悉使用雲端技術的用戶來說,仍然需要花一些時間摸索以達到這項目的。
為了實現在 AWS 上串接綠界金流 (ECPay) 提供信用卡付款機制,並且簡化管理和維護流程,以下內容以 Serverless 技術作為背景,簡介在 AWS 上實作的相關細節。
無伺服器運算 (Serverless) 概念提出了拋棄舊有傳統管理 Server ,在過去,你需要維護及管理運行你應用程式的基礎運算系統;Serverless 提出以平台即服務(PaaS)的運作模式,提供簡單且容易操作的微型的架構,使得你不需要部署、組態或管理伺服器,只需要運用 Serverless 相關的解決方案,將你的程式碼推送至相關平台,運行所需要的伺服器服務皆由雲端平台來提供。
自 2014 年 AWS 推出 Serverless 服務以來,已經儼然成為一項 IT 部署解決方案中熱門的運行架構;學會使用 Serverles,將幫助你更容易且快速地推行不同類型的應用,將你的想法付諸於實際實現。
在過去,如果要運作相關的金流服務,我可能會需要開啟一台虛擬機器 24 小時的提供商業邏輯的運作,並且,可能會因為一些非預期狀況而多許多而外的工作,例如:突然暴增的訂單請求、過高的使用負載等原因影響到業務。往往程式開發出來只是個起點,後面的系統維護工作才是更大的挑戰。
選擇使用 Serverless 架構設計的考量之一,便是考量部署、組態或管理伺服器的長遠維護性,尤其對於金流這種關鍵業務來說,更是至關重要。
一般來說,在 AWS 上要建立 Serverless 為基礎架構的應用程式,通常涉及幾種不同的關鍵服務;以這項支付金流系統為例,我採用了以下的 AWS 服務
是一種無伺服器的運算服務,可讓您執行程式但不必佈建或管理伺服器、建立工作負載感知叢集擴展邏輯、維護事件整合或管理執行階段。使用 Lambda,您可以透過虛擬方式執行任何類型的應用程式或後端服務,全部無需管理。在這篇內容中,我使用了 Lambda Function 以推送訊息至 Amazon SNS 以發佈檔案更新。 Amazon CloudWatch
為 AWS 提供的託管服務,可以讓開發人員輕鬆地建立、發佈、維護、監控和保護任何規模的 API。API 可作為應用程式的「前門」,以便從後端服務存取資料、商業邏輯或功能。使用 API Gateway 時,您可以建立 RESTful API 等應用程式。API Gateway 支援無伺服器工作負載和 Web 應用程式。API Gateway 可以用以負責處理有關接受和處理多達數十萬個並行 API 呼叫的所有工作,包括流量管理、CORS 支援、授權和存取控制。API Gateway 沒有最低費用或啟動成本。您要為收到的 API 呼叫和資料傳輸量支付費用。
為了更容易實現在 AWS 上運作串接綠界金流支付並且以 Serverless 架構運作的目標,我使用了 AWS Serverless Application Model (簡稱為 SAM) 為開發流程的重要工具,用以建置 Serverless 應用服務。
Serverless Application Model (SAM) 提供了一系列以簡易描述的方法,提供你用以更容易,在很多情況下,你可能無需非常熟悉不同 AWS 服務的設置,即可透過 SAM 建立無伺服器應用程式。
為了幫助你快速了解 Serverless Application Model (SAM) 的運作機制以及簡介,以下簡短 10 分鐘的影片分享了其運作流程的機制:
上述的使用者流程描述了用戶及各個彼此 AWS 服務之間的運作關係,以建立訂單為例,我們可以藉由綠界支付 (ECPay) 開放的對應 SDK 實際在 AWS 中設計屬於其結帳流程的相關操作,透過 API Gateway 建立一致的對外 API 接口,並且實作建立訂單訊息的 Python 應用程式,並且將其透過 Serverless Application Model 提供的 CLI 工具 (SAM CLI) 將應用部署至 AWS Lambda 上運作。
在這種運作架構下,我們只需要專注設計結帳流程和用戶流程的設計,其餘的服務運作機制,均可以交付由 AWS Serverless 相關的解決方案滿足業務實作需求。
如果你對於具體的實作內容有興趣,可以透過下列的連結獲取更多資訊:
從 Zero 到 Hero,學習 AWS 入門知識並深入了解、應用 Serverless 相關的服務及架構,同時學會在 AWS 上使用不同的解決方案實踐無伺服器技術 (Serverless),運行屬於你的雲端金流系統
本篇內容簡介了以 AWS Serverless 為基礎架構設計金流應用程式的實作流程,以及提及部分在 AWS 上實現串接綠界支付科技的對應機制,並且分享一項參考架構。如果你對於本篇具體的實作內容有興趣,可以利用以下連結獲取完整的內容:
如果你覺得這樣的內容有幫助,可以在底下按個 Like / 留言讓我知道。
]]>Here are some common load balancing solutions that can be applied on Amazon EKS:
This is the easiest way to provision your Elastic Load Balancer resource, which could be done by using default Kubernetes service deployment with type: LoadBalancer
. In most case, the in-tree controller can quickly spin up the load balancer for experiment purpose; or, offers production workload.
However, you need to aware the problem as we mentioned in the previous posts 1 2 because it generally can add a hop for your load balancing behavior on AWS and also can increase the complexity for your traffic.
In addition, you need to aware this method only applies for creating Classic Load Balancer and Network Load Balancer (by using annotation 3).
If you are using nginx Ingress controller in AWS, it will deploy Network load balancer (NLB) to expose the NGINX Ingress controller behind a Service of type=LoadBalancer
. Here is an example for deploying Kubernetes service of nginx Ingress controller 1.1.3:
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-type: nlb
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
app.kubernetes.io/version: 1.1.3
name: ingress-nginx-controller
namespace: ingress-nginx
spec:
externalTrafficPolicy: Local
ports:
- appProtocol: http
name: http
port: 80
protocol: TCP
targetPort: http
- appProtocol: https
name: https
port: 443
protocol: TCP
targetPort: https
selector:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
type: LoadBalancer
Guess what, yes, it is still can rely on the in-tree controller. On the other hand, the problem we were mentioning can persist. It can be hard to expect which Pods will receive the traffic; however, the main issue is that an Ingress controller does not typically eliminate the need for an external load balancer, it simply adds an additional layer of routing and control behind the load balancer.
So why to choose Nginx Ingress controller? It probably can be the reason why as mentioned in the post 4 as mentioned on the AWS Blog:
AWS Load Balancer Controller is similar to the in-tree Kubernetes controller and use native AWS APIs to provision and manage Elastic Load Balancers. The controller was an open-source project originally named ALB Ingress Controller because it was only provides capability to manage Application Load Balancer at the intial stage, lately, it officially renamed as AWS Load Balancer Controller 5, which is maintaining by AWS product team and open-source community.
Unlike in-tree Kubernetes controller needs to wait the upstream code to be updated, which requires you to upgrade Kubernetes control plane version if the controller has any bug or any new ELB features need to be supported. Using AWS Load Balancer Controller, it can gracefully be replaced because it will be running as Kubernetes deployment instead of relying on Kubernetes upstream source code integration.
The controller directly maintain your Elastic Load Balancer resources with up-to-date annotations. For nginx ingress controller, it can provision and add an extra load balancing layer with the Network Load Balancer, in this case, the traffic generally will pass through the controller itself (nginx-ingress); instead, for AWS Load Balancer Controller, it doesn’t play as a gateway. The AWS Load Balancer Controller will directly control the Elastic Load Balancer resource, which can register your Pod (by using IP mode) so the request can directly forward to your backend application.
The AWS Load Balancer Controller also starts to support TargetGroupBinding 6 and IngressGroup 7 feature since v2.2. It enables you can group multiple Ingress resources together, which allows multiple service deployments can share the same Elastic Load Balancer resource.
After comparing different load balancer controllers, generally speaking, using AWS Load Balancer basically can have better feature supports as well as adopt with the performance optimization by configuring AWS Load Balancer attributes correctly. It is essential to enable IP mode when applying the Kubernetes service deployment with AWS Load Balancer Controller to reduce unnecessary hop that can be caused by Kubernetes networking itself, which is generally not totally suitable for AWS networking and elastic load balancing feature.
However, the disadvantage of using AWS Load Balancer can be all features require to be supported by Elastic Load Balancer itself because the controller doesn’t involve additional functions to extend the traffic control. Using other controller still can have its benefit and provide different features that Elastic Load Balancer doesn’t have, such as using nginx Ingress controller you may be able to define forward service to external FastCGI targets, using Regular Expression to perform path matching … etc.
By the end of this article, I hope the comparison and information can better help you understand how to select load balancer controller that will be running in Amazon EKS, and choose the right option for your environment.
Thanks for reading! If you have any feedback or opinions, please feel free to leave the comment below.
[AWS][EKS] Best pratice load balancing - Let’s start with an example from Kubernetes document ↩
[AWS][EKS] Best pratice load balancing - imbalanced problem ↩
in-tree controller - Network Load Balancer support on AWS ↩
Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS ↩
AWS Load Balancer controller v2.2 - TargetGroupBinding ↩
AWS Load Balancer controller v2.2 - IngressGroup ↩
Follow the example as mentioned in the previous article, if you deployed a Kubernetes service and noticed the utilization on your backend application is not balanced; or, if you are using AWS Load Balancer controller, Traefik, nginx-ingress controller by finding the Elastic Load Balancer wasn’t correctly separate the loads (when using instance mode to register your Pods as targets), and you may find the imbalanced traffic, that’s the major topic in this article would like to talk about: discuss how to improve and optimize it.
Let’s say if I am deploying 4 Pods in my Kubernetes cluster, which is using the default deployment as mentioned below to expose my Kubernetes service:
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP
nginx-deployment-594764c789-5s668 1/1 Running 0 30m 192.168.42.171
nginx-deployment-594764c789-9k949 1/1 Running 0 30m 192.168.39.194
nginx-deployment-594764c789-b292m 1/1 Running 0 33m 192.168.29.24
nginx-deployment-594764c789-s226c 1/1 Running 0 30m 192.168.15.158
The Kubernetes service:
apiVersion: v1
kind: Service
metadata:
name: nginx-svc
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
protocol: TCP
selector:
app: nginx
To better understand the problem I am describing in this post, the application I deployed will response Pod IP address to let us know which one received the request:
After running a loop and making at least 79 HTTP requests in my test, I get the following response to know how the load has been distributed:
192.168.42.171
: 12 times192.168.39.194
: 33 times192.168.29.24
: 23 times192.168.15.158
: 10 timesAccording to the testing, we can see the load is not very evenly distributed.
As mentioned in the previous post, whether you are defining externalTrafficPolicy=Cluster
or externalTrafficPolicy=Local
, the routing behavior is relying on iptables
(or ipvs
) can be unpredictable. Because it is doing second layer of load balancing, which is totally unnecessary for increasing a hop in AWS VPC.
Elastic Load Balancer in AWS already provides a straightforward solution to balance your loads, and its algorithm will try to distribute the requests to all backend servers as even as possible. Doing load balancing in Kubernetes network generally is increasing the complexity of your architecture, and make traffic can be hard to trace; or, even worse, cause the imbalanced issue as you can observe.
This also makes the load balancing became unpredictable. Although the traffic send to the registered EC2 instance can be evenly distributed; however, it doesn’t mean the load can be separated to Pods as well. You will never know which Pods will be routed due to this load balancing layer implemented by Kubernetes networking.
No matter choose Traefik, nginx-ingress, if you are still following the default load balancing pattern offered by upstream Kubernetes code, then you can expect the traffic can come with load imbalanced.
The major problem is the default load balancing behavior can involve the Kubernetes load balancing and add a hop for the traffic. So you may start to wondering how to better resolve this problem; however, there is no specific feature can be adjusted on Kubernetes to remove the default load balancing, but it still could be possible to skip the Kubernetes load balancing and forward the traffic to Pods directly.
If you are running Pods on Amazon EKS and using default AWS VPC CNI Plugin1, you can expect your Pods should have dedicated secondary private IP address that can be communicated within your AWS VPC network; therefore, it also means that the IP address can be registered to your Elastic Load Balancer as backend target. The flow can be:
Client -> NLB (forawrd request to IP target) -> Pod IPs (Reach out to Pods directly)
For Application Load Balancer (ALB) and Network Load Balancer (NLB), both provide a feature that you can register backend targets with IP addresses (NLB, ALB. Note: Classic Load Balancer doesn’t offer this option). We can simply to associate these Pod IP addresses as backend targets instead of using instances. As long as the Pod IP addresses are reachable, it can move the request be forwarded to the backend Pods by skipping the Kubernetes load balancing behavior.
So how to register Pod IP addresses in Elastic Load Balancer? A seamlessly way is to deploy your Kubernetes service and use AWS Load Balancer Controller2 to enable this feature. Instead of using the default Kubernetes controller to deploy your Elastic Load Balancer, using AWS Load Balancer Controller helps you manage load balancer resource including all functionality features and different type of load balancer such as NLB, ALB, both are can be supported by the controller. After installing the AWS Load Balancer on your EKS cluster, you can enable the IP registration type for your Pods by simply adding annotations to the deployment manifests.
Here is a deployment sample that use IP targets with pods deployed to Amazon EC2 nodes. Your Kubernetes service must be created as type LoadBalancer
:
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
...
spec:
type: LoadBalancer
...
To deploy application load balancer on Amazon EKS through the AWS Load Balancer Controller, you generally will create an Ingress object in your deployment. With the AWS Load Balancer Controller, it also provides supported annotation that can register pods as targets for the ALB. Traffic reaching the ALB is directly routed to pods for your service. Here is an example:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: game-2048
name: ingress-2048
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
...
In the AWS EKS documentation, it also mentioned detailed guide regarding how to deploy these two load balancers and share an example by using IP target to register your Pods. If you are interested to learn more, please check out to the following documents to get more detail:
By using IP mode, it totally removes the layer of load balancing manipulated by Kubernetes. This generally forward requests to the Pods without doing second forwarding:
This time I used the same testing strategy as mentioned in the first problem description section and ran four Pods associated with Network Load Balancer using IP mode, which is showing below:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-75d48f6698-b5fm7 1/1 Running 0 35m 192.168.17.15 ip-192-168-5-38.ap-northeast-1.compute.internal <none> <none>
nginx-deployment-75d48f6698-l4gw5 1/1 Running 0 2m45s 192.168.27.143 ip-192-168-5-38.ap-northeast-1.compute.internal <none> <none>
nginx-deployment-75d48f6698-q2q57 1/1 Running 0 41m 192.168.22.126 ip-192-168-5-38.ap-northeast-1.compute.internal <none> <none>
nginx-deployment-75d48f6698-x5m25 1/1 Running 0 2m45s 192.168.14.48 ip-192-168-5-38.ap-northeast-1.compute.internal <none> <none>
After passing at least 50 requests, I can see the request distributions are showing below:
192.168.17.15
: 10 times192.168.27.143
: 12 times192.168.22.126
: 14 times192.168.14.48
: 13 timesFor each target, it nearly have ~25% chances will be routed evenly by the Network Load Balancer. Because it skip the load balancing layer of the Kubernetes, it will follow the routing algorithm3 and separate load evenly as we expected.
In my testing, I was running a couple of Pods with nginx image and provided simple web server in my backend. The scenario in this article mentioning generally is describing all targets were using stateless HTTP connections. However, in some cases, it could be possible ELB might unequally route traffic to your targets if:
Generally speaking, if the client or any configuration can cause sticky session, it still have possibility can get the traffic imbalanced. The detail can refer to the following article on AWS knowledge center:
But overall, using the IP mode to register our Pods, literally can resolve the problem as we described due to the design of Kubernetes service networking.
Although Elastic Load Balancer can offer an option to register your targets by instances, however, it generally would be suitable when you are running single service and expose it with a port on a dedicated EC2 instance. With Kubernetes service running on your EC2 instance but exposed as NodePort
service, it can involve multiple Pods behind the service port offered on your instance due to the service load balancing. The packet can be replaced to other destination field of your Pod’s private IP address when the packet flood into the instance through Linux ipvs or iptables rules.
If the service work load is relying on Kubernetes deployment, it is recommended such as service.beta.kubernetes.io/aws-load-balancer-nlb-target-type
for NLB, alb.ingress.kubernetes.io/target-type
for ALB.
It is also important to make sure the Elastic Load Balancer won’t stick your client session to specific target4 5. Although Elastic Load Balancer provides cookie-based stickiness session to bind a user’s session to a specific target, which can be achieved by configuring the load balancer attribute and also supported by AWS Load Balancer Controller as below, but to optimize the traffic imbalanced, it is recommended to avoid use the sticky session as it can potentially cause the phenomena.
# ALB
alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=60
# NLB
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: stickiness.enabled=true,stickiness.type=source_ip
As ELB requires to strike the balance between your Availability Zones to ensure the service high availability. This helps your traffic can correctly be separated on all backend target.
In this article, it explains the practice of optimizing the load balancing and mitigate the imbalanced traffic problem when deploying service with Kubernetes. This article also brings you an overview and learn what other scenarios that you can potentially find out ELB might unequally route traffic to your backend targets.
In the next article we will review a couple of Kubernetes load balancer controllers that can be deployed on Amazon EKS and see what option can be the best practice for your environment.
How Elastic Load Balancing works - Routing algorithm ↩
externalTrafficPolicy
I have many occurrences to see Kubernetes administrators are not very familiar with the Kubernetes network flow, and feel struggling about that when they need to diagnose networking issue, especially for users using managed Kubernetes cluster service. But I think that’s normal to see this gap because it reflects Kubernetes is doing encapsulation perfectly, causes you are unable to easily troubleshoot any real-world failures unless you had deeply understand its design.
Before walking through the detail about the load balancing, it is required to understand the fundamental knowledge of Kubernetes load balancing and its effect when defining your YAML files.
In the Kubernetes, it provides External traffic policy, so you can set this field (spec.externalTrafficPolicy
) in your Kubernetes service deployment to control the flow, and decide how to route the traffic from external. Kubernetes offers two options for this policy: Cluster
and Local
, let’s have a deep overview to see how it works:
By default, the kube-proxy
is performing this layer of load balancing by using iptables
. Based on the Pods you are running, it will create rules in your iptables and uses random mode (--mode random
) to perform the load balancing based on the probability. For example, if you have 3 Pods need to be distributed, kube-proxy
will take the responsibility to add required iptables rules with defined probability, and try to balance the load:
I am not going to drill down into too much detail as it can increase the complexity of this article, however, if you are interested to learn how this translation happens, you can review the iptables rules on your host to see what’s going on.
# An example of iptables rules
-A KUBE-SVC-XXXXX -m comment --comment "default/app" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-AAAAAA
-A KUBE-SVC-XXXXX -m comment --comment "default/app" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-BBBBBB
-A KUBE-SVC-XXXXX -m comment --comment "default/app" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-CCCCCC
-A KUBE-SVC-XXXXX -m comment --comment "default/app" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-DDDDDD
-A KUBE-SVC-XXXXX -m comment --comment "default/app" -j KUBE-SEP-EEEEEE
As mentioned in Figure 1, when externalTrafficPolicy=Cluster
, it can have a scenario will route the traffic to other Nodes if you deploy Pod(s) on them. By relying on the iptables rules, this policy can accomplish the load balancing by redirecting them to other Nodes. In theory, this can bring the traffic jump out of the original Node.
When externalTrafficPolicy=Local
, it limits the traffic only can be redirected on the same Node; however, the behavior of doing load balancing through the iptables still happens. If you have multiple Pods running on the single Node, the traffic can be routed to one of them.
Let’s see an example mentioned at official Kubernetes document1:
apiVersion: v1
kind: Service
metadata:
name: nginx-svc
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
protocol: TCP
selector:
app: nginx
If you use AWS as cloud provider and deploy the service, it generally will create an Elastic Load Balancer (Classic Load Balancer) and provide the traffic load balancing. The Elastic Load Balancer will be managed by the in-tree load balancer controller2, which is implemented in Kubernetes source code; hence, you can simply provision the Elastic Load Balancer on AWS seamlessly.
Looks familiar, right? The example above is quite common if you find tutorial on somewhere. Maybe that is exactly same configuration running in your production environment.
But here is the problem: by default, Kubernetes implements another layer of load balancing, which is backed with kube-proxy
. Let’s say if you have two worker nodes (Node-1
and Node-2
), and each node have Pods running on it (Pod-1
, Pod-2
on Node-1
; Pod-3
on Node-2
), using default option (externalTrafficPolicy=Cluster
). On AWS, the traffic flow generally is representing as below:
The default Kubernetes service will expose your application with a specific service port to provide external accessibility (NodePort
), and establish relevant iptables rules to perform NAT traslation by replacing the IP address of the destination field.
With this design, this can be a happy case if kube-proxy
doesn’t redirect the request to other host, which can be outlined as:
client -> Load Balancer -> Node-1 (NodePort) -> iptables rules -> Pod-1 on Node-1
However, what if the iptables forward the traffic to other Nodes?
client -> Load Balancer -> Node-1 (NodePort) -> iptables rules -> Pod-3 on Node-2
On the other hand, if you deploy a Kubernetes service like this, the traffic flow can be routed as two particular phenomena:
As you can see, no matter what it is, the behavior seems like doesn’t provide a better route because it definitely increases the number of hops for the traffic flow.
What about externalTrafficPolicy: Local
? Does it work better?
Follow the example as mentioned in the previous paragraph, let’s say if you have two Pods (Pod-1
and Pod-2
) running on the same Node (Node-1
). The traffic flow of this policy generally can be breaking down as below:
client -> Load Balancer -> Node-1 (NodePort) -> iptables rules -> Node-1 (Target Pod-1)
client -> Load Balancer -> Node-1 (NodePort) -> iptables rules -> Node-1 (Target Pod-2)
When load balancer move the request to the backend (Node-1
), the probability to forward the request by iptables rules to the Pod-1
and Pod-2
, is 50% chances.
On the other hand, the traffic firstly pass through the Elastic Load Balancer, and do the routing again in the system level (iptables), which means the architecture will perform the load balancing twice.
With no doubt, it did not offer the best path for the traffic routing.
If externalTrafficPolicy=Local
and you have multiple Nodes running behind your Elastic Load Balancer, you probably will see some Nodes will fail the health check, which can be expected.
That’s because if some Node doesn’t run the service’s backend Pods so it cannot pass the health check.
In general, it doesn’t impact anything because the ELB will ensure only healthy targets can be routed; however, in this case, it doesn’t perfectly distribute the load with Elastic Load Balancer and offer high availabilty when we have multiple Pods. If the Node down, it can impact all Pods running on it.
So, looks like using
externalTrafficPolicy=Cluster
is a good option?
Imagine you have a long running connection is jumping out of the first Node, unfortunately, the first Node is having issue such as hardware failure, intermittent connectivity problem … etc. In the end, it is going to be down. In this case, if any existing connections forwarded from other Nodes, the connections will be impacted and cannot response back to the origin correctly. In general, the Node down can cause the packet loss because the connection route is established in the middle:
(If you have established connection passed the Node, here is an example of the breaking route situation if the Node-1
in the middle is down.)
client -> Load Balancer ->
Node-1 (NodePort)-> iptables rules -> Target Pod-2 on Node-2
If you reviewed the flow of Figure 3, connections can be routed to different paths and it can be hard to predict once you deployed many Pods. It also increase the complexity if you would like to trace the networking flow during the problem diagnostic.
When having a large scale scenario (e.g. deploy 100, 500 even 10,000 Pods), this also can potentially bring the system level issue, or, result in the packet loss, such as network latency increased due to kernel needs to compare several iptables rules when a new connection comes in; or, reach out to the kernel limits for the networking stack, because Linux kernel needs to track of them when working with iptables, and insert the rules on the system level. One common issue is to fill out the connection tracking table (conntrack) of the Linux kernel when the scale grows.
In this article, it explains the behavior of load balancing on Kubernetes. This article also brings you an overview and learn what issue can occur if you follow the default Kubernetes example to deploy your Elastic Load Balancer.
Now we have deep-level understanding of the Kubernetes load balancing, let’s start with more discussion regarding the load imbalancing problem with the current architecture on Amazon EKS in the next article.
Kubernetes service - Type LoadBalancer ↩
Kubernetes source code - aws_loadbalancer.go ↩
NEX WORK 由一群 NEX Foundation (台灣未來基金會) 台灣熱血的工程師建置,目的在於打破對於海外求職的高門檻和增加被看見的機會,以串連在世界各地的海外菁英,建立永續的機制,並促進正向的人才循環。
NEX Foundation 台灣未來基金會成立於 2018 年,為美國聯邦政府核准的 501(c)(3) 非營利慈善新創機構。NEX 以美國西雅圖及台灣台北為據點,希望透過研發和經營線上的資源平台,協助海外人才在國際舞台上的職涯發展。並更進一步扮演橋樑的角色,連結活躍於世界各地的台灣人才,共同推動企業媒合、職涯諮詢、媒體實驗、社群聚會等計畫,期盼建立具有延續性的全球台灣人才互助圈。
NEX 的成立背景,來自團隊成員們的海外故事。一腳踏出熟悉的家鄉來到文化迥然的異地,學習在高度的競爭環境中生存,在面臨新的挑戰時保持堅強。然而,對於缺乏資源和當地人脈的海外遊子來說,眼前往往有許多的不容易和不安全感。
為此,我們希望透過 NEX 的運作,以互助的力量來提拔下一位追夢者,為需要幫助的人才們,創造更多的機會和協助更多夢想的實現。
基金會最初是由陳浩維 (HW. Chen) 和一群在美國工作的熱血朋友們,開始投入「NEX Foundation 台灣未來基金會」的籌備工作。(認識在世界各地的努力貢獻的團隊專家及志工)
並且,基金會於 2018 年 12 月獲得美國國稅局(IRS)批准,正式成為美國聯邦層級的非營利教育慈善機構。
NEX 的首要任務為建立一個信任平台來連結全球的台灣人才,系統性整合現有資源,推動相關協助計畫(如:公司內推整合、人才輔 導計畫、獎學金計畫等),期待去翻轉台灣「人才外移」的負面印象,並希望作為一個正向動力去團結大家的力量和資源,去幫助更多夢想和成就的實現。
目前 NEX Work 仍在持續投入許多非營利專案、包含定期舉辦社群活動和分享,為更多台灣人開啟海外求職的契機。你也可以關注在以下連結獲得更多資訊:
NEX WORK 是一個非營利線上求職內推平台,目前仍在 Beta 階段,目的在於連結世界各地菁英,創造團結互助的力量。
尤其在日趨競爭的就業環境、對外國人不友善的簽證流程、其它族裔互助合作壯大自我(甚至遊走法律邊緣)等等現實情況下,我們認為幫助自己人為理所當然,且勢在必行。1
因此,作為一個推進的力量,支持和協助台灣人才的職涯發展,團隊成員在工作之餘努力的推動 NEX WORK 專案,且不斷的仍在持續收集用戶回饋和優化。
透過 NEX 設計的內推系統,讓台灣人陪著台灣人在國際職涯路上打開第一扇窗或衝刺最後一哩路。NEX WORK 快速整合求職供需鏈,讓你在同公司內找到那一把手,爭取時間就是爭取機會。
NEX WORK 作為初始試驗平台,目前已經累積許多在各個世界知名公司的台灣人自發性的在上面提供內推的管道 (包含我自己)。
目前平台上除了有知名的科技公司的台灣海外菁英自發性提供內推渠道 (例如:Facebook(Meta), Amazon, Apple, Google, Dropbox, Cisco),也包含其他知名會計、加密貨幣交易所等。如果你有興趣提供相關的機會,也可以參考以下資訊透過註冊系統開啟這項渠道。
第一步可以透過以下連結訪問 NEX WORK 平台:
界面十分直覺 (如果覺得很難用請不吝透過右側的 Feedback 表單與我們分享使用者反饋),下捲即可以看到目前有哪些公司,以及該公司存在的推薦人數。
為了快速幫助你了解平台的相關特性,以下分為兩種使用情境分別提供更多細節:
你可以透過右上角的註冊按鈕 (或是點擊這裡註冊) 填寫基本資料註冊帳戶並且完成信箱驗證。
完成註冊後,你可以選擇要尋求內推的公司,以下以 Amazon 為例:
點擊後你可以查看有關提供內推渠道的推薦人以及查看更多資訊。
為了避免尋找不是活躍的推薦人讓推薦請求石沉大海,你也可以透過系統的幾項關鍵指標,例如:
點擊「幫我內推」即可以填寫必要基本資料及上傳履歷資訊:
完成申請後,點擊 「檢視我的內推紀錄」即可查看你的內推資訊:
請注意:NEX WORK 並不保證推薦人一定會幫助你內推,內推的推薦人仍然會針對你的經歷及你提供的各種資料來決定是否內推,並非來者不拒的盲目內推。
請提供詳實的自我經驗總結介紹及必要資訊,幫忙的朋友看了你提供的資料後將會依情況決定是否要花時間內推。
若履歷上經驗不足、或是你沒有提供備註註明所需的資訊,為了維持內推的品質而被 HR 列為黑名單,仍然有機會拒絕你的請求。
若你對於履歷格式不確定如何開始,可以參考以下撰寫範例:
請注意上述履歷通常比較適用美國企業 (例如:不需要特別放 profile photo、個人簡介),不過可能會隨著國家有所區別,網路上有許多資源,請依照自行狀況斟酌。
通常內推是無法保證一定有職缺面試、錄取的機會 (我自己協助內推的經驗仍然是有被錄取的候選人寥寥無幾)。
但各公司的內推機制通常提供了比直接從網路上海投更容易被招聘團隊看見的機會,甚至可能縮短你在前期等待回應的時間。
內推通常為推薦人主動提供這樣的機會,並且很常會需要花費額外的時間。甚至必須要花費一陣努力 (私下了解你的背景、跟 HR 追問進度)。
在付出這麼多額外的時間後,仍在招聘公司許多考量下,沒有被錄取也十分常見,因此,請記得保持禮貌及感謝每一位幫助的海外台灣人。
你可以透過右上角的註冊按鈕,並且點擊「註冊成為推薦人」(或是點擊這裡註冊成為推薦人) 填寫基本資料註冊帳戶:
並且可以於「公司名稱」一欄選擇或是新增你目前能夠協助推薦的公司完成註冊:
你可以進一步編輯相關的個人資料以利尋求內推機會的人更加了解你。如此一來就完成必要資料的填寫,尋求推薦的候選人便可以到首頁檢視並且透過您所提供的管道提交必要材料。
一旦有新的推薦請求,你可以在「檢視內推申請」查看待處理的推薦請求:
註:若你是推薦人,目前 NEX WORK 工程團隊已經收到「內推紀錄」和「內推申請」的相關使用者反饋,一個是尋求內推機會的紀錄、一個是協助內推的紀錄,請記得不要搞混哦
希望這篇內容能夠對你有所幫助並且更加了解 NEX WORK 平台。若有任何關於 NEX WORK 的任何建議,也歡迎透過右側的 Feedback 表單或是以下聯繫方式,讓我們一起把 NEX WORK 變得更好:
此外,在 NEX Foundation 我們相信「今日的路人是明日的引路人」,延續 Give and Take 的精神,如果你願意一同攜手支持或是加入全球志工團隊的一份子讓我們啟動正向迴圈2,幫助更多台灣人走向世界,讓回家的路變得更好。你可以透過以下連結了解更多資訊:
我同時也在 NEX Foundation 為串連台灣人才於國際舞台上的職涯發展貢獻己力,你可以透過 NEX WORK 附上 CV 提交內推申請 (需註冊登入) 以引薦更多像你這樣的優秀人才,或是透過我的 LinkedIn 與我聯繫。
如果你覺得這樣的內容有幫助,可以在底下按個 Like / 留言讓我知道。
]]>由於我個人沒有受過正式的學術訓練,因此,如果有專家願意提供任何見解,也請不吝給予指正及建議。
在閱讀完 Firecracker 的相關論文後,我對於這個專案的結論,簡直可以說是一個小孩才做選擇,我兩個全部都要的設計。
在論文中提到,設計 AWS Firecracker 由於在權衡 Hypervisor-based 和 Linux container 虛擬化技術之間產生的相容性、安全性的優缺點,兩者取其一似乎都無法滿足 AWS 基礎建設所需滿足的工程目標。因此,Firecracker 決定打破這樣的抉擇,擔任起 VMM (Virtual Machine Monitor) 的角色,並且引入相關的各種現有功能機制的優點,以滿足運算虛擬化的設計需要。
Implementors of serverless and container services can choose between hypervisor-based virtualization (and the potentially unacceptable overhead related to it), and Linux containers (and the related compatibility vs. security tradeoffs). We built Firecracker because we didn’t want to choose.
目前 AWS 已經將 Firecracker 導入至兩個公開的無伺服器 (Serverless) 服務:AWS Lambda 及 AWS Fargate,並且支援數百萬的用戶和單月萬億級別 (trillions) 的請求,以下將具體描述更多 Firecracker 相關的細節。
(相關的 Paper 原文和我自己畫的重點請參考1)
由於 Firecracker 屬於一種作業系統虛擬化技術的延伸,其中涉及 Linux 作業系統虛擬化的諸多細節。因此,首先在閱讀這篇內容之前,必須先了解基本的一些概念和名詞釋義:
Hypervisor 可以視為用於管理虛擬機器 (Virtual Machine) 的軟體、系統或是韌體。使用虛擬化技術允許我們在單一個電腦上運行多個不同的系統、甚至是可能不同的作業系統 Kernel,並且將其放置於一個虛擬的運行環境中 (Virtual Machine)。因此,Hypervisor 的目的就是用來管理這些虛擬機器,通常,用來執行一個或多個虛擬機器的電腦稱為宿主機 (Host),這些虛擬機器則稱為客戶機 (Guest)。
Gerald J. Popek 及 Robert P. Goldberg 在 1974 提出了兩種類型的 Hypervisor 2,分別為 Type 1 和 Type 2:
因為在虛擬機器中,安裝了一個 Guest OS 並不意味著就能直接使用 Host OS 的所有資源 (例如:磁碟寫入、CPU 時間、I/O 等操作)。通常,Hypervisor 會實作「模擬」這些裝置讓 Guest OS 以為能夠使用,但實際上仍交由虛擬化技術實際將這些操作轉譯、排程交給 Host OS 處理。
基本上與 Hypervisor 相同,如同他的名字一樣 (Virtual Machine Monitor),VMM 設計的目的就是用於建立、監控、管理和捕捉跑在虛擬機器中的 I/O 操作 (磁碟寫入、網路吞吐等)。
QEMU 是一個開源的 VMM,由於 QEMU 的架構由純軟體實現,並且處於 Guest machine 與 Host machine 擔任中間者角色,以處理 Guest machine 的相關硬體請求,並由其轉譯給真正的硬體,使得其存在一些效能問題。
KVM (Kernel-based Virtual Machine) 是一種 Linux Kernel 支持的虛擬化技術,可以將 Linux Kernel 轉換成一個可用的 VMM 並將系統轉換為 Type 1 (bare-metal) 類型的 Hypervisor,使得你可以在 Linux 系統上運行多個隔離的虛擬環境 (VM)。KVM 一直是 Linux Kernel 設計的一部分,並且存在於主流的 Linux Kernel 版本中。因此,由於屬於 Linux Kernel 支持功能的一部分,通常可以使用接近原生系統的相應執行效能處理對應的 I/O 操作。
crosvm 為 Google 的一項開源專案 (Chrome OS Virtual Machine Monitor),用於 Chrome OS 執行虛擬化機制的操作,基於 Linux KVM Hypervisor 實現虛擬化技術,並且用於 Android、Chrome OS 為基礎的系統中。與 QEMU 相比,它並不直接模擬實際的硬體裝置,反之,它採用了 Linux 支持的半虛擬化的裝置標準 (virtio) 來模擬虛擬機中相關的裝置。Firecracker paper 中具體提到了實作中採用了以 crosvm 作為基礎核心背景修改。
cgroup 是 Linux Kernel 一項支持的功能,主要可以用來限制運行在容器執行環境中的資源使用 (例如:CPU、Memory 和磁碟讀取寫入等)。cgroup 同時也被大量運用在 Linux container 的技術中,例如:Kubernetes、Docker 等。
在 Firecracker 中,提及了基於不信任 Guest OS 對於資源控制的行為。這是由於 Guest OS 屬於客戶控制的一部分,並無法預期其是否能依照合理的使用行為運行,因此,Firecracker 也採用了 Linux 本身支持的功能及 cgroup 等機制,限制了 VMM 和各個虛擬機器總體可用的資源。
seccomp 是 Linux Kernel 支持的一項功能,用來限制在容器中運行的 process 可以呼叫的系統方法 (syscall)。可以想像就像是允許使用特定 Linux function 的白名單,在 process 的直接階段僅允許特定系統呼叫操作。
同樣的機制也被實踐在一些容器虛擬化技術中,例如 Docker 定義的預設 seccomp profile。
在過去,AWS 主流的 Serverless 服務提供了 AWS 客戶另一項託管運行應用的選擇 - 用戶不需要再自行管理底層運作機器和安全性修補的工作。
最著代表性的 AWS 服務便是 AWS Lambda,如果你不知道 AWS Lambda 是什麼,AWS Lambda 是一種無伺服器 (Severless) 運算服務,使用者可以直接上傳你的程式碼並且選擇對應的規格運行,你無須在煩惱需要使用什麼樣的硬體規格以及為維護工作煩惱。
同時,也提供在大規模的應用情境中隨著用量可以動態擴展的優勢。然而,在 AWS Lambda 服務剛釋出時,其採用了 Linux Container 用以隔離不同客戶的執行環境 (類似於 Docker 相應的技術),然而,這樣的機制除了可能在客戶使用運行執行環境存在限制 (需使用依賴 Host OS 支持的 Kernel 版本指令集),也因為同時共用相同的 Kernel,也可能存在部分安全性風險。
When we first built AWS Lambda, we chose to use Linux containers to isolate functions, and virtualization to isolate between customer accounts.
因此,在這篇 Paper 中,主要提到了 firecracker 評估使用虛擬化技術設計時存在六項重點考量:
在這樣的條件下,論文 2.1 中的細節便是在具體討論和評估數項現有的虛擬化技術,包含:
因此,在評估和主流虛擬化技術比較這樣的背景下,AWS Firecracker 借鑒了許多解決方案而在眾多項目中選擇一個適當的平和。同時,基於 AWS 內部許多團隊,維運基礎架構都採用 Linux 系統,促使 Firecracker 在設計的哲學上的這項決定。更重要的是,Firecracker 更之所以遵循沿用 Linux Kernel 本身就支持的技術,而不是重新實作替代它,正是因為這些功能行之有年,並且具備高質量、成熟的設計 (例如:scheduler、TUN/TAP network interface),也能讓 AWS 原本的團隊使用熟悉的 Linux 工具和維運流程執行除錯。例如:採用 ps
即可列舉機器上運行的 microVM,其他 Linux 本身支持的工具 (top
、vmstat
甚至是 kill
) 均可以在預期的操作下管理 Firecracker。
基於這項原因,Firecracker 使用了 KVM 作為主要的虛擬化執行基礎,並且實作 VMM (Virtual Machine Monitor) 元件以滿足管理 KVM 執行環境的需要。
Our other philosophy in implementing Firecracker was to rely on components built into Linux rather than re-implementing our own, where the Linux components offer the right features, performance, and design
執行硬體層級的虛擬化 (HVM) 及資源分配,例如:CPU、處理記憶體管理、分頁 (Paging) 等。
在 Firecracker 的實作中,以 Google crosvm 作為基礎,移除了大量不必要的裝置,例如:USB、GPU 以及 9p 檔案系統協議 (Plan 9 Filesystem Protocol)。在這樣的基礎下,Firecracker 以 Rust 語言為主增加了額外約 2 萬行的程式碼;同時修改了約 3 萬行的程式碼並且開源公開。
Firecracker 同時模擬了有限的一些 I/O 裝置,例如:網路卡、磁碟、序列端口 (serial ports)、i8042 支持 (PS/2 鍵盤的控制器);與 QEMU 相比,QEMU 相對複雜許多,其支持多餘 40 種不同的裝置,包含 USB、影像和音訊裝置。
更細部的設計架構如下:
在 Firecracker 中採用了 virtio 作為網路和磁碟裝置的模擬,其中大約佔 Firecracker 1, 400 行 Rust 程式碼。同時 Firecracker 也提供了 REST API 設計,使其能夠使用 HTTP 的用戶端直接與其互動 (例如:curl
)。
總結來說,Firecracker 旨在提供以下機制 3 4:
在 Firecracker 中的硬體裝置涵括了限制配額的機制,包含可以限制 Disk IOPS (I/O Per Second)、PPS (Packets Per Second for network)。在 Firecracker 提供了使用 API 設定 microVM 可用的資源請求,包含 CPU、磁碟 I/O、網路吞吐等。
其資源限制機制採用 virtio
本身支持的資源限制功能,以網路裝置來說,可以是以下的配置機制 (rx_rate_limiter
):
PATCH /network-interfaces/iface_1 HTTP/1.1
Host: localhost
Content-Type: application/json
Accept: application/json
{
"iface_id": "iface_1",
"rx_rate_limiter": {
"bandwidth": {
"size": 1048576,
"refill_time": 1000
},
"ops": {
"size": 2000,
"refill_time": 1000
}
}
}
為了實踐安全性的最佳化,Firecracker 在部署階段需要充分避免一些因為 Linux kernel 或虛擬化技術可能帶來潛在的安全性問題,例如:Intel Meltdown、Spectre、Zombieload 等安全性漏洞。因此,在生產環境中,為了解決這項顧慮,Firecracker 實踐了幾項部署重點:
相關的 Firecracker 生產環境部署建議同時列舉於以下文件中:
同時為了避免 Firecracker VMM 執行操作的過程出現任何非預期行為 (例如:安全性漏洞允許植入惡意代碼),在 Firecracker 中實現了使用另一層沙箱 (Sandbox) 提供額外隔離的保護。在 Firecracker 的設計稱之為 Jailer。
雖然是這樣說,但在 Paper 中提到的具體實作,仍為使用 Linux container 提供的技術執行,包含:
在 jailer sandbox 配置的 chroot 目錄中,裡面僅包含 Firecracker 編譯的執行檔、/dev/net/tun
、cgroup 控制檔案和 microVM 所需的資源。並且,預設情況下 seccomp-bpf profile 設定了 24 個系統呼叫操作 (syscalls) 和 30 個 ioctls 操作為白名單。
不過就我的研究,如果我理解正確,似乎 Firecracker 在 seccomp filter 上面在最近的版本多了不少 syscalls 支援:
在 Firecracker 設計出來後,AWS 便逐漸於 AWS Lambda 的底層架構中導入使用。使用 AWS Firecracker 允許 Lambda 的架構在每個執行的節點 (Lambda worker) 運行數千個 microVM。
AWS Lambda 從上層到下層的架構可以由遠至近如下:
(1) 用戶透過事件經由 Frondend service 觸發 Lambda function (可以是 API Gateway, 其他來源等),會由 Worker Manager 定義配置部署可用的執行機器 (Lambda Worker)
(2) 一旦觸發後,Frondend service 交付由 Worker Manager 會遵循調度演算法 (sticky routing) 盡可能將觸發對象黏著在特定的 Lambda Worker 機器上,並且建議觸發的對象 (invoke service) 直接將請求的內容 (payload) 直接轉送到目標的 Lambda worker 機器上,減少觸發上的延遲和來回交互請求 (round-trip)。
(3) 在每個 Lambda worker 提供了 slot 這個抽象物件,該抽象物件即客戶預先載入的 Lambda function 應用程式碼 (Lambda function code),並且在後面每次觸發的行為上盡可能的重複使用這個執行環境 (slot)
重點在於 Firecracker 於 Lambda Worker 中部署的機制,每個 Lambda Worker 可以視為一個 Bare-metal 的機器,上面運行著 Firecracker VMM 用於管理多個 MicroVM (Lambda function, slot);每個 microVM 包含了客戶的執行環境 (sandbox) 和客戶的應用程式碼,以及一個 shim control process 用於採用 TCP/IP socket 和 Micro Manager 互動的元件。
(MicroManager 可以視為 Lambda data plane 和 control plane 互動的元件)
MicroManager provides slot management and locking APIs to placement, and an event invoke API to the Frontend
同時 MicroManager 部分也確保存在小量預先啟動的 MicroVMs,以確保有放置請求的即時需要。這是因為即使 Firecracker 能縮短在 125ms 內啟動,這樣的啟動時間可能仍不足以滿足 AWS Lambda 客戶快速啟動擴展的需要,並且可能會部分阻塞用戶的執行請求,因此在實務中,存在類似這樣 pre-warm 的機制。
當 AWS Lambda 中的應用執行寫入操作時 (假設 Guest OS 中的應用希望寫入檔案到磁碟),此時會交付由 virtio
driver 處理該操作,並且由 virtio
driver 將其放置到共享記憶體 (shared memory) 中,並且於系統 Ring Buffer 進行緩衝。然後,Firecracker 將被喚醒執行 I/O 操作,並且將該寫入操作真實的寫進實體磁碟當中。6
論文中提到 AWS 從 2018 年開始將 AWS Lambda 的客戶從 EC2 運行容器 (per function per container) 的基礎平台轉移到 Firecracker。在遷移過程中,並無可用性中斷、延遲或其他指標層面問題。
不過,在 AWS 內部團隊在遷移過程中,一些小問題也因為這樣的遷移暴露出來,例如前面提到為了安全性考量關閉了 Symmetric MultiThreading (SMT) 機制 (過去的部署中是開啟的),使得使用 Apache HttpClient 應用執行的行為因為一些執行緒 (Thread) 相關的 bug 也因此暴露,並且存在於過去的 AWS SDK 版本中,需透過修補依賴函式庫解決這項問題。
但在 AWS 內部團隊完成遷移後,便開始實際將外部客戶的相關基礎建設逐步遷移至 AWS Firecracker 為基礎的設施,並且獲得巨大的成功。
同時,有鑒於涉及未來安全性補丁的修復和系統更新,由於傳統使用 rpm
、yum
等套件管理工具進行管理的變因太多,可能導致軟體一致性問題產生,AWS 團隊採用了 immutable infrastructure 的策略來完成這項工作,即透過使用新版本的 AMI (Amazon Machine Image, 用於 EC2 的啟動鏡像) 直接啟動新的 EC2 instances,並且替換舊的 EC2 instances 來完成這項工作。
在該篇論文中,Firecracker 提供了數項不同測試數據的表現,同時,在 NSDI 2020 會議上也公開了對應的測試數據7。
下列的測試採用 EC2 m5d.metal
instance type,其擁有 Intel Xeon Platinum 8175M
處理器 (48 cores, hyper-threading disabled)、384GB RAM 和 4 個 840GB 的 NVMe 磁碟。
在這項測試中 Host OS 使用 Ubuntu 18.04.2
以及 Linux kernel 4.15.0-1044-aws
版本。
這項測試與幾個主要的虛擬化技術執行比較,包含:Firecracker v0.20.0
、Pre-configured Firecracker
、Intel Cloud Hypervisor
、QEMU v4.2.0
在啟動時間的定義中,啟動時間為 VMM process 執行建立 process 操作 (fork) 並且 Guest Kernel 發起第一個 init
process 的時間。
從數據顯示預先配置好 IO Port 的 Firecracker 和 Intel Cloud Hypervisor,兩者環境啟動時間皆優於 QEMU。然而,要注意的是,上述的測試結果中不包含設置網路裝置,一旦加入網路裝置的設置,Firecracker 和 Cloud Hypervisor 皆會在啟動時間中增加約 20ms,然而,QEMU 則是 35ms。
在記憶體的消耗表現上 (Figure 7),可以觀察到 QEMU 本身需要 128MB 的記憶體、Cloud Hypervisor 則約為 13 MB,然而,Firecracker 僅需約為 3MB 的記憶體消耗。
值得一提的是,在檔案 I/O 操作的表現上 (Figure 8 & Figure 9),該研究使用 fio
執行測試,明顯可以關注到在硬體資源能夠負荷超過 340,000 read IOPS (1GB/s at 4kB) 的情況下,Firecracker 以及 Cloud Hypervisor 僅可以被限縮使用約 13,000 IOPS (52MB/s at 4kB) 的吞吐效能。
該研究同時也採用了 iperf3
執行網路效能的測試 (針對虛擬的 tap 網路介面, MTU 為 1500 byte)。在機器能夠達到單一網路流 44Gb/s 及 46Gb/s (10 個並行傳輸) 的狀況下,Firecracker 僅可達到約 15Gb/s 的吞吐。然而,QEMU 獲得接近於 Cloud Hypervisor 的測試結果,均能擁有較好的網路吞吐性能,這裡部分歸納結論是由於 virtio
的裝置設計實作而產生的限制。
如同前面性能評估所提及的,基於 virtio
的實作因素,這使得 Firecracker 並無法取得以直接存取 PCI 裝置以接近實體機器的 I/O 吞吐性能。使得網路和磁碟 I/O 的效能存在部分限制。
然而,就論該研究的總結,與前面提及的六項主要問題呼應,AWS Firecracker 的技術著實達到其工程上的設計目標,包含:
Firecracker 除了於部分開源專案為虛擬化提供解決方案外 (例如:Kata container),目前 AWS Firecracker 更是已經導入使用於 AWS 的基礎產品建設中,包含 AWS Fargate 和 AWS Lambda。
這樣的基礎設施改進同時也為客戶帶來更大的優勢,借助 Firecracker 的設計,使得 AWS Fargate 將運算定價折扣甚至達到 50% 的成本優化 (AWS Fargate Price Reduction – Up to 50%)。
基於對於這樣的技術感興趣,我著實閱讀了有關 Firecracker 整篇論文和參考部分 Firecracker 專案,歸納出上述的內容,並且花了一些時間整理這項導讀,更多訊息可以參考:
希望透過這樣的導讀,能夠有助於你更加了解 AWS Firecracker 這項技術。
如果你覺得這樣的內容有幫助,可以在底下按個 Like / 留言讓我知道。
Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, Diana-Maria Popa. (2020). Firecracker: Lightweight virtualization for serverless applications ↩
Gerald J. Popek, Robert P. Goldberg. (1974). Formal requirements for virtualizable third generation architectures ↩
AWS re:Invent 2019: Firecracker open-source innovation (OPN402) ↩
AWS re:Invent 2020: Deep dive into AWS Lambda security: Function isolation ↩