1 / 21
文档名称:

Towards autoscaling of Apache Flink jobs 2021 Balázs Varga.pdf

格式:pdf   大小:443KB   页数:21页
下载后只包含 1 个 PDF 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

Towards autoscaling of Apache Flink jobs 2021 Balázs Varga.pdf

上传人:李十儿 2023/1/22 文件大小:443 KB

下载得到文件列表

Towards autoscaling of Apache Flink jobs 2021 Balázs Varga.pdf

相关文档

文档介绍

文档介绍:该【Towards autoscaling of Apache Flink jobs 2021 Balázs Varga 】是由【李十儿】上传分享,文档一共【21】页,该文档可以免费在线阅读,需要了解更多关于【Towards autoscaling of Apache Flink jobs 2021 Balázs Varga 】的内容,可以使用淘豆网的站内搜索功能,选择自己适合的文档,以下文字是截取该文章内的部分文字,如需要获得完整电子版,请下载此文档到您的设备,方便您编辑和打印。:.
,1(2021)39{59
DOI:-2021-0003
TowardsautoscalingofApacheFlinkjobs
BalazsVARGA
ELTEE•otv•osLorandUniversity
Budapest,Hungary
email:******@
MartonBALASSIAttilaKISS

Budapest,HungaryKomarno,Slovakia
email:******@:******@

-sourcedistributedstreamprocessingen-
ginethatisabletoprocessalargeamountofdatainrealtimewithlow
-
rently,provisioningtheappropriateamountofcloudresourcesmustbe
-
ceedthecapacityofthecluster,
paper,wedescribeanarchitecturethatenablestheautomaticscaling
ofFlinkjobsonKubernetesbasedoncustommetrics,anddescribea
ectsofstatesizeandtar-
getparallelismonthedurationofthescalingoperation,whichmustbe
consideredwhendesigninganautoscalingpolicy,sothattheFlinkjob
respectsaServiceLevelAgreement.
1Introduction
ApacheFlink[5,18,10]isanopen-sourcedistributeddatastreamprocess-

ComputingClassi cationSystem1998:
MathematicsSubjectClassi cation2010:68M14
Keywordsandphrases:ApacheFlink,autoscaling,datastreamprocessing,bigdata,
kubernetes,distributedcomputing
39:.
,,
unboundeddatastreamsusingvariousAPIso eringdi erentlevelsofabstrac-
,whichisadirected
graphofoperatorsperformingcomputationsasnodes,andthestreamingof
databetweenthemasedges.
-
ductionjobsmakeuseofstatefuloperatorsthatcanstoreinternalstatevia
variousstatebackends,suchasin-
checkpointingandsavepointingmechanismtocreateconsistentsnapshotsof
theapplicationstate,whichcanbeusedtorecoverfromfailureortorestart
theapplicationwithanexistingstate[4,3].
Thesestreamingjobsaretypicallylong-running,theirusagemayspanweeks
,-
tionmusthandlethechangeddemandswhilemeetingtheoriginallysetservice
levelagreement(SLA).Thischangingdemandmaybepredictableaheadof
time,incasesomeperiodicityisknown,orthereareeventsthatareknownto
in
uencetheworkload,butinothercases,-
icallyprovisioningresourcesandsettingthejob'sparallelismatlaunch-timeis
unsuitedfortheselong-(under-
provisioning),theapplicationwillnotkeepupwiththeincreasingworkload,
-
dictedmaximumload,thesystemwillrunover-provisionedmostofthetime,
notutilizingtheresourceseciently,andincurringunnecessarycloudcosts.
Flinkjobs'
howevertotakeasavepoint,thenrestartthejobwithadi erentparallelism
,itisalsopossibleatthis
pointtoprovision(orunprovision)additionalresources,newinstancesthat
.

persistentstoragebeforehand,whichcanbedoneasynchronously,butrestoring
fromthissavepointaftertherestartcantakeaconsiderableamountoftime.
Meanwhile,theincomingworkloadisnotbeingprocessed,sotherestarted


job,takeintoaccountthedelaysallowedbytheSLA,anddecidewhether
thetrade-o
automatically,reactingtothechangingloaddynamically,andperformingthe
actualscalingoperationareofgreatvalue,andmaketheoperationsoflong-
runningstreamingapplicationsfeasibleande:.
TowardsautoscalingofApacheFlinkjobs41
ContainerorchestratorssuchasKubernetes[9]allowustobothautomate
themechanicsofthescalingprocess,andtoimplementthecustomalgorithms

scalingoperationsusingKubernetes'HorizontalPodAutoscalerresource[20]
andGoogle'sopen-sourceFlinkoperator[21].
Inthispaper,
simplescalingpolicythatwehaveimplemented,thatisbasedonoperator
idlenessandchangesoftheinputrecords',weanalyzethe
downtimecausedbythescalingoperationandhowitisin
uencedbythesize
,that
shouldbeconsideredwhendesigninganautoscalingpolicytobestmeeta
givenSLAwhileminimizingoverprovisioning.
2Relatedwork
Cloudcomputingisarelativelynew eld,butintherecentyearsithasgained
alargeinterestamongresearchers.
Theautomaticscalingofdistributedstreamingapplicationsconsistsofthe
followingphases[13]:amonitoringsystemprovidesmeasurementsaboutthe
currentstateofthesystem,thesemetricsareanalyzedandprocessed,whichis
thenappliedtoapolicytomakeascalingdecision(plan).Finally,thedecision
isexecuted,
focusedontheanalyticandplanningphase.
Theauthorsof[13]havereviewedalargebodyofresearchregardingau-
vecategories:(1)
threshold-basedrules,(2)reinforcementlearning,(3)queuingtheory,(4)con-
troltheory,and(5)timeseriesanalysisbasedapproaches.
TheDS2controller[11]usesalightweightinstrumentationtomonitorstream-
ingapplicationsattheoperatorlevel,speci callytheproportionoftimeeach


-

experimentsonvariousqueriesoftheNexmarkbenchmarkingsuitetoshow
thatDS2satis estheSASOproperties[1]:stability,accuracy,shortsettling
time,
most3steps(scalings).Theresultingcon gurationexhibitsnobackpressure,
:.
,,
PASCAL[12]isaproactiveautoscalingarchitecturefordistributedstream-
lingandanautoscaling
lingphase,aworkloadmodelandaperformancemodelare

theautoscalertopredicttheinputrateandestimatefutureperformancemet-
rics,calculateaminimumcon guration,andtotriggerscalingifthecurrent
con gurationisdi -
tions,thesemodelsareusedtoestimatetheCPUusageofeachoperatorin-

scalingmodelcanoutperformreactiveapproachesandisabletosuccessfully
,weuseadi erent
metricfromtheCPUload,basedonhowmuchthejoblagsbehindthein-
,butitmightbeinterestingtoexplore
whetheraproactivemodelcouldbebuiltonthesemetrics.
.[8]investigatecost-optimalautoscalingofapplicationsthat
runinthecloud,onanIaaS(infrastructureasaservice)-
poseanapproachthatusesastochasticmodelpredictivecontrol(MPC)tech-

de neacostfunctionthatincorporatesbothcloudusagecosts,aswellasthe
expectedvalueofthecostorpenaltyassociatedwiththedeviationfromcer-
tainservicelevelobjectives(SLOs).TheseSLOsarebasedonmetricsthat
describetheoverallperformanceoftheapplication.
Inourwork,weaimtodescribethecharacteristicsofscalingFlinkjobs,to

architectureformakingandexecutingthescalingdecisions.
3Systemarchitecture

givesanoverviewofthecomponentsinvolvedinrunning,monitoringand
scalingtheapplications.

Flinkapplicationscanbeexecutedindi ersper-job,ses-
,
suchasstandalone,Yarn,Mesos,DockerandKubernetesbasedsolutions.
Therearevariousmanagedorfullyhostedsolutionsavailablebydi erentven-
:.
TowardsautoscalingofApacheFlinkjobs43

,
Kubernetesonlyprovidestheunderlyingresources,whichtheFlinkapplication
,
wheretheFlinkclientknowsaboutandinteractswiththeKubernetesAPI
server.
WehavedecidedtousethestandalonemodecombinedwithKubernetes'op-
eratorpattern[6]-sourceoperator[21]by
Googlede nesFlinkclustersascustomresources,allowingnativemanagement
throughtheKubernetesAPIandseamlessintegrationwithotherresourcesand
-speci cknowledgeandlogic
initscontroller.
Thedesiredstateoftheclusterisspeci edinadeclarativemanner,con-
formingtotheformatde nedinthecustomresourcede nition(CRD).The
usersubmitsthisspeci cationtotheKubernetesAPIserver,whichcreates
,installedasadeployment,startsto
.
,anditssub-resources,suchasJobManageror
TaskManagerdeployments,ingresses,etc.

Status eldsoftheresourcethroughtheAPI.
,
basedonthe(potentiallychanged)observedspeci cation,andtheob-
servedstatus.
,thedesiredcomponentspeci cationsareappliedthroughthe
API.
ThislooprunseveryfewsecondsforeveryFlinkClusterresourceinthe
Kubernetescluster.

Wehavemodi edtheoperatortoexposethescalesubresourceontheFlinkClus-

ofthescaling,whichcorrespondstothenumberofTaskManagerreplicasand
thejobparallelism,aswellasaselector,whichcanbeusedtoidentifythe:.
,,
,thisendpointcanbeused
tosetthedesirednumberofreplicasintheFlinkClusterSpec.
Thescalingprocessstartswiththisstep,thedesiredreplicasaresetthrough
,withintermediate
-
ter'sandthejob'sstate,
scalesubresource'sreplicasspeci cationchanges,theoperator rstrequests
,it
computesthedesireddeployment(step3ofthereconciliationloop)withthe
,itresubmits
thejobwiththeappropriateparallelism,startingfromthelatestsavepoint.


Prometheus[2]toscrapethejob'
system,includingaccesstoconnectormetrics(suchasKafka).Wehaveused
Flink'sPrometheusreportertoexposethemetricstoPrometheus.
ToaccessPrometheusmetricsthroughtheKubernetesmetricsAPI,wehave
usedanadapter[16].It ndsthedesiredtimeseriesmetricsinPrometheus,
connectsthemtotheappropriateKubernetesresources,andperformsaggre-
gations,exposingtheresultsasqueryableendpointsinthecustommetrics

.

TheHorizontalPodAutoscaler(HPA)[20]isabuilt-inKu