git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - YARN as an executor?


Is it possible for the (hypothetical) Airflow SparkExecutor to handle
general execution of any operator (i.e., run non-Spark code)?

*Taylor Edmiston*
Blog <http://blog.tedmiston.com> | Stack Overflow CV
<https://stackoverflow.com/story/taylor> | LinkedIn
<https://www.linkedin.com/in/tedmiston/> | AngelList
<https://angel.co/taylor>


On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov <dautkhanov@xxxxxxxxx>
wrote:

> I used "Executor" as an Airflow term, not meant spark executor ...
> Like Spark would be one of Executors
> in here
> https://github.com/apache/incubator-airflow/tree/master/airflow/executors
> or in here
> https://github.com/apache/incubator-airflow/tree/master/
> airflow/contrib/executors
>
> Thanks.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>
> > Im a bit lost on the spark executor to be honest. To my knowledge the
> > spark driver creates spark executors which run spark code. In other words
> > in can’t arbitrarily run generic code. Or can it?
> >
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <dautkhanov@xxxxxxxxx
> >
> > het volgende geschreven:
> > >
> > > Now I think if Airflow on PySpark Executor would be an easier target.
> > > Spark runs on YARN, Mesos and now Kubernetes.
> > > So PySpark Executor would give Airflow porting to these schedulers.
> > > It's my understanding we now have only Spark Operator and not Executor.
> > >
> > > Thanks!
> > >
> > >
> > >
> > > --
> > > Ruslan Dautkhanov
> > >
> > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <acehaidrey@xxxxxxxxx>
> > wrote:
> > >>
> > >> Hey I didn’t know this Bolke, I was under the impression of the same
> as
> > >> Ruslan.
> > >> Thanks for the share
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bdbruin@xxxxxxxxx>
> wrote:
> > >>>
> > >>> It actually can nowadays: https://cdn.oreillystatic.com/
> > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> > >>>
> > >>> We also have an on premise setup with ceph (s3a) and HDFS for when we
> > >> need the speed and kubernetes for our workloads. We are kicking out
> Yarn
> > >> (and hive etc for that matter).
> > >>>
> > >>> Bolke
> > >>>
> > >>>
> > >>>
> > >>> Verstuurd vanaf mijn iPad
> > >>>
> > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> > dautkhanov@xxxxxxxxx>
> > >> het volgende geschreven:
> > >>>>
> > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle
> what
> > >> YARN
> > >>>> can - for example schedule tasks local to data.
> > >>>> Hadoop has multiple levels of data locality (node-local,
> rack-local) -
> > >> so
> > >>>> computation happens local to data to minimize network
> > >>>> data transfer which is expensive.
> > >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
> > >> know.
> > >>>>
> > >>>> For cloud deployments where we don't have data locality problem
> > >> (because of
> > >>>> s3 is being used instead of storage local
> > >>>> to servers), k8s might be okay.
> > >>>>
> > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> > messos
> > >> ..
> > >>>> although I think it's an offtopic.
> > >>>>
> > >>>> We're mostly on-prem and we don't see kubernetes take over yarn any
> > time
> > >>>> soon.
> > >>>>
> > >>>> Thanks.
> > >>>>
> > >>>>
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> > >>>>
> > >>>> *2.3.2 Monolithic Schedulers *
> > >>>>
> > >>>>
> > >>>>
> > >>>> Monolithic schedulers use a single, centralized scheduling algorithm
> > for
> > >>>> all jobs. All workload is run through the same scheduler and same
> > >>>> scheduling logic. Swarm,
> > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> > >>>> improvised on basic monolithic version of Borg and Swarm schedulers.
> > >> This
> > >>>> type of schedulers are not suitable for running heterogeneous modern
> > >>>> workloads which include Spark jobs, containers, and other long
> running
> > >> jobs,
> > >>>> etc.
> > >>>>
> > >>>>
> > >>>>
> > >>>> *2.3.3 Two Level Schedulers *
> > >>>>
> > >>>>
> > >>>>
> > >>>> Two-level schedulers address the drawbacks of a monolithic scheduler
> > by
> > >>>> separating concerns of resource allocation and task placement. An
> > active
> > >>>> resource manager offers compute resources to multiple parallel,
> > >> independent
> > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> > >> approach,
> > >>>> and YARN supports a limited version of it. In Mesos, resources are
> > >> offered
> > >>>> to application-level schedulers. This allows for custom,
> > >> workload-specific
> > >>>> scheduling policies. The drawback with this type of scheduling
> > >> architecture
> > >>>> is that the application level frameworks cannot see all the possible
> > >>>> placement options anymore. Instead, they only see those options that
> > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> > >> resource
> > >>>> manager component. This makes priority preemption (higher priority
> > tasks
> > >>>> kick out lower priority ones) difficult.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ruslan Dautkhanov
> > >>>>
> > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bdbruin@xxxxxxxxx
> >
> > >> wrote:
> > >>>>>
> > >>>>> Happy to have it as a contrib executor. However, I personally think
> > >> yarn
> > >>>>> is a dead end. It has a lot of catching up to do and all the
> momentum
> > >> is
> > >>>>> with kubernetes.
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>> Verstuurd vanaf mijn iPad
> > >>>>>
> > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> > >> dautkhanov@xxxxxxxxx>
> > >>>>> het volgende geschreven:
> > >>>>>>
> > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> > >>>>>> somewhat a competitor for Kubernetes.
> > >>>>>>
> > >>>>>> Great job on adding k8s support to Airflow.
> > >>>>>>
> > >>>>>> Very similarly I see Airflow could integrate with YARN and use
> > >>>>>> its infrastructure as an "executor" .. have anyone explored
> > >> feasibility
> > >>>>> of
> > >>>>>> this approach?
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> Ruslan Dautkhanov
> > >>>>>
> > >>
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg03193.html on line 278
Call Stack
#TimeMemoryFunctionLocation
10.0017358472{main}( ).../msg03193.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg03193.html on line 278
Call Stack
#TimeMemoryFunctionLocation
10.0017358472{main}( ).../msg03193.html:0