git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mocking airflow (similar to moto for AWS)


Hi, I think Jarek that you can unit test your DAGs without a BDD. You will
have to patching some connections but it's feasible. On my side I test the
topological order of my DAGs to be sure of the order.

And I patch the xcom_push and xcom_push methods to be sure that between
task everything will be OK.

I your hooks are tested well I think it's OK.

For instance I have this kind of code to test the topological order :
https://gist.github.com/Bl3f/acd3d4b251eb565c96168635d84d0513.

Regards,
Christophe

Le ven. 19 oct. 2018 à 10:23, Jarek Potiuk <Jarek.Potiuk@xxxxxxxxxxx> a
écrit :

> Thanks! I like the suggestion about testing hooks rather than whole DAGs -
> we will certainly use it in the future. And BDD is the approach I really
> like - thanks for the code examples! We might also use it in the near
> future. Super helpful!
>
> So far we mocked hooks in our unit tests only (for example here
> <
> https://github.com/PolideaInternal/incubator-airflow/blob/master/tests/contrib/operators/test_gcp_compute_operator.py#L241
> >)
> - that helps to test the logic of more complex operators.
> @Anthony - we also use a modified docker-based environment to run the tests
> (https://github.com/PolideaInternal/airflow-breeze/tree/integration-tests)
> including running full Dags. And yeah missing import was just an
> exaggerated example :) we also use IDE/lints to catch those early :D.
>
> I think still there is a need to run whole DAGs on top of testing operators
> and hooks separate. This is to test a bit more complex interactions between
> the operators. In our case we use example dags for both documentation and
> running full e2e integration tests (for example here
>
> https://github.com/PolideaInternal/incubator-airflow/blob/master/airflow/contrib/example_dags/example_gcp_compute.py
> ).
> Those are simple examples but we will have a bit more complex interactions
> and it would be great to be able to run them quicker. However if we get the
> hook tests automated/unit-testable as well, maybe our current approach
> where we run them in the full dockerized environment will be good enough.
>
> J.
>
>
> On Thu, Oct 18, 2018 at 5:44 PM Anthony Brown <
> anthony.brown@xxxxxxxxxxxxxxx>
> wrote:
>
> > I have pylint set up in my IDE which catches most silly errors like
> missing
> > imports
> > I also use a docker image so I can start up airflow locally and manually
> > test any changes before trying to deploy them. I use a slightly modified
> > version of https://github.com/puckel/docker-airflow to control it. This
> > only works on connections I have access to from my machine
> > Finally I have a suite of tests based on
> >
> >
> https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
> > which I can run to test DAGs are valid and any unit tests I can put in.
> The
> > tests are run in a docker container which runs a local db instance so I
> > have access to xcoms etc
> >
> > As part of my deployment pipeline, I run pylint and tests again before
> > deploying anywhere to make sure nobody has forgotten to run them locally
> >
> > Gerard - I like the suggestion about using mocked hooks and BDD. I will
> > look into this further
> >
> > On Thu, 18 Oct 2018 at 15:12, Gerard Toonstra <gtoonstra@xxxxxxxxx>
> wrote:
> >
> > > There was a discussion about a unit testing approach last year 2017 I
> > > believe. If you dig the mail archives, you can find it.
> > >
> > > My take is:
> > >
> > > - You should test "hooks" against some real system, which can be a
> docker
> > > container. Make sure the behavior is predictable when talking against
> > that
> > > system. Hook tests are not part of general CI tests because of the
> > > complexity of the CI setup you'd have to make, so they are run on local
> > > boxes.
> > > - Maybe add additional "mock" hook tests, mocking out the connected
> > > systems.
> > > - When hooks are tested, operators can use 'mocked' hooks that no
> longer
> > > need access to actual systems. You can then set up an environment where
> > you
> > > have predictable inputs and outputs and test how the operators act on
> > them.
> > > I've used "behave" to do that with very simple record sets, but you can
> > > make these as complex as you want.
> > > - Then you know your hooks and operators work functionally. Testing if
> > your
> > > workflow works in general can be implemented by adding "check"
> operators.
> > > The benefit here is that you don't test the workflow once, but you test
> > for
> > > data consistency every time the dag runs. If you have complex workflows
> > > where the correct behavior of the flow is worrysome, then you may need
> to
> > > go deeper into it.
> > >
> > > The above doesn't depend on DAGS that need to be scheduled and the
> delays
> > > involving that.
> > >
> > > All of the above is implemented in my repo
> > > https://github.com/gtoonstra/airflow-hovercraft  , using "behave" as a
> > BDD
> > > method of testing, so you can peruse that.
> > >
> > > Rgds,
> > >
> > > G>
> > >
> > >
> > > On Thu, Oct 18, 2018 at 2:43 PM Jarek Potiuk <Jarek.Potiuk@xxxxxxxxxxx
> >
> > > wrote:
> > >
> > > > I am also looking to have (I think) similar workflow. Maybe someone
> has
> > > > done something similar and can give some hints on how to do it the
> > > easiest
> > > > way?
> > > >
> > > > Context:
> > > >
> > > > While developing operators I am using example test DAGs that talk to
> > GCP.
> > > > So far our "integration tests" require copying the dag folder and
> > > > restarting the airflow servers, unpausing the dag and waiting for it
> to
> > > > start. That takes a lot of time, sometimes just to find out that you
> > > missed
> > > > one import.
> > > >
> > > > Ideal workflow:
> > > >
> > > > Ideally I'd love to have a "unit" test (i.e possible to run via
> > nosetests
> > > > or IDE integration/PyCharm) that:
> > > >
> > > >    - should not need to have airflow scheduler/webserver started. I
> > guess
> > > >    we need a DB but possibly an in-memory, on-demand created database
> > > > might be
> > > >    a good solution
> > > >    - load the DAG from a file specified (not really from/dags
> > directory)
> > > >    - build internal dependencies between the DAG tasks (as specified
> in
> > > the
> > > >    Dag)
> > > >    - run the DAG immediately and fully (i.e. run all the "execute"
> > > methods
> > > >    as needed and pass XCOM between tasks).
> > > >    - ideally produce log output in console rather in per-task files.
> > > >
> > > > I thought about using DagRun/DagBag but have not tried it yet and not
> > > sure
> > > > if you need to have whole environment set (which parts?). Any help
> > > > appreciated :) ?
> > > >
> > > > J.
> > > >
> > > > On Thu, Oct 18, 2018 at 1:08 AM bielllobera@xxxxxxxxx <
> > > > bielllobera@xxxxxxxxx>
> > > > wrote:
> > > >
> > > > > I think it would be great to have a way to mock airflow for unit
> > tests.
> > > > > The way I approached this was to create a context manager that
> > creates
> > > a
> > > > > temporary directory, sets the AIRFLOW_HOME environment variable to
> > this
> > > > > directory (only within the scope of the context manager) and then
> > > renders
> > > > > an airflow.cfg to that location. This creates an SQLite just for
> the
> > > test
> > > > > so you can add variables and connections needed for the test
> without
> > > > > affecting the real Airflow installation.
> > > > >
> > > > > The first thing I realized is that this didn't work if the imports
> > were
> > > > > outside the context manager, since airflow.configuration and
> > > > > airflow.settings perform all the initialization when they are
> > imported,
> > > > so
> > > > > the AIRFLOW_HOME variable is already set to the real installation
> > > before
> > > > > getting inside the context manager.
> > > > >
> > > > > The workaround for this was to reload those modules and this works
> > for
> > > > the
> > > > > tests I have written. However, when I tried to use it for something
> > > more
> > > > > complex (I have a plugin that I'm importing) I noticed that inside
> > the
> > > > > operator in this plugin, AIRFLOW_HOME is still set to the real
> > > > > installation, not the temporary one for the test. I thought this
> must
> > > be
> > > > > related to the imports but I haven't been able to figure out a way
> to
> > > fix
> > > > > the issue. I tried patching some methods but I must have been
> missing
> > > > > something because the database initialization failed.
> > > > >
> > > > > Does anyone have an idea on the best way to mock/patch airflow so
> > that
> > > > > EVERYTHING that is executed inside the context manager uses the
> > > temporary
> > > > > installation?
> > > > >
> > > > > PS: This is my current attempt which works for the tests I defined
> > but
> > > > not
> > > > > for external plugins:
> > > > > https://github.com/biellls/airflow_testing
> > > > >
> > > > > For an example on how it works:
> > > > >
> > > >
> > >
> >
> https://github.com/biellls/airflow_testing/blob/master/tests/mock_airflow_test.py
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Jarek Potiuk, Principal Software Engineer*
> > > > Mobile: +48 660 796 129
> > > >
> > >
> >
> >
> > --
> > --
> >
> > Anthony Brown
> > Data Engineer BI Team - John Lewis
> > Tel : 0787 215 7305
> > **********************************************************************
> > This email is confidential and may contain copyright material of the John
> > Lewis Partnership.
> > If you are not the intended recipient, please notify us immediately and
> > delete all copies of this message.
> > (Please note that it is your responsibility to scan this message for
> > viruses). Email to and from the
> > John Lewis Partnership is automatically monitored for operational and
> > lawful business reasons.
> > **********************************************************************
> >
> > John Lewis plc
> > Registered in England 233462
> > Registered office 171 Victoria Street London SW1E 5NN
> >
> > Websites: https://www.johnlewis.com
> > http://www.waitrose.com
> > https://www.johnlewisfinance.com
> > http://www.johnlewispartnership.co.uk
> >
> > **********************************************************************
> >
>
>
> --
>
> *Jarek Potiuk, Principal Software Engineer*
> Mobile: +48 660 796 129
>