git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Pinning dependencies for Apache Airflow


Hello Erik,

I understand your concern. It's a hard one to solve in general (i.e.
dependency-hell). It looks like in this case you treat Airflow as
'library', where for some other people it might be more like 'end product'.
If you look at the "pinning" philosophy - the "pin everything" is good for
end products, but not good for libraries. In the case you have Airflow is
treated as a bit of both. And it's perfectly valid case at that (with
custom python DAGs being central concept for Airflow).
However, I think it's not as bad as you think when it comes to exact
pinning.

I believe - a bit counter-intuitively - that tools like pip-tools/poetry
with exact pinning result in having your dependencies upgraded more often,
rather than less - especially in complex systems where dependency-hell
creeps-in. If you look at Airflow's setup.py now - It's a bit scary to make
any change to it. There is a chance it will blow at your face if you change
it. You never know why there is 0.3 < ver < 1.0 - and if you change it,
whether it will cause chain reaction of conflicts that will ruin your work
day.

On the contrary - if you change it to exact pinning in
.lock/requirements.txt file (poetry/pip-tools) and have much simpler (and
commented) exclusion/avoidance rules in your .in/.tml file, the whole setup
might be much easier to maintain and upgrade. Every time you prepare for
release (or even once in a while for master) one person might consciously
attempt to upgrade all dependencies to latest ones. It should be almost as
easy as letting poetry/pip-tools help with figuring out what are the latest
set of dependencies that will work without conflicts. It should be rather
straightforward (I've done it in the past for fairly complex systems). What
those tools enable is - doing single-shot upgrade of all dependencies.
After doing it you can make sure that all tests work fine (and fix any
problems that result from it). And then you test it thoroughly before you
make final release. You can do it in separate PR - with automated testing
in Travis which means that you are not disturbing work of others
(compilation/building + unit tests are guaranteed to work before you merge
it) while doing it. It's all conscious rather than accidental. Nice side
effect of that is that with every release you can actually "catch-up" with
latest stable versions of many libraries in one go. It's better than
waiting until someone deliberately upgrades to newer version (and the rest
remain terribly out-dated as is the case for Airflow now).

So a bit counterintuitively I think tools like pip-tools/poetry help you to
catch up faster in many cases. That is at least my experience so far.

Additionally, Airflow is an open system - if you have very specific needs
for requirements, you might actually - in the very same way with
pip-tools/poetry - upgrade all your dependencies in your local fork of
Airflow before someone else does it in master/release. Those tools kind of
democratise dependency management. It should be as easy as `pip-compile
--upgrade` or `poetry update` and you will get all the "non-conflicting"
latest dependencies in your local fork (and poetry especially seems to do
all the heavy lifting of figuring out which versions will work). You should
be able to test and publish it locally as your private package for local
installations. You can even mark the specific dependency you want to use
specific version and let pip-tools/poetry figure out exact versions of
other requirements. You can even make a PR with such upgrade eventually to
get it faster in master. You can even downgrade in case newer dependency
causes problems for you in similar way. Guided by the tools, it's much
faster than figuring the versions out by yourself.

As long as we have simple way of managing it and document how to
upgrade/downgrade dependencies in your own fork, and mention how to locally
release Airflow as a package, I think your case could be covered even
better than now. What do you think ?

J.

On Fri, Oct 5, 2018 at 2:34 PM EKC (Erik Cederstrand)
<EKC@xxxxxxxxxxxxx.invalid> wrote:

> For us, exact pinning of versions would be problematic. We have DAG code
> that shares direct and indirect dependencies with Airflow, e.g. lxml,
> requests, pyhive, future, thrift, tzlocal, psycopg2 and ldap3. If our DAG
> code for some reason needs a newer point release due to a bug that's fixed,
> then we can't cleanly build a virtual environment containing the fixed
> version. For us, it's already a problem that Airflow has quite strict (and
> sometimes old) requirements in setup.py.
>
> Erik
> ________________________________
> From: Jarek Potiuk <Jarek.Potiuk@xxxxxxxxxxx>
> Sent: Friday, October 5, 2018 2:01:15 PM
> To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: Pinning dependencies for Apache Airflow
>
> I think one solution to release approach is to check as part of automated
> Travis build if all requirements are pinned with == (even the deep ones)
> and fail the build in case they are not for ALL versions (including
> dev). And of course we should document the approach of releases/upgrades
> etc. If we do it all the time for development versions (which seems quite
> doable), then transitively all the releases will also have pinned versions
> and they will never try to upgrade any of the dependencies. In poetry
> (similarly in pip-tools with .in file) it is done by having a .lock file
> that specifies exact versions of each package so it can be rather easy to
> manage (so it's worth trying it out I think  :D  - seems a bit more
> friendly than pip-tools).
>
> There is a drawback - of course - with manually updating the module that
> you want, but I really see that as an advantage rather than drawback
> especially for users. This way you maintain the property that it will
> always install and work the same way no matter if you installed it today or
> two months ago. I think the biggest drawback for maintainers is that you
> need some kind of monitoring of security vulnerabilities and cannot rely on
> automated security upgrades. With >= requirements those security updates
> might happen automatically without anyone noticing, but to be honest I
> don't think such upgrades are guaranteed even in current setup for all
> security issues for all libraries anyway.
>
> Finding the need to upgrade because of security issues can be quite
> automated. Even now I noticed Github started to inform owners about
> potential security vulnerabilities in used libraries for their project.
> Those notifications can be sent to devlist and turned into JIRA issues
> followed bvy  minor security-related releases (with only few library
> dependencies upgraded).
>
> I think it's even easier to automate it if you have pinned dependencies -
> because it's generally easy to find applicable vulnerabilities for specific
> versions of libraries by static analysers - when you have >=, you never
> know which version will be used until you actually perform the
> installation.
>
> There is one big advantage for maintainers for "pinned" case. Your users
> always have the same dependencies - so when issue is raised, you can
> reproduce it more easily. It's hard to know which version user has (as the
> user could install it month ago or yesterday) and even if you find out by
> asking the user, you might not be able to reproduce the set of requirements
> easily (simply because there are already newer versions of the libraries
> released and they are used automatically). You can ask the user to run pip
> --upgrade but that's dangerous and pretty lame ("check the latest version -
> maybe it fixes your problem ? ") and sometimes not possible (e.g. someone
> has pre-built docker image with dependencies from few months ago and cannot
> rebuild the image easily).
>
> J.
>
> On Fri, Oct 5, 2018 at 12:35 PM Ash Berlin-Taylor <ash@xxxxxxxxxx> wrote:
>
> > One thing to point out here.
> >
> > Right now if you `pip install apache-airflow=1.10.0` in a clean
> > environment it will fail.
> >
> > This is because we pin flask-login to 0.2.1 but flask-appbuilder is >=
> > 1.11.1, so that pulls in 1.12.0 which requires flask-login >= 0.3.
> >
> > So I do think there is maybe something to be said about pinning for
> > releases. The down side to that is that if there are updates to a module
> > that we want then we have to make a point release to let people get it
> >
> > Both methods have draw-backs
> >
> > -ash
> >
> > > On 4 Oct 2018, at 17:13, Arthur Wiedmer <arthur.wiedmer@xxxxxxxxx>
> > wrote:
> > >
> > > Hi Jarek,
> > >
> > > I will +1 the discussion Dan is referring to and George's advice.
> > >
> > > I just want to double check we are talking about pinning in
> > > requirements.txt only.
> > >
> > > This offers the ability to
> > > pip install -r requirements.txt
> > > pip install --no-deps airflow
> > > For a guaranteed install which works.
> > >
> > > Several different requirement files can be provided for specific use
> > cases,
> > > like a stable dev one for instance for people wanting to work on
> > operators
> > > and non-core functions.
> > >
> > > However, I think we should proactively test in CI against unpinned
> > > dependencies (though it might be a separate case in the matrix) , so
> that
> > > we get advance warning if possible that things will break.
> > > CI downtime is not a bad thing here, it actually caught a problem :)
> > >
> > > We should unpin as possible in setup.py to only maintain minimum
> required
> > > compatibility. The process of pinning in setup.py is extremely
> > detrimental
> > > when you have a large number of python libraries installed with
> different
> > > pinned versions.
> > >
> > > Best,
> > > Arthur
> > >
> > > On Thu, Oct 4, 2018 at 8:36 AM Dan Davydov
> <ddavydov@xxxxxxxxxxx.invalid
> > >
> > > wrote:
> > >
> > >> Relevant discussion about this:
> > >>
> > >>
> >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-airflow%2Fpull%2F1809%23issuecomment-257502174&amp;data=01%7C01%7CEKC%40novozymes.com%7Cd31403917b084e3615c208d62aba4c24%7C43d5f49ee03a4d22a2285684196bb001%7C0&amp;sdata=MM%2FoNwkPYR8UtBUczXLfZD2lCp7Ig%2BI%2FL2rFszcoJi8%3D&amp;reserved=0
> > >>
> > >> On Thu, Oct 4, 2018 at 11:25 AM Jarek Potiuk <
> Jarek.Potiuk@xxxxxxxxxxx>
> > >> wrote:
> > >>
> > >>> TL;DR; A change is coming in the way how dependencies/requirements
> are
> > >>> specified for Apache Airflow - they will be fixed rather than
> flexible
> > >> (==
> > >>> rather than >=).
> > >>>
> > >>> This is follow up after Slack discussion we had with Ash and Kaxil -
> > >>> summarising what we propose we'll do.
> > >>>
> > >>> *Problem:*
> > >>> During last few weeks we experienced quite a few downtimes of
> TravisCI
> > >>> builds (for all PRs/branches including master) as some of the
> > transitive
> > >>> dependencies were automatically upgraded. This because in a number of
> > >>> dependencies we have  >= rather than == dependencies.
> > >>>
> > >>> Whenever there is a new release of such dependency, it might cause
> > chain
> > >>> reaction with upgrade of transitive dependencies which might get into
> > >>> conflict.
> > >>>
> > >>> An example was Flask-AppBuilder vs flask-login transitive dependency
> > with
> > >>> click. They started to conflict once AppBuilder has released version
> > >>> 1.12.0.
> > >>>
> > >>> *Diagnosis:*
> > >>> Transitive dependencies with "flexible" versions (where >= is used
> > >> instead
> > >>> of ==) is a reason for "dependency hell". We will sooner or later hit
> > >> other
> > >>> cases where not fixed dependencies cause similar problems with other
> > >>> transitive dependencies. We need to fix-pin them. This causes
> problems
> > >> for
> > >>> both - released versions (cause they stop to work!) and for
> development
> > >>> (cause they break master builds in TravisCI and prevent people from
> > >>> installing development environment from the scratch.
> > >>>
> > >>> *Solution:*
> > >>>
> > >>>   - Following the old-but-good post
> > >>>
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnvie.com%2Fposts%2Fpin-your-packages%2F&amp;data=01%7C01%7CEKC%40novozymes.com%7Cd31403917b084e3615c208d62aba4c24%7C43d5f49ee03a4d22a2285684196bb001%7C0&amp;sdata=PVE3S4mgki7L%2BcAe104o2cf68wRXolvYXRFmAyiX8gA%3D&amp;reserved=0
> we are going to fix the
> > >>> pinned
> > >>>   dependencies to specific versions (so basically all dependencies
> are
> > >>>   "fixed").
> > >>>   - We will introduce mechanism to be able to upgrade dependencies
> with
> > >>>   pip-tools (
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjazzband%2Fpip-tools&amp;data=01%7C01%7CEKC%40novozymes.com%7Cd31403917b084e3615c208d62aba4c24%7C43d5f49ee03a4d22a2285684196bb001%7C0&amp;sdata=Kt9CjWrolvpjp7MwIR2nn8EIf9CW9HW02U7GVGyOXMo%3D&amp;reserved=0).
> We might also
> > >> take a
> > >>>   look at pipenv:
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpipenv.readthedocs.io%2Fen%2Flatest%2F&amp;data=01%7C01%7CEKC%40novozymes.com%7Cd31403917b084e3615c208d62aba4c24%7C43d5f49ee03a4d22a2285684196bb001%7C0&amp;sdata=1tiY6pgX3IbRYC5W0HKr0ER2qMZ3GKYrwmWg%2BUo0tqs%3D&amp;reserved=0
> > >>>   - People who would like to upgrade some dependencies for their PRs
> > >> will
> > >>>   still be able to do it - but such upgrades will be in their PR thus
> > >> they
> > >>>   will go through TravisCI tests and they will also have to be
> > specified
> > >>> with
> > >>>   pinned fixed versions (==). This should be part of review process
> to
> > >>> make
> > >>>   sure new/changed requirements are pinned.
> > >>>   - In release process there will be a point where an upgrade will be
> > >>>   attempted for all requirements (using pip-tools) so that we are not
> > >>> stuck
> > >>>   with older releases. This will be in controlled PR environment
> where
> > >>> there
> > >>>   will be time to fix all dependencies without impacting others and
> > >> likely
> > >>>   enough time to "vet" such changes (this can be done for alpha/beta
> > >>> releases
> > >>>   for example).
> > >>>   - As a side effect dependencies specification will become far
> simpler
> > >>>   and straightforward.
> > >>>
> > >>> Happy to hear community comments to the proposal. I am happy to take
> a
> > >> lead
> > >>> on that, open JIRA issue and implement if this is something community
> > is
> > >>> happy with.
> > >>>
> > >>> J.
> > >>>
> > >>> --
> > >>>
> > >>> *Jarek Potiuk, Principal Software Engineer*
> > >>> Mobile: +48 660 796 129
> > >>>
> > >>
> >
> >
>
> --
>
> *Jarek Potiuk, Principal Software Engineer*
> Mobile: +48 660 796 129
>


-- 

*Jarek Potiuk, Principal Software Engineer*
Mobile: +48 660 796 129