[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: conn_id breaking change; once more with feeling

The main reason for the conn_id prefix is to facilitate the use of
`default_args`. Because of this you can set all your connections at the top
of your script and from that point on you just instantiate tasks without
re-stating connections. It's common for people to define multiple
"operating contexts" (production/staging/dev) where default_args and
connections are defined conditionally based on some env vars and having
different names for different conn_ids is key in that case.

Also as you mentioned many operators (all data transfers) take 2 conn_ids
which would require prefixing anyways.

And yes, minor releases should never invalidate DAGs except for rare
exceptions (say a new feature that is still pre-release, or never worked
properly in previous release for some reason). Breaking changes should come
with an message aligned with the release. Pretty names doesn't
justify breaking stuff and forcing people to grep and mass replace things.
If someone wants a prettier name for an argument or anything else that's in
the "obviously-public API" (DAG objects, operators, setting deps, ...) they
should need to make the change backward compatible and start
logging.warning() about next major release deprecation.

I also think 2.0 should be as mild as possible on backward incompatible
changes or come with a compatibility layer (from airflow import LegacyDAG
as DAG). No one wants to go and mass update tons of scripts.

It should be the case too for the less-obviously public API (hooks, methods
not prefixed with `_` or `__`), though I think it may be reasonable in some
cases (say a method that really should have been defined as private) but
avoided as much as possible.

*Committers*, let's be vigilant about this. Breaking backward compatibility
is a big deal. An important part of code review is identifying backward
incompatible changes.


On Tue, May 29, 2018 at 6:19 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamblin@xxxxxxxxxxx> wrote:

> The short of this email is: can we please name all the connection id named
> parameters to all hooks and operators as similarly as possible. EG just
> `conn_id`?
> So, when we started using Airflow I _had_ thought that minor versions
> would be compatible for a user's DAG, assuming no use of anything marked
> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
> would all run dags from prior versions in that linage, each with possible
> stability and feature improvements and each with possibly more operators,
> hooks, executors (even) etc.
> This is (now) obviously not the case, and it's the group's choice about
> what should and what should not break on a release-by-release basis. I
> think a clear policy would be appropriate for full Apache status, but then
> I may have missed where the policy is defined.  Though, if defined as not
> giving stability to the user's dags for most version changes isn't in my
> opinion going to grow confidence for Airflow being something you can grow
> with.
> Not to be overly dramatic, but currently the tiny change that the
> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
> `s3_conn_id='still_any_string'` means that basically I have to maintain a
> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
> freeze before during and after an update so that we can update all the uses
> and be ready to roll back the update if something else breaks (also some
> people access the not-actually-private `_a_key and _s_key` and would need
> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
> run a second airflow setup at 1.9.0 (in each and every staged environment)
> and move the 8k dags piecemeal whe[never] people get the spare time and
> inclination. I mean, we might. It looks like 1.10.0 changes some config
> around s3 logging one more time… so maybe it's better to skip it?
> I saw the couple of PRs where the project itself had to make the changes
> to usages of the named field. There was momentary and passing concern that
> users' dags would need to do the same. In the PRs, of the options discussed
> (shims, supporting the deprecated name as deprecated, doing a hard rename),
> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
> this discussed elsewhere?]
> And so I wondered why is there this virtual Hungarian notation on all the
> `conn_id`s?
> A hook generally takes one `conn_id`, most operators take only one. In
> these cases couldn't the named parameter have been `conn_id`? When an
> operator needs a couple conn_ids, could it not have `src_conn_id` and
> `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
> dataflow or beam etc? Those are hypotheticals, I believe.
> Could I propose to plan to break absolutely all uses of
> `{named}_conn_id`s, before or by version 2.0, with an eye towards never
> again having to break a named parameter for the rest of 2.x's life? There's
> probably more named parameters that should be fixed per major release, and
> if you agree, some work should be done to identify them all, but this just
> happens to be the one that seems most likely to change often, and has so
> recently.
> Thanks,
> -Daniel