git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [VOTE] Airflow 1.10.0rc3


Yeah I figured it out. Originally i was using a different implementation of UTCDateTime, but that was unmaintained. I switched, but this version changed or has a different contract. While it transforms on storing to UTC it does not so when it receives timezone aware fields from the db. Hence the issue.

I will prepare a PR that removes the dependency and implements our own extension of DateTime. Probably tomorrow.

Good catch! Just in time :-(.

B.

> On 5 Aug 2018, at 22:43, Ash Berlin-Taylor <ash_airflowlist@xxxxxxxxxxxxxx> wrote:
> 
> Entirely possible, though I wasn't even dealing with the scheduler - the issue I was addressing was entirely in the webserver for a pre-existing Task Instance.
> 
> Ah, I hadn't noticed/twigged we are using sqlalchemy-utc. It appears that isn't working right/ as expected. This line: https://github.com/spoqa/sqlalchemy-utc/blob/master/sqlalchemy_utc/sqltypes.py#L34 doens't look right for us - as you mentioned the TZ is set to something (rather than having no TZ value).
> 
> Some background on how Pq handles TZs. It always returns DTs in the TZ of the connection. I'm not sure if this is unique to postgres or if other DBs behave the same.
> 
> postgres=# select '2018-08-03 00:00:00+00:00'::timestamp with time zone;
>      timestamptz
> ------------------------
> 2018-08-03 01:00:00+01
> 
> postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
>      timestamptz
> ------------------------
> 2018-08-03 01:00:00+01
> 
> The server will always return TZs in the connection timezone.
> 
> postgres=# set timezone=utc;
> SET
> postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
>      timestamptz
> ------------------------
> 2018-08-03 00:00:00+00
> (1 row)
> 
> postgres=# select '2018-08-03 01:00:00+01'::timestamp with time zone;
>      timestamptz
> ------------------------
> 2018-08-03 00:00:00+00
> (1 row)
> 
> 
> 
> 
> -ash
> 
>> On 5 Aug 2018, at 21:28, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>> 
>> This is the issue:
>> 
>> [2018-08-05 22:08:21,952] {jobs.py:906} INFO - NEXT RUN DATE: 2018-08-03 00:00:00+00:00 tzinfo: <Timezone [UTC]>
>> [2018-08-05 22:08:22,007] {jobs.py:1425} INFO - Created <DagRun example_http_operator @ 2018-08-03 02:00:00+02:00: scheduled__2018-08-03T00:00:00+00:00, externally triggered: False>
>> 
>> [2018-08-05 22:08:24,651] {jobs.py:906} INFO - NEXT RUN DATE: 2018-08-04 02:00:00+02:00 tzinfo: psycopg2.tz.FixedOffsetTimezone(offset=120, name=None)
>> [2018-08-05 22:08:24,696] {jobs.py:1425} INFO - Created <DagRun example_http_operator @ 2018-08-04 02:00:00+02:00: scheduled__2018-08-04T02:00:00+02:00, externally triggered: False>
>> 
>> Notice at line 1+2: that the next run date is correctly in UTC but from the DB it gets a +2. At the next bit (3+4) we get a psycopg2.tz.FixedOffsetTimezone which should be set to UTC according to the specs of https://github.com/spoqa/sqlalchemy-utc <https://github.com/spoqa/sqlalchemy-utc> , but it isn’t. 
>> 
>> So changing your setting of the DB to UTC fixes the symptom but not the cause.
>> 
>> B.
>> 
>>> On 5 Aug 2018, at 22:03, Ash Berlin-Taylor <ash_airflowlist@xxxxxxxxxxxxxx> wrote:
>>> 
>>> Sorry for being terse before.
>>> 
>>> So the issue is that the ts loaded from the DB is not in UTC, it's in GB/+01 (the default of the DB server)
>>> 
>>> For me, on a currently running 1.9 (no TZ) db:
>>> 
>>> airflow=# select * from task_instance;
>>> get_op            | example_http_operator | 2018-07-23 00:00:00
>>> 
>>> This date time appears in the log url, and the path it looks at on S3 is 
>>> 
>>> .../example_http_operator/2018-07-23T00:00:00/1.log
>>> 
>>> If my postgres server has a default timezone of GB (which the one running on my laptop does), and I then apply the migration then it is converted to that local time.
>>> 
>>> airflow=# select * from task_instance;
>>> get_op            | example_http_operator | 2018-07-23 01:00:00+01
>>> 
>>> airflow=# set timezone=UTC;
>>> airflow=# select * from task_instance;
>>> get_op            | example_http_operator | 2018-07-23 00:00:00+00
>>> 
>>> 
>>> This is all okay so far. The migration has kept the column at the same moment in time.
>>> 
>>> The issue come when the UI tries to display logs for this old task: because the timezone of the connection is not UTC, PG returns a date with a +01 TZ. Thus after the migration this old task tries to look for a log file of
>>> 
>>> .../example_http_operator/2018-07-23T01:00:00/1.log
>>> 
>>> which doesn't exist - it's changed the time it has rendered from midnight (in v1.9) to 1am (in v1.10).
>>> 
>>> (This is with my change to log_filename_template from UPDATING.md in my other branch)
>>> 
>>> Setting the timezone to UTC per connection means the behaviour of Airflow doesn't change depending on how the server is configured.
>>> 
>>> -ash
>>> 
>>>> On 5 Aug 2018, at 20:58, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>>>> 
>>>> Digging in a bit further. 
>>>> 
>>>> {{{{ ti.dag_id }}}}/{{{{ ti.task_id }}}}/{{{{ ts }}}}/{{{{ try_number }}}}.log
>>>> 
>>>> is the format
>>>> 
>>>> ts = execution_date.isoformat and should be in UTC afaik.
>>>> 
>>>> something is weird tbh.
>>>> 
>>>> B.
>>>> 
>>>> 
>>>>> On 5 Aug 2018, at 21:32, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>>>>> 
>>>>> Ash,
>>>>> 
>>>>> Reading your proposed changes on your “set-timezone-to-utc” branch and below analysis, I am not sure what you are perceiving as an issue.
>>>>> 
>>>>> For conversion we assume everything is stored in UTC and in a naive format. Conversion then adds the timezone information. This results in the following
>>>>> 
>>>>> postgres timezone = “Europe/Amsterdam”
>>>>> 
>>>>> 
>>>>> airflow=# select * from task_instance;
>>>>> get_op            | example_http_operator | 2018-07-27 02:00:00+02
>>>>> 
>>>>> airflow=# set timezone=UTC;
>>>>> airflow=# select * from task_instance;
>>>>> get_op            | example_http_operator | 2018-07-27 00:00:00+00
>>>>> 
>>>>> If we don’t set the timezone in the connection postgres assumes server timezone (in my case “Europe/Amsterdam”). So every datetime Airflow receives will be in “Europe/Amsterdam” format. However as we defined the model to use UTCDateTime it will always convert the returned DateTime to UTC.
>>>>> 
>>>>> If we have configured Airflow to support something else as UTC as the default timezone or a DAG has a associated timezone we only convert to that timezone when calculating the next runtime (not for cron btw). Nowhere else and thus we are UTC everywhere.
>>>>> 
>>>>> What do you think is inconsistent?
>>>>> 
>>>>> Bolke
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 5 Aug 2018, at 18:13, Ash Berlin-Taylor <ash_airflowlist@xxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> Relating to 2): I'm not sure that the upgrade from timezoneless to timezone aware colums in the task instance is right, or at least it's not what I expected.
>>>>>> 
>>>>>> Before weren't all TZs from schedule dates etc in UTC? For the same task instance (these outputs from psql directly):
>>>>>> 
>>>>>> before: execution_date=2017-09-04 00:00:00
>>>>>> after: execution_date=2017-09-04 01:00:00+01
>>>>>> 
>>>>>> **Okay the migration is fine**. It appears that the migration has done the right thing, but my local DB I'm testing with has a Timezone of GB set, so Postgres converts it to that TZ on returning an object.
>>>>>> 
>>>>>> 3) Do we need to set the TZ of the connection to UTC in SQLAlchemy to have consistent behaviour? Is this possible some how? I don't know SQLAlchemy that well.
>>>>>> 
>>>>>> 
>>>>>> -ash
>>>>>> 
>>>>>>> On 5 Aug 2018, at 16:01, Ash Berlin-Taylor <ash_airflowlist@xxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> 1.) Missing UPDATING note about change of task_log_reader to now always being "task" (was "s3.task" before.). Logging config is much simpler now though. This may be particular to my logging config, but given how much of a pain it was to set up S3 logging in 1.9 I have shared my config with some people in the Gitter chat so It's not just me.
>>>>>>> 
>>>>>>> 2) The path that log-files are written to in S3 has changed (again - this happened from 1.8 to 1.9). I'd like to avoid having to move all of my log files again to continue viewing them. The change is that the path now (in 1.10) has a timezone in it, and the date is in local time, before it was UTC:
>>>>>>> 
>>>>>>> before: 2018-07-23T00:00:00/1.log
>>>>>>> after: 2018-07-23T01:00:00+01:00/1.log
>>>>>>> 
>>>>>>> We can possibly get away with an updating note about this to set a custom log_filename_template. Testing this now.
>>>>>>> 
>>>>>>>> On 5 Aug 2018, at 15:00, Ash Berlin-Taylor <ash@xxxxxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> -1(binding) from me.
>>>>>>>> 
>>>>>>>> Installed with:
>>>>>>>> 
>>>>>>>> AIRFLOW_GPL_UNIDECODE=yes pip install 'https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz#egg=apache-airflow[emr <https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz#egg=apache-airflow[emr>, s3, crypto]>=1.10'
>>>>>>>> 
>>>>>>>> Install went fine.
>>>>>>>> 
>>>>>>>> Our DAGs that use SparkSubmitOperator are now failing as there is now a hard dependency on the Kubernetes client libs, but the `emr` group doesn't mention this.
>>>>>>>> 
>>>>>>>> Introduced in https://github.com/apache/incubator-airflow/pull/3112 <https://github.com/apache/incubator-airflow/pull/3112>
>>>>>>>> 
>>>>>>>> I see two options for this - either conditionally enable k8s:// support if the import works, or (less preferred) add kube-client to the emr deps (which I like less)
>>>>>>>> 
>>>>>>>> Sorry - this is the first time I've been able to test it.
>>>>>>>> 
>>>>>>>> I will install this dep manually and continue testing.
>>>>>>>> 
>>>>>>>> -ash
>>>>>>>> 
>>>>>>>> (Normally no time at home due to new baby, but I got a standing desk, and a carrier meaning she can sleep on me and I can use my laptop. Win!)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 4 Aug 2018, at 22:32, Bolke de Bruin <bdbruin@xxxxxxxxx <mailto:bdbruin@xxxxxxxxx>> wrote:
>>>>>>>>> 
>>>>>>>>> Bump. 
>>>>>>>>> 
>>>>>>>>> Committers please cast your vote. 
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On 3 Aug 2018, at 13:23, Driesprong, Fokko <fokko@xxxxxxxxxxxxxx <mailto:fokko@xxxxxxxxxxxxxx>> wrote:
>>>>>>>>>> 
>>>>>>>>>> +1 Binding
>>>>>>>>>> 
>>>>>>>>>> Installed it using: SLUGIFY_USES_TEXT_UNIDECODE=yes pip install
>>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz <https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/apache-airflow-1.10.0rc3+incubating-bin.tar.gz>
>>>>>>>>>> 
>>>>>>>>>> Cheers, Fokko
>>>>>>>>>> 
>>>>>>>>>> 2018-08-03 9:47 GMT+02:00 Bolke de Bruin <bdbruin@xxxxxxxxx>:
>>>>>>>>>> 
>>>>>>>>>>> Hey all,
>>>>>>>>>>> 
>>>>>>>>>>> I have cut Airflow 1.10.0 RC3. This email is calling a vote on the release,
>>>>>>>>>>> which will last for 72 hours. Consider this my (binding) +1.
>>>>>>>>>>> 
>>>>>>>>>>> Airflow 1.10.0 RC 3 is available at:
>>>>>>>>>>> 
>>>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/ <
>>>>>>>>>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc3/>
>>>>>>>>>>> 
>>>>>>>>>>> apache-airflow-1.10.0rc3+incubating-source.tar.gz is a source release that
>>>>>>>>>>> comes with INSTALL instructions.
>>>>>>>>>>> apache-airflow-1.10.0rc3+incubating-bin.tar.gz is the binary Python
>>>>>>>>>>> "sdist"
>>>>>>>>>>> release.
>>>>>>>>>>> 
>>>>>>>>>>> Public keys are available at:
>>>>>>>>>>> 
>>>>>>>>>>> https://dist.apache.org/repos/dist/release/incubator/airflow/ <
>>>>>>>>>>> https://dist.apache.org/repos/dist/release/incubator/airflow/>
>>>>>>>>>>> 
>>>>>>>>>>> The amount of JIRAs fixed is over 700. Please have a look at the
>>>>>>>>>>> changelog.
>>>>>>>>>>> Since RC2 the following has been fixed:
>>>>>>>>>>> 
>>>>>>>>>>> * [AIRFLOW-2817] Force explicit choice on GPL dependency
>>>>>>>>>>> * [AIRFLOW-2716] Replace async and await py3.7 keywords
>>>>>>>>>>> * [AIRFLOW-2810] Fix typo in Xcom model timestamp
>>>>>>>>>>> 
>>>>>>>>>>> Please note that the version number excludes the `rcX` string as well
>>>>>>>>>>> as the "+incubating" string, so it's now simply 1.10.0. This will allow us
>>>>>>>>>>> to rename the artifact without modifying the artifact checksums when we
>>>>>>>>>>> actually release.
>>>>>>>>>>> 
>>>>>>>>>>> WARNING: Due to licensing requirements you will need to set
>>>>>>>>>>> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
>>>>>>>>>>> installing or upgrading. We will try to remove this requirement for the
>>>>>>>>>>> next release.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Bolke
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>