git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Beam leaving temporary datasets in BigQuery


Hi Andrew,
This was fixed in https://github.com/apache/beam/pull/5360 and will be available in 2.5.
Even with 2.4, the temporary datasets have a TTL of 24 hours and self-destruct after that.

On Thu, May 31, 2018 at 2:44 AM Andrew Jones <andrew+beam@xxxxxxxxxxxxxxxx> wrote:
Hi,

We've recently enabled two Beam batch jobs in production, running daily, and have noticed a whole load of datasets being left behind in BigQuery (see attached). These jobs both read and write from BigQuery, and we're using Beam 2.4.0. The jobs are running as templates (with `withTemplateCompatibility()` when reading).

A similar issue has been reported at https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609.

The code to remove datasets does seem to be there, but I'm not seeing the logs in my job, so presumably it's not being called? https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySource.java#L151

Nothing else obvious in the logs.

Any ideas or suggestions on how to track this issue down?

Thanks,
Andrew