git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Processing of files: best practices for airflow


Hi all,

I have a question regarding the processing of individual files:

We collect some flat files from different sources in csv, raw and
unstructured formats.
These files are stored in a "{process}/YYYY/MM/DD/" hierarchy and we've
built
a GCSToGCSTransform operator, which runs a download/transform/upload loop
on each
file in the directory.

This works ok, but I get the impression that the DAG is getting a bit messy
from that and
because it's contained in each dag, I see very little potential for code
reusability.

We have some suggestions and they mention writing some libraries and
callable script
files, so that the functionality can be leveraged across multiple dags. I
can also imagine that
some may be writing docker containers for that and run these containers on
the cloud,
instructing where to get the files and put the results.

So I'm wondering if anyone found effective ways to deal with that and what
is considered
best practice for airflow?

Rgds,

Gerard