-
Notifications
You must be signed in to change notification settings - Fork 51
Update the log cleaner DAG, and add a Python-based alternative log cleaner DAG #142
Conversation
Setting base log folder and operator import were using deprecated Airflow functions Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impressive job considering all the parts of the Airflow log settings. It took me a while to test and when reviewing the documentation it seems that there is not much available that explains it. I'll go with the python version of the DAG, it's like one of the major selling points to use Airflow. From the "Features" section of its landing page:
Pure Python
No more command-line or XML black-magic! Use standard Python features to create your workflows, including date time formats for scheduling and loops to dynamically generate tasks. This allows you to maintain full flexibility when building your workflows.
Tests are working for me, I couldn't reproduce the error of the gh action locally 😕 I was able to run the python_airflow_log_cleanup
DAG from the Airflow webserver and clean some logs but changing the fixed params of the workflow. More about this below.
DEFAULT_MAX_LOG_AGE_IN_DAYS, | ||
False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These params are fixed here:
DEFAULT_MAX_LOG_AGE_IN_DAYS
is 7ENABLE_DELETE
is alwaysFalse
Running it from the Airflow webserver (the first time) didn't delete any log because of these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry the fixed params and no documentation caused you problems when reviewing, and thank you for persisting regardless :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, excuse me if it was misunderstood, I was referring here to the lack of Airflow documentation related to the logging configuration. I was searching for a description or explanation of the dag_processor_manager.log
file to help with that but I found nothing from its site.
Signed-off-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! Significant improvements. Let's go with the python version of the script and remove the other.
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
# Conflicts: # openverse_catalog/dags/commoncrawl_scripts/airflow_log_cleanup.py
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Instead of removing logs from gitignore, set the specific /logs/ directory to be 'unignored' Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the Python-based cleaner seems much more maintainable!
* Log cleanup DAG * Update the log cleaner DAG, and add a Python-based alternative log cleaner DAG (#142) * Set DAG id; remove deprecated calls Setting base log folder and operator import were using deprecated Airflow functions Signed-off-by: Olga Bulat <[email protected]> * Add option to override default max age and enable delete Signed-off-by: Olga Bulat <[email protected]> * Move log_cleanup DAG files to openverse_catalog/dags Signed-off-by: Olga Bulat <[email protected]> * Fix import and set schedule to weekly Signed-off-by: Olga Bulat <[email protected]> * Remove bash log cleanup workflow Signed-off-by: Olga Bulat <[email protected]> * Remove outdated comments Signed-off-by: Olga Bulat <[email protected]> * Add sample log files as test resources Signed-off-by: Olga Bulat <[email protected]> * Fix .gitignore exceptions Instead of removing logs from gitignore, set the specific /logs/ directory to be 'unignored' Signed-off-by: Olga Bulat <[email protected]> Co-authored-by: Zack Krida <[email protected]> * Update .github/workflows/push_pull_request_test.yml Co-authored-by: Olga Bulat <[email protected]>
Related to
Related to WordPress/openverse#1751 by @mathemancer
Description
This PR continues the work in #139. It replaces some deprecated functions, fixes style violations in the original DAG. It is a little difficult to examine the results of this DAG because it deletes its own log as well :) (if you set the
maxLogAgeInDays
to -1 in the params)It also adds an alternative Python-based version of the same DAG that is easier to test.
I would appreciate any ideas of which version is preferable and why :)
Technical details
The
python_airflow_log_cleanup_workflow.py
DAG loops through the airflow task log folders, and deletes a folder if it was modified earlier thanmaxLogAgeInDays
days ago.The DAG does not delete
dag_processor_manager
logs because in the Docker container, there is only one log file for it, and it is not yet clear if all the logs are written to one file (in which case, we would have to delete an old portion of this file) or if the next day's log is written to a new file.Tests
To test that the DAGs do delete the logs using the Airflow Web UI, you can run one of the DAGs using {"maxLogInDays": -1} as a DAG run parameter
The tests for the Python version of the DAG create local log files with different modification dates. This makes it possible to test whether the DAG actually does delete the log files based on the log modified date.
Checklist
Update index.md
).main
ormaster
).Developer Certificate of Origin
Developer Certificate of Origin