Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Update the log cleaner DAG, and add a Python-based alternative log cleaner DAG #142

Merged
merged 21 commits into from
Aug 20, 2021

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Aug 18, 2021

Related to

Related to WordPress/openverse#1751 by @mathemancer

Description

This PR continues the work in #139. It replaces some deprecated functions, fixes style violations in the original DAG. It is a little difficult to examine the results of this DAG because it deletes its own log as well :) (if you set the maxLogAgeInDays to -1 in the params)
It also adds an alternative Python-based version of the same DAG that is easier to test.
I would appreciate any ideas of which version is preferable and why :)

Technical details

The python_airflow_log_cleanup_workflow.py DAG loops through the airflow task log folders, and deletes a folder if it was modified earlier than maxLogAgeInDays days ago.

The DAG does not delete dag_processor_manager logs because in the Docker container, there is only one log file for it, and it is not yet clear if all the logs are written to one file (in which case, we would have to delete an old portion of this file) or if the next day's log is written to a new file.

Tests

To test that the DAGs do delete the logs using the Airflow Web UI, you can run one of the DAGs using {"maxLogInDays": -1} as a DAG run parameter
The tests for the Python version of the DAG create local log files with different modification dates. This makes it possible to test whether the DAG actually does delete the log files based on the log modified date.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Setting base log folder and operator import were using deprecated Airflow functions

Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat requested a review from a team as a code owner August 18, 2021 15:02
@obulat obulat requested review from krysal and dhruvkb and removed request for a team August 18, 2021 15:02
@dhruvkb dhruvkb added this to Needs review in Openverse Aug 18, 2021
@obulat obulat changed the title Style edits Update the log cleaner DAG, and add a Python-based alternative log cleaner DAG Aug 18, 2021
Signed-off-by: Olga Bulat <[email protected]>
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive job considering all the parts of the Airflow log settings. It took me a while to test and when reviewing the documentation it seems that there is not much available that explains it. I'll go with the python version of the DAG, it's like one of the major selling points to use Airflow. From the "Features" section of its landing page:

Pure Python
No more command-line or XML black-magic! Use standard Python features to create your workflows, including date time formats for scheduling and loops to dynamically generate tasks. This allows you to maintain full flexibility when building your workflows.

Tests are working for me, I couldn't reproduce the error of the gh action locally 😕 I was able to run the python_airflow_log_cleanup DAG from the Airflow webserver and clean some logs but changing the fixed params of the workflow. More about this below.

Comment on lines 106 to 107
DEFAULT_MAX_LOG_AGE_IN_DAYS,
False,
Copy link
Member

@krysal krysal Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These params are fixed here:

  • DEFAULT_MAX_LOG_AGE_IN_DAYS is 7
  • ENABLE_DELETE is always False

Running it from the Airflow webserver (the first time) didn't delete any log because of these values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the fixed params and no documentation caused you problems when reviewing, and thank you for persisting regardless :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, excuse me if it was misunderstood, I was referring here to the lack of Airflow documentation related to the logging configuration. I was searching for a description or explanation of the dag_processor_manager.log file to help with that but I found nothing from its site.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! Significant improvements. Let's go with the python version of the script and remove the other.

Openverse automation moved this from Needs review to Reviewer approved Aug 19, 2021
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Instead of removing logs from gitignore, set the specific /logs/ directory to be 'unignored'

Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the Python-based cleaner seems much more maintainable!

@obulat obulat merged commit d791de6 into log-cleanup-dag Aug 20, 2021
@obulat obulat deleted the style_edits branch August 20, 2021 10:03
Openverse automation moved this from Reviewer approved to Done! Aug 20, 2021
obulat added a commit that referenced this pull request Aug 21, 2021
* Log cleanup DAG

* Update the log cleaner DAG, and add a Python-based alternative log cleaner DAG (#142)

* Set DAG id; remove deprecated calls

Setting base log folder and operator import were using deprecated Airflow functions

Signed-off-by: Olga Bulat <[email protected]>

* Add option to override default max age and enable delete

Signed-off-by: Olga Bulat <[email protected]>

* Move log_cleanup DAG files to openverse_catalog/dags

Signed-off-by: Olga Bulat <[email protected]>

* Fix import and set schedule to weekly

Signed-off-by: Olga Bulat <[email protected]>

* Remove bash log cleanup workflow

Signed-off-by: Olga Bulat <[email protected]>

* Remove outdated comments

Signed-off-by: Olga Bulat <[email protected]>

* Add sample log files as test resources

Signed-off-by: Olga Bulat <[email protected]>

* Fix .gitignore exceptions

Instead of removing logs from gitignore, set the specific /logs/ directory to be 'unignored'

Signed-off-by: Olga Bulat <[email protected]>

Co-authored-by: Zack Krida <[email protected]>

* Update .github/workflows/push_pull_request_test.yml

Co-authored-by: Olga Bulat <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Openverse
  
Done!
Development

Successfully merging this pull request may close these issues.

None yet

4 participants