-
Notifications
You must be signed in to change notification settings - Fork 51
[API integration] Add StockSnap #114
Conversation
Signed-off-by: Olga Bulat <[email protected]>
# Conflicts: # src/cc_catalog_airflow/dags/common/__init__.py
Signed-off-by: Olga Bulat <[email protected]>
Co-authored-by: Zack Krida <[email protected]>
Co-authored-by: Zack Krida <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Co-authored-by: Zack Krida <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Co-authored-by: Krystle Salazar <[email protected]>
Signed-off-by: Olga Bulat <[email protected]> Signed-off-by: Olga Bulat <[email protected]>
It is ready for review now, thanks for the patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This is looking very good, but it could break down if the pages are redesigned.
We should mention that it uses page scraping in the top-level docstrings so that it's easier to debug later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first API workflow that also uses web scraping, so I guess we need to discuss the scraping process:
- Is it absolutely necessary? Scraping is expensive (data-wise, time-wise), and we should avoid it if at all possible. I know that the API does not provide all the data that we normally need. Could we use the CommonCrawl data for StockSnap instead? I would say that we should use scraping until we can add CommonCrawl pipeline for StockSnap, because the images from it are quite valuable and we really want to add it to the catalog
- What can we do to ensure that we are scraping politely? We do have a delayed requester, but is a delay of 1 second enough?
What do you think about these issues with web scraping, @krysal , @dhruvkb , @zackkrida ? Are there any other issues that I haven't mentioned?
I will look at this more deeply later today, but if someone could list the fields not available in the API, that require scraping, I could reach out to the StockSnap developer about adding those fields. He's quite approachable so it would be worth asking. |
So, the missing fields (thanks @obulat) are:
The image titles are actually built from the first two tags, so for example an image tagged: becomes "Father Daughter Photo" With regard to the creators, which url do we want to show? Their stocksnap author page, or their 'custom' link? In this case, it's https://stocksnap.io/author/mattmoloney versus https://mjmolo.com/ |
Custom link would be ideal for the link that credits the creator, but the StockSnap author page would be fine as well. |
Also replace `_get_license()` function with constant.
@obulat I had the same concerns with the scraping process, though I was mainly thinking about point 1 you mention, the resources and time-consuming extra tasks, and the fragility of it breaking if the structure of the pages changes, as @dhruvkb said. Point 2 is certainly something to take into account if we'd move forward on this. The good news is the StockSnap developer is kindly adding the required fields right after @zackkrida contact (thank you, Zack!) I already added the Finally, thank you everyone for your reviews, this is on hold for now, waiting for the API changes, hopefully this will be easier :) |
Don't make additional requests per image as info is now available in the API.
Removed the scraping, now this is a normal API workflow completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you @krysal. Glad to see the scraping is no longer needed.
One question: would StockSnap be a good candidate for popularity data? We have page_views
, downloads
, and likes
for each image. Maybe it makes sense to explore this in a different PR. @obulat might have some ideas since she has the most experience with the popularity data.
Fixes WordPress/openverse#1740.
The StockSnap API is missing some of our required fields, so the image data is completed from its landing page, making an additional request for each image.
For example, the image URL is build from its CDN
https://cdn.stocksnap.io/img-thumbs/960w/{img_id|img_slug}.jpg
and that URL is also used for thethumbnail_url
. There is a small version that we could use too:https://cdn.stocksnap.io/img-thumbs/280h/{img_id|img_slug}.jpg
.