Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

[API integration] Add StockSnap #114

Merged
merged 32 commits into from
Aug 17, 2021
Merged

[API integration] Add StockSnap #114

merged 32 commits into from
Aug 17, 2021

Conversation

krysal
Copy link
Member

@krysal krysal commented Jun 28, 2021

Fixes WordPress/openverse#1740.

The StockSnap API is missing some of our required fields, so the image data is completed from its landing page, making an additional request for each image.

For example, the image URL is build from its CDN https://cdn.stocksnap.io/img-thumbs/960w/{img_id|img_slug}.jpg and that URL is also used for the thumbnail_url. There is a small version that we could use too: https://cdn.stocksnap.io/img-thumbs/280h/{img_id|img_slug}.jpg.

@krysal krysal changed the base branch from main to template June 28, 2021 17:29
@dhruvkb dhruvkb added this to Needs review in Openverse Jun 28, 2021
Base automatically changed from template to licenses July 9, 2021 16:18
@krysal krysal changed the base branch from licenses to main July 16, 2021 20:09
@krysal krysal moved this from Needs review to In progress in Openverse Jul 27, 2021
@zackkrida zackkrida marked this pull request as ready for review July 29, 2021 16:06
@zackkrida zackkrida requested a review from a team as a code owner July 29, 2021 16:06
@zackkrida zackkrida requested review from obulat and dhruvkb July 29, 2021 16:06
@zackkrida
Copy link
Member

@krysal do you need anything here, just review from @dhruvkb and @obulat? Or are there any blockers?

@krysal
Copy link
Member Author

krysal commented Jul 31, 2021

It is ready for review now, thanks for the patience!

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is looking very good, but it could break down if the pages are redesigned.

We should mention that it uses page scraping in the top-level docstrings so that it's easier to debug later.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first API workflow that also uses web scraping, so I guess we need to discuss the scraping process:

  1. Is it absolutely necessary? Scraping is expensive (data-wise, time-wise), and we should avoid it if at all possible. I know that the API does not provide all the data that we normally need. Could we use the CommonCrawl data for StockSnap instead? I would say that we should use scraping until we can add CommonCrawl pipeline for StockSnap, because the images from it are quite valuable and we really want to add it to the catalog
  2. What can we do to ensure that we are scraping politely? We do have a delayed requester, but is a delay of 1 second enough?

What do you think about these issues with web scraping, @krysal , @dhruvkb , @zackkrida ? Are there any other issues that I haven't mentioned?

@zackkrida
Copy link
Member

I will look at this more deeply later today, but if someone could list the fields not available in the API, that require scraping, I could reach out to the StockSnap developer about adding those fields.

He's quite approachable so it would be worth asking.

@zackkrida
Copy link
Member

So, the missing fields (thanks @obulat) are:

  • image title,
  • creator
  • creator url

The image titles are actually built from the first two tags, so for example an image tagged:

CleanShot 2021-08-04 at 09 56 29@2x

becomes "Father Daughter Photo"

CleanShot 2021-08-04 at 09 56 42@2x

With regard to the creators, which url do we want to show? Their stocksnap author page, or their 'custom' link?

CleanShot 2021-08-04 at 09 58 40@2x

In this case, it's https://stocksnap.io/author/mattmoloney versus https://mjmolo.com/

@obulat
Copy link
Contributor

obulat commented Aug 4, 2021

Custom link would be ideal for the link that credits the creator, but the StockSnap author page would be fine as well.

@krysal
Copy link
Member Author

krysal commented Aug 5, 2021

@obulat I had the same concerns with the scraping process, though I was mainly thinking about point 1 you mention, the resources and time-consuming extra tasks, and the fragility of it breaking if the structure of the pages changes, as @dhruvkb said. Point 2 is certainly something to take into account if we'd move forward on this.

The good news is the StockSnap developer is kindly adding the required fields right after @zackkrida contact (thank you, Zack!) I already added the title from the API, so once we can get the creator and creator_url from the API too we would skip the scraping totally 🤞 I agree with Olga in taking preference for the custom link for creator_url over the StockSnap author page if it were up to us in the API (in the current process would involve an extra request).

Finally, thank you everyone for your reviews, this is on hold for now, waiting for the API changes, hopefully this will be easier :)

@krysal
Copy link
Member Author

krysal commented Aug 11, 2021

Removed the scraping, now this is a normal API workflow completed.

@krysal krysal requested a review from obulat August 11, 2021 23:15
Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you @krysal. Glad to see the scraping is no longer needed.

One question: would StockSnap be a good candidate for popularity data? We have page_views, downloads, and likes for each image. Maybe it makes sense to explore this in a different PR. @obulat might have some ideas since she has the most experience with the popularity data.

Openverse automation moved this from In progress to Reviewer approved Aug 11, 2021
@krysal krysal merged commit b12ba81 into main Aug 17, 2021
@krysal krysal deleted the stocksnap branch August 17, 2021 15:09
Openverse automation moved this from Reviewer approved to Done! Aug 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Openverse
  
Done!
Development

Successfully merging this pull request may close these issues.

[API Integration] https://stocksnap.io/
4 participants