[API integration] Add StockSnap #114

krysal · 2021-06-28T17:25:25Z

The StockSnap API is missing some of our required fields, so the image data is completed from its landing page, making an additional request for each image.

For example, the image URL is build from its CDN https://cdn.stocksnap.io/img-thumbs/960w/{img_id|img_slug}.jpg and that URL is also used for the thumbnail_url. There is a small version that we could use too: https://cdn.stocksnap.io/img-thumbs/280h/{img_id|img_slug}.jpg.

Signed-off-by: Olga Bulat <[email protected]>

# Conflicts: # src/cc_catalog_airflow/dags/common/__init__.py

Signed-off-by: Olga Bulat <[email protected]>

Co-authored-by: Zack Krida <[email protected]>

Signed-off-by: Olga Bulat <[email protected]>

Co-authored-by: Zack Krida <[email protected]>

…o template

Signed-off-by: Olga Bulat <[email protected]>

Co-authored-by: Krystle Salazar <[email protected]>

Signed-off-by: Olga Bulat <[email protected]> Signed-off-by: Olga Bulat <[email protected]>

zackkrida · 2021-07-29T16:06:47Z

@krysal do you need anything here, just review from @dhruvkb and @obulat? Or are there any blockers?

krysal · 2021-07-31T02:52:16Z

It is ready for review now, thanks for the patience!

dhruvkb

👍 This is looking very good, but it could break down if the pages are redesigned.

We should mention that it uses page scraping in the top-level docstrings so that it's easier to debug later.

obulat

This is the first API workflow that also uses web scraping, so I guess we need to discuss the scraping process:

Is it absolutely necessary? Scraping is expensive (data-wise, time-wise), and we should avoid it if at all possible. I know that the API does not provide all the data that we normally need. Could we use the CommonCrawl data for StockSnap instead? I would say that we should use scraping until we can add CommonCrawl pipeline for StockSnap, because the images from it are quite valuable and we really want to add it to the catalog
What can we do to ensure that we are scraping politely? We do have a delayed requester, but is a delay of 1 second enough?

What do you think about these issues with web scraping, @krysal , @dhruvkb , @zackkrida ? Are there any other issues that I haven't mentioned?

src/cc_catalog_airflow/dags/provider_api_scripts/stocksnap.py

zackkrida · 2021-08-04T13:10:22Z

I will look at this more deeply later today, but if someone could list the fields not available in the API, that require scraping, I could reach out to the StockSnap developer about adding those fields.

He's quite approachable so it would be worth asking.

zackkrida · 2021-08-04T13:59:39Z

So, the missing fields (thanks @obulat) are:

image title,
creator
creator url

The image titles are actually built from the first two tags, so for example an image tagged:

becomes "Father Daughter Photo"

With regard to the creators, which url do we want to show? Their stocksnap author page, or their 'custom' link?

In this case, it's https://stocksnap.io/author/mattmoloney versus https://mjmolo.com/

obulat · 2021-08-04T14:07:24Z

Custom link would be ideal for the link that credits the creator, but the StockSnap author page would be fine as well.

Also replace `_get_license()` function with constant.

krysal · 2021-08-05T00:06:24Z

@obulat I had the same concerns with the scraping process, though I was mainly thinking about point 1 you mention, the resources and time-consuming extra tasks, and the fragility of it breaking if the structure of the pages changes, as @dhruvkb said. Point 2 is certainly something to take into account if we'd move forward on this.

The good news is the StockSnap developer is kindly adding the required fields right after @zackkrida contact (thank you, Zack!) I already added the title from the API, so once we can get the creator and creator_url from the API too we would skip the scraping totally 🤞 I agree with Olga in taking preference for the custom link for creator_url over the StockSnap author page if it were up to us in the API (in the current process would involve an extra request).

Finally, thank you everyone for your reviews, this is on hold for now, waiting for the API changes, hopefully this will be easier :)

Don't make additional requests per image as info is now available in the API.

krysal · 2021-08-11T23:15:42Z

Removed the scraping, now this is a normal API workflow completed.

zackkrida

This looks great, thank you @krysal. Glad to see the scraping is no longer needed.

One question: would StockSnap be a good candidate for popularity data? We have page_views, downloads, and likes for each image. Maybe it makes sense to explore this in a different PR. @obulat might have some ideas since she has the most experience with the popularity data.

src/cc_catalog_airflow/dags/provider_api_scripts/stocksnap.py

obulat and others added 15 commits June 8, 2021 16:45

Create a Provider API script template

7ca3fea

Signed-off-by: Olga Bulat <[email protected]>

Merge branch 'main' into template

b343d84

# Conflicts: # src/cc_catalog_airflow/dags/common/__init__.py

Shorten lines

7181cd3

Signed-off-by: Olga Bulat <[email protected]>

Update src/cc_catalog_airflow/templates/template_provider.py_template

5e45660

Co-authored-by: Zack Krida <[email protected]>

Better wording for script date parameter

1331d46

Co-authored-by: Zack Krida <[email protected]>

Replace relative path with absolute to fix file not found errors

5efe43a

Signed-off-by: Olga Bulat <[email protected]>

Make image the default media type

4a03768

Co-authored-by: Zack Krida <[email protected]>

Merge branch 'template' of github.com:WordPress/openverse-catalog int…

41714d2

…o template

Make the script output clearer

88a777c

Signed-off-by: Olga Bulat <[email protected]>

Fix typo in provider template script

13e607b

Co-authored-by: Krystle Salazar <[email protected]>

Merge branch 'main' into template

67b9b30

Merge remote-tracking branch 'origin/template' into template

ab8d3cc

Improve DAG creation template

264306c

Signed-off-by: Olga Bulat <[email protected]> Signed-off-by: Olga Bulat <[email protected]>

Create base provider files for stocksnap

14452b2

Add StockSnap to dags/util/loader/provider_details.py

a42476a

krysal changed the base branch from main to template June 28, 2021 17:29

dhruvkb added this to Needs review in Openverse Jun 28, 2021

krysal force-pushed the stocksnap branch from 656ac95 to 2232d40 Compare June 28, 2021 19:29

Program stocksnap script with minimum required fields

b1cc1fe

krysal force-pushed the stocksnap branch from 2232d40 to b1cc1fe Compare June 28, 2021 19:37

krysal added 5 commits July 2, 2021 00:33

Complete image's title, creator and creator_url

7a14f6a

Fix filling of tags field

9519967

Add instruction to write tsv file with image data

127fa29

Add samples files of an image and a api response for tests

93415d9

Refactor to make only one extra request per image

ab8fa90

krysal force-pushed the stocksnap branch from 7cbe29c to ab8fa90 Compare July 2, 2021 17:08

Base automatically changed from template to licenses July 9, 2021 16:18

krysal changed the base branch from licenses to main July 16, 2021 20:09

Merge branch 'main' into stocksnap solving conflicts

eaaba61

krysal moved this from Needs review to In progress in Openverse Jul 27, 2021

zackkrida marked this pull request as ready for review July 29, 2021 16:06

zackkrida requested a review from a team as a code owner July 29, 2021 16:06

zackkrida requested review from obulat and dhruvkb July 29, 2021 16:06

krysal added 2 commits July 30, 2021 11:58

Pass license_info instead of license_ and license_version

bb31eb9

Add stocksnap tests

924b2b1

krysal force-pushed the stocksnap branch from 1d3f2cb to 924b2b1 Compare July 31, 2021 02:19

dhruvkb approved these changes Aug 4, 2021

View reviewed changes

obulat reviewed Aug 4, 2021

View reviewed changes

src/cc_catalog_airflow/dags/provider_api_scripts/stocksnap.py Outdated Show resolved Hide resolved

src/cc_catalog_airflow/dags/provider_api_scripts/stocksnap.py Outdated Show resolved Hide resolved

krysal added 2 commits August 4, 2021 18:34

Get image title from API response instead of the scraped page

ed77e0e

Also replace `_get_license()` function with constant.

Update stocksnap tests and example full_item.json

6724b14

krysal added 6 commits August 11, 2021 14:17

Merge branch 'main' into stocksnap

2d09991

Get foreign_landing_url from StockSnap API

8dca7f9

Don't make additional requests per image as info is now available in the API.

Make image's title from tags/keywords

a7d562a

Format with black & flake8

925272e

Get creator data from StockSnap API

bfc9d0d

Update StockSnap tests and example files

5bf8122

krysal requested a review from obulat August 11, 2021 23:15

zackkrida approved these changes Aug 11, 2021

View reviewed changes

src/cc_catalog_airflow/dags/provider_api_scripts/stocksnap.py Outdated Show resolved Hide resolved

Openverse automation moved this from In progress to Reviewer approved Aug 11, 2021

krysal merged commit b12ba81 into main Aug 17, 2021

krysal deleted the stocksnap branch August 17, 2021 15:09

Openverse automation moved this from Reviewer approved to Done! Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API integration] Add StockSnap #114

[API integration] Add StockSnap #114

krysal commented Jun 28, 2021 •

edited

Loading

zackkrida commented Jul 29, 2021

krysal commented Jul 31, 2021

dhruvkb left a comment •

edited

Loading

obulat left a comment

zackkrida commented Aug 4, 2021

zackkrida commented Aug 4, 2021

obulat commented Aug 4, 2021

krysal commented Aug 5, 2021

krysal commented Aug 11, 2021

zackkrida left a comment

[API integration] Add StockSnap #114

[API integration] Add StockSnap #114

Conversation

krysal commented Jun 28, 2021 • edited Loading

zackkrida commented Jul 29, 2021

krysal commented Jul 31, 2021

dhruvkb left a comment • edited Loading

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

zackkrida commented Aug 4, 2021

zackkrida commented Aug 4, 2021

obulat commented Aug 4, 2021

krysal commented Aug 5, 2021

krysal commented Aug 11, 2021

zackkrida left a comment

Choose a reason for hiding this comment

krysal commented Jun 28, 2021 •

edited

Loading

dhruvkb left a comment •

edited

Loading