Notes on an Open Scraping Database

Exposing the web's data texture

I’m no developer, but I’ve been using spreadsheets to query, extract and fiddle with the web for over a decade now.

And today I’m knee deep in an R&D project exploring the link between the web and spreadsheets. It’s basically my dream client gig.

Along the way, I found this wonderful project from Geoffrey Litt - Wildcard:

Wildcard is a browser extension that empowers anyone to modify websites to meet their own specific needs, using a familiar spreadsheet view.
@geoffreylitt https://www.geoffreylitt.com/wildcard/

Check out the demo video here:

This is kind of an “outside in” use of spreadsheets - overlaying a familiar spreadsheet UX on top of a website.

I’m working on a kind of “inside out” use of spreadsheets - extracting data from websites into a spreadsheet environment.

Both of these projects rely on parsing and extracting some kind of structured data from arbitrary websites. This…. is a very complex task.

Some gotchas (roughly ordered from simple to hard):

Websites change code all the time, so extracting data consistently requires maintenance.
There’s no single standard (that I know of) for writing a scraper. Some use CSS selectors, some use Xpath, some use regex etc.
Some websites are actively hostile to scrapers and obfuscate their code
Javascript-heavy websites load data in various ways, often forcing you to render the full javascript page to get what you want
Edge cases like sites that rate-limit scrapers, return 403 status codes etc etc.

What if… there was an open database for scraping. I’m imagining something like wikiscrape where a community builds and maintains a library of URL + schema for scraping.

So for example I could look up a website like airbnb and quickly grab the selector or xpath for extracting the price field from a listing.

Of course this wouldn’t resolve the harder scraping problems but might be “good enough” for many general hobby/prosumer use cases.

Does such a thing exist?

If you’re working on something like this get in touch!