March 29, 2021

Notes on an Open Scraping Database

Exposing the web's data texture

I’m no developer, but I’ve been using spreadsheets to query, extract and fiddle with the web for over a decade now.

And today I’m knee deep in an R&D project exploring the link between the web and spreadsheets. It’s basically my dream client gig.

Along the way, I found this wonderful project from Geoffrey Litt - Wildcard:

Wildcard is a browser extension that empowers anyone to modify websites to meet their own specific needs, using a familiar spreadsheet view.

Check out the demo video here:

This is kind of an “outside in” use of spreadsheets - overlaying a familiar spreadsheet UX on top of a website.

I’m working on a kind of “inside out” use of spreadsheets - extracting data from websites into a spreadsheet environment.

Both of these projects rely on parsing and extracting some kind of structured data from arbitrary websites. This…. is a very complex task.

Some gotchas (roughly ordered from simple to hard):

  • Websites change code all the time, so extracting data consistently requires maintenance.
  • There’s no single standard (that I know of) for writing a scraper. Some use CSS selectors, some use Xpath, some use regex etc.
  • Some websites are actively hostile to scrapers and obfuscate their code
  • Javascript-heavy websites load data in various ways, often forcing you to render the full javascript page to get what you want
  • Edge cases like sites that rate-limit scrapers, return 403 status codes etc etc.

What if… there was an open database for scraping. I’m imagining something like wikiscrape where a community builds and maintains a library of URL + schema for scraping.

So for example I could look up a website like airbnb and quickly grab the selector or xpath for extracting the price field from a listing.

Of course this wouldn’t resolve the harder scraping problems but might be “good enough” for many general hobby/prosumer use cases.

Does such a thing exist?

If you’re working on something like this get in touch!

More blog posts:

A Lil' Website Refresh

March 20, 2024

This post was written by Tom Critchlow - blogger and independent consultant. Subscribe to join my occassional newsletter: