March 29, 2021

Notes on an Open Scraping Database

Exposing the web's data texture

I’m no developer, but I’ve been using spreadsheets to query, extract and fiddle with the web for over a decade now.

And today I’m knee deep in an R&D project exploring the link between the web and spreadsheets. It’s basically my dream client gig.

Along the way, I found this wonderful project from Geoffrey Litt - Wildcard:

Wildcard is a browser extension that empowers anyone to modify websites to meet their own specific needs, using a familiar spreadsheet view.

Check out the demo video here:

This is kind of an “outside in” use of spreadsheets - overlaying a familiar spreadsheet UX on top of a website.

I’m working on a kind of “inside out” use of spreadsheets - extracting data from websites into a spreadsheet environment.

Both of these projects rely on parsing and extracting some kind of structured data from arbitrary websites. This…. is a very complex task.

Some gotchas (roughly ordered from simple to hard):

What if… there was an open database for scraping. I’m imagining something like wikiscrape where a community builds and maintains a library of URL + schema for scraping.

So for example I could look up a website like airbnb and quickly grab the selector or xpath for extracting the price field from a listing.

Of course this wouldn’t resolve the harder scraping problems but might be “good enough” for many general hobby/prosumer use cases.

Does such a thing exist?

If you’re working on something like this get in touch!


This blog is written by Tom Critchlow, an independent strategy consultant living and working in Brooklyn, NY. If you like what you read please leave a comment below in the comments or sign up for my newsletter.