Notes on an Open Scraping Database
Exposing the web's data texture
I’m no developer, but I’ve been using spreadsheets to query, extract and fiddle with the web for over a decade now.
And today I’m knee deep in an R&D project exploring the link between the web and spreadsheets. It’s basically my dream client gig.
Along the way, I found this wonderful project from Geoffrey Litt - Wildcard:
Wildcard is a browser extension that empowers anyone to modify websites to meet their own specific needs, using a familiar spreadsheet view.
Check out the demo video here:
This is kind of an “outside in” use of spreadsheets - overlaying a familiar spreadsheet UX on top of a website.
I’m working on a kind of “inside out” use of spreadsheets - extracting data from websites into a spreadsheet environment.
Both of these projects rely on parsing and extracting some kind of structured data from arbitrary websites. This…. is a very complex task.
Some gotchas (roughly ordered from simple to hard):
- Websites change code all the time, so extracting data consistently requires maintenance.
- There’s no single standard (that I know of) for writing a scraper. Some use CSS selectors, some use Xpath, some use regex etc.
- Some websites are actively hostile to scrapers and obfuscate their code
- Javascript-heavy websites load data in various ways, often forcing you to render the full javascript page to get what you want
- Edge cases like sites that rate-limit scrapers, return 403 status codes etc etc.
What if… there was an open database for scraping. I’m imagining something like wikiscrape where a community builds and maintains a library of URL + schema for scraping.
So for example I could look up a website like airbnb and quickly grab the selector or xpath for extracting the price field from a listing.
Of course this wouldn’t resolve the harder scraping problems but might be “good enough” for many general hobby/prosumer use cases.
Does such a thing exist?
If you’re working on something like this get in touch!