Enrich your audit with custom content extraction

Scrape and search specific content while you crawl.

Paul Shapiro

Paul Shapiro

Search Wilderness

Sitebulb is one sweet crawler. It has all the data I could possibly want, and it lays it out in such a fantastic manner - extremely conducive for site audits and finding critical issues very quickly. Plus, it's fast, low on memory usage, and feature-rich to boot. Thanks for making my life easier Sitebulb!

Custom data in your website audit

Content extraction allows you to customise the output you get from Sitebulb's crawler, and include specific datapoints that you need to help you make better decisions. You can use it to help determine popular content by scraping comment counts, assign author names to URLs from a blog, or count the number of products in each category on an ecommerce store.

Content search enables you to identify pages which contain specific words or phrases, which is a crucial feature for rebranding/migration exercises, and extremely useful for topic modelling and finding internal linking opportunities.

Sitebulb's extraction setup is a cinch for beginners to get the hang of, whilst also offering extremely advanced configuration options for more experienced users.

Taking the pain out of content extraction

If you're familiar with custom extraction on other crawler tools, you'll know how frustrating it can be to set it up correctly. Sitebulb is designed to remove all these frustrating pain-points, with an intuitive system that does the heavy lifting for you:

  1. Load in a test URL into the visual selector window (also works with JavaScript frameworks).
  2. Point and click the element you want to scrape.
  3. Give the datapoint a meaningful name.
  4. Check the 'Test' tab to ensure it is working as you expect.

This means it works on any website you throw at it. It means you don’t need a degree in advanced Regex to figure out what selector to choose. And it means you don’t need to crawl the website 37 times in order to test your selectors.

Content Extraction Example

Point-and-click interface

See a piece of data you'd like to extract, then just point-and-click, and Sitebulb will choose the correct CSS selector for you.

To perform content extraction on other crawler tools, divining the correct CSS selector can be a bit of a mission, involving digging around in Google Chrome DevTools and hoping you have got the right one.

Price Extraction

Test without crawling

The handy 'Test' tab means you can ascertain the result of your extraction setup without running a crawl, which makes a significant time saving.

The test feature allows you to tweak different extraction options to ensure you have picked the right selector and data type, or load in different URLs to ensure that the extraction works across different pages.

Test multiple times

Advanced extraction options

Extraction is not just restricted to simple scraping, you can also perform operations such as a count, or checking if an element exists. You can also apply URL pattern matching specifically for each extraction rule, to ensure computer resources are not wasted unnecessarily.

Scraping data with regex is typically an advanced use-case, and Sitebulb includes advanced functions to suit. You can specify a CSS selector in which to perform the regex pattern matching - rather than simply across the entire HTML document - and aditionally pull from specific regex groups.

NY Times example

Flexible content search

Content search is a simple concept; provide a word or phrase for the crawler to check for in the HTML of every page. The setup in Sitebulb does not overcomplicate this basic process:

Enter text to search - basic

As always, however, Sitebulb offers that little bit more for those that want it, allowing you to combine words and phrases and even set exclusion patterns.

Advanced content search

Offer more than just SEO

Make you and your team indispensible to your clients, by providing meaningful business data to satisfy numerous different departments.

Explore some of the things you can do with content extraction and search:

Identify 'out of stock' products

Make the merchandising team love you by providing a regular list of out of stock products.

Out of stock
Scrape Prices

Scrape competitor data

Collect vital competitor data like prices, product codes and stock availability.

Identify topic clusters

Search for multiple keywords at once to find content overlap and cannibalisation, or opportunities for internal linking.

Advanced Content Search

Free 14 day trial.
Full, unrestricted access.
No credit card required.

Try Sitebulb for Free