足球体育平台 (竞赛体育直播平台)
Web Scraping & Data Extraction Using The SEO Spider Tool
This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.
The custom extraction feature allows you to scrape any data from the HTML of a web page using CSSPath, XPath and regex. The extraction is performed on the static HTML returned from URLs crawled by the SEO Spider, which return a 200 ‘OK’ response. You can switch to JavaScript rendering mode to extract data from the rendered HTML.
To jump to examples click one of the below links:
To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. You can download via the buttons in the right hand side bar.
When you have the SEO Spider open, the next steps to start extracting data are as follows –
1) Click ‘Configuration > Custom > Extraction’
This menu can be found in the top level menu of the SEO Spider.

This will open up the custom extraction configuration which allows you to configure up to 100 separate ‘extractors’.

2) Select CSS Path, XPath or Regex for Scraping
The Screaming Frog SEO Spider tool provides three methods for scraping data from websites:
- XPath – XPath is a query language for selecting nodes from an XML like document, such as HTML. This option allows you to scrape data by using XPath selectors, including attributes.
- CSS Path – In CSS, selectors are patterns used to select elements and are often the quickest out of the three methods available. This option allows you to scrape data by using CSS Path selectors. An optional attribute field is also available.
- Regex – A regular expression is of course a special string of text used for matching patterns in data. This is best for advanced uses, such as scraping HTML comments or inline JavaScript.
CSS Path or XPath are recommended for most common scenarios, and although both have their advantages, you can simply pick the option which you’re most comfortable using.
When using XPath or CSS Path to collect HTML, you can choose exactly what to extract using the drop down filters –
- Extract HTML Element – The selected element and all of its inner HTML content.
- Extract Inner HTML – The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
- Extract Text – The text content of the selected element and the text content of any sub elements.
- Function Value – The result of the supplied function, eg count(//h1) to find the number of h1 tags on a page.
3) Input Your Syntax
Next up, you’ll need to input your syntax into the relevant extractor fields. A quick and easy way to find the relevant CSS Path or Xpath of the data you wish to scrape, is to simply open up the web page in Chrome and ‘inspect element’ of the HTML line you wish to collect, then right click and copy the relevant selector path provided.
For example, you may wish to start scraping ‘authors’ of blog posts, and number of comments each have received. Let’s take the Screaming Frog website as the example.
Open up any blog post in Chrome, right click and ‘inspect element’ on the authors name which is located on every post, which will open up the ‘elements’ HTML window. Simply right click again on the relevant HTML line (with the authors name), copy the relevant CSS path or XPath and paste it into the respective extractor field in the SEO Spider. If you use Firefox, then you can do the same there too.

You can rename the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used CSS Path.

The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.
When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.
Please note – This is not the most robust method for building CSS Selectors and XPath expressions. The expressions given using this method can be very specific to the exact position of the element in the code. This is something that can change due to the inspected view being the rendered version of the page / DOM, when by default the SEO Spider looks at the HTML source, and HTML clean-up that can occur when the SEO Spider processes a page where there is invalid mark-up.
These can also differ between browser, e.g. for the above ‘author’ example the following CSS Selectors are given –
Chrome: body > div.main-blog.clearfix > div > div.main-blog–posts > div.main-blog–posts_single–inside_author.clearfix.drop > div.main-blog–posts_single–inside_author-details.col-13-16 > div.author-details–social > a
Firefox: .author-details–social > a:nth-child(1)
The expressions given by Firefox are generally more robust than those provided by Chrome. Even so, this should not be used as a complete replacement for understanding the various extraction options and being able to build these manually by examining the HTML source.
The w3schools guide on CSS Selectors and their XPath introduction are good resources for understanding the basics of these expressions.
4) Crawl The Website
Next, input the website address into the URL field at the top and click ‘start’ to crawl the website, and commence scraping.

5) View Scraped Data Under The Custom Extraction Tab
Scraped data starts appearing in real time during the crawl, under the ‘Custom Extraction’ tab, as well as the ‘internal’ tab allowing you to export everything collected all together into Excel.
In the example outlined above, we can see the author names and number of comments next to each blog post, which have been scraped.

When the progress bar reaches ‘100%’, the crawl has finished and you can choose to ‘export’ the data using the ‘export’ buttons.
If you already have a list of URLs you wish to extract data from, rather than crawl a website to collect the data, then you can upload them using list mode.
That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software for web scraping.
Obviously the possibilities are endless, this feature can be used to collect anything from just plain text, to Google analytics IDs, schema, social meta tags (such as Open Graph Tags & Twitter Cards), mobile annotations, hreflang values, as well as price of products, discount rates, stock availability etc. I’ve covered some more examples, which are split by the method of extraction.
XPath Examples
SEOs love XPath. So I have put together very quick list of elements you may wish to extract, using XPath. The SEO Spider uses the XPath implementation from Java 11, which supports XPath version 1.0.
Jump to a specific XPath extraction example:
Headings
Hreflang
Structured Data
Social Meta Tags (Open Graph Tags & Twitter Cards)
Mobile Annotations
Email Addresses
iframes
AMP URLs
Meta News Keywords
Meta Viewport Tag
Extract Links In The Body Only
Extract Links Containing Anchor Text
Extract Links to a Specific Domain
Extract Content From Specific Divs
Extract Multiple Matched Elements
Headings
As default, the SEO Spider only collects h1s and h2s, but if you’d like to collect h3s, the XPath is –
//h3
The data extracted –

However, you may wish to collect just the first h3, particularly if there are many per page. The XPath is –
/descendant::h3[1]
To collect the first 10 h3’s on a page, the XPath would be –
/descendant::h3[position() >= 0 and position() <= 10]
To count the number of h3 tags on a page the expression needed is –
count(//h3)
In this case ‘Extract Inner HTML’ in the far right dropdown of the Custom Extraction Window must be changed to ‘Function Value’ for this expression to work correctly.
The length of any extracted string can also be calculated with XPath using the ‘Function Value’ option. To calculate the length of the h3’s on the page the following expression is needed –
string-length(//h3)
Hreflang
The following Xpath, combined with Extract HTML Element, will collect the contents all hreflang elements –
//*[@hreflang]
The above will collect the entire HTML element, with the link and hreflang value. The results –

So, perhaps you wanted just the hreflang values (like ‘en-GB’), you could specify the attribute using @hreflang.
//*[@hreflang]/@hreflang
The data extracted –

Hreflang analysis functionality is now built into the SEO Spider as standard, for more details please see Hreflang Extraction and Hreflang Tab.
Structured Data
You may wish to collect the types of various Schema on a page, so the set-up might be –
//*[@itemtype]/@itemtype
The data extracted –

For ‘itemprop’ rules, you can use a similar XPath –
//*[@itemprop]/@itemprop
Don’t forget, the SEO Spider can extract and validate structured data without requiring custom extraction.
Social Meta Tags (Open Graph Tags & Twitter Cards)
You may wish to extract social meta tags, such as Facebook Open Graph tags, account details, or Twitter Cards. The Xpath is for example –
//meta[starts-with(@property, 'og:title')]/@content
//meta[starts-with(@property, 'og:description')]/@content
//meta[starts-with(@property, 'og:type')]/@content
//meta[starts-with(@property, 'og:site_name')]/@content
//meta[starts-with(@property, 'og:image')]/@content
//meta[starts-with(@property, 'og:url')]/@content
//meta[starts-with(@property, 'fb:page_id')]/@content
//meta[starts-with(@property, 'fb:admins')]/@content
//meta[starts-with(@property, 'twitter:title')]/@content
//meta[starts-with(@property, 'twitter:description')]/@content
//meta[starts-with(@property, 'twitter:account_id')]/@content
//meta[starts-with(@property, 'twitter:card')]/@content
//meta[starts-with(@property, 'twitter:image:src')]/@content
//meta[starts-with(@property, 'twitter:creator')]/@content
etc.
The data extracted –

Mobile Annotations
If you wanted to pull mobile annotations from a website, you might use an Xpath such as –
//link[contains(@media, '640') and @href]/@href
Which for the Huffington Post would extract –

Email Addresses
Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –
//a[starts-with(@href, 'mailto')]
From our website, this would return the two email addresses we have in the footer on every page –

iframes
//iframe/@src
The data extracted –

To only extract iframes where a Youtube video is embedded would be –
//iframe[contains(@src ,'www.youtube.com/embed/')]
To extract iframes, but not a particular iframe URL such as Google Tag Manager URLs would be –
//iframe[not(contains(@src, 'https://www.googletagmanager.com/'))]/@src
Extracting just the URL of the first iframe found on a page would be –
(//iframe/@src)[1]
AMP URLs
//head/link[@rel='amphtml']/@href
The data extracted –

Meta News Keywords
//meta[@name='news_keywords']/@content
The data extracted –

Meta Viewport Tag
//meta[@name='viewport']/@content
The data extracted –

Extract Links In The Body Only
The following XPath will only extract links from the body of a blog post on https://www.screamingfrog.co.uk/annual-screaming-frog-macmillan-morning-bake-off/, where the blog content is contained within the class ‘main-blog–posts_single—inside’.
This will get the anchor text with ‘Extract Inner HTML’:
//div[@class="main-blog--posts_single--inside"]//a
This will get the URL with ‘Extract Inner HTML’:
//div[@class="main-blog--posts_single--inside"]//a/@href
This will get the full link code with ‘Extract HTML Element’:
//div[@class="main-blog--posts_single--inside"]//a
Extract Links Containing Anchor Text
To extract all links with ‘SEO Spider’ in the anchor text:
//a[contains(.,'SEO Spider')]/@href
This matching is case sensitive, so if ‘SEO Spider’ is sometimes ‘seo spider’, you’ll have to do the following:
//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'seo spider')]/@href
Which will lower case all found anchor text, allowing you to compare it against a lower case ‘seo spider’.
Extract Links to a Specific Domain
To extract all the links from a page referencing ‘screamingfrog.co.uk’ you can use:
//a[contains(@href,'screamingfrog.co.uk')]
Using the ‘Extract HTML Element’ or ‘Extract Text’ will allow you to extract with the full link code or just the anchor text respectively.
If you only want to extract the linked URL you can use:
//a[contains(@href,'screamingfrog.co.uk')]/@href
Extract Content From Specific Divs
The following XPath will extract content from specific divs or spans, using their class ID. You’ll need to replace that with your own.
//div[@class="example"]
//span[@class="example"]
Extract Multiple Matched Elements
A pipe can be used between expressions in a single extractor to keep related elements next to each other in an export.
The following expression matches blog titles and the number of comments they have on blog archive pages:
//div[contains(@class ,'main-blog--posts_single-inner--text--inner')]//h3|//a[@class="comments-link"]

Regex Examples
Jump to a specific Regex extraction example:
Google Analytics ID
Structured Data
Email Addresses
Google Analytics and Tag Manager IDs
To extract the Google Analytics ID from a page the expression needed would be –
["'](UA-.*?)["']
For Google Tag Manager (GTM) it would be –
["'](GTM-.*?)["']

The data extracted is –

Structured Data
If the structured data is implemented in the JSON-LD format, regular expressions rather than XPath or CSS Selectors must be used:
"product": "(.*?)"
"ratingValue": "(.*?)"
"reviewCount": "(.*?)"
To extract everything in the JSON-LD script tag, you could use –
<script type=\"application\/ld\+json\">(.*?)</script>
Email Addresses
The following will return any alpha numeric string, that contains an @ in the middle:
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-.]+
The following expression will bring back fewer false positives, as it requires at least a single period in the second half of the string:
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
That’s it for now, but I’ll add to this list over time with more examples, for each method of extraction.
As always, you can pop us through any questions or queries to our support.
足球体育平台 (竞赛体育直播平台)
Table of Contents
Join the mailing list for updates, tips & giveaways
How we use the data in this formBack to top