Search Engine Optimization (SEO) is really an umbrella term which encompasses a wide range of activities including front end design and coding (UX and performance), backend optimization (crawl optimization and performance), content strategy and optimization, digital PR and promotion.
You can’t be an expert in everything, but one thing I think is important for all SEOs and digital marketing managers to understand is how Google actually crawls and indexes the web. This is important because it informs how we build and structure websites, and how we can make Google’s life easier which is ultimately good for us…
To ensure fast and comprehensive crawling and indexing of your site you should try to incorporate main navigation and content elements in plain HTML as much as possible. It is also important to think about site structure and internal crawl paths so that it is easy for Google to find all your pages. XML sitemaps help but do not guarantee indexation or rankings.
How Google Crawls your website
You’ve almost certainly heard of Googlebot before and probably know that it regularly visits your website to index your content. But this is really an oversimplification of what is actually happening. The following chart outlines an more accurate view of this process.
There are really two phases to the crawl and indexation process…
In the first phase what we commonly know as Googlebot visits your site for the first time. Usually starting on the homepage, Googlebot will parse the static HTML, the same that is served to your web browser. You can see what Google sees at this stage by right clicking on your home page and choosing “View Source”.
So any content that is visible in the static HTML may be indexed, and any links to other pages that Googlebot can see in the static HTML will be added to a queue for Googlebot to crawl and index.
If your entire website is served as static HTML then Googlebot will quickly be able to find and index all your content, but these days it is rarely as simple as that for a couple main reasons:
- And even if your site is served as static HTML it almost certainly uses HTML + CSS for styling and Google is actually very interested in how your website is laid out because this also gives them some good signals about the quality of your site.
So, for these reasons Google has a second phase of crawl and indexation which is referred to in the diagram above as the renderer…
Crawl and indexation Summary
That’s not to say that there are not ways to mitigate these challenges…
8 Things We can we do to improve indexation
The easier it is for Google to find and index the content on your site the better visibility you will have in search results. This is especially true if you have a very large and complex site, like Trade Me for example. So what can we do to help Google:
1. Aim for a wide and flat structure
That statement is a little bit of an oversimplification, but it stands as a good rule of thumb. The fewer links Google needs to follow to find an inner page, the more likely it is to be crawled regularly.
Your circumstances may vary, but we always recommend trying to keep your site structure or category scheme as shallow as possible. So rather than a few big categories with many layers of sub-category, it is in general better to have more top level categories with minimal sub-categories.
2. Check pagination
The first rule of thumb doesn’t just apply to your category structure, it also applies to things like pagination. For example, if you have 1000 products in a category and only display 20 products per page then you’ll have 50 paginated category pages and depending on how you structure your pagination this may mean some products are buried 10 or more clicks deep in your site. In such cases you can either increase the number of products per page displayed, or consider splitting the products into multiple categories. The right approach always “depends” but the rule of thumb is good to keep in mind.
3. Try to avoid duplicate content
It’s important to know that there is no ‘penalty’ for duplicate content. Duplicate content is a fact of life on the web, and from a user point of view it often goes unnoticed. For instance, and ecommerce site might offer some faceted navigation which leads to many URL variations of the same category page. Or perhaps your list all your colorways as unique products, but the descriptions and product names and titles are all the same. Google has seen it all and is generally very good at deduping search results.
The problem with duplicate content though is that Google still needs to crawl the page to know that it is duplicate…and if you have a lot of duplicate content that is a lot of wasted crawling for pages that are not likely to be indexed. If you have these kinds of issues on a large site especially, it pays to figure out how to prevent Google from crawling pages unnecessarily. It makes there life easier and will ensure a faster and more comprehensive crawl of the pages you most care about.
4. Be wary of spider traps
A ‘spider trap’ is a endless loop of URL variations that causes unnecessary crawling. We mentioned faceted navigation above and this is one of the common causes of ‘spider traps’. We’ve seen cases on ecommerce sites where categories could be filtered by an almost infinite number of facets related to size, color, style, gender etc.
The ability for consumers to filter is great, but when those filters can be applied (and crawled) in a very large number of possible combinations, and each combination results in a unique URL…well then you have trouble. Google can get caught crawling all those unique URLs which is a huge waste of their and your resources. Again, the easy we make Google’s life the better for us.
5. Have an XML sitemap
An XML sitemap is mandatory. It doesn’t guarantee that your pages will all be indexed, but it does ensure they’ll at least be crawled. Couple of things to note about sitemaps:
- The Change Frequency and Priority fields in the XML spec are taken by Google as hints only. There is no point setting a regular change frequency for a page that isn’t changed regularly, it won’t help. In our experience Google will set their own crawl frequency based on how often they see the page changing.
- There are also special XML sitemap formats for images, videos and news (for news sites, not your company news feed) and these should be utilized as appropriate. We like image sitemaps for ecommerce sites in particular because of the large number of product images (image search for ecommerce will be a future blog post!)
6. Don’t block search engines from essential resources
The thing is that Google absolutely needs to see any resource that is used to present or display the content on your site. Google doesn’t only index the content on your page they also render the page as a browser does when a real persons visits the site. They do this so that they can see how the content is all laid out, and so understand what is important and how many ads there are on the page and what the user experience is like.
All those things are taken into consideration when Google ranks pages, so if they can’t access critical resources to properly render your pages then they can’t fully assess your pages and your rankings will reflect that.
7. Content that is hidden behind an interaction will not be indexed
8. Try to incorporate primary navigation in plain HTML
We’ve already covered how Google crawls and indexes so this point should be self evident. Incorporating your main navigation in plain HTML will make it faster and easier for Google to discover all the pages on your site without having to fully render all your pages.
Just having an XML sitemap doesn’t guarantee indexation
Crawling, indexing and ranking are different things. An XML sitemap will help ensure Google knows about a page and will crawl it, but it does not guarantee it is indexed or ranked.
Even if a page is included in your XML sitemap Google may not index or rank it for any number of possible reasons, for example…
- The page is a duplicate of another page (see above) so there is no point indexing it.
- Despite being included in the XML sitemap, if Google can’t find a link to the page on the website itself, then they will consider that the page isn’t very important.
- If a page is included in the sitemap, but is regularly inaccessible (404, or 5xx error) then it is unlikely it will be indexed.
- There are many other possible reasons, but the point is that XML sitemaps are just a pointer, they don’t guarantee indexation.
What is Crawl Budget?
Crawl budget can be thought of as the maximum number of pages that Google will crawl on your site per day. It isn’t a fixed number, and will vary from day to day, but it is generally fairly stable.
Crawl budget varies from site to site and is determined by a variety of factors including (but not limited to):
- How many pages on your site. This is where things like faceted navigation and ‘spider traps’ can have a big impact, ie. ‘spider traps’ can lead to lots of unnecessary crawling which keeps Google from crawling all the important pages.
- How often your site is updated.
- How important your site is. Remember that Google uses inbound links amongst other things as a proxy for “authority”. If you have lots of high quality links then your site will be considered more important and so Google will try harder to crawl it thoroughly.
- The overall health of your site. This assessment will take into consideration:
- Your site speed. Google will slow the crawler down on sites that are slow to respond. They don’t want to overwhelm your web hosting so they adjust automatically to response times.
- The number of errors they find, eg. lots of 4xx or 5xx errors will lead to less crawling.
Crawl budget is usually only a conversation we have with very large sites who are struggling to get all their content indexed. Usually the issue is one or more of the above factors.
Ranking is a whole other topic
Getting crawled and indexed is one thing. Ranking is a different topic altogether, but obviously you can only rank if you are first crawled and indexed. So this the first step in ensuring the best possible visibility for your site in the search results.