How Does Google Crawl and Index Pages?

H

Search Engine Optimization (SEO) is really an umbrella term which encompasses a wide range of activities including front end design and coding (UX and performance), backend optimization (crawl optimization and performance), content strategy and optimization, digital PR and promotion.

You can’t be an expert in everything, but one thing I think is important for all SEOs and digital marketing managers to understand is how Google actually crawls and indexes the web. This is important because it informs how we build and structure websites, and how we can make Google’s life easier which is ultimately good for us…

TL;DR

Google has a two phase crawling process. In Phase 1 they will parse all the static HTML on a page without fully rendering it, they’ll then follow any links they find and do the same on those pages. In Phase 2 they will fully render the page, including any JavaScript, and at this stage they may find more content and links to index.

So Google can and does index content rendered with JavaScript, but this is resource intensive and slows down the process of crawling and indexing your site. This is especially true for sites that are built as monolithic JavaScript web apps, and for very large websites that rely on a lot of JavaScript.

To ensure fast and comprehensive crawling and indexing of your site you should try to incorporate main navigation and content elements in plain HTML as much as possible. It is also important to think about site structure and internal crawl paths so that it is easy for Google to find all your pages. XML sitemaps help but do not guarantee indexation or rankings.

How Google Crawls your website

You’ve almost certainly heard of Googlebot before and probably know that it regularly visits your website to index your content. But this is really an oversimplification of what is actually happening. The following chart outlines an more accurate view of this process.

Google crawl and index

There are really two phases to the crawl and indexation process…

Phase 1

In the first phase what we commonly know as Googlebot visits your site for the first time. Usually starting on the homepage, Googlebot will parse the static HTML, the same that is served to your web browser. You can see what Google sees at this stage by right clicking on your home page and choosing “View Source”.

So any content that is visible in the static HTML may be indexed, and any links to other pages that Googlebot can see in the static HTML will be added to a queue for Googlebot to crawl and index.

If your entire website is served as static HTML then Googlebot will quickly be able to find and index all your content, but these days it is rarely as simple as that for a couple main reasons:

  1. Many websites these days use one or more of the popular JavaScript libraries (Vue, Node, React, Angular etc) for adding more sophisticated, app like user experiences. If not just for navigation, often websites today are built entirely as JavaScript web apps. This complicates things for Google because if that JavaScript is run client side (more on this later) then Googlebot will not see any of your content or links, because Googlebot doesn’t execute JavaScript.
  2. And even if your site is served as static HTML it almost certainly uses HTML + CSS for styling and Google is actually very interested in how your website is laid out because this also gives them some good signals about the quality of your site.

So, for these reasons Google has a second phase of crawl and indexation which is referred to in the diagram above as the renderer…

Phase 2

Once Google has found one or more URLs on your website these will be added to a second queue to be fully rendered. This means that Google will fully execute all assets and JavaScript on the site, just like it would be executed in your browser. To do this Google uses a late version of Chrome.

This is good news because it means that Google can infact crawl and index JavaScript. And they don’t just execute JavaScript they also render the page with all assets including CSS. For this reason it is very important that you do NOT block Google from crawling any assets such as CSS or JavaScript. Doing so will prevent Google properly rendering your site which will have a negative impact on your rankings…Google won’t rank a site well if they don’t know what it it looks like for users.

There an important consequences however if your site relies on client side JavaScript…It slows Google down, big time!

Imagine you have a very large site which is served as a client side JavaScript web app….Googlebot visits the home page, doesn’t see anything, but that URL goes to the render queue, the page is rendered and Google sees content and additional links to inner pages. Googlebot visits those pages, but doesn’t see anything…the URLs are eventually rendered and more content and links are found…those links go back to Googlebot, and so on. Rendering 30,000,000,000,000+ pages on the web is very resource intensive, and they can’t do it as regularly as they can crawl static HTML with Googlebot.

Crawl and indexation Summary

So in a nutshell, you can think of Google’s crawl and index process in two phases. Phase 1 is a quick scan of static HTML. The more content and links Google finds in this phase the faster your site will be indexed. Phase 2 is slower and more time consuming, but it is during this phase that Google can render all JavaScript and will see all content and links served by the JavaScript.

Google’s ability to render modern JavaScript is good news for developers, but it is important to be aware that it will make it slower for Google to crawl and index your entire site. This will also make it much slower for Google to discover new content as well as recrawling as old content as it is updated.

That’s not to say that there are not ways to mitigate these challenges…

8 Things We can we do to improve indexation

The easier it is for Google to find and index the content on your site the better visibility you will have in search results. This is especially true if you have a very large and complex site, like Trade Me for example. So what can we do to help Google:

1. Aim for a wide and flat structure

That statement is a little bit of an oversimplification, but it stands as a good rule of thumb. The fewer links Google needs to follow to find an inner page, the more likely it is to be crawled regularly.

Your circumstances may vary, but we always recommend trying to keep your site structure or category scheme as shallow as possible. So rather than a few big categories with many layers of sub-category, it is in general better to have more top level categories with minimal sub-categories.

2. Check pagination

The first rule of thumb doesn’t just apply to your category structure, it also applies to things like pagination. For example, if you have 1000 products in a category and only display 20 products per page then you’ll have 50 paginated category pages and depending on how you structure your pagination this may mean some products are buried 10 or more clicks deep in your site. In such cases you can either increase the number of products per page displayed, or consider splitting the products into multiple categories. The right approach always “depends” but the rule of thumb is good to keep in mind.

3. Try to avoid duplicate content

It’s important to know that there is no ‘penalty’ for duplicate content. Duplicate content is a fact of life on the web, and from a user point of view it often goes unnoticed. For instance, and ecommerce site might offer some faceted navigation which leads to many URL variations of the same category page. Or perhaps your list all your colorways as unique products, but the descriptions and product names and titles are all the same. Google has seen it all and is generally very good at deduping search results.

The problem with duplicate content though is that Google still needs to crawl the page to know that it is duplicate…and if you have a lot of duplicate content that is a lot of wasted crawling for pages that are not likely to be indexed. If you have these kinds of issues on a large site especially, it pays to figure out how to prevent Google from crawling pages unnecessarily. It makes there life easier and will ensure a faster and more comprehensive crawl of the pages you most care about.

4. Be wary of spider traps

A ‘spider trap’ is a endless loop of URL variations that causes unnecessary crawling. We mentioned faceted navigation above and this is one of the common causes of ‘spider traps’. We’ve seen cases on ecommerce sites where categories could be filtered by an almost infinite number of facets related to size, color, style, gender etc.

The ability for consumers to filter is great, but when those filters can be applied (and crawled) in a very large number of possible combinations, and each combination results in a unique URL…well then you have trouble. Google can get caught crawling all those unique URLs which is a huge waste of their and your resources. Again, the easy we make Google’s life the better for us.

5. Have an XML sitemap

An XML sitemap is mandatory. It doesn’t guarantee that your pages will all be indexed, but it does ensure they’ll at least be crawled. Couple of things to note about sitemaps:

  • The Change Frequency and Priority fields in the XML spec are taken by Google as hints only. There is no point setting a regular change frequency for a page that isn’t changed regularly, it won’t help. In our experience Google will set their own crawl frequency based on how often they see the page changing.
  • There are also special XML sitemap formats for images, videos and news (for news sites, not your company news feed) and these should be utilized as appropriate. We like image sitemaps for ecommerce sites in particular because of the large number of product images (image search for ecommerce will be a future blog post!)

6. Don’t block search engines from essential resources

Another easy to make mistake is blocking Google from crawling essential non-HTML resources like JavaScript. We have seen many cases where well intentioned developers have blocked Google from crawling JavaScript resources in the mistaken belief that ‘Google doesn’t need to see that stuff’.

The thing is that Google absolutely needs to see any resource that is used to present or display the content on your site. Google doesn’t only index the content on your page they also render the page as a browser does when a real persons visits the site. They do this so that they can see how the content is all laid out, and so understand what is important and how many ads there are on the page and what the user experience is like.

All those things are taken into consideration when Google ranks pages, so if they can’t access critical resources to properly render your pages then they can’t fully assess your pages and your rankings will reflect that.

7. Content that is hidden behind an interaction will not be indexed

A popular UI affect we see today is the “Load More” links at the bottom of blog pages or ecommerce category pages. Clicking the link will then load more posts or products without any page load. It is a good user experience, but it is important to know that even though Google can render content in JavaScript they will not interact with your page, so Google will not click that “Load More” link….and if they don’t click it they won’t see the additional posts or product links, which means they may not be crawled.

We always recommend implementing a graceful degradation for browsers that do not have JavaScript enabled. In the case of a “Load More” button this would be more traditional pagination to provide a crawl path into the archive of older posts or additional products in the category. This graceful degradation of features for non-JavaScript enabled browsers also suits Google who will be able to easily crawl those additional pages and find links all the additional posts or products without having to render the page.

8. Try to incorporate primary navigation in plain HTML

We’ve already covered how Google crawls and indexes so this point should be self evident. Incorporating your main navigation in plain HTML will make it faster and easier for Google to discover all the pages on your site without having to fully render all your pages.

Just having an XML sitemap doesn’t guarantee indexation

Crawling, indexing and ranking are different things. An XML sitemap will help ensure Google knows about a page and will crawl it, but it does not guarantee it is indexed or ranked.

Even if a page is included in your XML sitemap Google may not index or rank it for any number of possible reasons, for example…

  • The page is a duplicate of another page (see above) so there is no point indexing it.
  • Despite being included in the XML sitemap, if Google can’t find a link to the page on the website itself, then they will consider that the page isn’t very important.
  • If a page is included in the sitemap, but is regularly inaccessible (404, or 5xx error) then it is unlikely it will be indexed.
  • There are many other possible reasons, but the point is that XML sitemaps are just a pointer, they don’t guarantee indexation.

What is Crawl Budget?

Crawl budget can be thought of as the maximum number of pages that Google will crawl on your site per day. It isn’t a fixed number, and will vary from day to day, but it is generally fairly stable.

Crawl budget varies from site to site and is determined by a variety of factors including (but not limited to):

  • How many pages on your site. This is where things like faceted navigation and ‘spider traps’ can have a big impact, ie. ‘spider traps’ can lead to lots of unnecessary crawling which keeps Google from crawling all the important pages.
  • How often your site is updated.
  • How important your site is. Remember that Google uses inbound links amongst other things as a proxy for “authority”. If you have lots of high quality links then your site will be considered more important and so Google will try harder to crawl it thoroughly.
  • The overall health of your site. This assessment will take into consideration:
    • Your site speed. Google will slow the crawler down on sites that are slow to respond. They don’t want to overwhelm your web hosting so they adjust automatically to response times.
    • The number of errors they find, eg. lots of 4xx or 5xx errors will lead to less crawling.

Crawl budget is usually only a conversation we have with very large sites who are struggling to get all their content indexed. Usually the issue is one or more of the above factors.

Ranking is a whole other topic

Getting crawled and indexed is one thing. Ranking is a different topic altogether, but obviously you can only rank if you are first crawled and indexed. So this the first step in ensuring the best possible visibility for your site in the search results.

About the author

Charles

Charles has been working on the web for 20 years, building and promoting websites in New Zealand, Australia, the US and China.

By Charles

Get in touch

Paul - (027) 513 6134
Charles - (021) 807 829


Paul - (027) 513 6134
Charles - (021) 807 829