What is indexing in regards to Google?

21 Jan

The indexing of your content by Google is determined by system algorithms that take into account user demand and quality checks. You can influence the Google indexing process by how you manage the discovery of your content, which relies on the URL of the page.

Without your pages’ URLs, our systems cannot crawl, index, and ultimately present your information in Search. This document introduces the notion of getting into the Google index by helping you decide how to manage discovery of your content by Google, which is the first step in the indexing process.

The basics of a search engine

Let’s start by looking at the absolute basics of what a search engine does. A search engine is an incredible piece of technology, but the workings of it come down to three main parts: crawling, indexing, and ranking. Crawling has to do with spidering the web and finding content, indexing with reading the pages and putting it in a database, and ranking with determining which page to rank for a specific user query.

Crawling

A search engine needs to discover content to add it to the big index. The process of doing that is called crawling as it is literally using robots to trail the web in search of new and updated content. These crawlers use links and sitemaps to find content that might be useful for users. After finding that content, the process of indexing begins. By improving your crawlability you can determine how well your site works with or against these robots.

Indexing

Indexing is about understanding the content and filing it in the proper place. After finding the content, Google has to read and understand it before they can put it in the right buckets. For this, it first must parse the page or, in other words, translate it in a computer language that it can understand. After that’s done, it renders the page — like a regular browser does — to discover the content and what it looks like. When that’s done, it uses the signals and information on that page to file it in the proper location inside Google’s index — a.k.a. the big filing cabinet.

Ranking

Lastly, a search engine has to have a way to rank the results based on a user query and present it in a proper way to the user in the SERPs. The ranking process consists of understanding the question the user is asking and retrieving the most relevant content fit to answer those questions. The ranking algorithms heavily influence this process and they have loads of variables to go on.

After finding the most relevant results, a search engine serves these to the user in a way that makes sense. This might be a regular spot in a SERP or something rich like a knowledge panel, or something local if the topic is locally oriented.

What is indexing at Google?

Indexing is the process of organizing data in a structured way with the goal of helping find the information quickly when asked for. Search engines crawl millions of pages, extract the data and put that data in a big bin called the index. Without a proper, highly-optimized index, search engines would have no way for their algorithms to quickly extract the relevant content.

The process of indexing has a couple of steps. After discovering a piece of content during the crawling process, a parser is going to look at it and determine what it is. The parser recognizes structural elements like titles, links, headings, and more. It also identifies the text and tries to connect words to topics and entities. During parsing, it might encounter errors that make it hard for the parser to fully understand the page.

If the page does translate well, the system will use a browser and try to render it to see a more accurate picture of the content, the design, and the user experience. All these factors determine how a search engine sees and values your site. All of this influences your performance in search.

After reading the page, the contents — text, images, videos et cetera — will be analyzed and classified in the index. The data will be sorted and weighted to determine relevancy. For that, Google uses an inverted index to map all the words to the place in the index, making them easier to discover during the ranking process.

How to get indexed by Google

Found that your website or web page isn’t indexed in Google? Try this:

Go to Google Search Console
Navigate to the URL inspection tool
Paste the URL you’d like Google to index into the search bar.
Wait for Google to check the URL
Click the “Request indexing” button

This process is good practice when you publish a new post or page. You’re effectively telling Google that you’ve added something new to your site and that they should take a look at it.

However, requesting indexing is unlikely to solve underlying problems preventing Google from indexing old pages. If that’s the case, follow the checklist below to diagnose and fix the problem. Here are some quick tactics:

1. Remove crawl blocks in your robots.txt file

Is Google not indexing your entire website? It could be due to a crawl block in something called a robots.txt file.

To check for this issue, go to yourdomain.com/robots.txt. Look for either of these two snippets of code:

Both of these tell Googlebot that they’re not allowed to crawl any pages on your site. To fix the issue, remove them. It’s that simple.

A crawl block in robots.txt could also be the culprit if Google isn’t indexing a single web page. To check if this is the case, paste the URL into the URL inspection tool in Google Search Console. Click on the Coverage block to reveal more details, then look for the “Crawl allowed? No: blocked by robots.txt” error.

This indicates that the page is blocked in robots.txt. If that’s the case, recheck your robots.txt file for any “disallow” rules relating to the page or related subsection. Remove where necessary.

2. Remove rogue noindex tags

Google won’t index pages if you tell them not to. This is useful for keeping some web pages private. There are two ways to do it:

Method 1: meta tag

Pages with either of these meta tags in theirsection won’t be indexed by Google:

This is a meta robots tag, and it tells search engines whether they can or can’t index the page.

Note: The key part is the “noindex” value. If you see that, then the page is set to noindex.

To find all pages with a noindex meta tag on your site, run a crawl with Ahrefs’ Site Audit. Go to the Indexability report. Look for “Noindex page” warnings. Click through to see all affected pages. Remove the noindex meta tag from any pages where it doesn’t belong.

Method 2: X‑Robots-Tag

Crawlers also respect the X‑Robots-Tag HTTP response header. You can implement this using a server-side scripting language like PHP, or in your .htaccess file, or by changing your server configuration.

The URL inspection tool in Search Console tells you whether Google is blocked from crawling a page because of this header. Just enter your URL, then look for the “Indexing allowed? No: ‘noindex’ detected in ‘X‑Robots-Tag’ http header”

3. Include the page in your sitemap

A sitemap tells Google which pages on your site are important, and which aren’t. It may also give some guidance on how often they should be re-crawled.

Google should be able to find pages on your website regardless of whether they’re in your sitemap, but it’s still good practice to include them. After all, there’s no point making Google’s life difficult.

To check if a page is in your sitemap, use the URL inspection tool in Search Console. If you see the “URL is not on Google” error and “Sitemap: N/A,” then it isn’t in your sitemap or indexed.

4. Remove rogue canonical tags

A canonical tag tells Google which is the preferred version of a page. It looks something like this:

<link rel=”canonical” href=”/page.html/”>

Most pages either have no canonical tag, or what’s called a self-referencing canonical tag. That tells Google the page itself is the preferred and probably the only version. In other words, you want this page to be indexed.

But if your page has a rogue canonical tag, then it could be telling Google about a preferred version of this page that doesn’t exist. In which case, your page won’t get indexed.

To check for a canonical, use Google’s URL inspection tool. You’ll see an “Alternate page with canonical tag” warning if the canonical points to another page.

If this shouldn’t be there, and you want to index the page, remove the canonical tag.

Note: Canonical tags aren’t always bad. Most pages with these tags will have them for a reason. If you see that your page has a canonical set, then check the canonical page. If this is indeed the preferred version of the page, and there’s no need to index the page in question as well, then the canonical tag should stay.

5. Check that the page isn’t orphaned

Orphan pages are those without internal links pointing to them. Because Google discovers new content by crawling the web, they’re unable to discover orphan pages through that process. Website visitors won’t be able to find them either.

To check for orphan pages, crawl your site with Ahrefs’ Site Audit. Next, check the Links report for “Orphan page (has no incoming internal links)” errors:

This shows all pages that are both indexable and present in your sitemap, yet have no internal links pointing to them.

6. Fix nofollow internal links

Nofollow links are links with a rel=“nofollow” tag. They prevent the transfer of PageRank to the destination URL. Google also doesn’t crawl nofollow links.

Here’s what Google says about the matter:

Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap.

In short, you should make sure that all internal links to indexable pages are followed. To do this, use Ahrefs’ Site Audit tool to crawl your site. Check the Links report for indexable pages with “Page has nofollow incoming internal links only” errors:

Remove the nofollow tag from these internal links, assuming that you want Google to index the page. If not, either delete the page or noindex it.

7. Add “powerful” internal links

Google discovers new content by crawling your website. If you neglect to internally link to the page in question then they may not be able to find it.

One easy solution to this problem is to add some internal links to the page. You can do that from any other web page that Google can crawl and index.

However, if you want Google to index the page as fast as possible, it makes sense to do so from one of your more “powerful” pages. Why? Because Google is likely to recrawl such pages faster than less important pages.

To do this, head over to Ahrefs’ Site Explorer, enter your domain, then visit the Best by links report.

This shows all the pages on your website sorted by URL Rating (UR). In other words, it shows the most authoritative pages first. Skim this list and look for relevant pages from which to add internal links to the page in question.

8. Make sure the page is valuable and unique

Google is unlikely to index low-quality pages because they hold no value for its users. If you’ve ruled out technical issues for the lack of indexing, then a lack of value could be the culprit.

For that reason, it’s worth reviewing the page with fresh eyes and asking yourself: Is this page genuinely valuable? Would a user find value in this page if they clicked on it from the search results?

If the answer is no to either of those questions, then you need to improve your content. You can find more potentially low-quality pages that aren’t indexed using Ahrefs’ Site Audit tool and URL Profiler. To do that, go to Page Explorer in Ahrefs’ Site Audit and use these settings:

9. Remove low-quality pages (to optimize “crawl budget”)

Having too many low-quality pages on your website serves only to waste crawl budgets. Here’s what Google says on the matter:

“Wasting server resources on [low-value-add pages] will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.”

Think of it like a teacher grading essays, one of which is yours. If they have ten essays to grade, they’re going to get to yours quite quickly. or If they have a hundred, it’ll take them a bit longer. If they have thousands, their workload is too high, and they may never get around to grading your essay.

Google does state that “crawl budget is not something most publishers have to worry about,” and that “if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”

Still, removing low-quality pages from your website is never a bad thing. It can only have a positive effect on the crawl budget. You can use our content audit template to find potentially low-quality and irrelevant pages that can be deleted.

10. Build high-quality backlinks

Backlinks tell Google that a web page is important. After all, if someone is linking to it, then it must hold some value. These are pages that Google wants to index.

For full transparency, Google doesn’t only index web pages with backlinks. There are plenty (billions) of indexed pages with no backlinks. However, because Google sees pages with high-quality links as more important, they’re likely to crawl—and re-crawl—such pages faster than those without. That leads to faster indexing.

How to influence indexing in Google?

Roll out the red carpet for Google, so to say, if you want them to properly index your site. You need to do everything you can to make your site easy to crawl. Take away technical barriers and improve the discoverability of your URLs.

Keep your robots.txt clean and don’t block pages that you don’t need to block. Update your XML sitemap, check the pages you’ve — accidentally? noindexed with robots meta tags. Improve your internal linking structure. Have a ton of underperforming pages? It might be a good idea to do something about these low-quality pages. Also, regularly check Search Console to see if Google found errors on your site. There are more things you can do to optimize your crawl budget.

Conclusion

A search engine needs to do three things before it presents your content to visitors: crawling, indexing, and ranking. If Google doesn’t index your website, then you’re pretty much invisible. You won’t show up for any search queries, and you won’t get any organic traffic.

Indexing is an important part of what a search engine does. Without indexing, all the pages Googlebot crawls don’t have a place to live and the ranking systems don’t have the input they need to do their work. If Google can’t index your site it can’t appear in the search results.

No Comments

General