Sitemap auditing involves syntax, crawlability, and indexation checks for the URLs and tags in your sitemap files.
A sitemap file contains the URLs to index with further information regarding the last modification date, priority of the URL, images, videos on the URL, and other language alternates of the URL, along with the change frequency.
Sitemap index files can involve millions of URLs, even if a single sitemap can only involve 50,000 URLs at the top.
Auditing these URLs for better indexation and crawling might take time.
But with the help of Python and SEO automation, it is possible to audit millions of URLs within the sitemaps.
What Do You Need To Perform A Sitemap Audit With Python?
To understand the Python Sitemap Audit process, you’ll need:
A fundamental understanding of technical SEO and sitemap XML files.
The ability to work with Python Libraries, Pandas, Advertools, LXML, Requests, and XPath Selectors.
Which URLs Should Be In The Sitemap?
A healthy sitemap XML sitemap file should include the following criteria:
All URLs should have a 200 Status Code.
All URLs should be self-canonical.
URLs should be open to being indexed and crawled.
URLs shouldn’t be duplicated.
URLs shouldn’t be soft 404s.
The sitemap should have a proper XML syntax.
The URLs in the sitemap should have an aligning canonical with Open Graph and Twitter Card URLs.
The sitemap should have less than 50.000 URLs and a 50 MB size.
What Are The Benefits Of A Healthy XML Sitemap File?
Smaller sitemaps are better than larger sitemaps for faster indexation. This is particularly important in News SEO, as smaller sitemaps help for increasing the overall valid indexed URL count.
Differentiate frequently updated and static content URLs from each other to provide a better crawling distribution among the URLs.
Using the “lastmod” date in an honest way that aligns with the actual publication or update date helps a search engine to trust the date of the latest publication.
While performing the Sitemap Audit for better indexing, crawling, and search engine communication with Python, the criteria above are followed.
An Important Note…
When it comes to a sitemap’s nature and audit, Google and Microsoft Bing don’t use “changefreq” for changing frequency of the URLs and “priority” to understand the prominence of a URL. In fact, they call it a “bag of noise.”
However, Yandex and Baidu use all these tags to understand the website’s characteristics.
A 16-Step Sitemap Audit For SEO With Python
A sitemap audit can involve content categorization, site-tree, or topicality and content characteristics.
However, a sitemap audit for better indexing and crawlability mainly involves technical SEO rather than content characteristics.
In this step-by-step sitemap audit process, we’ll use Python to tackle the technical aspects of sitemap auditing millions of URLs.
Image created by the author, February 2022
1. Import The Python Libraries For Your Sitemap Audit
The following code block is to import the necessary Python Libraries for the Sitemap XML File audit.
import advertools as adv
import pandas as pd
from lxml import etree
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Here’s what you need to know about this code block:
Advertools is necessary for taking the URLs from the sitemap file and making a request for taking their content or the response status codes.
“Pandas” is necessary for aggregating and manipulating the data.
Plotly is necessary for the visualization of the sitemap audit output.
LXML is necessary for the syntax audit of the sitemap XML file.
IPython is optional to expand the output cells of Jupyter Notebook to 100% width.
2. Take All Of The URLs From The Sitemap
Millions of URLs can be taken into a Pandas data frame with Advertools, as shown below.
With the help of “urllib” or the “advertools” as above, you can easily parse the URLs within the sitemap into a data frame.
Creating a URL Tree with URLLib or Advertools is easy.
Checking the URL breakdowns helps to understand the overall information tree of a website.
The data frame above contains the “scheme,” “netloc,” “path,” and every “/” breakdown within the URLs as a “dir” which represents the directory.
Auditing the URL structure of the website is prominent for two objectives.
These are checking whether all URLs have “HTTPS” and understanding the content network of the website.
Content analysis with sitemap files is not the topic of the “Indexing and Crawling” directly, thus at the end of the article, we will talk about it slightly.
Check the next section to see the SSL Usage on Sitemap URLs.
5. Check The HTTPS Usage On The URLs Within Sitemap
Use the following code block to check the HTTP Usage ratio for the URLs within the Sitemap.
The code block above uses a simple data filtration for the “scheme” column which contains the URLs’ HTTPS Protocol information.
using the “value_counts” we see that all URLs are on the HTTPS.
Checking the HTTP URLs from the Sitemaps can help to find bigger URL Property consistency errors.
6. Check The Robots.txt Disallow Commands For Crawlability
The structure of URLs within the sitemap is beneficial to see whether there is a situation for “submitted but disallowed”.
To see whether there is a robots.txt file of the website, use the code block below.
import requests
r = requests.get("https://www.complaintsboard.com/robots.txt")
R.status_code
200
Simply, we send a “get request” to the robots.txt URL.
If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.
After checking the “robots.txt” existence, we can use the “adv.robotstxt_test” method for bulk robots.txt audit for crawlability of the URLs in the sitemap.
We have used “set_option” to expand all of the values within the “url_path” section.
A URL appears as disallowed but submitted via a sitemap as in Google Search Console Coverage Reports.
We see that a “profile” page has been disallowed and submitted.
Later, the same control can be done for further examinations such as “disallowed but internally linked”.
But, to do that, we need to crawl at least 3 million URLs from ComplaintsBoard.com, and it can be an entirely new guide.
Some website URLs do not have a proper “directory hierarchy”, which can make the analysis of the URLs, in terms of content network characteristics, harder.
Complaintsboard.com doesn’t use a proper URL structure and taxonomy, so analyzing the website structure is not easy for an SEO or Search Engine.
But the most used words within the URLs or the content update frequency can signal which topic the company actually weighs on.
Since we focus on “technical aspects” in this tutorial, you can read the Sitemap Content Audit here.
7. Check The Status Code Of The Sitemap URLs With Python
Every URL within the sitemap has to have a 200 Status Code.
A crawl has to be performed to check the status codes of the URLs within the sitemap.
But, since it’s costly when you have millions of URLs to audit, we can simply use a new crawling method from Advertools.
Without taking the response body, we can crawl just the response headers of the URLs within the sitemap.
It is useful to decrease the crawl time for auditing possible robots, indexing, and canonical signals from the response headers.
To perform a response header crawl, use the “adv.crawl_headers” method.
One to check whether the response header canonical hint is equal to the URL itself.
Another to see whether the status code is 200.
Since we have 404 URLs within the sitemap, their canonical value will be “NaN”.
It shows there are specific URLs with canonicalization inconsistencies.
We have 29 outliers for Technical SEO. Every wrong signal given to the search engine for indexation or ranking will cause the dilution of the ranking signals.
To see these URLs, use the code block below.
Screenshot from Pandas, February 2022.
The Canonical Values from the Response Headers can be seen above.
Even a single “/” in the URL can cause canonicalization conflict as appears here for the homepage.
ComplaintsBoard.com Screenshot for checking the Response Header Canonical Value and the Actual URL of the web page.
You can check the canonical conflict here.
If you check log files, you will see that the search engine crawls the URLs from the “Link” response headers.
Thus in technical SEO, this should be weighted.
9. Check The Indexing And Crawling Commands From Response Headers
There are 14 different X-Robots-Tag specifications for the Google search engine crawler.
The latest one is “indexifembedded” to determine the indexation amount on a web page.
The Indexing and Crawling directives can be in the form of a response header or the HTML meta tag.
This section focuses on the response header version of indexing and crawling directives.
The first step is checking whether the X-Robots-Tag property and values exist within the HTTP Header or not.
The second step is auditing whether it aligns itself with the HTML Meta Tag properties and values if they exist.
Use the command below yo check the X-Robots-Tag” from the response headers.
def robots_tag_checker(dataframe:pd.DataFrame):
for i in df_headers:
if i.__contains__("robots"):
return i
else:
return "There is no robots tag"
robots_tag_checker(df_headers)
OUTPUT>>>
'There is no robots tag'
We have created a custom function to check the “X-Robots-tag” response headers from the web pages’ source code.
It appears that our test subject website doesn’t use the X-Robots-Tag.
If there would be an X-Robots-tag, the code block below should be used.
Check whether there is a “noindex” directive from the response headers, and filter the URLs with this indexation conflict.
In the Google Search Console Coverage Report, those appear as “Submitted marked as noindex”.
Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.
10. Check The Self Canonicalization Of Sitemap URLs
Every URL in the sitemap XML files should give a self-canonicalization hint.
Sitemaps should only include the canonical versions of the URLs.
The Python code block in this section is to understand whether the sitemap URLs have self-canonicalization values or not.
To check the canonicalization from the HTML Documents’ “<head>” section, crawl the websites by taking their response body.
Use the code block below.
user_agent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The difference between “crawl_headers” and the “crawl” is that “crawl” takes the entire response body while the “crawl_headers” is only for response headers.
You can check the file size differences from crawl logs to response header crawl and entire response body crawl.
Python Crawl Output Size Comparison.
From 6GB output to the 387 MB output is quite economical.
If a search engine just wants to see certain response headers and the status code, creating information on the headers would make their crawl hits more economical.
How To Deal With Large DataFrames For Reading And Aggregating Data?
This section requires dealing with the large data frames.
A computer can’t read a Pandas DataFrame from a CSV or JL file if the file size is larger than the computer’s RAM.
Thus, the “chunking” method is used.
When a website sitemap XML File contains millions of URLs, the total crawl output will be larger than tens of gigabytes.
An iteration across sitemap crawl output data frame rows is necessary.
For chunking, use the code block below.
df_iterator = pd.read_json(
'sitemap_crawl_complaintsboard.jl',
chunksize=10000,
lines=True)
for i, df_chunk in enumerate(df_iterator):
output_df = pd.DataFrame(data={"url":df_chunk["url"],"canonical":df_chunk["canonical"], "self_canonicalised":df_chunk["url"] == df_chunk["canonical"]})
mode="w" if i == 0 else 'a'
header = i == 0
output_df.to_csv(
"canonical_check.csv",
index=False,
header=header,
mode=mode
)
df[((df["url"] != df["canonical"]) == True) & (df["self_canonicalised"] == False) & (df["canonical"].isna() != True)]
You can see the result below.
Python SEO Canonicalization Audit.
We see that the paginated URLs from the “book” subfolder give canonical hints to the first page, which is a non-correct practice according to the Google guidelines.
11. Check The Sitemap Sizes Within Sitemap Index Files
Every Sitemap File should be less than 50 MB. Use the Python code block below in the Technical SEO with Python context to check the sitemap file size.
15. Check The Open Graph URL And Canonical URL Matching
It is not a secret that search engines also use the Open Graph and RSS Feed URLs from the source code for further canonicalization and exploration.
The Open Graph URLs should be the same as the canonical URL submission.
From time to time, even in Google Discover, Google chooses to use the image from the Open Graph.
To check the Open Graph URL and Canonical URL consistency, use the code block below.
for i, df_chunk in enumerate(df_iterator):
if "og:url" in df_chunk.columns:
output_df = pd.DataFrame(data={
"canonical":df_chunk["canonical"],
"og:url":df_chunk["og:url"],
"open_graph_canonical_consistency":df_chunk["canonical"] == df_chunk["og:url"]})
mode="w" if i == 0 else 'a'
header = i == 0
output_df.to_csv(
"open_graph_canonical_consistency.csv",
index=False,
header=header,
mode=mode
)
else:
print("There is no Open Graph URL Property")
There is no Open Graph URL Property
If there is an Open Graph URL Property on the website, it will give a CSV file to check whether the canonical URL and the Open Graph URL are the same or not.
But for this website, we don’t have an Open Graph URL.
Thus, I have used another website for the audit.
if "og:url" in df_meta_check.columns:
output_df = pd.DataFrame(data={
"canonical":df_meta_check["canonical"],
"og:url":df_meta_check["og:url"],
"open_graph_canonical_consistency":df_meta_check["canonical"] == df_meta_check["og:url"]})
mode="w" if i == 0 else 'a'
#header = i == 0
output_df.to_csv(
"df_og_url_canonical_audit.csv",
index=False,
#header=header,
mode=mode
)
else:
print("There is no Open Graph URL Property")
df = pd.read_csv("df_og_url_canonical_audit.csv")
df
You can see the result below.
Python SEO Open Graph URL Audit.
We see that all canonical URLs and the Open Graph URLs are the same.
Python SEO Canonicalization Audit.
16. Check The Duplicate URLs Within Sitemap Submissions
A sitemap index file shouldn’t have duplicated URLs across different sitemap files or within the same sitemap XML file.
The duplication of the URLs within the sitemap files can make a search engine download the sitemap files less since a certain percentage of the sitemap file is bloated with unnecessary submissions.
For certain situations, it can appear as a spamming attempt to control the crawling schemes of the search engine crawlers.
use the code block below to check the duplicate URLs within the sitemap submissions.
sitemap_df["loc"].duplicated().value_counts()
You can see that the 49574 URLs from the sitemap are duplicated.
Python SEO Duplicated URL Audit from the Sitemap XML Files
To see which sitemaps have more duplicated URLs, use the code block below.