Sitemaps and Robots.txt: What to Set and Why

A few years ago, a client came to me frantic: their entire website had disappeared from Google.
Traffic dropped from thousands of monthly visitors to nearly zero overnight. After some
investigation, I found the culprit—a single line in their robots.txt file. During a hosting
migration, someone had temporarily added “Disallow: /” to prevent search engines from crawling the
incomplete migration site. When the migration finished, nobody removed that line. For six weeks,
Google obediently stopped crawling the site, and rankings plummeted.

That’s an extreme example, but it illustrates how profoundly sitemaps and robots.txt affect your
site’s relationship with search engines. These aren’t glamorous topics—there’s no viral potential in
discussing XML sitemap protocol—but getting them wrong can be catastrophic, and getting them right
creates a solid foundation for all other SEO efforts.

I’ve configured sitemaps and robots.txt for more websites than I can easily count. Most of them were
WordPress sites, which have specific considerations worth addressing. This guide covers not just
what these files do but how to configure them properly, what mistakes to avoid, and how to verify
everything is working correctly.

Understanding What These Files Actually Do

Before diving into configuration specifics, let’s establish exactly what sitemaps and robots.txt
accomplish. They serve related but distinct purposes, and understanding this distinction prevents
common misconceptions.

XML Sitemaps: Your Site’s Directory

An XML sitemap is essentially a directory of your website’s pages that you’re inviting search engines
to crawl and potentially index. Think of it as a table of contents submitted directly to
Google—”here are all the pages I have, and here’s when each one was last updated.”

Sitemaps help search engines discover pages they might not find through normal crawling. If you have
a page that’s not linked from anywhere else on your site (an orphan page), a sitemap ensures Google
knows it exists. If you’ve published new content, a sitemap with accurate last-modified dates
signals that fresh content is available.

What sitemaps explicitly do not do is guarantee indexing. I can’t emphasize this enough because I
encounter this misconception constantly. Being in a sitemap doesn’t mean Google will index a page—it
just means you’re suggesting Google look at it. Google makes its own decisions about what deserves
indexing based on content quality, competing pages, and many other factors.

Sitemaps also don’t improve rankings directly. A page doesn’t rank better because it’s in a sitemap.
The sitemap facilitates discovery and crawling; content quality and relevance determine ranking.

Robots.txt: Access Control Instructions

Robots.txt is a simple text file that provides instructions to search engine crawlers about which
parts of your site they should and shouldn’t access. It lives at the root of your domain
(example.com/robots.txt) and contains directives that crawlers read before crawling your site.

The primary use case is preventing search engines from wasting resources crawling pages that aren’t
valuable to index—administrative areas, search results pages, staging environments, and similar
content that shouldn’t appear in search results.

Here’s the critical concept that trips up many people: robots.txt controls crawling, not indexing.
The “Disallow” directive tells crawlers not to request a URL, but it doesn’t prevent that URL from
appearing in search results. If other sites link to a page you’ve disallowed in robots.txt, Google
may still index it—they just won’t see its content, so it appears as a cryptic listing with no
description.

For preventing indexing, you need different tools: the noindex meta tag or the X-Robots-Tag header.
Robots.txt prevents crawling; noindex prevents indexing. Confusing these concepts causes significant
problems.

How They Work Together

Sitemaps and robots.txt complement each other. Your sitemap lists pages you want search engines to
find and potentially index. Your robots.txt steers crawlers away from pages you don’t want them to
access. The second file also typically includes a reference to the first, telling crawlers where to
find your sitemap.

When a search engine encounters your site, it first checks robots.txt to understand access rules. It
then uses your sitemap (if referenced in robots.txt or submitted through Search Console) as a
directory of pages to consider crawling, respecting the access rules defined in robots.txt.

XML Sitemap Configuration: What to Include and Exclude

Effective sitemap configuration requires thoughtful decisions about what belongs in your sitemap and
what doesn’t. The goal isn’t to list every URL on your site—it’s to list the URLs you actually want
search engines to discover and potentially index.

Pages That Belong in Your Sitemap

Your sitemap should include pages that meet this criteria: they’re meant for public consumption, they
provide value to search users, and they’re not blocked from indexing through other means.

This typically includes all published blog posts and articles (the core content you want ranking),
important static pages like your homepage, about page, contact page, and service pages, category and
archive pages that aggregate your content meaningfully, and product pages if you run an e-commerce
site.

For each URL, sitemaps can include optional metadata. The lastmod field indicates when the page was
last modified—keep this accurate, as inaccurate dates can reduce Google’s trust in your sitemap
data. The changefreq field suggests how often content changes (daily, weekly, monthly), though
Google largely ignores this. The priority field indicates relative importance of pages on your site,
though Google reportedly ignores this too.

I focus primarily on ensuring URLs are correct and lastmod dates are accurate. The other fields
provide marginal value at best.

Pages That Don’t Belong in Your Sitemap

Equally important is knowing what to exclude. Your sitemap shouldn’t include pages you’ve marked as
noindex—including them creates conflicting signals (sitemap says “index this” while noindex says
“don’t”). SEO plugins typically handle this automatically, but custom implementations need explicit
exclusion logic.

Exclude administrative and login pages, which aren’t meant for search users. Exclude search results
pages, which create infinite URL possibilities from query combinations. Exclude paginated archives
beyond the first page (page 2, page 3, etc. of category listings)—while Google can handle these,
they’re not priority crawling targets. Exclude URLs that redirect elsewhere; sitemaps should contain
only directly-accessible canonical URLs.

Tag archive pages often warrant exclusion or noindexing entirely. Tags frequently create thin content
pages—a tag used on three posts creates an archive page with essentially the same content as those
posts but less context. Unless your tagging system creates genuinely valuable aggregate pages,
consider excluding them from sitemaps and noindexing them entirely.

Author archive pages depend on your site structure. Multi-author sites where each author has a
substantial portfolio might benefit from indexed author pages. Single-author sites gain nothing from
author archives—they duplicate the main blog archive.

WordPress Sitemap Options

WordPress 5.5 introduced native sitemap functionality accessible at /wp-sitemap.xml. This basic
implementation works but offers limited control. Most SEO professionals prefer plugin-generated
sitemaps for their additional features and configuration options.

Yoast SEO creates a sitemap index at /sitemap_index.xml with separate sitemaps for different content
types: post-sitemap.xml for posts, page-sitemap.xml for pages, category-sitemap.xml for categories,
and so on. This separation makes reviewing sitemap contents more manageable and stays within size
limits for large sites.

Configuring Yoast sitemaps involves navigating to SEO > Settings > Site Features, where you can
enable or disable sitemaps and configure which content types are included. Under each content type’s
settings, you can exclude specific post types from sitemaps and configure whether taxonomies
(categories, tags) appear.

Rank Math offers similar functionality with additional options. The sitemap settings allow granular
control over which post types, taxonomies, and even individual pages appear in sitemaps.

When using any SEO plugin, visit your sitemap URLs directly after configuration to verify the output
matches your expectations. Common issues include plugins defaulting to including content types you
wanted excluded or configuration not taking effect due to caching.

Robots.txt Configuration: Controlling Crawler Access

Robots.txt configuration requires understanding the directive syntax and making thoughtful decisions
about what to block. The stakes are high—a misconfigured robots.txt can prevent your entire site
from being crawled.

Understanding Robots.txt Syntax

Robots.txt uses a straightforward syntax. Each section starts with a User-agent line specifying which
crawler the rules apply to, followed by Allow and Disallow directives specifying what that crawler
can access.

“User-agent: *” means “these rules apply to all crawlers.” You can also specify particular crawlers
like “User-agent: Googlebot” or “User-agent: Bingbot” for crawler-specific rules, but most sites
don’t need this granularity.

“Disallow: /folder/” tells crawlers not to access URLs starting with /folder/. “Allow:
/folder/exception.html” creates an exception within a broader disallow rule. These directives use
pattern matching, where * serves as a wildcard and $ indicates end of URL.

The Sitemap directive (capitals matter) specifies your sitemap location. All major search engines
support this, making robots.txt an effective place to declare sitemap locations.

Recommended WordPress Robots.txt Configuration

For most WordPress sites, a sensible robots.txt looks like this: User-agent: * covers all crawlers.
Disallow: /wp-admin/ blocks the administrative area that search users shouldn’t access. Allow:
/wp-admin/admin-ajax.php creates an exception for AJAX functionality that some themes and plugins
require for proper page rendering.

Optionally, you might add Disallow: /wp-includes/ though this is less critical. You should add
Disallow: /?s= and Disallow: /search/ to prevent crawling of internal search results (though noindex
on search results pages is also necessary). And always include Sitemap:
https://yoursite.com/sitemap_index.xml with your actual sitemap URL.

What you should not block includes wp-content, specifically your themes, images, JavaScript, and CSS.
Google needs access to these resources to render pages properly. Blocking them causes “blocked
resources” warnings in Search Console and can affect how Google understands your pages.

What Not to Block (Common Mistakes)

The worst robots.txt mistake is blocking everything with Disallow: /. This single line instructs all
crawlers to stay away from your entire site. It happens more often than you’d expect, usually from
development settings that weren’t changed after launch or from copy-paste errors.

Blocking CSS and JavaScript is another common error. Years ago, convention was to block these files.
Now, Google explicitly recommends making them accessible so it can properly render pages. If your
robots.txt blocks /wp-content/themes/ or similar resource directories, remove those blocks.

Blocking uploads with /wp-content/uploads/ prevents Google from accessing your images, which hurts
image search visibility and can affect core content rendering. You want your images accessible.

Overly aggressive query string blocking using Disallow: /*? blocks all URLs with query strings,
including legitimate ones like pagination parameters that some themes use. Be specific about which
query patterns to block rather than catching everything.

The Disallow vs. Noindex Critical Distinction

I’ve seen this confusion cause serious problems multiple times. Site owners want to prevent a page
from appearing in search results, so they add a Disallow rule in robots.txt. The page still appears
in search results—sometimes with the title “A description for this result is not available”—because
Google indexed it based on external links without crawling its content.

To prevent indexing, use the noindex meta tag or X-Robots-Tag header. To prevent crawling while still
allowing potential indexing through external signals, use robots.txt Disallow. In most cases, if you
don’t want a page indexed, you want noindex, not robots.txt blocking.

There’s also a common interaction problem: if you block a page via robots.txt and try to use a
noindex meta tag on it, Google can’t see the noindex directive because you’ve told it not to crawl
the page. This creates a situation where the page might get indexed anyway from external links, and
your noindex instruction never gets read.

WordPress-Specific Considerations

WordPress has particular behaviors around sitemaps and robots.txt that warrant attention.
Understanding these saves troubleshooting time later.

Virtual vs. Physical Robots.txt

WordPress generates a virtual robots.txt by default—it doesn’t exist as an actual file but is
generated dynamically when requested. This works fine and allows plugins to modify it
programmatically.

If you create a physical robots.txt file in your webroot, it takes precedence over WordPress’s
virtual version. This creates potential for conflicts: you might configure robots.txt through an SEO
plugin but have a physical file overriding those settings. If your robots.txt settings aren’t taking
effect, check for a physical file that might be ignored by your plugin configuration.

The “Search Engine Visibility” Setting

WordPress includes a setting under Settings > Reading called “Discourage search engines from indexing
this site.” When checked, WordPress adds a noindex meta tag to all pages and modifies robots.txt to
discourage crawling.

This setting is intended for development and staging sites. The problem is that it sometimes gets
left checked after launch, or gets accidentally checked during troubleshooting. It’s a common cause
of “my site disappeared from Google” problems.

Always verify this setting is unchecked for production sites. After any development work or plugin
updates, confirm it hasn’t changed. Some SEO plugins add their own visibility controls—ensure you
understand which settings control what.

Plugin Conflicts and Caching Issues

Multiple plugins affecting sitemaps or robots.txt can create conflicts. If you’re using Yoast SEO for
sitemaps but also have an old XML sitemap plugin installed, you might have competing sitemaps. If
you’ve manually modified robots.txt through a plugin and then installed another SEO plugin that
generates its own robots.txt, you might have conflicting rules.

Caching can delay robots.txt changes from taking effect. If you update robots.txt through WordPress
but it’s being cached by server-side caching, CDN, or hosting provider caching, the old version
might be served. After making changes, verify by loading your robots.txt URL in a browser (bypass
cache with Ctrl+F5 or private browsing) or use Search Console’s robots.txt tester which fetches
fresh copies.

Verification and Troubleshooting

Configuration is half the work—verification ensures your configuration is actually applied and
working correctly. I run through a standard verification checklist after any sitemap or robots.txt
changes.

The Verification Checklist

Start by directly accessing your robots.txt file by entering yoursite.com/robots.txt in a browser.
Verify it loads correctly (status 200, not 404 or error). Review the contents—are they what you
expect? Check for any unexpected Disallow rules, especially Disallow: / that would block everything.
Verify your Sitemap directive is present and points to the correct URL.

Next, access your sitemap by visiting the URL specified in your Sitemap directive. If using a sitemap
index, click through to individual sitemaps. Verify they load without errors (valid XML, not 404).
Review included URLs—are your important pages listed? Check for URLs that shouldn’t be included
(noindexed pages, redirecting URLs, thin archive pages).

Use these Search Console tools for additional verification. The URL Inspection tool lets you check if
specific pages are blocked by robots.txt. The Index Coverage report shows which pages Google has
crawled, indexed, or excluded with reasons. The Sitemaps report confirms your sitemap is functional
and how many URLs Google has discovered from it.

Common Problems and Fixes

If your sitemap returns 404, several causes are possible. Your sitemap plugin might not be properly
activated or configured. Your permalink settings might need resaving (WordPress > Settings >
Permalinks, just click save without changes). Server caching might be serving a stale 404
response—clear caches and try again.

If robots.txt contains unexpected rules, check for physical robots.txt files that might be taking
precedence. Check multiple plugins that might be modifying robots.txt. Review recent changes—did a
plugin update change settings?

If Search Console reports “Submitted URL seems to be a Soft 404” for sitemap URLs, the linked pages
return 200 status but appear to have no content. Check if the pages actually have content. Consider
whether thin archive pages should be excluded from sitemaps. Ensure your site isn’t returning
“empty” pages due to JavaScript rendering issues.

If Search Console reports “Blocked by robots.txt” for pages you want indexed, your robots.txt is too
restrictive. Review your Disallow rules. Use the Search Console robots.txt tester to understand
exactly which rule is causing the block. Modify robots.txt to allow access to intended pages.

The Development-to-Production Transition

Many problems originate from development settings that persist into production. Before launching any
site or after any major update, explicitly check: WordPress “Discourage search engines” setting is
unchecked. Robots.txt doesn’t contain development-era blocks (Disallow: / or similar). Sitemap URLs
are correct for the production domain (not staging domain references). Password protection or IP
restrictions that might have been used during development are removed.

I’ve seen professionally-built sites launch with robots.txt blocking everything because the
developer’s working practice was to block during development and nobody included “remove robots.txt
blocking” in the launch checklist.

Sitemap and Robots.txt for Multi-Site or Subdomain Setups

More complex site architectures require additional consideration. Each scenario has distinct
requirements.

WordPress Multisite

WordPress Multisite installations share a single WordPress install across multiple sites. Each site
in the network should have its own sitemap accessible at its own domain or subdomain. Robots.txt
behavior depends on your configuration—subdomain installations have separate robots.txt per
subdomain, while subdirectory installations share a robots.txt at the root domain.

For subdirectory multisites, carefully configure robots.txt to handle all sites appropriately. You
can’t block /subsite/wp-admin/ without affecting the main site’s /wp-admin/. Consider whether
uniform rules make sense or whether you need more complex configuration.

Subdomain vs. Subdirectory Content

If your main site is at example.com and you have blog.example.com or shop.example.com, each subdomain
has its own robots.txt and should have its own sitemap. Search engines treat subdomains as separate
websites, so they need independent configuration.

Subdirectory structures (example.com/blog/, example.com/shop/) share robots.txt with the main domain.
Your sitemap can be a single consolidated sitemap or separate sitemaps for different sections,
referenced from a sitemap index.

Multiple Sitemaps

Large sites often use multiple sitemaps organized by content type or section. This is perfectly fine
and often preferable—it keeps individual sitemaps smaller and makes reviewing contents easier.

When using multiple sitemaps, create a sitemap index that references all child sitemaps. Submit the
index to Search Console rather than individual sitemaps. Include the index URL in your robots.txt
Sitemap directive. Ensure all child sitemaps are accessible and valid.

Advanced Considerations

Beyond basic configuration, several advanced topics affect how sitemaps and robots.txt work in
practice.

Crawl Budget and Large Sites

For sites with millions of pages, crawl budget—the amount of crawling Google will do on your
site—becomes a significant consideration. Robots.txt helps preserve crawl budget by blocking truly
unnecessary pages from being crawled.

For most small-to-medium sites (under 10,000 pages), crawl budget isn’t a practical concern. Google
will crawl your entire site regularly. But if you’re running a large e-commerce site or content
repository, thoughtful robots.txt blocking of low-value URL patterns (faceted navigation, session
IDs, sorting parameters) preserves crawl resources for important pages.

Handling Parameters and Query Strings

URLs with query strings (?sort=price, ?page=2, ?utm_source=twitter) create potential duplicates or
low-value pages. Some of these deserve crawling; others don’t.

Rather than blocking all query strings via robots.txt (which can cause problems), use Google Search
Console’s URL Parameters tool to communicate how different parameters affect page content. You can
also use canonical tags to consolidate parameter variations to primary URLs while still allowing
crawling.

For tracking parameters (?utm_source, ?fbclid), most sites benefit from allowing crawling but using
canonical tags pointing to the non-parameterized version. This lets you track referral traffic while
preventing duplicate indexing.

JavaScript-Rendered Content and Sitemaps

If your site relies heavily on JavaScript for content rendering, sitemaps become more important for
discovery since Google might not find pages through traditional link crawling if those links are
JavaScript-generated.

Ensure your sitemap includes all JavaScript-rendered pages. Consider whether dynamic content that
changes based on user interaction (infinite scroll, tabs, filters) should generate distinct URLs
that belong in sitemaps.

Mobile and AMP Considerations

If you have separate mobile URLs (m.yoursite.com) or AMP pages, these should have their own sitemap
entries or separate sitemaps. Ensure robots.txt for mobile subdomains doesn’t block content that
should be crawled.

For responsive sites using the same URLs for mobile and desktop (the recommended approach), no
special sitemap or robots.txt consideration is needed—Google crawls the same URLs and renders them
as mobile and desktop.

Monitoring and Maintenance

Sitemaps and robots.txt require ongoing attention, not just initial configuration. Search engines
evolve, your site changes, and problems can develop silently.

Regular Verification Schedule

Monthly, check Search Console’s Index Coverage report for any crawling or indexing issues. Verify
your sitemap status shows recent successful fetches. Review any new errors or warnings related to
crawl access.

Quarterly, directly review your robots.txt and sitemap contents. Verify they still reflect your
current site structure and indexing intentions. Check for new content types or sections that might
need sitemap or robots.txt adjustments.

After major changes like CMS updates, hosting migrations, plugin updates, or site redesigns,
immediately verify sitemaps and robots.txt are functioning correctly. These events commonly cause
configuration issues.

Search Console Alerts

Set up Search Console email notifications to alert you about significant crawling or indexing issues.
Google will notify you about problems like: inability to access your robots.txt, sitemap errors or
warnings, significant indexing changes, and crawl anomalies.

These alerts catch problems quickly before they cause extended traffic loss.

Conclusion

Sitemaps and robots.txt aren’t exciting SEO topics, but they’re foundational. Proper configuration
ensures search engines can discover your content and respects your intentions about what should and
shouldn’t be crawled. Mistakes in either file can devastate your search visibility, while correct
configuration creates a solid technical foundation for all other SEO efforts.

The key concepts worth remembering: sitemaps list pages you want discovered but don’t guarantee
indexing. Robots.txt controls crawling but doesn’t prevent indexing—use noindex for that. WordPress
has specific considerations around virtual files, settings, and plugins that merit attention.
Verification is essential after any configuration change.

Take time to review your current sitemap and robots.txt configuration. Verify everything is correct
using Search Console and direct access. Establish a maintenance routine that catches problems before
they compound. These unglamorous files deserve more attention than they typically receive—getting
them right is essential for SEO success.