Sitemaps and Robots.txt: What to Set and Why

Sitemaps and robots.txt are foundational to how search engines discover and crawl your site. A properly configured sitemap helps search engines find all your important pages, while robots.txt guides which pages to crawl. Misconfiguration of either can block indexing or waste crawl budget. This guide explains what both files do, how to configure them for WordPress, and common mistakes to avoid.

📋 Key Takeaways

Sitemaps list pages you want indexed; robots.txt controls crawler access
XML sitemaps should include only indexable pages
Robots.txt Disallow does not prevent indexing—only noindex does
Always include sitemap reference in robots.txt

I. Understanding XML Sitemaps

A. What Sitemaps Do

Page discovery: Sitemaps tell search engines which pages exist on your site.
Crawl hints: Include last modified date to indicate recently updated content.
Priority signals: Optional priority values suggest relative importance of pages.
Media inclusion: Image and video sitemaps can help with media indexing.

B. What Sitemaps Do NOT Do

Guarantee indexing: Being in a sitemap does not guarantee Google will index the page.
Override noindex: Pages marked noindex will not be indexed even if in the sitemap.
Improve rankings: Sitemaps help discovery, not ranking. Content quality determines rankings.

II. WordPress Sitemap Configuration

A. Built-in WordPress Sitemaps

Core sitemaps: WordPress 5.5+ includes native sitemap functionality at /wp-sitemap.xml.
Limitations: Core sitemaps are basic. SEO plugins provide more control.
Plugin preference: Most SEO plugins disable core sitemaps and use their own.

B. SEO Plugin Sitemaps

Yoast SEO: Creates sitemap index at /sitemap_index.xml with separate sitemaps for posts, pages, and taxonomies.
Rank Math: Similar structure with additional configuration options.
Configuration: Control which post types and taxonomies are included in plugin settings.

C. What to Include

Indexable posts: All posts you want search engines to find and index.
Indexable pages: Main pages including homepage, about, contact, and service pages.
Category archives: If you want category pages indexed (often yes for publishers).
Author archives: Only if multiple authors with unique content—often better noindexed.

D. What to Exclude

Noindexed pages: Pages marked noindex should not appear in sitemap.
Admin/login pages: Internal pages not meant for search users.
Tag archives: Often thin content—consider excluding or noindexing.
Date archives: Monthly/yearly archives usually do not warrant indexing.
Search results: Internal search result pages should never be in sitemaps.

III. Understanding Robots.txt

A. What Robots.txt Does

Crawler directive: Tells search engine crawlers which URLs they should or should not request.
Crawl budget management: Prevents wasting crawl resources on unimportant pages.
Sitemap location: Provides location of XML sitemaps for crawler discovery.

B. Critical Limitation

Robots.txt Disallow does not prevent indexing—this is a common and serious misconception.

Disallow vs noindex: Disallow prevents crawling but does not prevent indexing. If other pages link to a blocked page, it can still appear in search results.
For blocking indexing: Use noindex meta tag or X-Robots-Tag header, not robots.txt.
Sensitive content: Never rely on robots.txt to hide sensitive information. It is publicly readable.

IV. WordPress Robots.txt Configuration

A. Default WordPress Robots.txt

WordPress generates a virtual robots.txt that allows all crawlers with basic WordPress-specific blocks.

B. Recommended Configuration

Allow general access: Do not block legitimate crawlers from content pages.
Block wp-admin: Use Disallow: /wp-admin/ but Allow: /wp-admin/admin-ajax.php.
Block wp-includes: Generally unnecessary to block but does no harm.
Add sitemap reference: Include Sitemap: https://yoursite.com/sitemap_index.xml.

C. Common Blocks

Search results: Disallow: /?s= and Disallow: /search/
Query strings: Disallow: /*?* to block most query string variations (use carefully).
Feed URLs: Disallow: /feed/ if you do not want feeds crawled.

V. Submitting to Search Console

A. Sitemap Submission

Navigate to Sitemaps: In Search Console, go to Sitemaps section.
Submit URL: Enter your sitemap URL (usually sitemap_index.xml or sitemap.xml).
Monitor status: Check back to verify successful reading and note any errors.

B. Robots.txt Testing

Robots.txt Tester: Use Search Console tool to test if specific URLs are blocked.
Verify rules: Test that intended blocks work and content pages are accessible.

VI. Common Problems and Solutions

A. Sitemap Problems

Sitemap returns 404: SEO plugin may not be configured or permalink settings need saving.
URLs redirect: All sitemap URLs should resolve directly, not through redirects.
Noindex pages included: Review plugin settings to exclude noindexed content types.
Stale last-modified dates: Ensure dates update when content actually changes.

B. Robots.txt Problems

Blocking entire site: Disallow: / blocks everything. Check for this mistake.
Blocking CSS/JS: Do not block CSS and JavaScript files—Google needs them for rendering.
Conflicting rules: More specific rules override general rules. Order matters for same specificity.
Development settings in production: Ensure "Discourage search engines" is unchecked after launch.

VII. Verification Checklist

Access robots.txt: Visit yoursite.com/robots.txt and verify it loads correctly.
Check for blocks: Ensure no Disallow: / that blocks everything.
Verify sitemap reference: Sitemap URL should be listed in robots.txt.
Access sitemap: Visit your sitemap URL and verify it loads without errors.
Check sitemap contents: Review included URLs to ensure important pages are listed.
Submit to Search Console: Add sitemap and monitor for errors.
Test with robots.txt tester: Verify content pages are accessible to crawlers.

VIII. Conclusion

Properly configured sitemaps and robots.txt support healthy search engine crawling. Sitemaps should include all indexable pages and exclude noindexed content. Robots.txt should allow access to important content while blocking only truly unnecessary pages. Remember the critical distinction: robots.txt controls crawling, not indexing. For blocking indexing, use noindex directives instead. Submit your sitemap to Search Console and regularly verify both files remain correctly configured, especially after site updates or plugin changes that might affect them.

What sitemap or robots.txt issue have you encountered? Share your troubleshooting experience!