Experiencing Persistent Sitemap Generation Errors with Large Sites: Why Does My Tool Keep Crashing?

3 weeks ago Asked

80 Views

2 Replies

Hi folks, hoping for some expert advice here! Our 'Free XML Sitemap Generator' tool has been doing great, helping users quickly create sitemaps. However, we've hit a wall with larger websites.

When users try to generate sitemaps for sites with over 50,000 URLs through our website crawling process, the tool often crashes or times out. It's really frustrating because smaller sites work perfectly, but these larger crawls just don't complete, leaving users without a proper XML sitemap.

Here's a typical error log snippet we're seeing:

[2023-10-27 14:35:01] ERROR: SitemapGenerationService - Process timed out after 300 seconds for domain example.com
[2023-10-27 14:35:01] ERROR: MemoryLimitExceededException - PHP Fatal error: Allowed memory size of 256MB exhausted (tried to allocate 12345678 bytes) in /var/www/html/generator.php on line 145

We're looking for strategies to handle these large-scale sitemap generation requests more robustly. What are the best practices for optimizing the website crawling and sitemap creation process to avoid these timeouts and memory issues for huge sites?

Any insights or architectural suggestions would be hugely appreciated! Waiting for an expert reply. Thanks in advance!

sitemap xml Crawling errors sitemap generation

2 Answers

Nala Diallo

Answered 3 weeks ago

Hi Lucia Sanchez,

The issues you're seeing – process timeouts and memory exhaustion – are very common challenges when scaling any web crawling or data processing tool, especially for something as resource-intensive as sitemap generation on large sites. The 256MB PHP memory limit is definitely a bottleneck for processing 50,000+ URLs in memory. While you can initially try increasing PHP's memory_limit and max_execution_time in your php.ini, these are short-term fixes. For truly scalable sitemap generation, you need a more robust architectural approach.

To handle large-scale requests more effectively, consider implementing an asynchronous processing model. Instead of processing everything synchronously, queue up crawl jobs using a message broker like Redis or RabbitMQ. Your main application can then dispatch these jobs to separate worker processes or microservices that handle the actual crawling and sitemap construction. This allows you to break down the task into smaller, manageable chunks and process them in parallel. For storing discovered URLs and their metadata, avoid keeping everything in memory; persist data incrementally to a database (like MySQL, PostgreSQL, or even a NoSQL solution) as you crawl. This also opens up the possibility of distributed crawling, where multiple workers can crawl different segments of a large site simultaneously. Optimizing your crawling logic to avoid unnecessary data fetching and processing can also significantly reduce resource consumption.

What kind of infrastructure are you currently running this on?

Lucia Sanchez

Answered 3 weeks ago

So for handling those large crawls, would a serverless setup with something like AWS Lambda for the processing part be a more effective or even newer approach than managing separate worker processes and a message broker?

Your Answer

You must Log In to post an answer and earn reputation.

Hot Discussions

New to Laravel: Why Am I Facing Eloquent ORM Relationsh... 2218 Views

facing major conversion discrepancies with Impact.com t... 437 Views

Why is my IP geolocation accuracy completely broken aft... 370 Views

Why is my public IP address tool showing wrong info? 348 Views

How does color theory actually boost SaaS conversions? 345 Views