Experiencing Persistent Sitemap Generation Errors with Large Sites: Why Does My Tool Keep Crashing?

Author
Lucia Sanchez Author
|
4 days ago Asked
|
16 Views
|
2 Replies
0

Hi folks, hoping for some expert advice here! Our 'Free XML Sitemap Generator' tool has been doing great, helping users quickly create sitemaps. However, we've hit a wall with larger websites.

When users try to generate sitemaps for sites with over 50,000 URLs through our website crawling process, the tool often crashes or times out. It's really frustrating because smaller sites work perfectly, but these larger crawls just don't complete, leaving users without a proper XML sitemap.

Here's a typical error log snippet we're seeing:

[2023-10-27 14:35:01] ERROR: SitemapGenerationService - Process timed out after 300 seconds for domain example.com
[2023-10-27 14:35:01] ERROR: MemoryLimitExceededException - PHP Fatal error: Allowed memory size of 256MB exhausted (tried to allocate 12345678 bytes) in /var/www/html/generator.php on line 145

We're looking for strategies to handle these large-scale sitemap generation requests more robustly. What are the best practices for optimizing the website crawling and sitemap creation process to avoid these timeouts and memory issues for huge sites?

Any insights or architectural suggestions would be hugely appreciated! Waiting for an expert reply. Thanks in advance!

2 Answers

0
Nala Diallo
Answered 3 days ago

Hi Lucia Sanchez,

The issues you're seeing โ€“ process timeouts and memory exhaustion โ€“ are very common challenges when scaling any web crawling or data processing tool, especially for something as resource-intensive as sitemap generation on large sites. The 256MB PHP memory limit is definitely a bottleneck for processing 50,000+ URLs in memory. While you can initially try increasing PHP's memory_limit and max_execution_time in your php.ini, these are short-term fixes. For truly scalable sitemap generation, you need a more robust architectural approach.

To handle large-scale requests more effectively, consider implementing an asynchronous processing model. Instead of processing everything synchronously, queue up crawl jobs using a message broker like Redis or RabbitMQ. Your main application can then dispatch these jobs to separate worker processes or microservices that handle the actual crawling and sitemap construction. This allows you to break down the task into smaller, manageable chunks and process them in parallel. For storing discovered URLs and their metadata, avoid keeping everything in memory; persist data incrementally to a database (like MySQL, PostgreSQL, or even a NoSQL solution) as you crawl. This also opens up the possibility of distributed crawling, where multiple workers can crawl different segments of a large site simultaneously. Optimizing your crawling logic to avoid unnecessary data fetching and processing can also significantly reduce resource consumption.

What kind of infrastructure are you currently running this on?

0
Lucia Sanchez
Answered 2 days ago

So for handling those large crawls, would a serverless setup with something like AWS Lambda for the processing part be a more effective or even newer approach than managing separate worker processes and a message broker?

Your Answer

You must Log In to post an answer and earn reputation.