To get through as many pages as fast as possible, I installed the program into 17 computers in the lab. In each computer I ran 5 instances of the program simultaneously, which makes 85 instances total. This meant, I had to split the file containing 4.36 million URI-s from Open Directory Project into 85 chunks and give each chunk to one program instance to download and analyze.
All the results were stored into MySQL database, which stood in the 18th computer - all the other machines communicating with it.
I was running short with my time. I had about a week to run the program and it wasn't enough to download and analyze all the URI-s I had. So I just strived for as much as I could.
The result of about 100 hours of continuous running between 10th and 16th of April was 1.27 million URI-s downloaded and analyzed (29%). (See figure 2). Not even a half of what was initially planned, but still a considerable amount to make some statistics.
Figure 2. How the initial selection of URI-s was trimmed down.
Although I had trimmed down the initial selection of 4.7 million URI-s from Open Directory Project by removing duplicates and links to non-HTML pages, there were still more URI-s to be removed.
First of all the program was unable to download few hundreds of thousands URI-s. At the time the program was running the internet connection was lost for a few hours, so I guess that counts for the most of these not downloaded URI-s.
I also left aside the data about pages that had different HTTP status code than 200 OK (mostly 404 Not found error pages) and pages that were simply empty (0 bytes in size).
I also removed a great amount of pages where the program was unable to find even a single HTML element. If you look again at figure 2, you?ll see that there was a great many pages like that (about quarter of all the pages left). This doesn?t look normal and indeed it isn?t: I checked those pages to see if they really were empty ? but they weren?t. They were just ordinary pages, but because of some bug in the program, no HTML elements were found from those. I have yet to identify the exact bug, but only thing I can do with the gathered results, is to exclude those pages from all the statistical analyzes that follow.
So... if you still believe, that a research with so many shortcomings is worth looking at, you can continue with the results.
Kirjutatud 12. juunil 2006.
RSS, RSS kommentaarid, XHTML, CSS, AA