Chris Babcock <cbabcock@[EMAIL PROTECTED]
> writes:
>> >The good news is that I'm starting to see some instances of "Item
>> >already checked" fla****ng by. This would mean that I probably don't
>> >have to wait another 16 hours or more for final results and that
>> >there may not be 3,000 broken links by the time it's finished
>> >checking the site...
>>
>> What you mean, I think, is that you're doublecounting some of the
>> "repeated" broken links where the same link exists in many places and
>> of course each one is broken? Some things that are im****tant.... how
>> HUGE the site is, how many links there are!!! And is 3% large or
>> small? I think that's about what I would have thought it was. I
>> think that's neither large nor small but about what such a huge site,
>> originally built over ten years ago, has as a legacy.
>There turned out to be a nasty bit of recursion in the site - items
>that are in the "DipPouch" folder on the site are physically located in
>the root folder... as is the link (in the filesystem sense) that
>redirects the traffic there. It's a clever thing to do in a couple
>ways. It was just inconvenient for this project. In the end, I had to
>'break' that link in order to successfully crawl the site with the
>spider. Otherwise I would have gotten ever deeper levels of URLs that
>look like:
>"diplom.org/DipPouch/DipPouch/DipPouch/DipPouch/DipPouch/DipPouch/..."
I see. And yes, knowing how the directories are formed, I see why that
happened. This partly has to do with the fact that the Pouch is the site
and the Szine.
>The final result is that there are 37776 links to 5892 unique targets
>(including images). There are 4824 good links and 945 bad links; The
>number for 'bad links' unfortunately including those links that needed
>to be tem****arily disabled. I'll be contacting the maintainers
>individually with specifics on their sections as soon as I can generate
>re****ts.
Ohh, now that's not so good. That's more like 18%, which is getting high,
depending on how many of them are the "tem****arily disabled" links.
I look forward to my re****t for the postal section, I know about some of
the bad links and they just need to be deleted.
>So I've got a tool that can help find the broken links (with some human
>intervention), but the statistics are more obviously useless than is
>normally the case (and the recursion makes it difficult to *****s the
>size of the site too).
>Chris
I see.
Jim-Bob


|