Just to give a status update, I've captured all of the 365 pages of HTML comprising threads 1-8 on Websleuths. It turned out to be very easy to do and didn't require me to write much code, i.e. the following code imports all 52 pages of thread #1 in its entirety.
thread1 =
Table[Import[
"http://www.websleuths.com/forums/showthread.php?t=89204&page=" <>
ToString[i], "TEXT"], {i, 1, 52}];
The resulting pages include all of the HTML code from websleuths so now I have to write some code to pull out the human readable portions. Currently we have stuff like this:
<!-- message -->
<div id="post_message _ 4765162">
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tr>
<td class="alt2" style="border:1px inset">
<div>
Originally Posted by <strong>AndresEscobar</strong>
<a href="showthread.php?p=4764095# post4764095" \
rel="nofollow"><img class="inlineimg" \
src="http://www.websleuths.com/forums/images/buttons/viewpost.gif" \
border="0" alt="View Post" /></a>
</div>
<div style="font-style:italic">If my argument isn;t abundently \
clear, I suppose my lack of eloquence is to blame. Maybe it's \
because I'm doing work and multi tasking, but that's fine. I'd like \
us to hold information to some standard. It doesn't have to \
be admissible in court. It just has to plausible and have some \
corroboration. But, you guys win, I'll back off.</div>
</td>
</tr>
</table>
</div><br />
<br />
I'm not trying to "win," man, I'm trying to reason with \
you. My take on this is that everyone pretty much gets how to weigh \
the bits of information under discussion now. We've grasped the \
difference between corroborated evidence and uncorroborated evidence. \
(In that sense, <i>you</i> have "won"--you have made that \
point quite eloquently and I think we all we thank you for it.) And \
now, we can commence with imagining what might have happened. It'll \
be more interesting this way, Andres, really. <img \
src="http://www.websleuths.com/forums/images/smilies/smile.gif" \
border="0" alt="" title="Smilie" class="inlineimg" />
</div>
<!-- / message -->
I'm going to try and preserve some of the formatting, quotes, and even the thank yous and perhaps translate the HTML tags into BBcode tags. I have some other things going on this week so this may take several days to get done. I'll post some samples as I progress.
thread1 =
Table[Import[
"http://www.websleuths.com/forums/showthread.php?t=89204&page=" <>
ToString[i], "TEXT"], {i, 1, 52}];
The resulting pages include all of the HTML code from websleuths so now I have to write some code to pull out the human readable portions. Currently we have stuff like this:
<!-- message -->
<div id="post_message _ 4765162">
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tr>
<td class="alt2" style="border:1px inset">
<div>
Originally Posted by <strong>AndresEscobar</strong>
<a href="showthread.php?p=4764095# post4764095" \
rel="nofollow"><img class="inlineimg" \
src="http://www.websleuths.com/forums/images/buttons/viewpost.gif" \
border="0" alt="View Post" /></a>
</div>
<div style="font-style:italic">If my argument isn;t abundently \
clear, I suppose my lack of eloquence is to blame. Maybe it's \
because I'm doing work and multi tasking, but that's fine. I'd like \
us to hold information to some standard. It doesn't have to \
be admissible in court. It just has to plausible and have some \
corroboration. But, you guys win, I'll back off.</div>
</td>
</tr>
</table>
</div><br />
<br />
I'm not trying to "win," man, I'm trying to reason with \
you. My take on this is that everyone pretty much gets how to weigh \
the bits of information under discussion now. We've grasped the \
difference between corroborated evidence and uncorroborated evidence. \
(In that sense, <i>you</i> have "won"--you have made that \
point quite eloquently and I think we all we thank you for it.) And \
now, we can commence with imagining what might have happened. It'll \
be more interesting this way, Andres, really. <img \
src="http://www.websleuths.com/forums/images/smilies/smile.gif" \
border="0" alt="" title="Smilie" class="inlineimg" />
</div>
<!-- / message -->
I'm going to try and preserve some of the formatting, quotes, and even the thank yous and perhaps translate the HTML tags into BBcode tags. I have some other things going on this week so this may take several days to get done. I'll post some samples as I progress.