Outcasts

Outcasts

Forum for outcast sleuths.


    Websleuths Snarf

    Share
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Websleuths Snarf

    Post by dangrsmind on Tue Feb 16, 2010 11:42 am

    Just to give a status update, I've captured all of the 365 pages of HTML comprising threads 1-8 on Websleuths. It turned out to be very easy to do and didn't require me to write much code, i.e. the following code imports all 52 pages of thread #1 in its entirety.

    thread1 =
    Table[Import[
    "http://www.websleuths.com/forums/showthread.php?t=89204&page=" <>
    ToString[i], "TEXT"], {i, 1, 52}];

    The resulting pages include all of the HTML code from websleuths so now I have to write some code to pull out the human readable portions. Currently we have stuff like this:

    <!-- message -->
    <div id="post_message _ 4765162">

    <div style="margin:20px; margin-top:5px; ">
    <div class="smallfont" style="margin-bottom:2px">Quote:</div>
    <table cellpadding="6" cellspacing="0" border="0" width="100%">
    <tr>
    <td class="alt2" style="border:1px inset">

    <div>
    Originally Posted by <strong>AndresEscobar</strong>
    <a href="showthread.php?p=4764095# post4764095" \
    rel="nofollow"><img class="inlineimg" \
    src="http://www.websleuths.com/forums/images/buttons/viewpost.gif" \
    border="0" alt="View Post" /></a>
    </div>
    <div style="font-style:italic">If my argument isn;t abundently \
    clear, I suppose my lack of eloquence is to blame. Maybe it's \
    because I'm doing work and multi tasking, but that's fine. I'd like \
    us to hold information to some standard. It doesn't have to \
    be admissible in court. It just has to plausible and have some \
    corroboration. But, you guys win, I'll back off.</div>

    </td>
    </tr>
    </table>
    </div><br />
    <br />
    I'm not trying to &quot;win,&quot; man, I'm trying to reason with \
    you. My take on this is that everyone pretty much gets how to weigh \
    the bits of information under discussion now. We've grasped the \
    difference between corroborated evidence and uncorroborated evidence. \
    (In that sense, <i>you</i> have &quot;won&quot;--you have made that \
    point quite eloquently and I think we all we thank you for it.) And \
    now, we can commence with imagining what might have happened. It'll \
    be more interesting this way, Andres, really. <img \
    src="http://www.websleuths.com/forums/images/smilies/smile.gif" \
    border="0" alt="" title="Smilie" class="inlineimg" />
    </div>
    <!-- / message -->

    I'm going to try and preserve some of the formatting, quotes, and even the thank yous and perhaps translate the HTML tags into BBcode tags. I have some other things going on this week so this may take several days to get done. I'll post some samples as I progress.
    avatar
    tapu

    Posts : 228
    Join date : 2010-02-16
    Age : 58
    Location : Sunny Maine

    Re: Websleuths Snarf

    Post by tapu on Tue Feb 16, 2010 11:59 am

    U DA BOM!


    afro
    avatar
    Heroine
    Admin
    Admin

    Posts : 337
    Join date : 2010-02-16
    Location : VA

    Yay for DM!

    Post by Heroine on Tue Feb 16, 2010 1:15 pm

    Thanks DM! I was wondering how on earth we would ever retrieve all that info! BUT Mr smartypants has done it! tongue
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Wed Feb 17, 2010 10:25 am

    No real progress to report since yesterday...I need to find some more time to work on this.

    Currently I have my own local copy of all 8 threads and I can read them locally on my hard drive. I'm still planing to parse out the interesting bits and capture any images stored on WS for posting here.
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sat Feb 20, 2010 2:53 am

    Further progress...I've figured out how to capture and cleanse all 8 threads of unnecessary HTML and get just nice clean text and it is just two (very crafty) lines of code. tongue (I am sure only the true geeks can appreciate this)

    I can separate each posting from the surrounding text fairly easily, still need to handle some exceptions, but mostly now I just have to parse the postings themselves. This is a bit tricky since they have several semi-structured elements but not too bad. Once that is done we shall see what we can learn...

    Some of the things I can capture:

    name of poster
    time and date of posting
    posting contents including URLs
    quotation and thank you references
    edit times

    Example:

    # 228
    10-13-2009, 11:19 AM


    dangrsmind
    Are you pondering what I'm pondering?

    Join Date: Oct 2009
    Posts: 2,018




    Quote:

    Originally Posted by Kano
    Rellik 781 is backwards for killer 187.

    The spelling just a street gang rip off.. They dont use letters \
    associated with enemys, so bloods spell everything with k's and not c's
    In this case i would say the "k" more for killer, since its big in \
    the horror core scene.

    Sicktaniks post withe "whipe that dead @#%$ off ya" Also leans into \
    the horror core comunity. Heard it more used with the ones who use \
    the ICP face paint though.

    Eh, cant say i ever heard that one. Yes I figured that out and \
    posted the decoded reversal for 781. It is interesting that 187 is \
    the California radio code for murder and the murderer was from \
    California, however I understand that people outside of California \
    use this reference.

    In Hebrew gematria the number 781 corresponds to the word for "a \
    deposit; dung; dung-hill". The word "KILLER" encodes to the number \
    290 which means, among other things, "trouble" or an "evil spirit".

    Remember they call this music the "wicked sh*t"?

    But hey maybe it's just a coincidence...


    Last edited by dangrsmind; 10-13-2009 at 11:38 AM . Reason: \
    clarity



    dangrsmind
    View Public Profile
    Find all posts by dangrsmind
    avatar
    wadahoot

    Posts : 97
    Join date : 2010-02-16
    Age : 59
    Location : Indiana

    Re: Websleuths Snarf

    Post by wadahoot on Sat Feb 20, 2010 3:14 am

    I don't know what all that means ... but WOW! That's great! HAHAHAHAAAA!

    Did you/can you capture thread 9 since (part of) it is back up?
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sat Feb 20, 2010 11:16 am

    wadahoot wrote:I don't know what all that means ... but WOW! That's great! HAHAHAHAAAA!

    Did you/can you capture thread 9 since (part of) it is back up?

    I can capture anything that is published on the Web...
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sat Feb 20, 2010 12:16 pm

    wadahoot wrote:I don't know what all that means ... but WOW! That's great! HAHAHAHAAAA!

    Did you/can you capture thread 9 since (part of) it is back up?

    What it means is, I can begin computing some interesting things like time distributions of postings, thank yous, and so on as well as applying text analysis to the postings.
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sun Feb 21, 2010 2:48 am

    Ok, I have a first cut at separating the postings. This allows me to do some calculations of stylometric values on each posting producing things like this...

    [You must be registered and logged in to see this link.]

    Here we have an analysis of 8000+ postings showing relative lower and upper case usage across all posters.
    avatar
    wadahoot

    Posts : 97
    Join date : 2010-02-16
    Age : 59
    Location : Indiana

    Re: Websleuths Snarf

    Post by wadahoot on Sun Feb 21, 2010 3:30 am

    uh, still don't know what it means, but the graph is pretty tongue
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sun Feb 21, 2010 4:22 am

    wadahoot wrote:uh, still don't know what it means, but the graph is pretty tongue

    I analyzed over 8000 postings from Websleuths.

    For each posting, I counted the number of upper and lower case characters in the posting. I then computed the ratio:

    #lower
    ---------
    #upper

    The plot shows these values with the postings ordered in time from left to right on the X-axis, and the ratio on the Y-axis.

    Notable in the image is the fact that early in the threads history the postings contained more lower case characters than they did later in the thread's history. This indicates that there were a different group of dominant posters early in the the thread's history and that these posters used fewer upper case letters on average. Later a different group became dominant and this group is using a higher percentage of upper case letters. The early posters peaked around the 2000th posting, and their postings contained on average 17 lower case letters for each upper case letter. The later posters come to dominate the scene after the 3000 posting, and they average just 10 lower case letters per upper case letter.

    The measure is a bit crude and it doesn't distinguish between texts where letters are not capitalized correctly versus texts consisting of longer sentences which would tend to have a higher ratio of lower case to upper case characters organically.

    I can compute many of these sorts of measures...

    Tapu, Tapu, paging Tapu to the white courtesy telephone.
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sun Feb 21, 2010 1:58 pm

    Here's another interesting one...

    This chart shows "other oriented" vs. "self oriented" word use in the same 8500 posting data set. The Y-axis in this case indicates this measure with higher scores indicating other oriented texts and scores closer to zero indicating selfish texts.

    [You must be registered and logged in to see this link.]
    avatar
    dangrsmind

    Posts : 676
    Join date : 2010-02-16
    Location : San Francisco

    Re: Websleuths Snarf

    Post by dangrsmind on Sun Feb 21, 2010 2:01 pm

    Q: Are you going to post the entire contents of the Websleuth threads here?

    A: No.

    Websleuths claims copyright to these threads and therefore I cannot repost them in their entirety. Quotation for academic discussion is fair use, and the entire set of postings from threads 1-8 are now searchable. So if there is some specific posting or information someone wants to find I can find it easily. And post an excerpt with a link reference.

    Sponsored content

    Re: Websleuths Snarf

    Post by Sponsored content


      Current date/time is Tue Sep 25, 2018 3:33 am