random.bml

Jul. 14th, 2006 02:55 pm
[personal profile] nibot
In a (usually futile) effort to break out of the usual LJ-clique and find new and interesting content*, I have a toolbar button that takes me to a "random livejournal" (http://www.livejournal.com/random.bml).

It just took me to [livejournal.com profile] frinkle_twinkle, whose current (from September) entry is all about how it seems she gets a disproportionate number of visitors via the random journal selector: http://frinkle-twinkle.livejournal.com/

Is the random-livejournal selector broken, or is this just observational bias? (Anyone could notice some random visitors, put up a note about them, and then suddenly have lots of people saying that they, too, got there via the random journal button, and isn't that funny.) I suppose the way to test this, aside from inspecting the source code to random.bml, would be to make a new journal with a post that says "Isn't it funny, so many people seem to be getting here via the random button" post, and see how many "me too" comments accrue.

* actually, just to waste time.

Date: 2006-07-14 08:35 pm (UTC)
From: [identity profile] nibot.livejournal.com
Thanks for participating in the experiment. I suggest you "backdate" your honey-pot entry into the future, so that it will appear as the most recent entry for some time to come.

Date: 2006-07-14 08:37 pm (UTC)
ext_3729: All six issues-to-date of GUD Magazine. (Default)
From: [identity profile] kaolinfire.livejournal.com
I wavered on that, but yes, I probably should.

Date: 2006-07-14 10:32 pm (UTC)
From: [identity profile] codetoad.livejournal.com
How do you randomly select a column from a table (in SQL) efficiently and randomly? One way is to grab the entire set of columns (tons of data, in this case). So, here's how we could select 5 rows in a random order:

query = "SELECT name from curPages order by random() limit 5"

So this is pairing each row with a number generated by random() when the query is run. To ensure that the columns are actually selected randomly and not merely in some order maintained by the BTree, it seems that PostgreSQL does a full sequential scan -- looking at all of the records -- to do the random() sort:

wiki=# explain SELECT name from curPages order by random() limit 5;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Limit  (cost=1.29..1.30 rows=5 width=218)
   ->  Sort  (cost=1.29..1.32 rows=10 width=218)
         Sort Key: random()
         ->  Seq Scan on curpages  (cost=0.00..1.12 rows=10 width=218)
(4 rows)


So with PostgreSQL we do a sequential scan. If we did a full sequential scan of the 10,669,375 livejournals every time an enterprising young reader clicked that button that would be seriously bad news. Even limited to LJ's user clusters you're still looking at around a million users to scan. So, a good solution seems to be to cache some of that randomly selected data someplace -- say around 5,000 random journals -- and keep it around for a little while (randomly select from it and so forth).

Another solution is to pair every row with a random value when you insert it into the database. It's poor design, but it would allow you to use that value as an index to select the page. For the amount the random feature is truly used and the low value in having true randomness this is probably a bad idea.

MySQL, which LJ uses, doesn't give very much information in their "EXPLAIN" to know what's going on in this query, but it's probably inefficient too.

So the other real no-brainer solution that works in LJ's case is to select a random number between 1 and the maximum user id and then select that record. This would work because user id's are sequential. It seems it's what LJ does, kinda:

    my $dbr = LJ::get_db_reader();
    my $max = $dbr->selectrow_array("SELECT statval FROM stats ".
                                    "WHERE statcat='userinfo' AND statkey='randomcount'");
    if ($max) {
        my $rand = int(rand($max))+1;
        my $user = $dbr->selectrow_array("SELECT u.user FROM randomuserset r, useridmap u ".
                                         "WHERE r.userid=u.userid AND r.rid=$rand");
        return BML::redirect("$LJ::SITEROOT/users/$user/") if $user;
    }
    
    # something went wrong, redirect them back to top of site... *shrug*
    return BML::redirect("$LJ::SITEROOT/");


LJ builds a table called randomuserset daily which consists of 5000 journals which are marked as 'public yo' and also have been updated in the past day interval (
AND uu.timeupdate > DATE_SUB(NOW(), INTERVAL 1 DAY)
). The small set (5000) combined with the day long interval as well as the condition that the user has to update is likely to favor certain users. LJ tells me there have been around 200,000 updates in the past day, so let's assume there's 100,000 updates a day by unique users. If you update every day that's a 5% chance of being in that random pool, which actually isn't that bad at all.

Date: 2006-07-14 11:19 pm (UTC)
From: [identity profile] evan.livejournal.com
Nice analysis! Oftentimes in the past these sorts of "stuck" behaviors is when a daemon or cronjob starts failing, so my initial guess (upon reading the post and vaguely remembering how this stuff worked) is that the random user set building stopped working for some reason.

Date: 2006-07-14 11:27 pm (UTC)
From: [identity profile] codetoad.livejournal.com
Yeah, that might be it. A lot of journals I'm seeing in random.bml haven't been updated in the past few days..

Date: 2006-07-14 11:41 pm (UTC)
From: [identity profile] nibot.livejournal.com
Many of them, in fact, have been deleted.
(deleted comment)

Date: 2006-07-19 08:30 pm (UTC)
From: [identity profile] nibot.livejournal.com
It's definitely stuck. Many of the journals that show up via random_bml have comments on their entries about how the reader got there via random_bml.

most recently http://shebbybaby.livejournal.com/60884.html
(deleted comment)

Date: 2006-07-15 01:58 am (UTC)
From: [identity profile] evan.livejournal.com
It's sorta sad, but I can often diagnose these sorts of LJ failures by just knowing that they're generally sloppy about the sorts of monitoring/regression testing necessary to keep everything going simultaneously.
(deleted comment)

Date: 2006-07-14 11:51 pm (UTC)
From: [identity profile] codetoad.livejournal.com
Obviously ORDER BY rand() is going to be expensive if run over all the users but what we might look at here is user.timeupdate (or something). Grab the 200,000 users who've updated in the past day and pick 5000 from that randomly. That can't be that expensive to do on a per day basis? (I just ran a SELECT .. ORDER BY random() limit 5000; statement on a 200,000 row table and it returned in about a second)

March 2020

S M T W T F S
1234567
891011121314
15 161718192021
22232425262728
293031    

Style Credit

Page generated Sep. 13th, 2025 09:22 pm
Powered by Dreamwidth Studios

Expand Cut Tags

No cut tags

Most Popular Tags