SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 33

Thread: Scalable Sessions

  1. #1
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Scalable Sessions

    We are about to rerelease a site that gets around 90,000 user registrations a day. On any given day we have around 250,000 unique visits.

    Now, the new code for the site lays off the database a bit by storing more in the session. It equates to about 40k per logged in user, or 1-2k per anonymous user for up to a week of inactive time. This is obviously a significant amount of disk space to be using. In addition to the disk space, the session data is shared by 4 web servers via nfs.

    My question is: how can I support this much session data and avoid the sessions being a bottleneck?

    I have looked at sql based session storage (too slow, too much overhead), file based sessions (how well does this scale?) and msession (don't know enough about it). Any recommendations?
    Last edited by bmatheny; Jul 15, 2004 at 11:16.

  2. #2
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    40k of data per session sounds like alot to me, although it would depend on your application. You may trying to install a 10x10 rug in a 9x9 room: your never going to push the bump out. Your database and your sessions are both shared data sources. Moving information from one to the other may result in no net scalability gain.

    You might look into something like Jason Sweat's DataCache, or the query caching capability of adodb. By paying attention to your common queries, you can cache them on each web server and eliminate the network trip, and the ease resource usage on your shared data stores.

    It would seem that your anonymous pages would be low hanging fruit, ripe for caching. You could either cache the assembled page, or the queries used to assemble the page.

  3. #3
    SitePoint Evangelist Daijoubu's Avatar
    Join Date
    Oct 2002
    Location
    Canada QC
    Posts
    454
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    How about using MySQL HEAP/MEMORY table type?
    If you're using PHP built-in session handler, you can make it run in memory too
    Speed & scalability in mind...
    If you find my reply helpful, fell free to give me a point

  4. #4
    SitePoint Enthusiast
    Join Date
    Feb 2004
    Location
    France
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well there's always brute force : build a dual or 4-way Xeon system with as much RAM as you can, put MySQL into it and activate the query cache. A big, well tuned MySQL server can take quite a beating.

    If many pages only read from the database and session, you should probably rather look into replication (it's cheap and it works pretty nicely for load balancing)

  5. #5
    Non-Member
    Join Date
    Jan 2004
    Location
    Planet Earth
    Posts
    1,764
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well there's always brute force
    If you are unable (at this time) to refactor the application, then this may make a difference, although in saying that, if your SESSIONs are 40k each then to me there already is a serious problem

    SESSIONs should be seen much like COOKIEs, so if you need large scale storage, they ain't suitable as your finding out now

  6. #6
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Widow Maker
    If you are unable (at this time) to refactor the application, then this may make a difference, although in saying that, if your SESSIONs are 40k each then to me there already is a serious problem

    SESSIONs should be seen much like COOKIEs, so if you need large scale storage, they ain't suitable as your finding out now
    The problem is, there is a ton of data that needs to be stored for the duration of a users session. And storing it (the ephemeral data) in the database is really inappropriate, since it never needs to be queried on and changes from login to login. We have around 3500 concurrent connections right now, I don't see even a really big x86 box being able to handle that kind of sql traffic.

    Edit: I was able to drop the session size to 16k. I think with some more refactoring I can drop it down to maybe 8-10k. Is this still too large for a session?

  7. #7
    Non-Member
    Join Date
    Jan 2004
    Location
    Planet Earth
    Posts
    1,764
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well yes in my view ? If it's just for the duration of a users visit, I'd think that it'd proberly be better to store in a flat file anyways, as your SESSIONs are files aren't they ?

    Just wouldn't have the overhead (if any, not sure) of PHP based SESSIONs that's all, as I see it anyways ?

    Good news that you've brought the size way down though If you're unsure or unwilling to store the user data to a flat file, maybe leaving things as they are (after refactoring) for a while to see what the performance gain is, if any ?

  8. #8
    SitePoint Evangelist
    Join Date
    May 2004
    Location
    New Jersey, USA
    Posts
    567
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    The problem is, there is a ton of data that needs to be stored for the duration of a users session. And storing it (the ephemeral data) in the database is really inappropriate, since it never needs to be queried on and changes from login to login.
    Given that it changes from login to login, does it change from session to session for a given login?

    That is, will all this data be "the same" each time I log in, or do your sessions expire after a short interval?

    We'll revisit this.

    We have around 3500 concurrent connections right now, I don't see even a really big x86 box being able to handle that kind of sql traffic.
    That depends on how much you're doing with the DB on each connection.

    Edit: I was able to drop the session size to 16k. I think with some more refactoring I can drop it down to maybe 8-10k. Is this still too large for a session?
    My first suggestion was going to be a bet that you could cut the session size in half.

    My next suggestion will be that you can do it again.

    First, though, let's look at policy: you've got 250k unique visits/day, and 90k registrations/day. I'd say a registration definitely qualifies as a unique visit, so I'm reading that more than a third of all your activity is registration. (Also, of course, you've apparently registered 12 million users in the last six months. Is this true?)

    If you've got that many registrations, compared to your visits, then maybe you're doing something wrong: maybe you can change the balance of registrations vs. visits by some policy-level change -- let people read the articles without registering, or whatever. This wouldn't be a code change, necessarily, so much as a site policy/structure/requirements change. But if it converts some of your logins into anonymous users, you'll see a big win in session allocation, since their data requirements shrink by 10x.

    Next, let's talk about data flow. When a user signs up, why are you blowing up a 40k (now 20k) balloon full of stuff in their session? Is there some way that you could break the session data into "parts", and have the individual "parts" expire after a few minutes if not used? (For instance, if a user was browsing a thread on sitepoint, you would keep all the 'thread browsing data' but eventually the 'blog reading data' and the 'article reading data' would get dropped out of the session, unless the user clicked back to those areas.)

    The idea here is to reduce the average session size -- there will surely exist some spastic pathological users who bounce from place to place keeping all the modules open, but most people do things one step at a time.

    Another data item: what's in all this session data? I'll bet you can reduce the session data set down by half or more again (target: 10k).

    Another data item: what's the session duration? Right now you've got 250k/day visits. Assume that 30% are "every day" visitors, and you get 250k + 6 * 175k = 1300k "unique" visits/week.

    If all those are anonymous, then figure 2k * 1300k = 2600kk = 2.6gb of session data. If all those were registered, then you would have 52gb of session data.

    If you get the registered user size down to 10k, you get down to 13gb of data, a much nicer number.

    If you shorten the session timeout, and implement some sort of "garbage collection" in your session directories, then you can cut the number down even more.

    Let's talk about architecture: Right now you've got 4 servers handling the load for your site, with NFS file I/O. How do you handle "sessions" with the servers? Can any request go to any server, or do you have the requests going back to the same servers after the initial connection?

    Implementing a 'preferred' server scheme isn't too difficult, and it will radically improve your network behavior because you can turn up the caching on your NFS mounts.

    Since we're heading in this direction, let me say that sessions *ARE*, IMO, the right thing for what you seem to be doing -- provided that you comb through the data and get rid of some of the bloat.

    You should definitely look into writing your own session handler, though.

    Instead of using SQL, you'll want to keep using files -- files are faster than DB. But maybe you can implement your own caching algorithm by storing the files on a ramdisk or in /tmp (mounted on swap), and then writing a second copy to the nfs filesystem. This works particularly well if you know that subsequent session requests will come back to the same host.

    Also, consider compressing the files before you write them, thereby shrinking your NFS traffic (but sadly not your operation count). If you can get those session files down to 2-3kb (compressed), you reduce your session I/O traffic by a third.

    Consider dedicating one of your servers to being a server -- that is, to doing all the disk IO and database work. If you have control of the box, you could crank up the number of biod's (or whatever) and see some really good numbers.


    Hope this helps,

    =Austin

  9. #9
    Non-Member
    Join Date
    Jan 2004
    Location
    Planet Earth
    Posts
    1,764
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    exist some spastic pathological users

  10. #10
    SitePoint Zealot
    Join Date
    Dec 2003
    Location
    with my kids
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You're really getting 32.8 million new users a year?

  11. #11
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you so much for the well thought out reply. It has significantly eased my mind, as I realized that the pointy haired who gave me the registration stats was simply a bold faced liar We have 2.1 million users, and average about 250k registrations a month. which is about 8500 people a day. Not nearly so bad! So, now with this new information, I'll try and respond to your questions.

    Quote Originally Posted by Austin_Hastings
    Given that it changes from login to login, does it change from session to session for a given login?

    That is, will all this data be "the same" each time I log in, or do your sessions expire after a short interval?
    It does change from session to session. When you log back in you may have new roles, a different subscription expiration date, new number of forum posts, new private messages, a new password, etc. What we do right now is the following: say a user goes to their preferences and turns text message alerts from on to off. That gets updated in their session as well as in the database. But from then on while they're browsing the site, all the places that need to know whether the user accepts text messages or not can just grab it from an object in the session instead of querying the database. This happens for LOTS of user attributes.

    Quote Originally Posted by Austin_Hastings
    My next suggestion will be that you can do it again.
    I'm at 12k now, working on getting that down to 10k.

    This is new info I have. We have 45k different users logging into the site every day. And about 2.5 times that number visit and never register or login. So around 160k unique visits/day.

    Quote Originally Posted by Austin_Hastings
    But if it converts some of your logins into anonymous users, you'll see a big win in session allocation, since their data requirements shrink by 10x.
    This would be a diasaster for us. We rely on new people coming, registering in order to use the service (you really do have to register to use it) and then signing up for a subscription to the service (a cellular content provider).

    Quote Originally Posted by Austin_Hastings
    Another data item: what's the session duration?
    The average length of visit on the site is 24 minutes. The session timeout is set for, well it really doesn't matter since PHP doesn't do GC on sessions if you use the split sessions feature.

    Quote Originally Posted by Austin_Hastings
    If you shorten the session timeout, and implement some sort of "garbage collection" in your session directories, then you can cut the number down even more.
    This can be done.

    Quote Originally Posted by Austin_Hastings
    Let's talk about architecture: Right now you've got 4 servers handling the load for your site, with NFS file I/O. How do you handle "sessions" with the servers? Can any request go to any server, or do you have the requests going back to the same servers after the initial connection?
    We have a load balancer that balances between 4 web servers. The load balancer uses the sticky IP scheme because we have SSL traffic. Each web server has two nfs mounts both to the same machine, one for the source code and one for the sessions. The nfs cache timeout for sourcecode is 15 minutes, there is no caching for the sessions. There are 3 SQL servers, one is write only, one is ready only slave, and one is read and write and has different data than the other two machines.

    Quote Originally Posted by Austin_Hastings
    Instead of using SQL, you'll want to keep using files -- files are faster than DB. But maybe you can implement your own caching algorithm by storing the files on a ramdisk or in /tmp (mounted on swap), and then writing a second copy to the nfs filesystem. This works particularly well if you know that subsequent session requests will come back to the same host.
    This is why I was considering using msession. Since we can have a beefy box with 16GB/ram keep all the sessions in memory and the web servers can retrieve session data over the msession port. I'm curious if msession is less expensive than nfs, my guess is yes.

    I'm going to work on getting the session size down to 8-10k. Doing the math, that shouldn't be too bad in terms of storage requirements. I will have to write a reasonable GC mechanism, since right now our session.save_path is 4;/sessions. Thanks again for the insight.

  12. #12
    SitePoint Enthusiast
    Join Date
    Feb 2003
    Location
    vancouver
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you do end up trying msession I'd be very curious as to your impressions, as I've just started looking into the possibility of using it in the near future over an nfs type solution as you've described.

  13. #13
    SitePoint Evangelist
    Join Date
    May 2004
    Location
    New Jersey, USA
    Posts
    567
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    It does change from session to session. When you log back in you may have new roles, a different subscription expiration date, new number of forum posts, new private messages, a new password, etc. What we do right now is the following: say a user goes to their preferences and turns text message alerts from on to off. That gets updated in their session as well as in the database. But from then on while they're browsing the site, all the places that need to know whether the user accepts text messages or not can just grab it from an object in the session instead of querying the database. This happens for LOTS of user attributes.
    So my thought is that you can defer all that data, so:
    PHP Code:
    if (!exists($_SESSION['accept private messages?']))
    {
        
    get_private_message_info();

    This way, if you never need to know that, it never gets into the session data. (Realize that we're trading smaller session data size for less efficient login/data access.)


    I'm at 12k now, working on getting that down to 10k.
    If it was that easy, aim for 4k.

    [not logging in] ... would be a diasaster for us. We rely on new people coming, registering in order to use the service (you really do have to register to use it) and then signing up for a subscription to the service (a cellular content provider).
    Would it really be a disaster? Of could you verify their account some other way? (I know exactly nothing about what you're doing, or cellular phone [WAP?] access.) Do they have a fixed IP address on their phone that you could memorize, and so eliminate all the login stuff? Could you store an encrypted cookie that eliminated the need for a session if they were doing 30% to 60% of your site's functions? (Obviously, if they don't have the cookie then they go through the login process again...)

    The average length of visit on the site is 24 minutes. The session timeout is set for, well it really doesn't matter since PHP doesn't do GC on sessions if you use the split sessions feature.
    So maybe you need two "session" files: more permanent stuff, and the number-of-posts type info that becomes obsolete as soon as they disconnect.

    If you can guarantee that same-session requests come back to the same server, then you could just write the session data on a local disk/ramdisk. (Then have your garbage collector "move" the long-term session info back to the 'central' server when it's obvious that the session is ended.)

    The processing would be:

    web sees new login
    web copies nfs session data locally to /files/php_sessions
    ...
    web GC cron sees expired session in /files/php_sessions
    web GC cron copies data to nfs server

    There's the possibility of a login/logout error, but how likely is that? Especially if your session timeout and your GC timeout are set up correctly.

    This is why I was considering using msession. Since we can have a beefy box with 16GB/ram keep all the sessions in memory and the web servers can retrieve session data over the msession port. I'm curious if msession is less expensive than nfs, my guess is yes.
    Well, if they have to talk to another machine the time isn't going to be great, especially if your nfs directories are structured well (easily cached).

    OTOH, I've never even heard of msession. Where can I find a pointer?

    I'm going to work on getting the session size down to 8-10k.
    4k, dude. 10k was too easy. Bleed a little.

    =Austin

  14. #14
    SitePoint Enthusiast
    Join Date
    Feb 2003
    Location
    vancouver
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Austin_Hastings
    OTOH, I've never even heard of msession. Where can I find a pointer?
    Here you go:

    http://www.php.net/msession

    "msession is an interface to a high speed session daemon which can run either locally or remotely. It is designed to provide consistent session management for a PHP web farm."

  15. #15
    SitePoint Enthusiast
    Join Date
    Jun 2004
    Location
    New Jersey
    Posts
    69
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You mind if I ask what website this is? If you really get this many signups a day/hits a day, I'd be interested to see what it is your developing...

  16. #16
    SitePoint Evangelist
    Join Date
    May 2004
    Location
    New Jersey, USA
    Posts
    567
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Unknown_Relic
    Yikes. That didn't take long for MEGO. I'll read it again when I've got more brain cells.

    =Austin

  17. #17
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by evolve
    You mind if I ask what website this is? If you really get this many signups a day/hits a day, I'd be interested to see what it is your developing...
    Replied privately.

    I'm also a bit curious. What do sites like amazon, slashdot, cnn, yahoo, etc do? I have never found a good article regarding web technology at these companies except for yahoo finance, and that's because Jeremy works there. Pointers?

  18. #18
    SitePoint Addict been's Avatar
    Join Date
    May 2002
    Location
    Gent, Belgium
    Posts
    284
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Don't think these companies will disclose the nitty gritty insights of there infrastructures, but there are some clues to be found:

    Google infrastructure:
    A video: http://www.uwtv.org/programs/displayevent.asp?rid=1680
    A pdf: http://www.computer.org/micro/mi2003/m2022.pdf

    One year of php at yahoo (summer 2003):
    http://public.yahoo.com/~radwin/talk...-oscon2003.htm
    Per
    Everything
    works on a PowerPoint slide

  19. #19
    SitePoint Addict
    Join Date
    Apr 2002
    Posts
    330
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    The problem is, there is a ton of data that needs to be stored for the duration of a users session. And storing it (the ephemeral data) in the database is really inappropriate, since it never needs to be queried on and changes from login to login. We have around 3500 concurrent connections right now, I don't see even a really big x86 box being able to handle that kind of sql traffic.
    Maybe if you can answer the following questions, it will be possible to give a more adequate advice.

    1. Are those concurrent connections to the database or to the Web server?

    2. Which Web server do you use?

    3. Do you use the same Web server for serving PHP scripts and static content (images, css, etc..)?

    4. Do you actually use sessions to provide personalized pages for non-logged users or only pages served to logged users need the data stored in sessions?
    Manuel Lemos

    Metastorage - Data object relational mapping layer generator
    PHP Classes - Free ready to use OOP components in PHP

  20. #20
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mlemos
    1. Are those concurrent connections to the database or to the Web server?
    We do about 3k concurrent web connections and about 3500 concurrent sql connections.

    Quote Originally Posted by mlemos
    2. Which Web server do you use?
    Apache 1.3.31

    Quote Originally Posted by mlemos
    3. Do you use the same Web server for serving PHP scripts and static content (images, css, etc..)?
    The NFS server serves 90% of static content.

    Quote Originally Posted by mlemos
    4. Do you actually use sessions to provide personalized pages for non-logged users or only pages served to logged users need the data stored in sessions?
    Sessions for logged in users. The session for a non-logged in user is only like 100 bytes or so, and is only used to store permissions: 'All Users', 'Anonymous User' and I think that's it.

  21. #21
    SitePoint Addict
    Join Date
    Apr 2002
    Posts
    330
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    We do about 3k concurrent web connections and about 3500 concurrent sql connections.
    Does that mean that the exceeding 500 connections are not established by PHP scripts running from the Web server?

    When you say concurrent Web connections, do you mean 3000 forked Apache processes?

    Quote Originally Posted by bmatheny
    Apache 1.3.31


    The NFS server serves 90% of static content.
    I am not sure if you answered what I asked. When you say the NFS server serves 90% of static content, do you mean you have a separate Web server running on the NFS server machine serving static content, or do you mean that static content is stored in files available to the Web server via NFS?


    Quote Originally Posted by bmatheny
    Sessions for logged in users. The session for a non-logged in user is only like 100 bytes or so, and is only used to store permissions: 'All Users', 'Anonymous User' and I think that's it.
    I am not sure whether you are saying that it is really necessary. I would advise to not start sessions for non-logged users unless your site cannot live without that. Lets not start using resources until they are really necessary.
    Manuel Lemos

    Metastorage - Data object relational mapping layer generator
    PHP Classes - Free ready to use OOP components in PHP

  22. #22
    SitePoint Enthusiast
    Join Date
    Feb 2004
    Location
    France
    Posts
    58
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    We have around 3500 concurrent connections right now, I don't see even a really big x86 box being able to handle that kind of sql traffic.
    Well I have heard of x86 boxes running MySQL with about 3800 queries/sec

    One solution is to have every user connected to the same web server (this can be done at the load-balancer level, there are other solutions*). Once you do this your problem is solved : you do not need to share session datas, and you can store them locally on the web server (flat file, SQLite, etc.). Your system becomes highly scalable and you can store almost as much stuff as you want for each session (you can just add more HTTP servers as you grow).

    * Solutions to assign a server to a visitor :
    - assign the server based on the IP of the user (for example users with IP ending in 1 gets server 1, etc.)
    - have each user use a custom subdomain ( mylogin.domain.com ) and then map each subdomain to a web server
    - etc.

  23. #23
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mlemos
    Does that mean that the exceeding 500 connections are not established by PHP scripts running from the Web server?

    When you say concurrent Web connections, do you mean 3000 forked Apache processes?
    Yes, there are external connections to the sql servers. And yes, but across 4 machines. So around 700 forked processes per machine.

    Quote Originally Posted by mlemos
    I am not sure if you answered what I asked. When you say the NFS server serves 90% of static content, do you mean you have a separate Web server running on the NFS server machine serving static content, or do you mean that static content is stored in files available to the Web server via NFS?
    Yes, we have a seperate web server running on the nfs machine, serving static content.

    Quote Originally Posted by mlemos
    I am not sure whether you are saying that it is really necessary. I would advise to not start sessions for non-logged users unless your site cannot live without that. Lets not start using resources until they are really necessary.
    We have to for anonymous users. Since the session contains permissions data and possibly shopping cart information.

  24. #24
    SitePoint Member
    Join Date
    Mar 2004
    Location
    Indiana
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Betcour
    One solution is to have every user connected to the same web server (this can be done at the load-balancer level, there are other solutions*). Once you do this your problem is solved : you do not need to share session datas...
    We already do have every user that connects, connect to the same web server (the load balancer does this). However the reason that we share session data is to deal with the issue where a web server goes down. If a web server goes down we would like a user to not loose their session data.

  25. #25
    SitePoint Evangelist
    Join Date
    May 2004
    Location
    New Jersey, USA
    Posts
    567
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bmatheny
    We already do have every user that connects, connect to the same web server (the load balancer does this). However the reason that we share session data is to deal with the issue where a web server goes down. If a web server goes down we would like a user to not loose their session data.
    Perfect: You code your own session handler to store the local session data in memory someplace, and then write through to a write-only cache area.

    Also, you can explicitly close your http out (wrapping up transmission to the user) and then finish up with writing out the 'permanent' cache.

    =Austin

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •