meyerweb.com

Skip to: site navigation/presentation
Skip to: Thoughts From Eric

Regular Expression Help

Some time ago, Simon Willison pointed out a very cool bookmarklet that helps solve the “I have one password for all my public sites” problem.  This is where someone picks a password they can remember, and then uses that as the password for their accounts on Amazon, eBay, Hotmail, Netflix, et cetera.  This is one of those things that security experts tell you never to do, and yet just about everyone does, because given the plethora of accounts most of us maintain, there’s no way we could keep track of which password goes with which account unless it was all written down somewhere… and that’s something the security experts insist that you never, ever do.

So the bookmarklet takes your ‘master password’, crosses it with the domain of the site, and generates an MD5-based result.  So let’s assume meyerweb had accounts.  You would fire off the bookmarklet, which would ask you type in your master password.  So let’s say your master password is ‘passwd’; this is combined with www.meyerweb.com and the resulting password is 68573552.  On the other hand, if you just use meyerweb.com, the result is 92938a6e

Now, while those aren’t the most secure possible passwords, they’re a lot more secure than ‘passwd’.  So I’d like to make use of this bookmarklet.  Fine, great.  The problem is what you just saw: the generated password changes if the full host and domain name bit changes.  This could be a problem if, say, amazon.com suddenly starts routing all logins to a server named login.amazon.com… or vice versa.  So I’d like to adapt the bookmarklet so it grabs just the domain and TLD (I probably got those terms wrong; I usually do) of a URL.  Problem is, I can’t write regular expressions for squat.  I don’t even understand how the regexp in the existing bookmarklet works.

So, a little help, please?  Given the form http://www.domain.tld/blah/foo/wow.xyz, I want the regexp to return just domain.tld.  Just leave a solution in the comments, and you’ll earn the respect and adulation of your peers.  At least those of them who read the comments.

44 Responses»

    • #1
    • Comment
    • Wed 27 Oct 2004
    • 1315
    ugly virgin wrote in to say...

    what about something like: /\.?([^.\/]+\.[^.\/]+)/

    • #2
    • Comment
    • Wed 27 Oct 2004
    • 1318
    Lucas Carlson wrote in to say...

    Try this (or some permutation thereof):

    function getHost(url) {
    var match = /http(s)?:\/\/(www.)?([^\/]+)\/.*/i.exec(url)
    return match[3]
    }

    • #3
    • Comment
    • Wed 27 Oct 2004
    • 1319
    Jeff Carnahan wrote in to say...

    That isn’t going to be sufficient. Remember you’re starting from the URL, not the hostname. If you think deeper about this problem, you realize that the solution must work for various types of URL’s. Say URL’s that use http or https, URL’s with an explicit host defined, and URL’s with longer than three TLD’s. Example:

    https://some.really.long.host.com:555/some/path/to/a/page.html

    I assume the goal is to identify only “host.com” in this case, and to do so without using any additional functions (only a single regex).

    • #4
    • Comment
    • Wed 27 Oct 2004
    • 1320
    Stuart Ballard wrote in to say...

    I’d make sure that if the TLD is two characters, you get the third level domain as well. Not that I particularly want to write that regular expression, but I think it would kind of suck if you generated the same password for everything in .co.uk…

    • #5
    • Comment
    • Wed 27 Oct 2004
    • 1320
    Jeff Carnahan wrote in to say...

    Gah, by explicit host, I mean port.. :-)

    • #6
    • Comment
    • Wed 27 Oct 2004
    • 1327
    Marilyn Langfeld wrote in to say...

    For those of us who need a little more, I can suggest Web Confidential, which is a simple program for all your passwords. It doesn’t generate them, just lists them, but I’m pleased with it so far. I’m no longer afraid to have unique passwords for each site, etc. Works on Mac, PC and Palm and is put out by the same folks who make URL Manager Pro.

    I also want to thank you for all your work. Really fantastic!

    • #7
    • Comment
    • Wed 27 Oct 2004
    • 1327
    Seb wrote in to say...

    I think this works with just about every possibility:

    [a-zA-Z]+://(.+\.)*([^\.\/]+)\.([^\.\/]+)/?.?

    Replace with:

    \2.\3

    (This is tested using BBEdit’s grep; I don’t know if it’s any different for other uses/languages…)

    • #8
    • Comment
    • Wed 27 Oct 2004
    • 1328
    Uri Bernstein wrote in to say...

    To further complicate things, comment #4 correctly applies to some two-letter TLDs, such as .uk .il, and .jp (I think), but not to others (such as .it and .ch) – which do not have a “three level” structure (i.e. there is no co.it or co.ch, but rather things go directly under .it and .ch).

    See mozilla bug 252342 for a similar problem.

    • #9
    • Comment
    • Wed 27 Oct 2004
    • 1329
    Seb wrote in to say...

    Oops! Didn’t take into account the .co.uk factor (which is stupid really, as I live in the UK…)

    • #10
    • Comment
    • Wed 27 Oct 2004
    • 1336
    Uri Bernstein wrote in to say...

    In reply to comment #7:
    I think that (.+\.)* is problematic as it might match a lot more than you expect. e.g. if the URl is
    http://somedomain.com/other.domain.com
    The regexp will split it:
    http://(somedomain.)(com/other.)(domain).(com)
    So I think you want to replace that first element by: ([a-z0-9_]+\.)*

    • #11
    • Comment
    • Wed 27 Oct 2004
    • 1342
    Mark wrote in to say...


    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    echo "domain name is: {$matches[0]}\n";

    • #12
    • Comment
    • Wed 27 Oct 2004
    • 1348
    Frode Danielsen wrote in to say...

    I’ve tested this one through PHP’s preg_match, not through JavaScripts RegExp so I’m not sure it’ll work in the bookmarklet. Tested on these URIs:
    http://www.amazon.com/
    http://www.amazon.co.uk/
    http://www.amazon.co.uk:80/
    http://www.blah.amazon.com/
    http://www.blah.amazon.co.uk/
    http://www.blah.amazon.co.uk:80/

    And the RegExp is:
    http[s]?://(?:(?:[^\.]+)?\.)*([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?)(?::[\d]+)?

    Which makes certain assumptions. I’m not sure if those assumptions would rule out a few legal URIs, but I couldn’t think of any such URIs at the moment of hacking it together.
    (?:(?:[^\.]+)?\.)* – takes care of any amount of http://www.xxx.yyy.
    ([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?) – saves the domain (which must be 3 letters or more, ‘amazon’ in my examples) and then at least one ‘.com’/’.co’/what-have-you, which may be up to 7 letters long (I think that’s the largest allowed TLD). And finally it allows one optional ‘.uk’ etc. which may only be 2-3 letters. I don’t think there exists any ‘.x.y’ which have a longer ‘y’.
    (?::[\d]+)? – takes care of any specified ports.

    Phew :)

    • #13
    • Comment
    • Wed 27 Oct 2004
    • 1352
    Frode Danielsen wrote in to say...

    Damnit, ok – to make Eric’s own example work, I’ll require the URI to end with a / after the TLD. So:
    http[s]?://(?:(?:[^\.]+)?\.)*([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?)(?::[\d]+)?/

    My RegExp works on Eric’s own example at least. :)

    • #14
    • Comment
    • Wed 27 Oct 2004
    • 1402
    Paul Roub wrote in to say...

    Not to wade into the murky waters of the domain extraction itself, at least part of the problem (and the bookmarklet) could be simplified by using document.location.hostname instead of document.location.href — at least we no longer have to worry about pulling out protocol strings and paths.

    e.g., for this page, document.location.href gives “http://www.meyerweb.com/eric/thoughts/2004/10/27/regular-expression-help/”, whereas document.location.hostname gives “www.meyerweb.com”

    I’d also point out (for those trying this as an exercise at home), the bookmarklet doesn’t *have* to be written around a single domain regexp. Loops, etc. could be used.

    • #15
    • Comment
    • Wed 27 Oct 2004
    • 1411
    Ethan wrote in to say...

    Any chance we could use document.location.host as a starting point?

    function getDomain() {
    var domain = window.location.host;
    var match = /[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})/i.exec(domain);
    alert(match[match.length - 2] + match[match.length - 1]);
    }

    My regex-fu sucks (really, REALLY sucks), but the above returns “.meyerweb.com” if I run it on this page.

    • #16
    • Comment
    • Wed 27 Oct 2004
    • 1412
    Ethan wrote in to say...

    Oops. Paul hadn’t yet posted when I started drafting my response — sorry ’bout that.

    • #17
    • Comment
    • Wed 27 Oct 2004
    • 1413
    Sergio Lopes wrote in to say...

    Hi! You can’t look only to “something.xxx¨. Here, in Brazil, for example, we have domain names like this: name.lastname.nom.br (used for personal sites). So, two people can register two absolutely diferent domains like eric.meyer.nom.br and john.meyer.nom.br. Both are valid domains and both have the same “something.xxx” (note that meyer.nom.br doesn’t exist and is not valid).
    So, if you are considering a bookmarklet to global use, don’t do this. Maybe other countries use things like this too.

    • #18
    • Comment
    • Wed 27 Oct 2004
    • 1424
    Jacob Kaplan-Moss wrote in to say...

    Actually, I think a regex is overkill here (“… now you’ve got two problems”).

    The following function, using document.location.hostname and array/string manipulation should do what you want:


    function getDomainName() {
    var host = document.location.hostname;
    var parts = host.split('.');
    if (parts[parts.length - 1].length == 2) {
    return parts.slice(parts.length - 1).join('.');
    } else {
    return parts.slice(parts.length - 2).join('.');
    }
    }

    • #19
    • Comment
    • Wed 27 Oct 2004
    • 1425
    Dan Cech wrote in to say...

    You could take the hostname and split it on the .s, then start from the end and take everything up to the first item > 3 characters, like this:

    var split = location.hostname.split(‘.’);
    var output = ”;
    while (item = split.pop()) {
    if (output != ”) output = ‘.’+output;
    output = item+output;
    if (item.length > 3) break;
    }

    • #20
    • Comment
    • Wed 27 Oct 2004
    • 1440
    Uri Bernstein wrote in to say...

    Given that “host” is the hostname as per comment 14, the following will do the trick assuming you are dealing with a two-part domain name (such as foo.com, not foo.co.uk):

    function getdomain(host) {
    var match = /[^.]*\.[^.]*$/i.exec(host);
    return match;
    }

    If you want to assume that all two-letter TLDs are used with three-part domain names (like .uk and .il are), you can use:

    function getdomain(host) {
    var match = /[^.]*\.[^.]*(\…)?$/.exec(host);
    return match[0];
    }

    • #21
    • Comment
    • Wed 27 Oct 2004
    • 1448
    Jonathan Blake wrote in to say...

    I’m not one to resist an interesting problem, but being a security-geek wannabe, let me first recommend a much more secure solution that provides single password access to all of your other passwords originally designed by everyone’s favorite security expert, Bruce Schneier: Password Safe.

    Now that that’s off my chest, the only way to do this for all possible domains would be to encode a great deal of knowledge about the domain name system into the regex. For the reason mentioned by Uri Bernstein in #7, you’ll need to know the rules for each possible ccTLD.

    The safer route would be to restrict to the top few TLDs that you know the rules for and throw an error for all others that you weren’t expecting. The following regex tested against the window.location.hostname should work.

    .*?([^.]+\.)(com|org|edu|co\.uk)

    Add TLDs to taste.

    I admit I haven’t written many JavaScript regexen, but the following should work:

    var domainBits = /.*?([^.]+\.)(com|org|edu|co\.uk)/.exec(window.location.hostname);
    if( domainBits ) {
      var domain = domainBits[ 1 ] + domainBits[ 2 ];
      /* perform MD5 stuff, etc. */
    }
    else {
      alert( "Top level domain (TLD) unrecognized. Stopping." );
    }
    

    Paste the following into your web browser to test (only tested on FireFox and IE6 so far):

    javascript:var domainBits = /.*?([^.]+\.)(com|org|edu|co\.uk)/.exec(window.location.hostname);if( domainBits ) { var domain = domainBits[ 1 ] + domainBits[ 2 ]; alert( domain ); }else { alert( "Top level domain (TLD) unrecognized. Stopping." ); }

    • #22
    • Comment
    • Wed 27 Oct 2004
    • 1448
    Jesse Ruderman wrote in to say...

    Nothing stops a site from redefining prompt() and stealing your master password when you type it. Do you trust every site you log into to not do that?

    • #23
    • Comment
    • Wed 27 Oct 2004
    • 1510
    Jonathan Blake wrote in to say...

    Let me revise the expression that I suggested in my previous comment:

    ^.*?([^.]+\.)(com|org|edu|co\.uk)$

    Some security geek I am! I didn’t anchor the regex to ensure that the entire hostname was tested. This would create problems for URIs like www.com.com.

    • #24
    • Comment
    • Wed 27 Oct 2004
    • 1540
    Uri Bernstein wrote in to say...

    Jonathan – do you really need that “^.*?” at the beginning?
    Me being a (former) perl geek, I just can’t see four characters wasted like that for (apparently) nothing :-)

    • #25
    • Comment
    • Wed 27 Oct 2004
    • 1550
    Jonathan Blake wrote in to say...

    Uri, in the interest of saving defenseless bits everywhere, thank you for pointing that out. I just wasn’t originally sure if the exec() method matched the entire string or not—I was too lazy to test. But it appears that v3.0 of the regex is in order:

    ([^.]+\.)(com|org|edu|co\.uk)$

    Please believe me when I say that I detest extra typing just as much as the next guy. I am not a bit-killer! :)

    • #26
    • Comment
    • Wed 27 Oct 2004
    • 1652
    Jehiah wrote in to say...

    I like dan’s method in comment #19 but would suggest a slightly different approach. If our real goal is to drop off the first one or two elements from the domain, then that is exactly what we should do.
    In the code below I have modified it to help with some really long domains. And this still works when browsing to the IP address.


    var split = location.hostname.split('.');
    var output = '';
    var dropoff = (split.length > 4) ? 2 : 1; // the longer the url, the more to drop off
    while (item = split.pop()) {
    if (split.length >= dropoff) // drop off the lowest elements
    {if (output != '')
    output = '.'+output;
    output = item + output;}
    }
    alert(output);

    • #27
    • Comment
    • Wed 27 Oct 2004
    • 1654
    Charles wrote in to say...

    One problem is that some sites have specific password rules, so that an arbitrary password may not be accepted.

    • #28
    • Comment
    • Wed 27 Oct 2004
    • 1704
    Helge Grimm wrote in to say...

    I would do I it like this way (no regexp needed – it’s basic js):

    function getDomain() {
    var domArray = document.domain.split('.');
    return domArray[domArray.length-2]+"."+domArray[domArray.length-1];
    }

    As Bookmarklet:
    javascript:domArray = document.domain.split('.');alert(domArray[domArray.length-2]+"."+domArray[domArray.length-1]);

    • #29
    • Comment
    • Wed 27 Oct 2004
    • 1710
    lamer wrote in to say...

    will your regexps work with 127.0.0.1 or localhost? :)

    • #30
    • Comment
    • Thu 28 Oct 2004
    • 0001
    Angus Turnbull wrote in to say...

    Enlighten thyself, grasshopper :). The simplest and best coverage of JavaScript regular expressions I’ve found is this Evolt article.

    Having said that, different countries are REALLY going to be the major problem here. Some countries, like NZ, Australia and the UK, allow second level domain names:

    foo.co.nz
    bar.co.nz

    are completely different sites, whereas in some other countries with first-level domains:

    foo.abc.ru
    bar.abc.ru

    are both subdomains of the main abc.ru site, a fully qualified domain in its own right (unlike co.nz).

    So really, you’d need a database in the bookmarklet of all 2 and 3 letter country codes. Otherwise, the script won’t be able to tell which domains san be trimmed down, and which can’t. Alternatively, just list a common bunch of second-level subdomains like (co|com|org|net|gen|mil) and don’t sweat the occasional shared password…

    • #31
    • Comment
    • Thu 28 Oct 2004
    • 0133
    Thomas Cutter wrote in to say...

    I could really use some help with regular expressions if somebody out there could email me that would be great. I have a project for work that I need to get done asap and I am pretty clueless when it comes to regular expressions.

    • #32
    • Comment
    • Thu 28 Oct 2004
    • 0454
    Martijn Ras wrote in to say...

    Heya Eric,

    It’s really not that hard, here’s my regular expression:

    [^.]*\.\([^:/]*\).*

    You’re not interested in everything up to the first dot in the hostname “[^.]*\.”, what comes next is of interest up to either the start of the port (:) or the path (/), anything after that is of no interest either.

    Here’s two examples:
    echo “http://www.domain.tld/blah/foo/wow.xyz” | sed ‘s/[^.]*\.\([^:/]*\).*/\1/’ => domain.tld

    echo “http://www.very.long.domain.co.uk/blah/foo/wow.xyz” | sed ‘s/[^.]*\.\([^:/]*\).*/\1/’ => very.long.domain.co.uk

    Mazzel,

    Martijn

    • #33
    • Comment
    • Thu 28 Oct 2004
    • 0548
    Michael Ward wrote in to say...

    Mozilla Firefox RC1 has a master passowrd setting for the automatic inputting of usernames/passwords for web sites.

    It works by, once per browser session, prompting for the master password when a website with saved login details is accessed.

    Works very well IMHO.

    • #34
    • Comment
    • Thu 28 Oct 2004
    • 0739
    Aristotle wrote in to say...

    $ perl -MRegexp::Common=URI -le'print $RE{URI}{HTTP}{-scheme => "https?"}{-keep}'
    ((https?)://((?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::((?:[0-9]*)))?(/(((?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?]((?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?))?)

    The third capture group will contain the hostname. The pattern fully respects every aspect of the HTTP URI RFC; I don’t know if the JavaScript pattern dialect supports all the features found in Perl’s, though.

    Unfortunately, due to aforementioned problems with domain structures like in .co.uk and the like, you can’t really tell how much you can chop off the front of the FQDN…

    • #35
    • Comment
    • Thu 28 Oct 2004
    • 0752
    Aristotle wrote in to say...

    Oh, and btw: how about not using the URL, but simply letting you pick a local password yourself? So for Amazon, you’d enter your master password and “amazon” to get a unique site password. Sure you’ll have to remember two things, but the local password can be reasonably trivial and even be written down without a real problem.

    Having a bit of code that proposes such a site password by looking at the URL would be beneficial, then. You can still override it if it guesses badly, and twiddle the heuristics if there is a pattern to bad guesses, without ever risking loss of access to any place.

    • #36
    • Comment
    • Thu 28 Oct 2004
    • 1138
    jgraham wrote in to say...

    From the point of view of actually using this, the important comment is Jesse’s comment 22 – if you use this utility as a bookmarklet, any site can easilly steal your master password and thus your login to all other sites that you use. That seems like an awful lot of risk. (there are other, practical, issues, like sites that require the same login but are at different domains, but they’re somewhat irrelevant in the face of the huge gaping secuirty hole that this bookmarklet opens up). The version with a hardcoded master password variable is a little better (you need to be sure that you trust people with physical access to your machine and that your machine in general is secure).

    • #37
    • Comment
    • Thu 28 Oct 2004
    • 2125
    Ondrej Valek wrote in to say...

    Is that worth trying? From my point of view and everyday’s experience, every site requires different type of password. One asks for digits-only, another for 5-to-10 characters alphabet-only and so on. It’s impossible to create them automagically. I prefer having one stupid password which I use everywhere with just minor predictable changes, which i never use for ‘mission critical’ sites. And I never use those web passwords for digital signatures or real-life needs, like bank accounts or whatever.

    • #38
    • Comment
    • Sat 30 Oct 2004
    • 1818
    Nathan Herald wrote in to say...

    This would work great:


    [^.]*\.\([^:/]*\).*

    When can we see this bookmarklet in action Eric?
    I would like to see it put to the test…

    • #39
    • Comment
    • Sun 31 Oct 2004
    • 0454
    Nathan Herald wrote in to say...

    I have made a favelet some of you might want to check out. It does not read the website automagically, you enter it manually. But it works.

    http://www.myobie.com/password/

    Hope this helps someone, I am thinking about using it for a while.

    • #40
    • Comment
    • Sun 31 Oct 2004
    • 2318
    Peter Chandler wrote in to say...

    What’s the point? How would the hash be any more secure than just the master password?

    If you’re going to do this, you’d be better off using a more obscure hash than MD5.

    • #41
    • Comment
    • Wed 3 Nov 2004
    • 1124
    Martijn Ras wrote in to say...

    Heya Nathan,

    Thanks for noticing my comment # 32.

    Mazzel,

    Martijn.

    • #42
    • Comment
    • Fri 5 Nov 2004
    • 0916
    jlcooke wrote in to say...

    A few things:
    – If this “master password” to “site password” thing is going to work, you need more than just a master password hashed with a website domain name.
    + sitepswd = hash(secret + pswd + domain), where secret is randomly generated and stored in a cookie.
    This is no good for “roaming” users who don’t have a dedicated computer, but, meh.
    – Your site password needs to use more then just hex, try the full [0-9][a-z][A-z](~!@#$%^&*()_+-=) you’ll make the passwords harder to guess that way.

    Here’s a few tools that might be handy:
    Generate a password using mouse movments, timing and SHA-256:
    http://www.certainkey.com/demos/mkpasswd
    Check password strength:
    http://www.certainkey.com/demos/password

    • #43
    • Comment
    • Mon 15 Nov 2004
    • 1201
    Gabriel Radic wrote in to say...

    Eric, you’re using a Mac, right? Then love your Keychain.

    • #44
    • Comment
    • Wed 16 Feb 2005
    • 1546
    DRIVE wrote in to say...


    <?php
    $hostname = ($_SERVER['SERVER_NAME']);
    preg_match("/[^\.\/]+\.[^\.\/]+$/", $hostname, $matches);
    echo "domain name is: {$matches[0]}\n";
    ?>

Leave a Comment

Line and paragraph breaks automatic, e-mail address required but never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



Remember to encode character entities if you're posting markup examples! Management reserves the right to edit or remove any comment—especially those that are abusive, irrelevant to the topic at hand, or made by anonymous posters—although honestly, most edits are a matter of fixing mangled markup. Thus the note about encoding your entities. If you're satisfied with what you've written, then go ahead...


October 2004
SMTWTFS
September November
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Sidestep

Feeds

Extras