Regular Expression Help
Published 20 years, 4 weeks pastSome time ago, Simon Willison pointed out a very cool bookmarklet that helps solve the “I have one password for all my public sites” problem. This is where someone picks a password they can remember, and then uses that as the password for their accounts on Amazon, eBay, Hotmail, Netflix, et cetera. This is one of those things that security experts tell you never to do, and yet just about everyone does, because given the plethora of accounts most of us maintain, there’s no way we could keep track of which password goes with which account unless it was all written down somewhere… and that’s something the security experts insist that you never, ever do.
So the bookmarklet takes your ‘master password’, crosses it with the domain of the site, and generates an MD5-based result. So let’s assume meyerweb had accounts. You would fire off the bookmarklet, which would ask you type in your master password. So let’s say your master password is ‘passwd’; this is combined with www.meyerweb.com and the resulting password is 68573552. On the other hand, if you just use meyerweb.com, the result is 92938a6e
Now, while those aren’t the most secure possible passwords, they’re a lot more secure than ‘passwd’. So I’d like to make use of this bookmarklet. Fine, great. The problem is what you just saw: the generated password changes if the full host and domain name bit changes. This could be a problem if, say, amazon.com suddenly starts routing all logins to a server named login.amazon.com… or vice versa. So I’d like to adapt the bookmarklet so it grabs just the domain and TLD (I probably got those terms wrong; I usually do) of a URL. Problem is, I can’t write regular expressions for squat. I don’t even understand how the regexp in the existing bookmarklet works.
So, a little help, please? Given the form http://www.domain.tld/blah/foo/wow.xyz, I want the regexp to return just domain.tld. Just leave a solution in the comments, and you’ll earn the respect and adulation of your peers. At least those of them who read the comments.
Comments (44)
what about something like: /\.?([^.\/]+\.[^.\/]+)/
Try this (or some permutation thereof):
function getHost(url) {
var match = /http(s)?:\/\/(www.)?([^\/]+)\/.*/i.exec(url)
return match[3]
}
That isn’t going to be sufficient. Remember you’re starting from the URL, not the hostname. If you think deeper about this problem, you realize that the solution must work for various types of URL’s. Say URL’s that use http or https, URL’s with an explicit host defined, and URL’s with longer than three TLD’s. Example:
https://some.really.long.host.com:555/some/path/to/a/page.html
I assume the goal is to identify only “host.com” in this case, and to do so without using any additional functions (only a single regex).
I’d make sure that if the TLD is two characters, you get the third level domain as well. Not that I particularly want to write that regular expression, but I think it would kind of suck if you generated the same password for everything in .co.uk…
Gah, by explicit host, I mean port.. :-)
For those of us who need a little more, I can suggest Web Confidential, which is a simple program for all your passwords. It doesn’t generate them, just lists them, but I’m pleased with it so far. I’m no longer afraid to have unique passwords for each site, etc. Works on Mac, PC and Palm and is put out by the same folks who make URL Manager Pro.
I also want to thank you for all your work. Really fantastic!
I think this works with just about every possibility:
[a-zA-Z]+://(.+\.)*([^\.\/]+)\.([^\.\/]+)/?.?
Replace with:
\2.\3
(This is tested using BBEdit’s grep; I don’t know if it’s any different for other uses/languages…)
To further complicate things, comment #4 correctly applies to some two-letter TLDs, such as .uk .il, and .jp (I think), but not to others (such as .it and .ch) – which do not have a “three level” structure (i.e. there is no co.it or co.ch, but rather things go directly under .it and .ch).
See mozilla bug 252342 for a similar problem.
Oops! Didn’t take into account the .co.uk factor (which is stupid really, as I live in the UK…)
In reply to comment #7:
I think that (.+\.)* is problematic as it might match a lot more than you expect. e.g. if the URl is
http://somedomain.com/other.domain.com
The regexp will split it:
http://(somedomain.)(com/other.)(domain).(com)
So I think you want to replace that first element by: ([a-z0-9_]+\.)*
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
echo "domain name is: {$matches[0]}\n";
I’ve tested this one through PHP’s preg_match, not through JavaScripts RegExp so I’m not sure it’ll work in the bookmarklet. Tested on these URIs:
http://www.amazon.com/
http://www.amazon.co.uk/
http://www.amazon.co.uk:80/
http://www.blah.amazon.com/
http://www.blah.amazon.co.uk/
http://www.blah.amazon.co.uk:80/
And the RegExp is:
http[s]?://(?:(?:[^\.]+)?\.)*([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?)(?::[\d]+)?
Which makes certain assumptions. I’m not sure if those assumptions would rule out a few legal URIs, but I couldn’t think of any such URIs at the moment of hacking it together.
(?:(?:[^\.]+)?\.)* – takes care of any amount of http://www.xxx.yyy.
([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?) – saves the domain (which must be 3 letters or more, ‘amazon’ in my examples) and then at least one ‘.com’/’.co’/what-have-you, which may be up to 7 letters long (I think that’s the largest allowed TLD). And finally it allows one optional ‘.uk’ etc. which may only be 2-3 letters. I don’t think there exists any ‘.x.y’ which have a longer ‘y’.
(?::[\d]+)? – takes care of any specified ports.
Phew :)
Damnit, ok – to make Eric’s own example work, I’ll require the URI to end with a / after the TLD. So:
http[s]?://(?:(?:[^\.]+)?\.)*([^\.]{3,}\.(?:[\w]{2,7})(?:\.[\w]{2,3})?)(?::[\d]+)?/
My RegExp works on Eric’s own example at least. :)
Not to wade into the murky waters of the domain extraction itself, at least part of the problem (and the bookmarklet) could be simplified by using document.location.hostname instead of document.location.href — at least we no longer have to worry about pulling out protocol strings and paths.
e.g., for this page, document.location.href gives “http://www.meyerweb.com/eric/thoughts/2004/10/27/regular-expression-help/”, whereas document.location.hostname gives “www.meyerweb.com”
I’d also point out (for those trying this as an exercise at home), the bookmarklet doesn’t *have* to be written around a single domain regexp. Loops, etc. could be used.
Any chance we could use document.location.host as a starting point?
function getDomain() {
var domain = window.location.host;
var match = /[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})/i.exec(domain);
alert(match[match.length – 2] + match[match.length – 1]);
}
My regex-fu sucks (really, REALLY sucks), but the above returns “.meyerweb.com” if I run it on this page.
Oops. Paul hadn’t yet posted when I started drafting my response — sorry ’bout that.
Hi! You can’t look only to “something.xxx¨. Here, in Brazil, for example, we have domain names like this: name.lastname.nom.br (used for personal sites). So, two people can register two absolutely diferent domains like eric.meyer.nom.br and john.meyer.nom.br. Both are valid domains and both have the same “something.xxx” (note that meyer.nom.br doesn’t exist and is not valid).
So, if you are considering a bookmarklet to global use, don’t do this. Maybe other countries use things like this too.
Actually, I think a regex is overkill here (“… now you’ve got two problems”).
The following function, using
document.location.hostname
and array/string manipulation should do what you want:function getDomainName() {
var host = document.location.hostname;
var parts = host.split('.');
if (parts[parts.length - 1].length == 2) {
return parts.slice(parts.length - 1).join('.');
} else {
return parts.slice(parts.length - 2).join('.');
}
}
You could take the hostname and split it on the .s, then start from the end and take everything up to the first item > 3 characters, like this:
var split = location.hostname.split(‘.’);
var output = ”;
while (item = split.pop()) {
if (output != ”) output = ‘.’+output;
output = item+output;
if (item.length > 3) break;
}
Given that “host” is the hostname as per comment 14, the following will do the trick assuming you are dealing with a two-part domain name (such as foo.com, not foo.co.uk):
function getdomain(host) {
var match = /[^.]*\.[^.]*$/i.exec(host);
return match;
}
If you want to assume that all two-letter TLDs are used with three-part domain names (like .uk and .il are), you can use:
function getdomain(host) {
var match = /[^.]*\.[^.]*(\…)?$/.exec(host);
return match[0];
}
I’m not one to resist an interesting problem, but being a security-geek wannabe, let me first recommend a much more secure solution that provides single password access to all of your other passwords originally designed by everyone’s favorite security expert, Bruce Schneier: Password Safe.
Now that that’s off my chest, the only way to do this for all possible domains would be to encode a great deal of knowledge about the domain name system into the regex. For the reason mentioned by Uri Bernstein in #7, you’ll need to know the rules for each possible ccTLD.
The safer route would be to restrict to the top few TLDs that you know the rules for and throw an error for all others that you weren’t expecting. The following regex tested against the window.location.hostname should work.
.*?([^.]+\.)(com|org|edu|co\.uk)
Add TLDs to taste.
I admit I haven’t written many JavaScript regexen, but the following should work:
Paste the following into your web browser to test (only tested on FireFox and IE6 so far):
javascript:var domainBits = /.*?([^.]+\.)(com|org|edu|co\.uk)/.exec(window.location.hostname);if( domainBits ) { var domain = domainBits[ 1 ] + domainBits[ 2 ]; alert( domain ); }else { alert( "Top level domain (TLD) unrecognized. Stopping." ); }
Nothing stops a site from redefining prompt() and stealing your master password when you type it. Do you trust every site you log into to not do that?
Let me revise the expression that I suggested in my previous comment:
^.*?([^.]+\.)(com|org|edu|co\.uk)$
Some security geek I am! I didn’t anchor the regex to ensure that the entire hostname was tested. This would create problems for URIs like
www.com.com
.Jonathan – do you really need that “^.*?” at the beginning?
Me being a (former) perl geek, I just can’t see four characters wasted like that for (apparently) nothing :-)
Uri, in the interest of saving defenseless bits everywhere, thank you for pointing that out. I just wasn’t originally sure if the exec() method matched the entire string or not—I was too lazy to test. But it appears that v3.0 of the regex is in order:
([^.]+\.)(com|org|edu|co\.uk)$
Please believe me when I say that I detest extra typing just as much as the next guy. I am not a bit-killer! :)
I like dan’s method in comment #19 but would suggest a slightly different approach. If our real goal is to drop off the first one or two elements from the domain, then that is exactly what we should do.
In the code below I have modified it to help with some really long domains. And this still works when browsing to the IP address.
var split = location.hostname.split('.');
var output = '';
var dropoff = (split.length > 4) ? 2 : 1; // the longer the url, the more to drop off
while (item = split.pop()) {
if (split.length >= dropoff) // drop off the lowest elements
{if (output != '')
output = '.'+output;
output = item + output;}
}
alert(output);
One problem is that some sites have specific password rules, so that an arbitrary password may not be accepted.
I would do I it like this way (no regexp needed – it’s basic js):
function getDomain() {
var domArray = document.domain.split('.');
return domArray[domArray.length-2]+"."+domArray[domArray.length-1];
}
As Bookmarklet:
javascript:domArray = document.domain.split('.');alert(domArray[domArray.length-2]+"."+domArray[domArray.length-1]);
will your regexps work with 127.0.0.1 or localhost? :)
Enlighten thyself, grasshopper :). The simplest and best coverage of JavaScript regular expressions I’ve found is this Evolt article.
Having said that, different countries are REALLY going to be the major problem here. Some countries, like NZ, Australia and the UK, allow second level domain names:
foo.co.nz
bar.co.nz
are completely different sites, whereas in some other countries with first-level domains:
foo.abc.ru
bar.abc.ru
are both subdomains of the main abc.ru site, a fully qualified domain in its own right (unlike co.nz).
So really, you’d need a database in the bookmarklet of all 2 and 3 letter country codes. Otherwise, the script won’t be able to tell which domains san be trimmed down, and which can’t. Alternatively, just list a common bunch of second-level subdomains like (co|com|org|net|gen|mil) and don’t sweat the occasional shared password…
I could really use some help with regular expressions if somebody out there could email me that would be great. I have a project for work that I need to get done asap and I am pretty clueless when it comes to regular expressions.
Heya Eric,
It’s really not that hard, here’s my regular expression:
[^.]*\.\([^:/]*\).*
You’re not interested in everything up to the first dot in the hostname “[^.]*\.”, what comes next is of interest up to either the start of the port (:) or the path (/), anything after that is of no interest either.
Here’s two examples:
echo “http://www.domain.tld/blah/foo/wow.xyz” | sed ‘s/[^.]*\.\([^:/]*\).*/\1/’ => domain.tld
echo “http://www.very.long.domain.co.uk/blah/foo/wow.xyz” | sed ‘s/[^.]*\.\([^:/]*\).*/\1/’ => very.long.domain.co.uk
Mazzel,
Martijn
Mozilla Firefox RC1 has a master passowrd setting for the automatic inputting of usernames/passwords for web sites.
It works by, once per browser session, prompting for the master password when a website with saved login details is accessed.
Works very well IMHO.
$ perl -MRegexp::Common=URI -le'print $RE{URI}{HTTP}{-scheme => "https?"}{-keep}'
((https?)://((?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::((?:[0-9]*)))?(/(((?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?]((?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?))?)
The third capture group will contain the hostname. The pattern fully respects every aspect of the HTTP URI RFC; I don’t know if the JavaScript pattern dialect supports all the features found in Perl’s, though.
Unfortunately, due to aforementioned problems with domain structures like in
.co.uk
and the like, you can’t really tell how much you can chop off the front of the FQDN…Oh, and btw: how about not using the URL, but simply letting you pick a local password yourself? So for Amazon, you’d enter your master password and “amazon” to get a unique site password. Sure you’ll have to remember two things, but the local password can be reasonably trivial and even be written down without a real problem.
Having a bit of code that proposes such a site password by looking at the URL would be beneficial, then. You can still override it if it guesses badly, and twiddle the heuristics if there is a pattern to bad guesses, without ever risking loss of access to any place.
From the point of view of actually using this, the important comment is Jesse’s comment 22 – if you use this utility as a bookmarklet, any site can easilly steal your master password and thus your login to all other sites that you use. That seems like an awful lot of risk. (there are other, practical, issues, like sites that require the same login but are at different domains, but they’re somewhat irrelevant in the face of the huge gaping secuirty hole that this bookmarklet opens up). The version with a hardcoded master password variable is a little better (you need to be sure that you trust people with physical access to your machine and that your machine in general is secure).
Is that worth trying? From my point of view and everyday’s experience, every site requires different type of password. One asks for digits-only, another for 5-to-10 characters alphabet-only and so on. It’s impossible to create them automagically. I prefer having one stupid password which I use everywhere with just minor predictable changes, which i never use for ‘mission critical’ sites. And I never use those web passwords for digital signatures or real-life needs, like bank accounts or whatever.
This would work great:
When can we see this bookmarklet in action Eric?
I would like to see it put to the test…
I have made a favelet some of you might want to check out. It does not read the website automagically, you enter it manually. But it works.
http://www.myobie.com/password/
Hope this helps someone, I am thinking about using it for a while.
What’s the point? How would the hash be any more secure than just the master password?
If you’re going to do this, you’d be better off using a more obscure hash than MD5.
Heya Nathan,
Thanks for noticing my comment # 32.
Mazzel,
Martijn.
A few things:
– If this “master password” to “site password” thing is going to work, you need more than just a master password hashed with a website domain name.
+ sitepswd = hash(secret + pswd + domain), where secret is randomly generated and stored in a cookie.
This is no good for “roaming” users who don’t have a dedicated computer, but, meh.
– Your site password needs to use more then just hex, try the full [0-9][a-z][A-z](~!@#$%^&*()_+-=) you’ll make the passwords harder to guess that way.
Here’s a few tools that might be handy:
Generate a password using mouse movments, timing and SHA-256:
http://www.certainkey.com/demos/mkpasswd
Check password strength:
http://www.certainkey.com/demos/password
Eric, you’re using a Mac, right? Then love your Keychain.
<?php
$hostname = ($_SERVER['SERVER_NAME']);
preg_match("/[^\.\/]+\.[^\.\/]+$/", $hostname, $matches);
echo "domain name is: {$matches[0]}\n";
?>