Skip to main content

Front-end Solutions for Parsing URLs

Free2016-08-27#JS#利用a标签解析URL#正则解析URL#javascript URL class#如何解析URL串

Busy every day, but can't recall what I've been busy with

Preface

July 11th - August 27th, a month and a half passed incredibly fast when measured in weeks.

Busy every day, but can't recall what I've been busy with. During this period:

  • Haven't picked up any new books (occasionally flipped through Node and JS books while writing blogs, as I tend to forget some content)

  • Read 1.5 prose essays

  • Charged Kindle twice, but never used it

  • Studied Japanese up to Lesson 7

  • Monday FEX Weekly (once published on Tuesday), Friday Qiwu Weekly (this week's mock Taobao Maker Festival panorama from Altimeter Labs was quite nice)

What books have you been reading recently? Any recommendations?

Japanese books. Don't have large blocks of time to read. Will have time once I'm more proficient with daily work.

Is that really the case? Why don't I have large blocks of time?

I. Problem Overview

Please use URL standard methods to extract the host, then judge the host. Don't judge character by character.

This is the scenario: the current page's query string carries a url parameter, and we need to parse out the hostname from it.

location.hostname can retrieve the current page's hostname, but what about any arbitrary URL string? Is there a non-regex solution?

I previously thought there wasn't one. From memory, JS API didn't provide new URL() or similar, thinking regex parsing was necessary.

When I raised this question while hanging out, three experienced developers instantly gave me three solutions.

II. Solutions

A standard URL format is: scheme://domain:port/path?query_string#fragment_id. Simple regex capture can parse this easily.

There are also some strange (carefully constructed) URLs:

http://www.example.com/public/page/2015/index.html?url=http://12.23.34.45/hack.html?http://www.example.com//check.htm

http://www.example.com/public/page/2015/index.html?url=http://www.example.com @12.23.34.45/hack.html

P.S. To make it clearer, the url parameter part wasn't encoded. These parameters look malicious.

There are even some non-standard URLs that are difficult to handle with ordinary regex, such as:

// How to interpret this?
http://www.example.com/what??key=val?&&#123http:// @www.abc.com?query=2#45
// What about this?
http://www.example.com:8899 @www.abc.com/what??key=val?&&#123http://?query=2#45
// This one?
http://www.example.com:$88;9,9 @www.abc.com$/what??key=val?&&#123http://?query=2#45
//...

1. Anchor Tag Automatic URL Parsing

var a = document.createElement('a');
a.href = 'http://www.example.com/news.php?id=10#footer';

var div = document.createElement('div');
for (var key in a) {
    !(key in div) && console.log(`${key} = ${a[key]}`);
}

The output looks like this:

// Indicates where the resource points to: current window, new tab, etc.
target = 
// Notifies UA to download the pointed resource
download = 
// Asynchronously POSTs to specified address when clicked, used for ad tracking
ping = 
// Indicates relationship between pointed resource and current resource: backup, bookmark, etc.
rel = 
// Language of the pointed resource
hreflang = 
// MIME type of the pointed resource
type = 
// Referrer header policy, used to protect user privacy
referrerpolicy = 
// 
text = 
// Deprecated. Supports custom shapes, passes a series of coordinates
coords = 
// Deprecated. Character encoding of the pointed resource
charset = 
// Deprecated. Jumps to a tag with specified name
name = 
// Deprecated. Reverse relationship, antonym of rel
rev = 
// Deprecated. Used to specify custom shape hotspots
shape = 
// URL of the pointed resource, or the #fragment_id part of the URL
href = http://www.example.com/news.php?id=10#footer
origin = http://www.example.com
protocol = http:
username = 
password = 
host = www.example.com
hostname = www.example.com
port = 
pathname = /news.php
search = ?id=10
hash = #footer

These are properties unique to the a tag. Among them is the hostname we want. In other words, the a tag automatically completes URL parsing. For front-end development, this was once the cheapest way to parse URLs:

var getHostname = function(url) {
    var a = document.createElement('a');
    a.href = url;
    return a.hostname;
};

100% reliable. No matter how complex the URL, it can't fool the browser.

2. JS URL API

var url = new URL('http://www.example.com:$88;9,9 @www.abc.com$/what??key=val?&&#123http://?query=2#45');
for (var key in url) {
    console.log(`${key} = ${url[key]}`);
}

Chrome outputs:

searchParams = %3Fkey=val%3F
href = http://www.example.com:$88%3B9,9 @www.abc.com%24/what?%3Fkey=val%3F#123http://?query=2#45
origin = http://www.abc.com%24
protocol = http:
username = www.example.com
password = $88%3B9,9
host = www.abc.com%24
hostname = www.abc.com%24
port = 
pathname = /what
search = ?%3Fkey=val%3F
hash = #123http://?query=2#45

Because this URL is too non-standard, UA processing details differ. FF gives different results:

"href = http://www%2Eexample%2Ecom:$88%3B9,9 @www.abc.com$/what??key=val?&&#123http://?query=2#45"
"origin = http://www.abc.com$"
"protocol = http:"
username = www%2Eexample%2Ecom
password = $88%3B9,9
host = www.abc.com$
hostname = www.abc.com$
port = 
pathname = /what
search = ??key=val?&&
searchParams = %3Fkey=val%3F
hash = #123http://?query=2#45

Browsers do provide the URL class quietly. It's not ES5 or ES6/7 standard, currently just an experimental feature. Compatibility is as follows:

Android4.0  webkitURL
Android4.4  URL
Safari6.0   webkitURL
Chrome32    URL
FF19        URL
IE10        URL

Mobile can be used with confidence. For more compatibility information, please see URL - Web APIs | MDN

var getHostname = function(url) {
    return new URL(url).hostname;
};

3. Regex Parsing

var parseUrl = function(url) {
    var urlParseRE = /^\s*(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^: @\/#\?]+)(?:\:([^: @\/#\?]+))?) @)?(([^:\/#\?\]\[]+|\[[^\/\] @#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/;

    var matches = urlParseRE.exec(url || "") || [];

    return {
        href:         matches[0] || "",
        hrefNoHash:   matches[1] || "",
        hrefNoSearch: matches[2] || "",
        domain:       matches[3] || "",
        protocol:     matches[4] || "",
        doubleSlash:  matches[5] || "",
        authority:    matches[6] || "",
        username:     matches[8] || "",
        password:     matches[9] || "",
        host:         matches[10] || "",
        hostname:     matches[11] || "",
        port:         matches[12] || "",
        pathname:     matches[13] || "",
        directory:    matches[14] || "",
        filename:     matches[15] || "",
        search:       matches[16] || "",
        hash:         matches[17] || ""
    };
};

Scared by the complexity. Let's test if it's robust enough:

var url = parseUrl('http://www.example.com:$88;9,9 @www.abc.com$/what??key=val?&&#123http://?query=2#45');
for (var key in url) {
    console.log(`${key} = ${url[key]}`);
}

Output:

href = http://www.example.com:$88;9,9 @www.abc.com$/what??key=val?&&#123http://?query=2#45
hrefNoHash = http://www.example.com:$88;9,9 @www.abc.com$/what??key=val?&&
hrefNoSearch = http://www.example.com:$88;9,9 @www.abc.com$/what
domain = http://www.example.com:$88;9,9 @www.abc.com$
protocol = http:
doubleSlash = //
authority = www.example.com:$88;9,9 @www.abc.com$
username = www.example.com
password = $88;9,9
host = www.abc.com$
hostname = www.abc.com$
port = 
pathname = /what
directory = /
filename = what
search = ??key=val?&&
hash = #123http://?query=2#45

Completely consistent with FF. The results are trustworthy. Let's try to interpret this invincible regex:

/^                      #href
\s*
(                       #hrefNoHash
  (                     #hrefNoSearch
    (                   #domain
      ([^:\/#\?]+:)?    #protocol
      (?:
        (\/\/)          #doubleSlash
        (               #authority
          (?:
            (           #$7 is skipped when getting results, should also use non-capturing parentheses (?:
              ([^: @\/#\?]+)     #username
              (?:
                \:
                ([^: @\/#\?]+)   #password
              )?
            )
            @
          )?
          (                     #host
            ([^:\/#\?\]\[]+|\[[^\/\] @#?]+\])    #hostname
            (?:
              \:
              ([0-9]+)  #port
            )?
          )
        )?
      )?
    )?
    (                   #pathname
      (\/?(?:[^\/\?#]+\/+)*)    #directory
      ([^\?#]*)         #filename
    )
  )?
  (\?[^#]+)?            #search
)
(#.*)?                  #hash
/

According to the analysis above, the 9th left parenthesis should use non-capturing parentheses (?:, so we don't need to skip $7 when getting values:

var getHostname = function(url) {
    // Changed the 9th parenthesis
    var urlParseRE = /^\s*(((([^:\/#\?]+:)?(?:(\/\/)((?:(?:([^: @\/#\?]+)(?:\:([^: @\/#\?]+))?) @)?(([^:\/#\?\]\[]+|\[[^\/\] @#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/;

    var matches = urlParseRE.exec(url || "") || [];

    return matches[10] || "";
};

Blinded by reading it. Goodbye regex.

III. Solution Analysis

Anchor Tag

Developer 1 is rich in front-end experience. Cold knowledge techniques solve the problem instantly.

Compatibility is fine (this technique is from many years ago). Pure front-end solution, simple and effective. The a tag is surprisingly powerful.

For more cold knowledge techniques, please see Front-end Unknown Aspects -- Front-end Cold Knowledge Collection. Discovered another senior predecessor yesterday. Following now.

URL Class

Developer 2 has broad vision and solid details.

Knows even non-standard URL classes. I use console every day but didn't notice this. Without carefulness, you gain less experience. Like Super Mario.

Invincible Regex

Developer 3 is experienced in problem-solving and has accumulated many resources.

This regex scared me to tears. orz

IV. Summary

Leaving early and returning late, not gaining experience. What am I busy with?

Time is fragmented. No clear tasks for the current period. Look up, another half hour has passed. In a blink, it's time for the weekly meeting again... A month and a half has passed, the experience bar hasn't moved at all.

Continuing like this, I'll become an ordinary coder (3 years of work experience, 1 year of actual work experience).

References

Comments

No comments yet. Be the first to share your thoughts.

Leave a comment