Subscribe to the Free Print Edition!
Celebrating 25 Years

What’s in a domain name? NIST has an answer

By William Jackson

Everyone knows how frustrating — and embarrassing — it can be to mistype a URL into your browser. (Remember the snickering you used to hear if you went to “whitehouse.com” instead of “whitehouse.gov”? The .com address is now a political news site, by the way.) The Internet Corp. for Assigned Names and Numbers (ICANN) plans to launch a new round of proposals later this year for generic top-level Internet domains and is looking for a way to help avoid confusion and fraud as the number of domains increases.

To help this effort, Paul Black, a computer scientist at the National Institute of Standards and Technology, has come up with an algorithm to measure the amount of visual similarity between domain names. The tool scores the similarities between a proposed domain and an existing one. For instance, a domain such as “.c0m” (with a zero) scores an 88 percent compared with “.com” and probably would not be approved.

Generic top-level domains are the strings of letters and numbers that appear after the far right “.” or dot, before a “/” or slash in a URL. According to ICANN, there are 21 generic top-level domains now approved for use — from .aero (reserved for members of the air transport industry) to .travel (reserved for the travel industry), as well as the more familiar .com, .edu, .gov and .mil.

According to NIST, Black’s algorithm rates the degree of similarity between pairs of alphanumeric characters, such as the numeral “1” and the lowercase letter “l,” which in some fonts are dead ringers and would receive the highest score. Other pairs, such as “h” and “n,” are similar and get lower scores. The algorithm also takes into consideration combinations of letters, such as “cl,” which can look like “d.” Putting everything together, the algorithm then computes the “cost” of transforming one string into another based on visual similarity and expresses that in a percentage score.

NIST says ICANN is considering future enhancements to the algorithm, including checks for confusing similarities between domains in other alphabets or scripts such as Cyrillic.







GCN Popup