A regex, which is short for regular expression, is a sequence of characters that defines a specific search pattern. When included in code or search algorithms, regular expressions can be used to find certain patterns of characters within a string, or to find and replace a character or sequence of characters within a string. They are also frequently used to validate input. In this article we'll see how to use a regex to match a URL.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Let's break down the components of this regular expression used to match and validate URLs.
🌻 Regex Components
â—¾ Anchors
The URL matching regex starts with the ^
(caret) symbol and ends with the $
(dollar) symbol. These anchors define that the entire string must match the pattern between them, ensuring that the regex matches the complete URL and not just a part of it.
The caret anchor indicates that the string to be examined must include the characters following it. It is important to note that the regular expression is case-sensitive.
The dollar sign anchor indicates that the string to be examined includes the characters preceding it.
â—¾ Quantifiers
Quantifiers used to quantify how many times a part of your regular expression should be repeated.
*
(zero or more occurrences)+
(one or more occurrences)?
(zero or one occurrence){}
(specifying a specific range of occurrences).
â—¾ Grouping Constructs
The parentheses grouping constructs ()
in regular expressions are used to group commands together to determine the order of processing.
â—¾ Bracket Expressions
Bracket Expressions []
match any one of a set of characters specified within square brackets. For example, [abc] matches any single character that is either a, b, or c.
🌻 Matching a URL
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
🟡 Protocol
This part of our regex https?:\/\/
matches the protocol of a URL, such as "http://" or "https://". The ?
makes the 's' character optional, allowing URLs with both HTTP and HTTPS protocols to match.
🟡 Domain Name
This section [\da-z.-]+
matches the domain name in the URL. It allows for alphanumeric characters (including digits), hyphens, and dots.
www.example
🟡 Top-Level Domain
This section [a-z.]{2,6}
matches the top-level domain part of the URL. It typically includes domain extensions like .com, .net, .org, etc. The {2,6} specifies that it can consist of 2 to 6 characters.
.com
.org
🟡 Path and File Name
This segment [/\w .-]*
matches the path and file name portion of the URL(represents the specific location or resource on the web server). It allows forward slashes, word characters, spaces, dots, and hyphens.
/page1.html
_________________________________________
🌻 Regex101 a popular online tool that can help you testing, debugging, and learning regular expressions (regex).
Conclusion
Regular expressions offer a powerful way to search, manipulate, and validate text data in programming. With practice, regex can become an indispensable part of your programming toolkit.