URL Encoded Attacks
Attacks using the common web browser
Web Browser Attacks
A popular misconception is that web hacking and defacement is difficult, often requiring detailed technical knowledge and specialist tools. Unfortunately, one of the best tools in a hacker�s arsenal is the common web browser. Using Microsoft�s Internet Explorer or Netscape�s Communicator, it is possible to identify and exploit many common vulnerability�s in both the remote web server�s hosting software and the site content, through simple URL editing. Over the last few years, the numbers of vulnerabilities and security flaws directly exploitable through this type of attack have increased phenomenally, primarily due to application developers failing to adequately check and decode the received client data.
A large proportion of these attacks could be prevented by understanding the methods for encoding data currently supported by popular Internet protocols (such as HTTP) and hosting applications (such as Microsoft�s Internet Information Server). In particular, an understanding of URL encoding techniques is required. In many resources, the usage of various terms like Unicode, web encoding, percent-encoding, escape-encoding and UTF encoding are used interchangeably. This document aims to enlighten developers and security administrators on the issues associated with URL encoded attacks. It is also important to note that many of the encoding methods and security implications are applicable to any application accepting data from a client system.
Uniform Resource Indicators (URI) are a compact string of characters for identifying an abstract or physical resource, typically a web based Uniform Resource Locator (URL). Certain rules and standards have been established to ensure a constructed URI can be correctly interpreted by an application (for more information, read �Uniform Resource Identifiers (URI): Generic Syntax�, http://www.ietf.org/rfc/rfc2396.txt).
Traditional web applications transfer data between client and server using the HTTP or HTTPS protocols. There are essentially two methods in which a server receives input from a client; data can be passed in the HTTP headers (submitted through the cookie field, or the post data field) or it can be included in the query portion of the requested URL. When data is included in a URL, it must be specially encoded to conform to proper URL syntax.
The standard (rfc2396) defines the following classes of characters:
When dealing with IPv6, it is advised that to use a literal IPv6 address in a URL, the literal address should be enclosed in "[" and "]" characters. If this is the case, it is recommended that the characters �[� and �]� are moved from the �unwise� list to the reserved list (for more information, read �Format for Literal IPv6 Addresses in URL's� http://www.ietf.org/rfc/rfc2732.txt).
Escaped-encoding, or sometimes referred to as percent-encoding, is the accepted method of representing characters within a URI that may need special syntax handling to be correctly interpreted. This is achieved by encoding the character to be interpreted with a sequence of three characters. This triplet sequence consists of the percentage character �%� followed by the two hexadecimal digits representing the octet code of the original character. For example, the US-ASCII character set represents a space with octet code 32, or hexadecimal 20. Thus its URL-encoded representation is %20.
Applications may automatically escape reserved and unreserved characters, or automatically un-escape an escape-encoded sequence within a URI, if there is potential for it to be incorrectly interpreted by the remote application. This conversion may be due to the position of the character or escape-encoded sequence within the URI. For example, "%7e" is sometimes used instead of "~" in an http URL path, but the two are equivalent for an http URL.
Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI. The RFC for URI encoding recommends that care should be taken not to escape or un-escape the same string more than once, since un-escaping an already un-escaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.
Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the un-escaped character to appear.
The standard (rfc2396) defines the following groupings of characters that must be escaped to be included within a URI.
Unicode was developed in a direct response to problems associated with multiple language implementations of the ASCII character set. In the past, due to the limited size of the standard ASCII character reference table, different languages could use the same reference number for different characters, or the same character may have been represented by multiple reference numbers. As expected, this led to various problems in the display and interpretation of data, as well as hundreds of different methods of encoding country specific characters. These problems were further compounded by the necessity to reference an expanded array of commonly used punctuation and technical symbols.
Unfortunately, the extended referencing system is not completely compatible with many old (albeit common) protocols and applications, and this has led to the development of a few UCS transformation formats (UTF) with varying characteristics. One of the most commonly utilised formats, UTF-8, has the characteristic of preserving the full US-ASCII range. It is compatible with file systems, parsers and other software relying on US-ASCII values, but it is transparent to other values.
In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.
The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the UCS-4 character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
The UTF-8 translation has the following characteristics:
At the application level, earlier versions of HTML allowed the entire range of the ISO-8859-1 (ISO Latin-1) character set; the HTML 4.0 specification expanded to permit any character in the Unicode character set.
This encoding scheme may not seem overly clear, therefore consider the character �.� (dot) with the UCS-4 hexadecimal value of 0000 002E (which is 2E in US-ASCII). In UTF-8 encoding, this value can be represented in 6 different ways:
C0 AE (11000000 10101110)
E0 80 AE (11100000 10000000 10101110)
F0 80 80 AE (11110000 10000000 10000000 10101110)
F8 80 80 80 AE (11111000 10000000 10000000 10000000 10101110)
FC 80 80 80 80 AE (11111100 10000000 10000000 10000000 10000000 10101110)
Thus, the character may be represented with two bytes (C0 AE) by utilising the second UTF-8 level, three bytes (E0 80 AE) by utilising the third UTF-8 level, and so on to 6 bytes as indicated above.
Abuse of Encoding Schemes
A popular method of manipulating a web application for malicious ends is to extend the functionality of the URL in an HTTP or HTTPS request beyond that originally envisaged by the developer. Using a mix of escaped-encoding and Unicode character representation, it is often possible for an attacker to craft requests that may be interpreted by either the server or client environments as a valid application request. Even though certain characters do not need to be escape-encoded, any 8-bit code (i.e., decimal 0-255 or hexadecimal 00-FF) may be encoded. ASCII control characters such as the NULL character (decimal code 0) can be escape-encoded, as can all HTML entities and any restricted characters used by the operating system or database. In some cases, the encoding of URL information may be designed to purposefully disguise the nature of the attack.
Various guidelines and RFC's carefully explain the method of decoding escape encoded characters and hint at the dangers associated with decoding multiple times and at multiple layers of an application. However, many applications still incorrectly parse escape-encoded data multiple times.
The significance of this form of attack is directly related to the order of decoding the escape-encoded URI, and when appropriate security checks are made on the validity of the URI data. For example, a commercial web server may originally decode all escape-encoded characters; part of the security verification may include the monitoring of �\..\� path recursion for sanity checking and to ensure that directory-path information does not expand beyond a defined limit. However, by escape-encoding this information multiple times, this security check may be circumvented on the initial decoding pass. If this information is then passed onto another application component, it may go through additional decoding, and result in an action not originally envisaged by the application developer.
The multiple escape-encoding of characters or sequences such as �\� or �..\� is particularly relevant in previously successful attacks against applications hosted on Microsoft Windows operating systems. Consider the character �\� as the escape-encoded sequence �%5c�. It is possible to further encode this sequence by escape-encoding each character individually ('%' = %25, '5' = %35, 'c' = %63), and combining them together in multiple ways or multiple times. For example:
Thus, the sequence �..\� may be represented by �..%255c�, �..%%35c� or other permutation. After the first decoding, the sequence �..%255c� is converted to �..%5c�, and only in the second decoding pass is the sequence is finally converted to �..\�.
Describing how a Unicode attack functions, and why the resultant character string may be successful, is a difficult task due to the extreme variety and resulting complexity of the of Unicode-encoding. Three issues are prevalent; Character Mapping, Character Encoding, and how an application supports character mapping and encoding.
In most circumstances, Unicode attacks have been successful due to poor security validating of the UTF-8 encoded character or string, and the interpretation of illegal octet sequences. Consider the following:
In the majority of attacks, Unicode data will be escape-encoded for inclusion within the requested URL. Depending upon the application receiving the encoded request, a successful attack may be made using valid or invalid URL encoding.
An application that supports %u encoding gains the ability to represent the full range of Unicode character strings, beyond those normally available through escape-encoded UTF-8. At the present time, %u encoding is not a recognised standard. However, Microsoft�s IIS Web server is one such application that supports %u encoding.
The %u encoding schema takes the form �%u0061� for UTF-8 character �a�, where the value after %u is the full Unicode value of the character. As previously discussed, the Unicode language code for UTF-8 is 00. Thus, for comparison, the character �Δ� under Basic Greek (03) would be represented as %u0394, and the character �♂� under Miscellaneous Symbols (26) would be represented by %u2642.
Attacks using this method of encoding character strings have been successful in the past largely due to perimeter defence systems (e.g. content filtering) and intrusion detection systems (IDS) not being aware of the encoding system, and therefore not decoding it.
Obfuscating an IP Address
Most Internet users are familiar with navigating to sites and services using a fully qualified domain name, such as www.iss.net. For an application to communicate over the Internet (and most internal networks), this address must to be resolved to an IP address, such as 126.96.36.199 for www.iss.net. This resolution of IP address to host name is achieved through domain name servers.
An attacker may wish to use the IP address as part of a URI to obfuscate the host and possibly bypass content filtering systems, or hide the destination from the end user. Although many IT professionals are familiar with the classic dotted-decimal representation of IP addresses (000.000.000.000), most are not familiar with other possible representations. Using these other IP representations within an URI, it may be possible obscure the host destination from many automated defence systems.
Other representations of an IP address
Depending on the application interpreting an IP address, there may be a variety of ways to encode the address other than the classic dotted-decimal format. Alternative formats include:
These alternative formats are best explained using an example. Consider the URI http://www.iss.net/, which resolves to 188.8.131.52. This can be interpreted as:
In some cases, it may be possible to mix formats (e.g. http://0321.0x86.161.0043).
A dot-less IP calculator can be found at http://www.tcp-ip.nu/cgi-bin/tcp-ip/calc.cgi.
Further representations of the dot-less �Dword� IP address can be achieved by adding multiples of 4294967296. For example, the following addresses all resolve to 184.108.40.206:
IP version 6 (IPv6) is a new version of the Internet Protocol designed as a successor to IP version 4 (IPv4) (for information on IPv4 visit http://www.ietf.org/rfc791, and http://www.ietf.org/rfc/rfc1883.txt for IPv6). The most interesting change lies in the increase in the IP address size from 32 bits to 128 bits, and the associated changes in representing this addressing. There are three conventional forms for representing IPv6 addresses as text strings:
This formatting of IPv6, and support for IPv4 addresses, enables an IP address to be further obscured to a casual observer and many automated detection systems that do not correctly identify and process IPv6 formatted requests. Examples of the IPv6 formatting options are included in the following table. It is worth noting that, when using an IPv6 address in a URL, the literal address should be enclosed in "[" and "]" characters (for more information, read �Format for Literal IPv6 Addresses in URL's� http://www.ietf.org/rfc/rfc2732.txt).
A Defensive Strategy
It is evident that the use of the character encoding schemes previously discussed can offer an attacker an almost infinite number of ways to encode an attack. Detecting an attack using common signature matching techniques can range from being tedious, through to almost impossible. Thus, much of the responsibility for defending against such encoded attacks lies with the application developers themselves. Many past successful attacks and application vulnerabilities could have been averted by the following security practices:
|Copyright � Gunter Ollmann
The Author asserts the moral right to be identified as the author of this work