Well-designed applications insulate the users from business logic. Some applications however do not validate user input and allow malicious users to make direct database calls to the database. This attack, called direct SQL injection, is surprisingly simple.
Imagine a web application that has some functionality that allows you to change your password. Most do. You login and navigate to the account options page, select change password, enter your old password and specify the new password; twice for security of course. To the user it's a transparent process but behind the scenes some magic is taking place. When the user enters his old password and two new passwords in the web form, his browser is creating an http request to the web application and sending the data. This should be done over SSL to protect the data in transit.
That typical request actually may look like this (A GET request is used here for simplicity. In practice this should be done using a POST):
http://www.victim.com/changepwd?pwd=Catch22&newpwd=Smokin99&newconfirmpwd=Smokin99&uid=testuser
The application that receives this request takes the four sets of parameters supplied as input:
Pwd=Catch22 Newpwd=Smokin99 Newconfirmpwd=Smokin99 Uid=testuser
It checks to make sure the two new passwords match out of courtesy to the user, discards the duplicate data and builds a database query that will check the original password and replace it with the new one entered. That database query may look like this:
UPDATE usertable SET pwd='$INPUT[pwd]' WHERE uid='$INPUT[uid]';
All works just fine until the attacker comes along and figures out he can add another database function to the request that actually gets processed and executed. Here he adds a function that simply replaces the password of any accounts named admin with his chosen password. For instance:
http://www.victim.com/changepwd?pwd=Catch22&newpwd=Smokin99&newconfirmpwd=Smokin99&uid=testuser '+or+uid+like'%25admin%25';--%00
The consequences are devastating. The attacker has been able to reset the administrative password to one he chose, locking out the legitimate systems owners and allowing him unlimited access. A badly designed web application means hackers are able to retrieve and place data in authoritative systems of record at will.
The example above uses a technique of appending an additional database query to the legitimate data supplied. Direct SQL Injection can be use to:
change SQL values
concatenate SQL statements
add function calls and stored-procedures to a statement
typecast and concatenate retrieved data
Some examples are shown below to demonstrate these techniques.
UPDATE usertable SET pwd='$INPUT[pwd]' WHERE uid='$INPUT[uid]';
Malicious HTTP request
http://www.none.to/script?pwd=ngomo&uid=1'+or+uid+like'%25admin%25';--%00
SELECT id,name FROM products WHERE id LIKE '%$INPUT[prod]%';
Malicious HTTP request
				http://www.none.to/script?0';insert+into+pg_shadow+usename+values+('hoschi')
				SELECT id,name FROM products WHERE id LIKE '%$INPUT[prod]%';
Malicious HTTP request
http://www.none.to/script?0';EXEC+master..xp_cmdshell(cmd.exe+/c)
Preventing SQL injection is a challenging task especially for large distributed web systems consisting of several applications. Filtering SQL commands directly prior to their execution reduces the risk of erronous filtering, and shared components should be developed to preform this function.
If your input validation strategy is as we recommend, that is to say only accept expected input then the problem is significantly reduced. However this approach is unlikely to stop all SQL injection attacks and can be difficult to implement if the input filtering algorithm has to decide whether the data is destined to become part of a query or not, and if it has to know which database such a query might be run against. For example, a user who enters the last name "O'Neil" into a form includes the special meta-character ('). This input must be allowed, since it is a legitimate part of a name, but it may need to be escaped if it becomes part of a database query. Different databases may require that the character be escaped differently, however, so it would also be important to know for which database the data must be sanitized. Fortunately, there is usually a very good solution to this problem.
The best way to protect a system against SQL injection attacks is to construct all queries with prepared statements and/or parameterized stored procedures. A prepared statement, or parameterized stored procedure, encapsulates variables and should escape special characters within them automatically and in a manner suited to the target database.
Common database API's offer developers two different means of writing a SQL query. For example, in JDBC, the standard Java API for relational database queries, one can write a query either using a PreparedStatement or as a simple String. The preferred method from both a performance and a security standpoint should be to use PreparedStatements. With a PreparedStatement, the general query is written using a ? as a placeholder for a parameter value. Parameter values are substituted as a second step. The substitution should be done by the JDBC driver such that the value can only be interpreted as the value for the parameter intended and any special characters within it should be automatically escaped by the driver for the database it targets. Different databases escape characters in different ways, so allowing the JDBC driver to handle this function also makes the system more portable.
If the following query (repeated from the example above) is made using a JDBC PreparedStatement, the value $INPUT[uid] would only be interpreted as a value for uid. This would be true regardless of any quotation marks or other special characters used in the input string.
UPDATE usertable SET pwd='$INPUT[pwd]' WHERE uid='$INPUT[uid]';
Common database interface layers in other languages offer similar protections. The Perl DBI module, for example, allows for prepared statements to be made in a way very similar to the JDBC PreparedStatement. Developers should test the behavior of prepared statements in their system early in the development cycle.
Use of prepared statements is not a panecea and proper input data validation is still strongly recommended. Defense in depth implies that both techniques should be used if possible. Also, some application infrastructures may not offer an analogue to the PreparedStatement. In these cases, the following two rules should be followed in the input validation step, if possible.
SQL queries should be built from data values and never other SQL queries or parts thereof.
If you must use an "explicitly bad" strategy then the application should filter special characters used in SQL statements. These include "+", "," "'" (single quote) and "=".
Nearly every programming language allows the use of so called "system-commands", and many applications make use of this type of functionality. System-interfaces in programming and scripting languages pass input (commands) to the underlying operating system. The operating system executes the given input and returns its output to stdout along with various return-codes to the application such as successful, not successful etc.
System commands can be a very convenient feature, which with little effort can be integrated into a web-application. Common usage for these commands in web applications are filehandling (remove,copy), sending emails and calling operating system tools to modify the applications input and output in various ways (filters).
Depending on the scripting or programming language and the operating-system it is possible to:
Alter system commands
Alter parameters passed to system commands
Execute additional commands and OS command line tools.
Execute additional commands within executed command
Some common problems to avoid are:
PHP
require()
include()
eval()
preg_replace() (with /e modifier)
exec()
passthru()
`` (backticks)
system()
popen()
Shell Scripts
often problematic and dependent on the shell
Perl
open()
sysopen()
glob()
system()
'' (backticks)
eval()
Java(Servlets, JSP's)
System.* (especially System.Runtime)
C & C++
system()
exec**()
strcpy
strcat
sprintf
vsprintf
gets
strlen
scanf
fscanf
sscanf
vscanf
vsscanf
vfscanf
realpath
getopt
getpass
streadd
strecpy
strtrns
Preventing direct OS commands is a challenging task especially for large distributed web systems consisting of several applications. Architecturally if all requests come in to a central location and leave from a central location then the problem is easier to solve with a common component. Validation is most effective when placed nearest to the intended entrance and exit points of a system, allowing more accurate assessment of the threats at every point.
If your input validation strategy is as we recommend, that is to say only accept expected input then the problem is significantly reduced. We cannot stress that this is the correct strategy enough!
Many web applications utilize the file system of the web server in a presentation tier to temporarily and/or permanently save information. This may include page assets like image files, static HTML or applications like CGI's. The WWW-ROOT directory is typically the virtual root directory within a web server, which is accessible to a HTTP Client. Web Applications may store data inside and/or outside WWW-ROOT in designated locations.
If the application does NOT properly check and handle meta-characters used to describe paths for example "../" it is possible that the application is vulnerable to a "Path Trasversal" attack. The attacker can construct a malicious request to return data about physical file locations such as /etc/passwd. This is often referred to as a "file disclosure" vulnerability. Attackers may also use this properties to create specially crafted URL's to Path traversal attacks are typically used in conjunction with other attacks like direct OS commands or direct SQL injection.
Scripting languages such as PHP, Perl, SSIs and several "template-based-systems" who automatically execute code located in required, included or evaluated files.
Traversing back to system directories which contain binaries makes it possible to execute system commands OUTSIDE designated paths instead of opening, including or evaluating file.
Where possible make use of path normalization functions provided by your development language. Also remove offending path strings such as "../" as well as their unicode variants from system input. Use of "chrooted" servers can also mitigate this issue.
Preventing path traversal and path disclosure is a challenging task especially for large distributed web systems consisting of several applications. Architecturally if all requests come in to a central location and leave from a central location then the problem is easier to solve with a common component.
If your input validation strategy is as we recommend, that is to say only accept expected input then the problem is significantly reduced. We can not stress that this is the correct strategy enough!
While web applications may be developed in a variety of programming languages, these applications often pass data to underlying lower level C-functions for further processing and functionality.
If a given string, lets say "AAA\0BBB" is accepted as a valid string by a web application (or specifically the programming language), it may be shortened to "AAA" by the underlying C-functions. This occurs because C/C++ perceives the null byte (\0) as the termination of a string. Applications which do not perform adequate input validation can be fooled by inserting null bytes in "critical" parameters. This is normally done by URL Encoding the null bytes (%00). In special cases it is possible to use Unicode characters.
The attack can be used to :
Disclose physical paths, files and OS-information
Truncate strings
Paths
Files
Commands
Command parameters
Bypass validity checks, looking for substrings in parameters
Cut off strings passed to SQL Queries
The most popular affected scripting and programming languages are:
Perl (highly)
Java (File, RandomAccessFile and similar Java-Classes)
PHP (depending on its configuration)
Just when you figured out and understood the most common attacks, canonicalization steps them all up a few gears!
Canonicalization deals with the way in which systems convert data from one form to another. Canonical means the simplest or most standard form of something. Canonicalization is the process of converting something from one representation to the simplest form. Web applications have to deal with lots of canonicalization issues from URL encoding to IP address translation. When security decisions are made based on canonical forms of data, it is therefore essential that the application is able to deal with canonicalization issues accurately.
As an example, one may look at the Unicode character set. Unicode is the internal format of the Java language. Unicode Encoding is a method for storing characters with multiple bytes. Wherever input data is allowed, data can be entered using Unicode to disguise malicious code and permit a variety of attacks. RFC 2279 references many ways that text can be encoded.
Unicode was developed to allow a Universal Character Set (UCS) that encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few UCS transformation formats (UTF) with varying characteristics. UTF-8 has the characteristic of preserving the full US-ASCII range. It is compatible with file systems, parsers and other software relying on US-ASCII values, but it is transparent to other values.
In a Unicode Encoding attack, there are several unique issues at work. The variety of issues increases the complexity. The first issue involves Character Mapping while the second issue involves Character Encoding. An additional issue is related to whether the application supports Character Mapping and how that application encodes and decodes that mapping.
Table�11.1.�
| UCS-4 Range | UTF-8 encoding | 
|---|---|
| 0x00000000-0x0000007F | 0xxxxxxx | 
| 0x00000080 - 0x000007FF | 110xxxxx 10xxxxxx | 
| 0x00000800-0x0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 
| 0x00010000-0x001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 
| 0x00200000-0x03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 
| 0x04000000-0x7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 
It is thus possible to form illegal UTF-8 encodings, in two senses:
A UTF-8 sequence for a given symbol may be longer than necessary for representing the symbol.
A UTF-8 sequence may contain octets that are in incorrect format (i.e. do not comply with the above 6 formats).
The importance of UTF-8 representation stems from the fact that web-servers/applications perform several steps on their input of this format. The order of the steps is sometimes critical to the security of the application. Basically, the steps are "URL decoding" potentially followed by "UTF-8 decoding", and intermingled with them are various security checks, which are also processing steps. If, for example, one of the security checks is searching for "..", and it is carried out before UTF-8 decoding takes place, it is possible to inject ".." in their overlong UTF-8 format. Even if the security checks recognize some of the non-canonical format for dots, it may still be that not all formats are known to it. Examples: Consider the ASCII character "." (dot). Its canonical representation is a dot (ASCII 2E). Yet if we think of it as a character in the second UTF-8 range (2 bytes), we get an overlong representation of it, as C0 AE. Likewise, there are more overlong representations: E0 80 AE, F0 80 80 AE, F8 80 80 80 AE and FC 80 80 80 80 AE.
Consider the representation C0 AE of a certain symbol (see [1]). Like UTF-8 encoding requires, the second octet has "10" as its two most significant bits. Now, it is possible to define 3 variants for it, by enumerating the rest possible 2 bit combinations ("00", "01" and "11"). Some UTF-8 decoders would treat these variants as identical to the original symbol (they simply use the least significant 6 bits, disregarding the most significant 2 bits). Thus, the 3 variants are C0 2E, C0 5E and C0 FE.
To further "complicate" things, each representation can be sent over HTTP in several ways: In the raw. That is, without URL encoding at all. This usually results in sending non ASCII octets in the path, query or body, which violates the HTTP standards. Nevertheless, most HTTP servers do get along just fine with non ASCII characters.
Valid URL encoding. Each non ASCII character (more precisely, all characters that require URL encoding - a superset of non ASCII characters) is URL-encoded. This results in sending, say, %C0%AE.
Invalid URL encoding. This is a variant of [2], wherein some hexadecimal digits are replaced with non-hexadecimal digits, yet the result is still interpreted as identical to the original, under some decoding algorithms. For example, %C0 is interpreted as character number ('C'-'A'+10)*16+('0'-'0') = 192. Applying the same algorithm to %M0 yields ('M'-'A'+10)*16+('0'-'0') = 448, which, when forced into a single byte, yields (8 least significant bits) 192, just like the original. So, if the algorithm is willing to accept non hexadecimal digits (such as 'M'), then it is possible to have variants for %C0 such as %M0 and %BG.
It should be kept in mind that these techniques are not directly related to Unicode, and they can be used in non-Unicode attacks as well.
http://host/cgi-bin/bad.cgi?foo=../../bin/ls%20-al|
URL Encoding of the example attack:
http://host/cgi-bin/bad.cgi?foo=..%2F../bin/ls%20-al|
Unicode encoding of the example attack:
http://host/cgi-bin/bad.cgi?foo=..%c0%af../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%9c../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%pc../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c0%9v../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c0%qf../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%8s../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%1c../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%9c../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%c1%af../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%e0%80%af../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%f0%80%80%af../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%f8%80%80%80%af../bin/ls%20-al| http://host/cgi-bin/bad.cgi?foo=..%fc%80%80%80%80%af../bin/ls%20-al|
A suitable canonical form should be chosen and all user input canonicalized into that form before any authorization decisions are performed. Security checks should be carried out after UTF-8 decoding is completed. Moreover, it is recommended to check that the UTF-8 encoding is a valid canonical encoding for the symbol it represents.
Traditional web applications transfer data between client and server using the HTTP or HTTPS protocols. There are basically two ways in which a server receives input from a client; data can be passed in the HTTP headers or it can be included in the query portion of the requested URL. Both of these methods correspond to form input types (either GET or POST). Because of this, URL manipulation and form manipulation are simply two sides of the same issue. When data is included in a URL, it must be specially encoded to conform to proper URL syntax.
The RFC 1738 specification defining Uniform Resource Locators (URLs) and the RFC 2396 specification for Uniform Resource Identifiers (URIs) both restrict the characters allowed in a URL or URI to a subset of the US-ASCII character set. According to the RFC 1738 specification, "only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL." The data used by a web application, on the other hand, is not restricted in any way and in fact may be represented by any existing character set or even binary data. Earlier versions of HTML allowed the entire range of the ISO-8859-1 (ISO Latin-1) character set; the HTML 4.0 specification expanded to permit any character in the Unicode character set.
URL-encoding a character is done by taking the character's 8-bit hexadecimal code and prefixing it with a percent sign ("%"). For example, the US-ASCII character set represents a space with decimal code 32, or hexadecimal 20. Thus its URL-encoded representation is %20.
Even though certain characters do not need to be URL-encoded, any 8-bit code (i.e., decimal 0-255 or hexadecimal 00-FF) may be encoded. ASCII control characters such as the NULL character (decimal code 0) can be URL-encoded, as can all HTML entities and any meta characters used by the operating system or database. Because URL-encoding allows virtually any data to be passed to the server, proper precautions must be taken by a web application when accepting data. URL-encoding can be used as a mechanism for disguising many types of malicious code.
Excerpt from script.php:
echo $HTTP_GET_VARS["mydata"];
HTTP request:
http://www.myserver.c0m/script.php?mydata=%3cscript%20src=%22http%3a%2f%2fwww.yourserver.c0m%2fbadscript.js%22%3e%3c%2fscript%3e
Generated HTML:
<script src="http://www.yourserver.com/badscript.js"></script>
Original database query in search.asp:
				sql = "SELECT lname, fname, phone FROM usertable WHERE lname='" & Request.QueryString("lname") & "';" 
				HTTP request:
http://www.myserver.c0m/search.asp?lname=smith%27%3bupdate%20usertable%20set%20passwd%3d%27hAx0r%27%3b--%00
Executed database query:
SELECT lname, fname, phone FROM usertable WHERE lname='smith';update usertable set passwd='hAx0r'
A suitable canonical form should be chosen and all user input canonicalized into that form before any authorization decisions are performed. Security checks should be carried out after decoding is completed. It is usually the web server itself that decodes the URL and hence this problem may only occur on the web server itself.