Drupal Functions for Sanitizing User Input

Ways to Protect Your Websites

This article was published in the print magazine Drupal Watchdog, Volume 2 Issue 2, , on pages 18-19, by Tag1 Publishing. The magazine was distributed at DrupalCon Munich, . The article was also published on their website. It was featured in TheWeeklyDrop, Issue 63, .

Any website can be vulnerable to a variety of security problems, regardless of its underlying web technologies, including Drupal. Yet the most common type of attack involves a visitor injecting ill-intentioned code that is presumed to be regular text. For instance, an attacker might submit a comment to a blog post, but instead of providing only innocuous text, he includes malicious JavaScript code, hoping that it will be rendered by the web browser of anyone later viewing that page. Another attack vector, known as SQL injection, works by submitting through a form field some SQL code that, if not handled properly, ends up as part of a database query, intended to execute an unauthorized statement, such as truncating or dropping tables within the database, setting passwords to known values, or stealing user sessions.

That latter type of foul play is well addressed by Drupal's database API layer, which, if used properly and consistently in one's custom code, can negate the risk of an SQL injection breaching one's defenses. Consequently, Drupal developers and administrators will more likely encounter the former type of attack.

Broadly speaking, there are two schools of thought regarding how best to avoid falling victim to any online miscreant attempting to force his code to be displayed in your pages' contents or URLs. It might seem that the safest defense is to never allow unvetted content into the website's database. This process of sanitizing all text beforehand could be thought of as "pre-filtering". Drupal generally takes the opposite approach ("post-filtering") — namely, allowing all submitted content into the database, but always sanitizing it on output.

This approach may seem counterintuitive, but it confers a number of benefits: Firstly, post-filtering preserves the evidence of any user who inserted dangerous content, even if done unintentionally. This is necessary for determining which user should be banned or, depending upon his intentions, be given training in safe content authoring. Secondly, overaggressive pre-filtering can lose valuable content — whether plain text or HTML markup — that probably cannot be restored later, at which point the original source may no longer be available. Thirdly, pre-filtering is useless in any situation where a module outputs data read from files that are presumed safe, and an attacker gains FTP access to the server and thus is able to modify them.

Policing in Plain Clothes

Drupal offers several built-in API functions to help your website avoid falling prey to this type of cyber hostilities, by filtering out potentially dangerous components. For processing plain text — such as what a user might type into a single-line form field — the workhorse function is check_plain(), which sanitizes its input by converting quotation marks, ampersands, and angle brackets into their corresponding HTML entities. As a result, these components will be displayed on the web page as intended, and not interpreted by the browser as HTML markup. For example, '<script src="https://www.ross.ws/_writing/articles/Drupal%20Functions%20for%20Sanitizing%20User%20Input/evil.js"></script>' is neutered by check_plain() into the innocuous text "&lt;script src=&quot;evil.js&quot;&gt;&lt;/script&gt;". The documentation notes that it "also validates strings as UTF-8 to prevent cross site scripting attacks on Internet Explorer 6".

Incidentally, check_plain() is helpful as well even when you are not filtering content provided by outside users. For instance, in your own module code, you can and should use check_plain() when creating the titles of blocks and pages.

In addition to calling check_plain() directly, you can utilize it indirectly by calling Drupal functions that pass their input through check_plain(). They include t(), for translating text from one language to another, and l(), for converting a URL into an HTML anchor element.

Enemies in Rich Raiment

The typical Internet user may tolerate only being able to submit plain text into single-line entry fields, but she will be displeased if a comment submission form is similarly limited to plain text — thereby preventing her from styling her entry by italicizing and bolding words, adding hyperlinks, highlighting words with background colors, etc. These forms of "rich text" can improve the quality and visual appeal of web pages, but it also poses the risk of a devious user trying to add malevolent code.

The function check_markup() is a powerful remedy, because it sanitizes text formatted in HTML and in lightweight markup languages such as BBCode, Markdown, and Textile. The transformation is performed according to whatever input format is specified in the function call (or the default format, if none is specified). It supports two optional parameters, which allow you to specify the human language of the text, and whether to cache the filtered output (in the cache_filter table).

More Security Goodness Baked In

You may want to output some rich text in cases where it would be overkill to use check_markup() and the input filtering system. For instance, your main forum page might be enhanced with links to external information sources, e.g., websites summarizing the lightweight markup language your visitors are using in their forum posts. Or, a client might request that their mission statement be included on every page on their website, in a smaller font, with some keywords highlighted; so that text should be filtered before output — especially if a non-technical staff member may later edit it. Fortunately, filter_xss_admin() will remove any potentially dangerous tags, while leaving untouched the sort of harmless tags used for styling text. filter_xss() provides more granular control.

For any user-submitted URL, check_url() is more powerful than check_plain(), because it performs additional filtering for cross-site scripting (XSS) vulnerabilities. It will remove potentially harmful protocols from URLs, e.g., "javascript://". If a URL might contain characters such as "https://www.ross.ws/?" or "#", then it would be wise to filter it through the native PHP function urlencode().

The code you write for any custom modules should be made secure from all the well-known attack techniques. Use of the aforementioned functions is arguably the most effective way to mitigate the risks of any forced-output type of attack.

Copyright © 2012 Michael J. Ross. All rights reserved.
bad bots block