Three Steps to Perform on Input Data to Make Your Software More Secure

Why checking input data as the first line of defense is crucial in software security? How is this defense concretely achieved?

Published in

CodeX

4 min readJun 18, 2021

Photo by Alan Levine on Flickr (September 1, 2018).

One of the main entry points for someone who wants to attack a computer system is all the inputs that are used by the software [1]. These input data are indeed part of the attack surface of the system and can be used by attackers if they offer vulnerabilities that can be exploited.

According to the OWASP Top 10 security risks for web applications, the first one in 2017 was injection and it will likely still be the case in 2021 [2]. Other attacks exploiting input data include buffer overflows, web parameter tampering, cross-site scripting, etc., and are not restricted to web applications.

In general, input data do not just consist of values collected from forms. It also refers to files read by the software, configuration data, values retrieved from databases, environment variables of the operating system, etc. In fact, it includes anything that comes from outside the executable code and whose values are only known at run-time and undefined during the compilation.

Three kinds of operation performed on input data can be used by developers to write more secure code. They contribute to helping programmers decrease the number of possible vulnerabilities when they are handling input data.

1. Validation

The validation step consists of checking that several characteristics of the input data are satisfied [3]. Aspects like the size of the input data, its domain, or its format can be checked. The validation has to be done before data are used in the application. For example, an application asking for your year of birth as input could check that:

It is an integer value with four digits,
And it is a strictly positive value in the [1900–2021] interval.

Concretely, the validation step can just be done from scratch with basic constructs such as if-else statements. Regular expressions are also a great tool to check the format of data. Of course, it is better to use specific libraries proposing standard validators and ways to define custom ones.

To be able to perform data validation, several expected properties and characteristics of all the input data must be first defined, such as the format for text values or the structure for binary values, the type, and the domain, the size, the encoding for text values, the file format, etc.

For example, let’s think about an online shopping cart application that does not validate the quantity fields. If the quantities can be negative, customers may be able to hack the system by earning money from an order.

2. Canonicalization

The canonicalization step consists of transforming the values of input data into a canonical form [4]. This operation is necessary for values that can be represented in alternative ways, such as absolute and relative file paths, for example. The canonicalization has to be done before any other operation on the input data. For example, an application asking for file paths can first transform them all into absolute ones.

Concretely, the validation step is done by choosing a unique representation and always transforming all the input data into the chosen one. It can again either be performed “by hand” or with the help of specific libraries.

Using canonicalization, therefore, makes it possible to get rid of the need to have several cases in the program, depending on the representation, in the remaining code. It has to be used for paths and URLs, Unicode strings, XML and HTML documents, etc.

For example, let’s think about a web application that allows its clients to download files whose name is given as a parameter. Using a relative path, it is possible for an attacker to download a file he/she is not supposed to access.

3. Sanitisation

Finally, the sanitization step consists of transforming the values of input data by removing some of their parts or writing them differently [5]. This step is important for values that are then interpreted in the program. Sanitization must be performed before using the input data. For example, a blog application may need to remove JavaScript code from the input texts.

Concretely, the sanitization step is typically done by applying several filters on the values of the input data, by escaping special characters, or by using prepared statements when working with DBMS.

Data sanitization is an important step, mainly to avoid attacks such as code injection or cross-site scripting. It can also be used to make the applications more user-friendly, imposing less restriction on the user and cleaning his/her inputs in the program.

The three steps presented in this article are very important as the first line of defense in software security. Input data are indeed often used by attackers as a way to attack a system. To check the robustness of a program, testing it thoroughly against several inputs and, in particular, malicious ones, is very important. One possible technique to look at and use is input fuzzing [6].

To conclude, as is always the case in computer security, it is impossible to remove all the vulnerabilities and reduce the risks to zero. We can only try to do our best to decrease the risks as much as possible. We also have to pay attention to residual risks when checking input. For example, regular expressions for data validation can be vulnerable to ReDoS attacks.

References

[1] The European Union Agency for Cybersecurity (2016). The Dangers of Trusting User Input, Cybersecurity info notes.
[2] OWASP (2017). Top 10 Web Application Security Risks, OWASP Top Ten.
[3] OWASP (2018). C5: Validate All Inputs, OWASP Top Ten Proactive Controls.
[4] William L. Fithen (2013). Ensure that Input Is Properly Canonicalized, Carnegie Mellon University.
[5] Kevin Smith (2019). Sanitize Your Inputs?, Kevin Smith — In it for the long haul.
[6] OWASP (2021). Fuzzing.