How to investigate a web attack

In this post I showcase samples of useful search queries when investigating a web attack.

Time for school

What is a web attack?

A web attack targets vulnerabilities in websites to gain unauthorized access, obtain confidential information, introduce malicious content, or alter the website’s content.

The Open Web Application Security Project, or OWASP, is a popular project dedicated to web application security. They report on the most critical risks concerning web application security, with their OWASP Top 10 report focusing on the top 10 security risks.

In this post we are going to see how we can identify some of the most common and popular web attacks:

SQL Injection (SQLi)

SQLi is an attack where a web application directly includes unsanitized data provided by the user in SQL queries.

Cross-site scripting (XSS)

XSS is a type of injection based web security vulnerability that enables malicious code to be run.

Command Injection

Command Injection attacks happen when the data received from a user is not sanitized and is directly transmitted to the operating system shell.

Insecure Direct Object Reference (IDOR)

IDOR attacks targets the lack (or misconfiguration) of an authorization mechanism. It essentially enables an attacker to gain access to an object that belongs to another.

Among the highest web application vulnerability security risks published in the 2021 OWASP, IDOR, or “Broken Access Control”, takes 1s place 🥇.

Local File Inclusion (LFI) & Remote File Inclusion (RFI)

File inclusion is a security vulnerability that occurs when a file is included without sanitizing the data obtained from a user.

On LFI, the file that is intended to be included is on the same web server that the web application is hosted on.

On RFI, the file that is intended to be included is hosted on a different server.

What is pandas?

pandas is a software library written for Python, that focuses on data manipulation and analysis.

In particular, it offers data structures and operations for manipulating numerical tables and time series.

This library allows us to take a large data set (for instance, access logs from a web server) and investigate it using filters as search queries.

What we need

Python
pandas library (pip install pandas)
Apache access log

Tutorial

This tutorial will follow 3 steps:

Step 1. Parse server logs
Step 2. Investigate logs
Step 3. Report results

Step 1. Parse server logs

We first need to retrieve our server’s log and identify its format.

Check here the possible formats you can encounter when analyzing Apache logs.

In this case, my server’s logs are being stored with a Combined Log Format. Like so:

127.0.0.1 - Olga [10/Dec/2019:13:55:36 -0100] "GET /server-status HTTP/1.1" 200 2326 "http://localhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"

Where:

127.0.0.1 → IP address of the client that made the request;
⁠-⁠ → The hyphen defining the second field in the log file is the identity of the client. This field is often returned as a hyphen and Apache’s HTTP server documentation recommends that this particular field not be relied upon except in the case of a controlled internal network.
Olga → userid of the person requesting the resource;⁠⁠
[10/Dec/2019:13:55:36 -0700] → date, time and time zone of the request;
"GET /server-status HTTP/1.1" → request type and resource being requested;
200 → server HTTP response status code;
2326 → size of the object returned to the client.
"http://localhost/" → This is the HTTP referrer, which represents the address from which the request for the resource originated.
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" → This is the User Agent, which identifies information about the browser that the client is using to access the resource.

Taking the log format into account, we want to parse the logs in order to have them organize like the following table:

ip	clientid	userid	datetime	timezone	request	url	response	size	referrer	useragent
127.0.0.1	-⁠	Olga	10/Dec/2019 13:55:36	-0100	GET	/server-status HTTP/1.1”	200	2326	“http://localhost/“	“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36”

Let’s see how we can do this 👇

Open python console and follow the steps.

1. Read logs into a dataframe

import pandas as pd
df = pd.read_csv(r"your-path-to-file\apache access logs.txt", sep='\"', engine='python', header=None, skipinitialspace = True)
df      # to see what the dataframe contains

2. Remove leading and trailing whitespaces

for column in df:
    df[column] = df[column].str.strip()

3. Reformat first column

# create new dataframe from spliting the original dataframe
logs = df[0].str.split(' ', expand=True)

# remove '[' and ']'
logs[3] = logs[3].str.replace('[', '', regex=False)
logs[4] = logs[4].str.replace(']', '', regex=False)

# rename columns
logs.rename(columns = {0:'ip', 1:'clientid', 2:'userid', 3:'datetime', 4:'timezone'}, inplace=True)

# change date to date object
logs['datetime'] = pd.to_datetime(logs['datetime'], format='%d/%b/%Y:%H:%M:%S')

4. Reformat request column

logs[['request', 'url']] = df[1].str.split(' ', n=1, expand=True)

5. Divide http response from size

logs[['response', 'size']] = df[2].str.split(' ', n=1, expand=True)

6. Add referrer and user-agent columns

logs['referrer'] = df[3]
logs['useragent'] = df[4]

7. Delete original dataframe

del df

Now we have a dataframe named ⁠logs⁠ with clean values.

Step 2. Investigate results

Get the most relevant results

First, if the attack came from the internet, disregard internal IPs:

maskIP = ~logs.ip.str.contains("10.|127.0.0.1", na=False)

Also, we can disregard .gif and .ico⁠ resources:

maskICO = ~logs.request.str.contains(".gif|.ico", na=False)

Look for SQLi payloads

Look for common SQL terms and the symbols.

You can check some frequently used SQL Injection payloads here.

SQL Mask:

maskSQL = logs.url.str.contains("%27|SELECT|UNION|SLEEP|AND|OR|CHR|INSERT|WHERE|EXEC", case=False, na=False) 

Note: the symbol “'” commonly seen in SQLi payloads can appear in percent encoding, as ⁠%27⁠.

Apply masks:

logs[maskIP & maskICO & maskSQL].sort_values(by="datetime")

If this returns results, because these are specific words that belong to SQL, we can determine that we are face to face with a SQL Injection attack.

Look for XSS payloads

Look for keywords which are commonly used in XSS payloads, such as “alert” and “script” .

You can examine some frequently used payloads here.

XSS Mask:

maskXSS = logs.url.str.contains("script|prompt|console.log|alert|confirm|document.write|String.fromCharCode", case=False, na=False)

Apply masks:

logs[maskIP & maskXSS].sort_values(by="datetime").tail(60)

If this returns results, because these are specific words that belong to typical XSS attacks, we can determine that we are face to face with a XSS attack.

Look for Command Injection payloads

Look for keywords related to the terminal language, such as:

Linux	Windows
`whoami`	`whoami`
`ls`	`ver`
`cp`	`ipconfig`
`cat`	`netstat`
`type`	`tasklist`

Command Injection mask:

maskCMD = logs.useragent.str.contains("dir|ls|cp|cat|type|etc|echo|whoami|pwd", case=False, na=False) | logs.url.str.contains("dir|ls|cp|cat|type|etc|echo|whoami|pwd", case=False, na=False)

Apply masks:

logs[maskIP & maskCMD].sort_values(by="datetime").tail(60)

Look for an IDOR attack

IDOR attacks are more difficult to detect than other attacks because they do not have certain payloads, such as SQL Injection and XSS attacks usually have.

One approach is for us to look for the IP addresses that connected most to the webserver

Get the most relevant IP addresses

Search for the IP addresses that connected most to the webserver (remember our maskIP to ignore internal IPs):

logs[maskIP].groupby(["ip"]).count().sort_values(by="datetime").tail(60)

Take the top 5 source IPs and check what they did. For each do:

logs[logs.ip == "<YOUR_IP>"].sort_values(by="datetime").head(60)

Get the most relevant requests by the most relevant IP addresses

With this mask built, now we can see the most relevant IPs and associated relevant requests:

logs[maskIP & maskICO].groupby(["ip", "url"]).count().sort_values(by="datetime").tail(60)

Look for an RFI/LFI attack

Some things we can look for when investigating an LFI attack:

Target files. Attackers will exploit LFI vulnerabilities by manipulating the file location parameter, in an effort to display the contents of target files on a UNIX / Linux based system, such as ⁠/etc/passwd.
Directory traversal. Because attackers do not known what directory the web application is in, they will try to reach the “root” directory using “../”.
Null byte injection. This bypasses application filtering within web applications by adding URL encoded “Null bytes” such as %00.
LFI Wrapper rot13 and base64 - php://filter . This wrapper allows an attacker include local files and encodes the output (base64 or rot13). Therefore, any base64 output will need to be decoded to reveal the contents.

LFI Mask:

maskLFI = logs.url.str.contains("etc/|passwd|%00|%2500|../|base64|rot13|proc/", case=False, na=False)

RFI Mask:

maskRFI = logs.url.str.contains("http:|base64|php:expect:", case=False, na=False)

Apply both masks:

logs[maskIP & (maskLFI | maskRFI)].sort_values(by="datetime").tail(60)

Step 3. Report

When we find an evidence of malicious activity, we should include it in our report.

To see the full results in the python console we can simply use the ⁠print⁠ function:

print(*logs[YOUR-MASK].sort_values(by="datetime").to_csv().split("\n"), sep="\n")

With pandas, we can directly output a search result to a csv file.

logs[YOUR-MASK].sort_values(by="datetime").to_csv("report.csv", mode='w')

Tip: use ⁠mode="a+"⁠ to append multiple results together in the same file.

Finally, what should be included in our report?

The attacker’s IP address
The date the attack started and ended
If the attack was successful. If so:
- The type of the attack
- If the attack was performed by an automated tool (check time of requests and UserAgent)

And voilà!

Happy Hunting 🕵️‍♀️