Webscraping with Powershell - an Introduction

Webscraping with Powershell

an Introduction

Tomaso Vasella
by Tomaso Vasella
on February 11, 2021
time to read: 21 minutes

Keypoints

How to use Powershell for Webscraping

  • Powershell offers potent functionalities for executing HTTP requests
  • There are significant differences between the Powershell versions
  • With a little effort, the request parameters can be customized to a large extent
  • With current web browsers the useful XPath expressions can be easily generated

In the broadest sense, the term web scraping refers to the more or less automated extraction of information from web pages. This requires essentially two steps: First, the data must be retrieved from the web server as specifically as possible, and second, this data must be programmatically interpreted (parsed) so that the desired information can then be extracted for further processing.

Although modern websites increasingly offer APIs and many web applications use APIs for data access, web pages are usually formatted for human consumption. Technically, content data, such as the text of an article, is mixed with control data, metadata, formatting information, images, and other data. While this data is necessary for the functionality and look and feel of a website, it is rather impeding from a data extraction perspective. Therefore, suitable tools are needed to retrieve the desired information from this data medley.

There are many reasons for such data extractions. For example, it could be desired to create a RSS feed based on a the content of a web page or you may want to collect publicly available information as part of a Red Team engagement. Although web scraping sounds like something fairly simple, in practice it can prove to be surprisingly tedious. Powershell may not necessarily be the first choice for this task, but given its prevalence, it’s worth knowing its capabilities. Additionally, it is sometimes necessary to operate in restricted environments where alternatives such as Python or Perl are not available.

HTTP Requests with Powershell

The following table gives an overview of the most common methods to perform HTTP requests using Powershell.

MethodTechnologyPS VersionRemarks
Invoke-Webrequest Powershell cmdlet3 or higherPerforms extensive parsing, except when using Powershell version 7, since in that version the HTML DOM of Internet Explorer is no longer available.
System.Net.WebClient .NETor higherAccording to Microsoft: We don’t recommend that you use the WebClient class for new development. Instead, use the System.Net.Http.HttpClient class.
System.Net.HTTPWebrequest .NET2 or higherAccording to Microsoft: We don’t recommend that you use the HTTPWebrequest class for new development. Instead, use the System.Net.Http.HttpClient class.
System.Net.Http.HttpClient .NET3 or higher 

Besides the described methods several other possibilities exist, e.g. an Internet Explorer COM Automation object can be instantiated via Powershell or other COM objects such as Msxml2.XMLHTTP can be used.

$ie = new-object -com "InternetExplorer.Application"
$ie.navigate('https://www.example.com')
$ie.document

Invoke-Webrequest

This cmdlet is probably the most common way to use Powershell for retrieving data from a website:

$res = Invoke-WebRequest -Uri 'http://www.example.com'
$res = Invoke-WebRequest -Uri 'http://www.example.com' -Outfile index.html
$res = Invoke-WebRequest -Uri 'http://www.example.com' -Method POST -Body "postdata"

In contrast to the other methods described in this article, Invoke-Webrequest automatically sets the User Agent header. It is possible to influence this using the parameter -UserAgent. In addition, the parameter -Headers exists. However, it does not allow customizing certain headers such as User-Agent. Full control over the headers can only be achieved by .NET classes as described below.

If no error occurrs, an object of the type Microsoft.PowerShell.Commands.HtmlWebResponseObject is returned. Depending on the version of Powershell, the properties of this object differ: In version 5, the content supplied by the web server is parsed and is available with the property .ParsedHtml. This property is missing under Powershell 7.

The automatic parsing can be very useful for quick web scraping tasks. For example, all links of a web page can be extracted with one simple command:

(Invoke-WebRequest -Uri 'https://www.example.com').Links.Href

The entire body of the downloaded page is stored in the .Content property of the result object. If the headers are also desired, they can be retrieved from the .RawContent property.

If an error occurs, no object is returned. However, the error message contained in the exception can be retrieved:

try
{
    $res = Invoke-WebRequest -Uri 'http://getstatuscode.com/401'
}
catch [System.Net.WebException]
{
    $ex = $_.Exception
}
$msg = $ex.Message
$status = $ex.Response.StatusCode.value__

In case of errors where the web server does return content, for example a custom 401 error page, this content is also accessible via the exception as follows (access to the response headers is also shown):

try
{
    $res = Invoke-WebRequest -Uri 'http://getstatuscode.com/401'
}
catch
{
    $rs = $_.Exception.Response.GetResponseStream()
    $reader = New-Object System.IO.StreamReader($rs)
    $content = $reader.ReadToEnd()

    foreach($header in $rs.Response.Headers.AllKeys)
    {
        write-host $('{0}: {1}' -f $header, $rs.Response.Headers[$header])
    }
}

When sending multiple requests, e.g. automated in a loop, it sometimes becomes noticeable that Invoke-Webrequest can be very slow. This is due to the extensive parsing of the response that is enabled by default. Additionally, a progress bar is often displayed for each request.

The following parameters and settings can help to accelerate:

By default, Invoke-Webrequest automatically follows up to 5 redirects. This can be influenced with the paratemeter -MaximumRedirection.

Proxy Support

Invoke-Webrequest uses the proxy defined in the Windows settings by default. This can be overridden with the -Proxy parameter, which takes an URI as argument. If the proxy requires authentication, the credentials can be specified with the -ProxyCredential parameter, requiring an argument of type PSCredential. This would look something like this:

$secPw = ConvertTo-SecureString '************' -AsPlainText -Force
$creds = New-Object System.Management.Automation.PSCredential -ArgumentList 'username', $secPw
$res = Invoke-WebRequest -Uri 'https://www.example.com' -Proxy 'http://127.0.0.1:8080' -ProxyCredential $creds

Alternatively, the parameter -ProxyUseDefaultCredentials can be specified which results in using the credentials of the current user.

Sessions and Cookies

Invoke-Webrequest is able to use cookies and thus also supports cookie based sessions by specifying the -SessionVariable parameter. The corresponding cookies / session can then be used in subsequent requests with the -WebSession parameter.

Invoke-WebRequest -SessionVariable -Uri 'https://www.google.com'
Invoke-WebRequest -WebSession $Session -Uri 'https://www.google.com'

The cookies are available with the property .Cookies and it is possible to define custom cookies. In the following example, the session object is created in advance, which can be useful to set custom cookies before executing the first request:

$cookie = New-Object System.Net.Cookie
$cookie.Name = "specialCookie"
$cookie.Value = "value"
$cookie.Domain = "domain"

$session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
$session.Cookies.Add($cookie)
Invoke-WebRequest -WebSession $session -Uri 'https://www.example.com'

System.Net.WebClient

Accessing websites is also quite easy with this method, as the following examples show.

$wc = New-Object System.Net.WebClient
$res = $wc.DownloadString('https://www.example.com')

In contrast to using Invoke-Webrequest, this method does not set headers automatically. It is possible to specify them though, for example to set the user agent which is usually advisable, since web pages sometimes behave differently without it.

$wc.Headers.Add('UserAgent', 'Mozilla/5.0 (Windows NT; Windows NT 10.0; en-US)')

Should an error occur, any content supplied by the Werserver is not made available in the result object. To access it, an similar procedure as described above can be used. However, this does not work with Powershell 2 since that version does not support try / catch constructs.

try
{
        $res = $wc.DownloadString('http://getstatuscode.com/401')
}
catch [System.Net.WebException] 
{
        $rs = $_.Exception.Response.GetResponseStream()
        $reader = New-Object System.IO.StreamReader($rs)
        $content = $reader.ReadToEnd()
}

Proxy Support

By default, this method also uses the proxy defined in the Windows settings, which can be retrieved with [System.Net.WebProxy]::GetDefaultProxy(). It is possible to explicitly define a proxy:

$proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080')
$proxyCreds = New-Object Net.NetworkCredential('username', '************')

$wc = New-Object System.Net.WebClient
$wc.Proxy = $proxy
$wc.Proxy.Credentials = $proxyCreds

Alternatively, the default credentials of the current user can be used:

$wc.UseDefaultCredentials = $true
$wc.Proxy.Credentials = $wc.Credentials

Sessions and Cookies

System.Net.WebClient does not provide convenient methods for cookies. The corresponding response headers must be read and processes manually, which can easily get tedious.

$cookie = $wc.ResponseHeaders["Set-Cookie"] 

System.Net.HTTPWebrequest

This method is also relatively simple to implement and follows the same principles as those already mentioned.

$wr = [System.Net.HttpWebRequest]::Create('http://www.example.com')
try
{
    $res = $wr.GetResponse()
}
catch [System.Net.WebException]
{
    $ex = $_.Exception
}

$rs = $res.GetResponseStream()
$rsReader = New-Object System.IO.StreamReader $rs
$data = $rsReader.ReadToEnd()

POST-Requests are also easily possible:

$data = [byte[]][char[]]'postdatastring'
$wr = [System.Net.HttpWebRequest]::Create('http://ptsv2.com/t/g1eiu-1612179889/post')
$wr.Method = 'POST'
$requestStream = $wr.GetRequestStream()
$requestStream.Write($data, 0, $data.Length);

try
{
    $res = $wr.GetResponse()
}
catch [System.Net.WebException]
{
    $ex = $_.Exception
}

$rs = $res.GetResponseStream()
$rsReader = New-Object System.IO.StreamReader $rs
$data = $rsReader.ReadToEnd()

Should an error occur, any possible content sent by the server is also not available in the response object, but can be read from the exception thrown. Headers are not set automatically, but they can be specified easily:

$wr.Headers['UserAgent'] = 'Mozilla/5.0 (Windows NT; Windows NT 10.0; en-US)'

Proxy Support

This method also uses the proxy defined in the Windows settings by default, which can be easily overridden analogous to the example above:

$proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080')
$wr.proxy = $proxy

Sessions and Cookies

Cookies can be used by defining a cookie container. This can also be done before sending the request which is submitted with Get.Response().

$cookieJar = New-Object System.Net.CookieContainer
$wr.CookieContainer = $cookieJar
$res = $wr.GetResponse()

System.Net.Http.HttpClient

With HttpClient the .NET Framework offers yet another class for web requests which can be used via Add-Type. The snippted shown below also returns the content supplied by the server in the event of an error (e.g. 404), if available.

Add-Type -AssemblyName System.Net.Http
$httpClient = New-Object System.Net.Http.HttpClient
try
{
    $task = $httpClient.GetAsync('http://www.example.com')
    $task.wait()
    $res = $task.Result

    if ($res.isFaulted)
    {
        write-host $('Error: Status {0}, reason {1}.' -f [int]$res.Status, $res.Exception.Message)
    }
    return $res.Content.ReadAsStringAsync().Result

}
catch [Exception]
{
    write-host ('Error: {0}' -f $_)
} finally {
    if($null -ne $res)
    {
        $res.Dispose()
    }
}

As can be seen in this example, GetAsync() is executed asynchronously, i.e. it does not block and thus can be parallelized. However, if you want to process the supplied content immediately, you have to wait for the result as shown above.

Sending data is also quite easy to accomplish using PostAsync():

$data = New-Object 'system.collections.generic.dictionary[string,string]'
$data.Add('param1', 'value1')
$data.Add('param2', 'value2')
$postData = new-object System.Net.Http.FormUrlEncodedContent($data)
$task = $httpClient.PostAsync('http://www.example.com', $PostData)

Again, no headers are set automatically, but they can be easily set:

$httpClient.DefaultRequestHeaders.add('User-Agent', $userAgent)

Proxy Support

This method also uses the proxy defined in the Windows settings by default. To set is manually, it has proven very useful to use a custom ClientHandler and apply the desired settings there:

Add-Type -AssemblyName System.Net.Http
$httpHandler = New-Object System.Net.Http.HttpClientHandler
$cookieJar = New-Object System.Net.CookieContainer
$httpHandler.CookieContainer = $cookieJar
$proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080')
$httpHandler.Proxy = $proxy
$httpClient = new-object System.Net.Http.HttpClient($httpHandler)

Sessions and Cookies

The above example also shows how cookies and thus cookie-based sessions can be used by assigning a cookie container to the HTTP handler.

TLS and Certificate Checking

If communication is requested with a web server that uses invalid or untrusted certificates, it can be useful to disable the corresponding certificate check. With the command Invoke-Webrequest this is easily possible – starting from Powershell version 6 – with the help of the parameter -SkipCertificateCheck. Furthermore, the parameter -SslProtocol can be used to specify the SSL/TLS versions to be used.

With lower versions of Powershell or with the other methods described above this requires a more elaborate approach. In case of SSL/TLS certificate errors, a callback is called internally. Therefore, one can override the certificate check using a custom callback, which is possible by adding some C# code using Add-Type:

Add-Type @"
using System;
using System.Net;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;
public class IgnoreCertValidErr
{
    public static void Ignore()
    {
        ServicePointManager.ServerCertificateValidationCallback = 
            delegate
            (
                Object sender, 
                X509Certificate certificate, 
                X509Chain chain, 
                SslPolicyErrors sslPolicyErrors
            )
            {
                return true;
            };
    }
}
"@

[IgnoreCertValidErr]::Ignore()

It is also possible to specify the desired SSL/TLS protocol versions:

[System.Net.ServicePointManager]::SecurityProtocol = 'tls12, tls11'

The enumeration (enum) SecurityProtocolType contains all possible values.

This method can be used to disable certificate checking for all of the described web request techniques.

Data Analysis and Extraction

Invoke-Webrequest already does extensive parsing of the retrieved content and makes it available as properties of the result object. This can be very helpful in some cases, because useful results can be achieved with just a few lines of code. However, the methods that rely on the ParsedHtml property only work with Powershell versions 3 – 5. Below are a few examples.

List all links of a web page:

(Invoke-WebRequest -Uri 'https://www.admin.ch').Links.Href | Sort-Object | Get-Unique

List all image links of a web page:

(Invoke-WebRequest -Uri 'https://www.admin.ch').Images | Select-Object src

Extract information about forms and input fields:

(Invoke-WebRequest 'http://www.google.com').Forms
bc. (Invoke-WebRequest 'http://www.google.com').InputFileds

Extract elements by their class name:

(Invoke-WebRequest -Uri 'https://www.meteoschweiz.admin.ch/home.html?tab=report').ParsedHtml.getElementsByClassName('textFCK') | %{Write-Host $_.innertext}

Other useful methods are getElementById, getElementsByName and getElementsByTagName.

Simple processes such as filling and submitting a form can be automated:

$url = 'https://www.google.com'

$html = Invoke-WebRequest -Uri $url -SessionVariable 'Session' -UserAgent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0'
$html.Forms[0].Fields['q'] = 'powershell'
$action = $html.Forms[0].Action
$method = $html.Forms[0].Method
$res = Invoke-WebRequest -Uri "$url$action" -Method $method -Body $html.Forms[0].Fields -UserAgent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0' -WebSession $Session

Information can also be extracted directly from the retrieved content using built-in tools of Powershell. However, this is only advisable for the very simplest cases because processing HTML and XML using pattern recognition is almost always bound to fail.

$wc = New-Object System.Net.WebClient

($wc.DownloadString('http://www.myip.ch/') | Select-String -Pattern "\d{1,3}(\.\d{1,3}){3}" -AllMatches).Matches.Value

$wc = New-Object System.Net.WebClient
$res = $wc.DownloadString('https://wisdomquotes.com/dalai-lama-quotes-tenzin-gyatso/')
([regex]'<blockquote><p>(.*?)</p></blockquote>').Matches($res) | ForEach-Object { $_.Groups[1].Value }

PowerHTML Module

A more convenient method is available with the Powershell module PowerHTML. This module is a Powershell implementation of the HtmlAgilityPack, which provides a complete HTML parser. This offers powerful capabilities, such as the use of the XPath syntax. It is also quite useful in cases where the Internet Explorer HTML DOM is not available, for example Invoke-Webrequest under Powershell 7.

When installing Powershell modules from the Powershell Gallery, it is advisable to check the dependencies before installation to understand which additional and possibly unwanted modules will also be installed. Fortunately, PowerHTML does not require installation of any additional modules.

Installation of the module could be achieved as follows:

function loadPowerHtml
{
    if (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML))
    {
        Write-Host "Installing PowerHTML module"
        Install-Module PowerHTML -Scope CurrentUser -ErrorAction Stop
    }

    Import-Module -ErrorAction Stop PowerHTML
}

The following example shows how to access the paragraph elements of www.example.com:

$wc = New-Object System.Net.WebClient
$res = $wc.DownloadString('http://www.example.com')
$html = ConvertFrom-Html -Content $res
$p = $html.SelectNodes('/html/body/div/p')

All <p> elements are now contained in $p and can be processed accordingly:

NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
-------- ---- -------------- -------------- ------------- ---------
Element  p    0              1              156           This domain is for...
Element  p    0              1              70            More information...

Output the text contained in an element::

$p[0].innerText

Find a specific element based on its text content:

($p | Where-Object { $_.innerText -match 'domain' }).innerText

Extraction of a table by its class name:

$wc = New-Object System.Net.WebClient
$res = $wc.DownloadString('https://www.imdb.com/title/tt0057012/')
$html = ConvertFrom-Html -Content $res
$table = $html.SelectNodes('//table') | Where-Object { $_.HasClass('cast_list') }

$cnt = 0
foreach ($row in $table.SelectNodes('tr'))
{
    $cnt += 1
    # skip header row
    if ($cnt -eq 1) { continue }

    $a = $row.SelectSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", " "
    $c = $row.SelectSingleNode('td[4]').innerText.Trim() -replace "`n|`r|\s+", " "

    $row = New-Object -TypeName psobject
    $row | Add-Member -MemberType NoteProperty -Name Actor -Value $a
    $row | Add-Member -MemberType NoteProperty -Name Character -Value $c

     [array]$data += $row
}

When analyzing content with highly nested elements it can quickly become difficult to find the correct XPath expression. The Inspector of browsers such as Firefox, Chrome or Edge offers help here: In the Inspector, an element can be selected and then the Copy Xpath function is available in the context menu.

Conclusion

A confusing number of options exist to retrieve content from websites using Powershell. In many cases, it is advisable to use Invoke-Webrequest and make use of its automatic parsing capabilities. If complete control or advanced capabilities are required, System.Net.Http.HttpClient is currently the best option.

For the subsequent extraction of data, the simplest cases only require a few on-board Powershell tools. More comprehensive possibilities are available through additional Powershell modules.

About the Author

Tomaso Vasella

Tomaso Vasella has a Master in Organic Chemistry at ETH Z├╝rich. He is working in the cybersecurity field since 1999 and worked as a consultant, engineer, auditor and business developer. (ORCID 0000-0002-0216-1268)

Links

You want to test the security of your firewall?

Our experts will get in contact with you!

×
CIS Controls

CIS Controls

Tomaso Vasella

Passwordless Authentication

Passwordless Authentication

Tomaso Vasella

Data Leakage Prevention

Data Leakage Prevention

Tomaso Vasella

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here