Security Testing
Tomaso Vasella
How to use Powershell for Webscraping
Although modern websites increasingly offer APIs and many web applications use APIs for data access, web pages are usually formatted for human consumption. Technically, content data, such as the text of an article, is mixed with control data, metadata, formatting information, images, and other data. While this data is necessary for the functionality and look and feel of a website, it is rather impeding from a data extraction perspective. Therefore, suitable tools are needed to retrieve the desired information from this data medley.
There are many reasons for such data extractions. For example, it could be desired to create a RSS feed based on a the content of a web page or you may want to collect publicly available information as part of a Red Team engagement. Although web scraping sounds like something fairly simple, in practice it can prove to be surprisingly tedious. Powershell may not necessarily be the first choice for this task, but given its prevalence, it’s worth knowing its capabilities. Additionally, it is sometimes necessary to operate in restricted environments where alternatives such as Python or Perl are not available.
The following table gives an overview of the most common methods to perform HTTP requests using Powershell.
Method | Technology | PS Version | Remarks |
---|---|---|---|
Invoke-Webrequest | Powershell cmdlet | 3 or higher | Performs extensive parsing, except when using Powershell version 7, since in that version the HTML DOM of Internet Explorer is no longer available. |
System.Net.WebClient | .NET | or higher | According to Microsoft: We don’t recommend that you use the WebClient class for new development. Instead, use the System.Net.Http.HttpClient class. |
System.Net.HTTPWebrequest | .NET | 2 or higher | According to Microsoft: We don’t recommend that you use the HTTPWebrequest class for new development. Instead, use the System.Net.Http.HttpClient class. |
System.Net.Http.HttpClient | .NET | 3 or higher |
Besides the described methods several other possibilities exist, e.g. an Internet Explorer COM Automation object can be instantiated via Powershell or other COM objects such as Msxml2.XMLHTTP can be used.
$ie = new-object -com "InternetExplorer.Application" $ie.navigate('https://www.example.com') $ie.document
This cmdlet is probably the most common way to use Powershell for retrieving data from a website:
$res = Invoke-WebRequest -Uri 'http://www.example.com' $res = Invoke-WebRequest -Uri 'http://www.example.com' -Outfile index.html $res = Invoke-WebRequest -Uri 'http://www.example.com' -Method POST -Body "postdata"
In contrast to the other methods described in this article, Invoke-Webrequest
automatically sets the User Agent header. It is possible to influence this using the parameter -UserAgent
. In addition, the parameter -Headers
exists. However, it does not allow customizing certain headers such as User-Agent. Full control over the headers can only be achieved by .NET classes as described below.
If no error occurrs, an object of the type Microsoft.PowerShell.Commands.HtmlWebResponseObject
is returned. Depending on the version of Powershell, the properties of this object differ: In version 5, the content supplied by the web server is parsed and is available with the property .ParsedHtml
. This property is missing under Powershell 7.
The automatic parsing can be very useful for quick web scraping tasks. For example, all links of a web page can be extracted with one simple command:
(Invoke-WebRequest -Uri 'https://www.example.com').Links.Href
The entire body of the downloaded page is stored in the .Content
property of the result object. If the headers are also desired, they can be retrieved from the .RawContent
property.
If an error occurs, no object is returned. However, the error message contained in the exception can be retrieved:
try { $res = Invoke-WebRequest -Uri 'http://getstatuscode.com/401' } catch [System.Net.WebException] { $ex = $_.Exception } $msg = $ex.Message $status = $ex.Response.StatusCode.value__
In case of errors where the web server does return content, for example a custom 401 error page, this content is also accessible via the exception as follows (access to the response headers is also shown):
try { $res = Invoke-WebRequest -Uri 'http://getstatuscode.com/401' } catch { $rs = $_.Exception.Response.GetResponseStream() $reader = New-Object System.IO.StreamReader($rs) $content = $reader.ReadToEnd() foreach($header in $rs.Response.Headers.AllKeys) { write-host $('{0}: {1}' -f $header, $rs.Response.Headers[$header]) } }
When sending multiple requests, e.g. automated in a loop, it sometimes becomes noticeable that Invoke-Webrequest
can be very slow. This is due to the extensive parsing of the response that is enabled by default. Additionally, a progress bar is often displayed for each request.
The following parameters and settings can help to accelerate:
-UseBasicParsing
this does a little less in-depth processing of the server response, but the Content
, RAWContent
, Links
, Images
and Headers
properties are still populated.$ProgressPreference = 'SilentlyContinue'
(see also this Microsoft article).By default, Invoke-Webrequest
automatically follows up to 5 redirects. This can be influenced with the paratemeter -MaximumRedirection
.
Invoke-Webrequest
uses the proxy defined in the Windows settings by default. This can be overridden with the -Proxy
parameter, which takes an URI as argument. If the proxy requires authentication, the credentials can be specified with the -ProxyCredential
parameter, requiring an argument of type PSCredential. This would look something like this:
$secPw = ConvertTo-SecureString '************' -AsPlainText -Force $creds = New-Object System.Management.Automation.PSCredential -ArgumentList 'username', $secPw $res = Invoke-WebRequest -Uri 'https://www.example.com' -Proxy 'http://127.0.0.1:8080' -ProxyCredential $creds
Alternatively, the parameter -ProxyUseDefaultCredentials
can be specified which results in using the credentials of the current user.
Invoke-Webrequest
is able to use cookies and thus also supports cookie based sessions by specifying the -SessionVariable
parameter. The corresponding cookies / session can then be used in subsequent requests with the -WebSession
parameter.
Invoke-WebRequest -SessionVariable -Uri 'https://www.google.com' Invoke-WebRequest -WebSession $Session -Uri 'https://www.google.com'
The cookies are available with the property .Cookies
and it is possible to define custom cookies. In the following example, the session object is created in advance, which can be useful to set custom cookies before executing the first request:
$cookie = New-Object System.Net.Cookie $cookie.Name = "specialCookie" $cookie.Value = "value" $cookie.Domain = "domain" $session = New-Object Microsoft.PowerShell.Commands.WebRequestSession $session.Cookies.Add($cookie) Invoke-WebRequest -WebSession $session -Uri 'https://www.example.com'
Accessing websites is also quite easy with this method, as the following examples show.
$wc = New-Object System.Net.WebClient $res = $wc.DownloadString('https://www.example.com')
In contrast to using Invoke-Webrequest
, this method does not set headers automatically. It is possible to specify them though, for example to set the user agent which is usually advisable, since web pages sometimes behave differently without it.
$wc.Headers.Add('UserAgent', 'Mozilla/5.0 (Windows NT; Windows NT 10.0; en-US)')
Should an error occur, any content supplied by the Werserver is not made available in the result object. To access it, an similar procedure as described above can be used. However, this does not work with Powershell 2 since that version does not support try-catch constructs.
try { $res = $wc.DownloadString('http://getstatuscode.com/401') } catch [System.Net.WebException] { $rs = $_.Exception.Response.GetResponseStream() $reader = New-Object System.IO.StreamReader($rs) $content = $reader.ReadToEnd() }
By default, this method also uses the proxy defined in the Windows settings, which can be retrieved with [System.Net.WebProxy]::GetDefaultProxy()
. It is possible to explicitly define a proxy:
$proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080') $proxyCreds = New-Object Net.NetworkCredential('username', '************') $wc = New-Object System.Net.WebClient $wc.Proxy = $proxy $wc.Proxy.Credentials = $proxyCreds
Alternatively, the default credentials of the current user can be used:
$wc.UseDefaultCredentials = $true $wc.Proxy.Credentials = $wc.Credentials
System.Net.WebClient
does not provide convenient methods for cookies. The corresponding response headers must be read and processes manually, which can easily get tedious.
$cookie = $wc.ResponseHeaders["Set-Cookie"]
This method is also relatively simple to implement and follows the same principles as those already mentioned.
$wr = [System.Net.HttpWebRequest]::Create('http://www.example.com') try { $res = $wr.GetResponse() } catch [System.Net.WebException] { $ex = $_.Exception } $rs = $res.GetResponseStream() $rsReader = New-Object System.IO.StreamReader $rs $data = $rsReader.ReadToEnd()
POST-Requests are also easily possible:
$data = [byte[]][char[]]'postdatastring' $wr = [System.Net.HttpWebRequest]::Create('http://ptsv2.com/t/g1eiu-1612179889/post') $wr.Method = 'POST' $requestStream = $wr.GetRequestStream() $requestStream.Write($data, 0, $data.Length); try { $res = $wr.GetResponse() } catch [System.Net.WebException] { $ex = $_.Exception } $rs = $res.GetResponseStream() $rsReader = New-Object System.IO.StreamReader $rs $data = $rsReader.ReadToEnd()
Should an error occur, any possible content sent by the server is also not available in the response object, but can be read from the exception thrown. Headers are not set automatically, but they can be specified easily:
$wr.Headers['UserAgent'] = 'Mozilla/5.0 (Windows NT; Windows NT 10.0; en-US)'
This method also uses the proxy defined in the Windows settings by default, which can be easily overridden analogous to the example above:
$proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080') $wr.proxy = $proxy
Cookies can be used by defining a cookie container. This can also be done before sending the request which is submitted with Get.Response()
.
$cookieJar = New-Object System.Net.CookieContainer $wr.CookieContainer = $cookieJar $res = $wr.GetResponse()
With HttpClient the .NET Framework offers yet another class for web requests which can be used via Add-Type
. The snippted shown below also returns the content supplied by the server in the event of an error (e.g. 404), if available.
Add-Type -AssemblyName System.Net.Http $httpClient = New-Object System.Net.Http.HttpClient try { $task = $httpClient.GetAsync('http://www.example.com') $task.wait() $res = $task.Result if ($res.isFaulted) { write-host $('Error: Status {0}, reason {1}.' -f [int]$res.Status, $res.Exception.Message) } return $res.Content.ReadAsStringAsync().Result } catch [Exception] { write-host ('Error: {0}' -f $_) } finally { if($null -ne $res) { $res.Dispose() } }
As can be seen in this example, GetAsync()
is executed asynchronously, i.e. it does not block and thus can be parallelized. However, if you want to process the supplied content immediately, you have to wait for the result as shown above.
Sending data is also quite easy to accomplish using PostAsync()
:
$data = New-Object 'system.collections.generic.dictionary[string,string]' $data.Add('param1', 'value1') $data.Add('param2', 'value2') $postData = new-object System.Net.Http.FormUrlEncodedContent($data) $task = $httpClient.PostAsync('http://www.example.com', $PostData)
Again, no headers are set automatically, but they can be easily set:
$httpClient.DefaultRequestHeaders.add('User-Agent', $userAgent)
This method also uses the proxy defined in the Windows settings by default. To set is manually, it has proven very useful to use a custom ClientHandler and apply the desired settings there:
Add-Type -AssemblyName System.Net.Http $httpHandler = New-Object System.Net.Http.HttpClientHandler $cookieJar = New-Object System.Net.CookieContainer $httpHandler.CookieContainer = $cookieJar $proxy = New-Object System.Net.WebProxy('http://127.0.0.1:8080') $httpHandler.Proxy = $proxy $httpClient = new-object System.Net.Http.HttpClient($httpHandler)
The above example also shows how cookies and thus cookie-based sessions can be used by assigning a cookie container to the HTTP handler.
If communication is requested with a web server that uses invalid or untrusted certificates, it can be useful to disable the corresponding certificate check. With the command Invoke-Webrequest
this is easily possible – starting from Powershell version 6 – with the help of the parameter -SkipCertificateCheck
. Furthermore, the parameter -SslProtocol
can be used to specify the SSL/TLS versions to be used.
With lower versions of Powershell or with the other methods described above this requires a more elaborate approach. In case of SSL/TLS certificate errors, a callback is called internally. Therefore, one can override the certificate check using a custom callback, which is possible by adding some C# code using Add-Type
:
Add-Type @" using System; using System.Net; using System.Net.Security; using System.Security.Cryptography.X509Certificates; public class IgnoreCertValidErr { public static void Ignore() { ServicePointManager.ServerCertificateValidationCallback = delegate ( Object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors ) { return true; }; } } "@ [IgnoreCertValidErr]::Ignore()
It is also possible to specify the desired SSL/TLS protocol versions:
[System.Net.ServicePointManager]::SecurityProtocol = 'tls12, tls11'
The enumeration (enum) SecurityProtocolType contains all possible values.
This method can be used to disable certificate checking for all of the described web request techniques.
Invoke-Webrequest
already does extensive parsing of the retrieved content and makes it available as properties of the result object. This can be very helpful in some cases, because useful results can be achieved with just a few lines of code. However, the methods that rely on the ParsedHtml
property only work with Powershell versions 3 – 5. Below are a few examples.
List all links of a web page:
(Invoke-WebRequest -Uri 'https://www.admin.ch').Links.Href | Sort-Object | Get-Unique
List all image links of a web page:
(Invoke-WebRequest -Uri 'https://www.admin.ch').Images | Select-Object src
Extract information about forms and input fields:
(Invoke-WebRequest 'http://www.google.com').Forms bc. (Invoke-WebRequest 'http://www.google.com').InputFileds
Extract elements by their class name:
(Invoke-WebRequest -Uri 'https://www.meteoschweiz.admin.ch/home.html?tab=report').ParsedHtml.getElementsByClassName('textFCK') | %{Write-Host $_.innertext}
Other useful methods are getElementById
, getElementsByName
and getElementsByTagName
.
Simple processes such as filling and submitting a form can be automated:
$url = 'https://www.google.com' $html = Invoke-WebRequest -Uri $url -SessionVariable 'Session' -UserAgent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0' $html.Forms[0].Fields['q'] = 'powershell' $action = $html.Forms[0].Action $method = $html.Forms[0].Method $res = Invoke-WebRequest -Uri "$url$action" -Method $method -Body $html.Forms[0].Fields -UserAgent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0' -WebSession $Session
Information can also be extracted directly from the retrieved content using built-in tools of Powershell. However, this is only advisable for the very simplest cases because processing HTML and XML using pattern recognition is almost always bound to fail.
$wc = New-Object System.Net.WebClient ($wc.DownloadString('http://www.myip.ch/') | Select-String -Pattern "\d{1,3}(\.\d{1,3}){3}" -AllMatches).Matches.Value $wc = New-Object System.Net.WebClient $res = $wc.DownloadString('https://wisdomquotes.com/dalai-lama-quotes-tenzin-gyatso/') ([regex]'<blockquote><p>(.*?)</p></blockquote>').Matches($res) | ForEach-Object { $_.Groups[1].Value }
A more convenient method is available with the Powershell module PowerHTML. This module is a Powershell implementation of the HtmlAgilityPack, which provides a complete HTML parser. This offers powerful capabilities, such as the use of the XPath syntax. It is also quite useful in cases where the Internet Explorer HTML DOM is not available, for example Invoke-Webrequest under Powershell 7.
When installing Powershell modules from the Powershell Gallery, it is advisable to check the dependencies before installation to understand which additional and possibly unwanted modules will also be installed. Fortunately, PowerHTML does not require installation of any additional modules.
Installation of the module could be achieved as follows:
function loadPowerHtml { if (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML)) { Write-Host "Installing PowerHTML module" Install-Module PowerHTML -Scope CurrentUser -ErrorAction Stop } Import-Module -ErrorAction Stop PowerHTML }
The following example shows how to access the paragraph elements of www.example.com:
$wc = New-Object System.Net.WebClient $res = $wc.DownloadString('http://www.example.com') $html = ConvertFrom-Html -Content $res $p = $html.SelectNodes('/html/body/div/p')
All <p>
elements are now contained in $p
and can be processed accordingly:
NodeType Name AttributeCount ChildNodeCount ContentLength InnerText -------- ---- -------------- -------------- ------------- --------- Element p 0 1 156 This domain is for... Element p 0 1 70 More information...
Output the text contained in an element::
$p[0].innerText
Find a specific element based on its text content:
($p | Where-Object { $_.innerText -match 'domain' }).innerText
Extraction of a table by its class name:
$wc = New-Object System.Net.WebClient $res = $wc.DownloadString('https://www.imdb.com/title/tt0057012/') $html = ConvertFrom-Html -Content $res $table = $html.SelectNodes('//table') | Where-Object { $_.HasClass('cast_list') } $cnt = 0 foreach ($row in $table.SelectNodes('tr')) { $cnt += 1 # skip header row if ($cnt -eq 1) { continue } $a = $row.SelectSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", " " $c = $row.SelectSingleNode('td[4]').innerText.Trim() -replace "`n|`r|\s+", " " $row = New-Object -TypeName psobject $row | Add-Member -MemberType NoteProperty -Name Actor -Value $a $row | Add-Member -MemberType NoteProperty -Name Character -Value $c [array]$data += $row }
When analyzing content with highly nested elements it can quickly become difficult to find the correct XPath expression. The Inspector of browsers such as Firefox, Chrome or Edge offers help here: In the Inspector, an element can be selected and then the Copy Xpath function is available in the context menu.
A confusing number of options exist to retrieve content from websites using Powershell. In many cases, it is advisable to use Invoke-Webrequest
and make use of its automatic parsing capabilities. If complete control or advanced capabilities are required, System.Net.Http.HttpClient
is currently the best option.
For the subsequent extraction of data, the simplest cases only require a few on-board Powershell tools. More comprehensive possibilities are available through additional Powershell modules.
Our experts will get in contact with you!
Tomaso Vasella
Tomaso Vasella
Tomaso Vasella
Tomaso Vasella
Our experts will get in contact with you!