Discussion:
structured html without invoke-webrequest
(too old to reply)
Marcel Müller
2016-06-07 16:21:07 UTC
Permalink
Hi.

Objective:
a) Get an HTML-page and store it's content into a variable
b) Navigate within the object by elements

Is:
$var = invoke-webrequest www.somepage.html
Use $var.innertext for parsing

My question is: Is there a way to get the HTML-content (so far I tried
using "wget.exe -O filename.html" and "$var = get-content
filename.html"), store it in a variable and still navigate it by it's
elements? I tried this today and only ended up with a big chunk of
unstructured text in my variable, which isn't easy to work with (it's
not an object, it's plain text, of course)

Is there a way to achieve what I want? There might be something in .Net
that could be usefule, maybe?
--
Marcel
Jeff Zeitlin
2016-06-07 23:10:19 UTC
Permalink
Post by Marcel Müller
Hi.
a) Get an HTML-page and store it's content into a variable
b) Navigate within the object by elements
$var = invoke-webrequest www.somepage.html
Use $var.innertext for parsing
Actually, a quick google of "parse HTML powershell" led to
https://superwidgets.wordpress.com/tag/read-html-via-powershell-powershell/
as the first hit; Invoke-WebRequest _is_ what you want, rather than
Get-Content. Invoke-WebRequest gives you a good object, rather than just
text. Rather than using $var.innertext, you want to use the methods of
$var.ParsedHTML. The URI is a decent introduction; I'm sure that other
results from that google will offer more depth.

Another possibility is to treat the HTML as XML, or cast it to XML, and
then use the XML cmdlets and methods, which would also let you use
XPath. The constraint there is that you must be using conformant HTML 4,
HTML 5, or XHTML; the more relaxed versions of HTML won't meet the
constraints of XML.
Post by Marcel Müller
My question is: Is there a way to get the HTML-content (so far I tried
using "wget.exe -O filename.html" and "$var = get-content
filename.html"), store it in a variable and still navigate it by it's
elements? I tried this today and only ended up with a big chunk of
unstructured text in my variable, which isn't easy to work with (it's
not an object, it's plain text, of course)
Is there a way to achieve what I want? There might be something in .Net
that could be usefule, maybe?
If all else fails, there _is_ XML support in .NET, with the caveats
above. And since PowerShell is built on .NET and offers fairly easy
access to .NET technologies, you could always "roll your own" HTML
processing to meet your needs.
Marcel Müller
2016-06-15 19:18:42 UTC
Permalink
Post by Jeff Zeitlin
Post by Marcel Müller
a) Get an HTML-page and store it's content into a variable
b) Navigate within the object by elements
$var = invoke-webrequest www.somepage.html
Use $var.innertext for parsing
Actually, a quick google of "parse HTML powershell" led to
https://superwidgets.wordpress.com/tag/read-html-via-powershell-powershell/
as the first hit; Invoke-WebRequest _is_ what you want, rather than
Get-Content. Invoke-WebRequest gives you a good object, rather than just
text. Rather than using $var.innertext, you want to use the methods of
$var.ParsedHTML.
[...]

I had already seen that and other sites from searching the web.
Unfortunately it misses the point of my original question, which was to
work *without* invoke-webrequest. *With* invoke-webrequest I have it
working.
Post by Jeff Zeitlin
Another possibility is to treat the HTML as XML
[...]

The website(s) I need to work with are not fully conformant, so treating
them this way will most likely not work.
Post by Jeff Zeitlin
Post by Marcel Müller
My question is: Is there a way to get the HTML-content (so far I tried
using "wget.exe -O filename.html" and "$var = get-content
filename.html"), store it in a variable and still navigate it by it's
elements?
[...]
Post by Jeff Zeitlin
If all else fails, there _is_ XML support in .NET, with the caveats
above. And since PowerShell is built on .NET and offers fairly easy
access to .NET technologies, you could always "roll your own" HTML
processing to meet your needs.
Since I'm not quite in-depth-working with .Net, I at least managed to
use system.net.webclient to work without using an external program and
then work with Powershell-built-ins to retrieve the information I need.
It doesn't fully meet my objective, but let's me circumvent
invoke-webrequest.
--
Marcel
Jeff Zeitlin
2016-06-15 23:18:00 UTC
Permalink
Post by Marcel Müller
I had already seen that and other sites from searching the web.
Unfortunately it misses the point of my original question, which was to
work *without* invoke-webrequest. *With* invoke-webrequest I have it
working.
Ah, I missed that (I generally "tune out" the subject line; long
experience with bad subject lines). But it raises the question: Why do
you wish to avoid using Invoke-WebRequest? Using it gets you the
structure that you want for using the HTML-centric methods of
$var.ParsedHTML; any other way of getting the page is going to amount to
"suck it in as raw HTML and do it the bash-style text-processing way" Or
spend a lot of time writing myriad lines of code to build the structure
for those methods yourself. Or writing your own methods, scripts,
functions, or cmdlets/advanced functions to parse the HTML yourself.
Which largely defeats the purpose of working in PowerShell in the first
place. And in terms of the communication phase of things, it really
can't do things differently from wget; the protocols are well-defined
and not really subject to interpretation.
Marcel Müller
2016-06-17 05:58:58 UTC
Permalink
Post by Jeff Zeitlin
Post by Marcel Müller
I had already seen that and other sites from searching the web.
Unfortunately it misses the point of my original question, which was to
work *without* invoke-webrequest. *With* invoke-webrequest I have it
working.
Ah, I missed that (I generally "tune out" the subject line; long
experience with bad subject lines). But it raises the question: Why do
you wish to avoid using Invoke-WebRequest? Using it gets you the
structure that you want for using the HTML-centric methods of
$var.ParsedHTML; any other way of getting the page is going to amount to
"suck it in as raw HTML and do it the bash-style text-processing way" Or
spend a lot of time writing myriad lines of code to build the structure
for those methods yourself. Or writing your own methods, scripts,
functions, or cmdlets/advanced functions to parse the HTML yourself.
Which largely defeats the purpose of working in PowerShell in the first
place. And in terms of the communication phase of things, it really
can't do things differently from wget; the protocols are well-defined
and not really subject to interpretation.
The short answer is an uninitialized (that is: has never been started,
no settings whatsoever configured) Internet Explorer.

The longer version is: The script is mostly running on machines, which
are freshly installed. Some software needs to be installed and I have a
website on which a current version of that software is available via
link (which unfortunately changes with varying dates, i.e.
package-20160203 to package 20160405), so i can not use a hardcoded
download-link to download the file directly (which might fail in the
future anyway, if naming convention changed).

So the goal was (or is) to make the script not trip over an
uninitialized Internet Explorer, in which case, invoke-webrequest fails
with an error that it can't do anything (I don't have the exact wording
at hand).

I thought it might be easier to find something in .Net thar would give
the same or some equivalent result as the nice object invoke-webrequest
returns to work with, rather than finding something else (and likely
more complex) to get the underlying Internet Explorer to be initialized.

I have no problem working with some .split()'s and extracting the
information I need, but as you said, it's "nicer" to work with objects
in Powershell. And since I already have a line that parses the object
returned by invoke-webrequest (if it works), I would gladly keep using
it instead of adding more code to parse a blob of text.

My gut feeling was, that there might be a nice and simple way to
instantiating a .Net-object, handing over the URL and getting and object
to work with.
--
Marcel
Loading...