Bot releases are visible (Hide)
New read_html_live()
reads HTML into a real, live, HTML browser, meaning
that you can scrape HTML generated by javascript. It returns a LiveHTML
object which you can also use to simulate user interactions with the page,
like clicking, typing, and scrolling (#245).
html_table()
discards rows without cells (@epiben, #360).
Published by hadley about 2 years ago
.Rd
.Published by hadley almost 3 years ago
Fixes for CRAN
html_table()
converts empty tables to empty tibbles (@epiben, #327).
Published by hadley about 3 years ago
html_table()
correctly handles tables with cells that contain blank values
for rowspan
and/or colspan
, so that e.g. <td rowspan="">
is parsed as
<td rowspan=1>
(@epiben, #323).
Fix broken example
Published by hadley over 3 years ago
(to be released as rvest 1.0.0)
New html_text2()
provides a more natural rendering of HTML nodes into text,
converting <br>
into "\n", and removing non-significant whitespace (#175).
By default, it also converts
into regular spaces, which you can
suppress with preserve_nbsp = TRUE
(#284).
html_table()
has been re-written from scratch to more closely mimic the
algorithm that browsers use for parsing tables. This should mean that there
are far fewer tables for which it fails to produce some output (#63, #204,
#215). The fill
argument has been deprecated since it is no longer needed.
html_table()
now returns a tibble rather than a data frame to be compatible
with the rest of the tidyverse (#199). Its performance has been considerably
improved (#237). It also gains a na.strings
argument to control what values
are converted to NA
(#107), and a convert
argument to control whether to
run the conversion (#311).
New html_form_submit()
allows you to submit a form directly, without
needing to create a session (#300).
rvest is now licensed as MIT (#287).
Since this is the 1.0.0 release, I included a large number of API changes to make rvest more compatible with current tidyverse conventions. Older functions have been deprecated, so existing code will continue to work (albeit with a few new warnings).
rvest now imports xml2 rather than depending on it. This is cleaner because
it avoids attaching all the xml2 functions that you're less likely to use.
To reduce the change of breakages, rvest re-exports xml2 functions
read_html()
and url_absolute()
, but your code may now need an explicit
library(xml2)
.
html_form()
now returns an object with class rvest_form
(instead of form).
Fields within a form now have class rvest_field
, instead of a
variety of classes that were lacking the rvest_
prefix. All functions for
working with forms have a common html_form_
prefix: set_values()
became
html_form_set()
. submit_form()
was renamed to session_submit()
because
it returns a session.
html_node()
and html_nodes()
have been superseded in favor of
html_element()
and html_elements()
since they (almost) always return
elements, not nodes (#298).
html_session()
is now session()
and returns an object of class
rvest_session
(instead of session
). All functions that work with session
objects now have a common session_
prefix.
Long deprecated html()
, html_tag()
, xml()
functions have been removed.
minimal_html()
(which doesn't appear to be used by any other package)
has had its arguments flipped to make it more intuitive.
guess_encoding()
has been renamed to html_encoding_guess()
to avoid
a clash with stringr::guess_encoding()
(#209). repair_encoding()
has
been deprecated because it doesn't appear to work.
pluck()
is no longer exported to avoid a clash with purrr::pluck()
;
if you need it use purrr::map_chr()
and friends instead (#209).
xml_tag()
, xml_node()
, and xml_nodes()
have been formally deprecated
in favor of their html_
equivalents.
The "harvesting the web" vignette has been rewritten to focus more on basics
rvest, eliminating the screenshots to keep the installed package as svelte as
possible. It's also been renamed to vignette("rvest")
since it's the
vignette that you should read first.
The SelectorGadget vignette is now a web-only article,
https://rvest.tidyverse.org/articles/articles/selectorgadget.html,
so we can be more generous with screenshots since they're no longer bundled
with every install of the package. Together with the rewrite of the other
vignette, this means that rvest is now ~90 Kb instead of ~1.1 Mb.
All uses of IMDB have been eliminated since the site explicitly prohibits
scraping (#195).
session_submit()
errors if form
doesn't have a url
(#288).
New session_forward()
function to complement session_back()
.
It now allows you to pick the submission button by position (#156).
The ...
argument is deprecated; please use config
instead.
html_form_set()
can now accept character vectors allowing you to select
multiple checkboxes in a set or select multiple values from a multi-<select>
(#127, with help from @juba). It also uses dynamic dots so that you can use
!!!
if you have a list of values (#189).
Published by hadley about 4 years ago
Published by hadley almost 5 years ago
Published by hadley over 5 years ago
read_xml.response()
method (#242).Published by hadley over 5 years ago
Fix R CMD check
failure
submit_request()
now checks for empty form-field-types to select the
correct submit fields (@rentrop, #159)
Published by hadley over 8 years ago
follow_link()
and back()
to correctly manage session history.html_node()
will now return a "missing node".Published by hadley almost 9 years ago
<options>
that don't have value attribute (#85).html()
in favor of read_html()
(@jimhester,Published by hadley about 9 years ago
html_tag()
-> html_name()
html()
-> read_html()
html_node()
now throws an error if there are no matches, and a warningxml_structure()
has been moved to xml2. New html_structure()
(also insubmit_form()
now works with forms that use GET (#66).submit_request()
(and hence submit_form()
) is now case-insensitive,<input type=SUBMIT>
as well as<input type="submit">
.submit_request()
(and hence submit_form()
) recognizes forms with<input type="image">
as a valid form submission button perPublished by hadley almost 10 years ago
html()
and xml()
pass ...
on to httr::GET()
so you can morexml()
, then work with using xml_node()
,xml_attr()
, xml_attrs()
, xml_text()
and xml_tag()
(#24).xml_structure()
: new function that displays the structure (i.e. tagfollow_link()
now accepts css and xpath selectors. (#38, #41, #42)html()
does a better job of dealing with encodings (passing theXML::parseHTML()
) instead of trying to do it itselfhtml_attr()
returns default value when input is NULL (#49)html_node()
method for session.html_nodes()
now returns an empty list if no elements are found (#31).submit_form()
converts relative paths to absolute URLs (#52).Published by hadley almost 10 years ago
Initial release