1 package WWW::Mechanize;
5 WWW::Mechanize - Handy web browsing in a Perl object
13 our $VERSION = '1.34';
17 C<WWW::Mechanize>, or Mech for short, helps you automate interaction with
18 a website. It supports performing a sequence of page fetches including
19 following links and submitting forms. Each fetched page is parsed and
20 its links and forms are extracted. A link or a form can be selected, form
21 fields can be filled and the next page can be fetched. Mech also stores
22 a history of the URLs you've visited, which can be queried and revisited.
25 my $mech = WWW::Mechanize->new();
29 $mech->follow_link( n => 3 );
30 $mech->follow_link( text_regex => qr/download this/i );
31 $mech->follow_link( url => 'http://host.com/index.html' );
37 password => 'lost-and-alone',
42 form_name => 'search',
43 fields => { query => 'pot of gold', },
44 button => 'Search Now'
48 Mech is well suited for use in testing web applications. If you use
49 one of the Test::*, like L<Test::HTML::Lint> modules, you can check the
50 fetched content and use that as input to a test call.
53 like( $mech->content(), qr/$expected/, "Got expected content" );
55 Each page fetch stores its URL in a history stack which you can
60 If you want finer control over your page fetching, you can use
61 these methods. C<follow_link> and C<submit_form> are just high
62 level wrappers around them.
64 $mech->find_link( n => $number );
65 $mech->form_number( $number );
66 $mech->form_name( $name );
67 $mech->field( $name, $value );
68 $mech->set_fields( %field_values );
69 $mech->set_visible( @criteria );
70 $mech->click( $button );
72 L<WWW::Mechanize> is a proper subclass of L<LWP::UserAgent> and
73 you can also use any of L<LWP::UserAgent>'s methods.
75 $mech->add_header($name => $value);
77 Please note that Mech does NOT support JavaScript. Please check the
78 FAQ in WWW::Mechanize::FAQ for more.
80 =head1 IMPORTANT LINKS
84 =item * L<http://code.google.com/p/www-mechanize/issues/list>
86 The queue for bugs & enhancements in WWW::Mechanize and
87 Test::WWW::Mechanize. Please note that the queue at L<http://rt.cpan.org>
88 is no longer maintained.
90 =item * L<http://search.cpan.org/dist/WWW-Mechanize/>
92 The CPAN documentation page for Mechanize.
94 =item * L<http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize/FAQ.pod>
96 Frequently asked questions. Make sure you read here FIRST.
105 use HTTP::Request 1.30;
106 use LWP::UserAgent 2.003;
108 use HTML::TokeParser;
110 use base 'LWP::UserAgent';
114 $HAS_ZLIB = eval 'use Compress::Zlib (); 1;';
117 =head1 CONSTRUCTOR AND STARTUP
121 Creates and returns a new WWW::Mechanize object, hereafter referred to as
124 my $mech = WWW::Mechanize->new()
126 The constructor for WWW::Mechanize overrides two of the parms to the
127 LWP::UserAgent constructor:
129 agent => 'WWW-Mechanize/#.##'
130 cookie_jar => {} # an empty, memory-only HTTP::Cookies object
132 You can override these overrides by passing parms to the constructor,
135 my $mech = WWW::Mechanize->new( agent => 'wonderbot 1.01' );
137 If you want none of the overhead of a cookie jar, or don't want your
138 bot accepting cookies, you have to explicitly disallow it, like so:
140 my $mech = WWW::Mechanize->new( cookie_jar => undef );
142 Here are the parms that WWW::Mechanize recognizes. These do not include
143 parms that L<LWP::UserAgent> recognizes.
147 =item * C<< autocheck => [0|1] >>
149 Checks each request made to see if it was successful. This saves you
150 the trouble of manually checking yourself. Any errors found are errors,
151 not warnings. Default is off.
153 =item * C<< onwarn => \&func >>
155 Reference to a C<warn>-compatible function, such as C<< L<Carp>::carp >>,
156 that is called when a warning needs to be shown.
158 If this is set to C<undef>, no warnings will ever be shown. However,
159 it's probably better to use the C<quiet> method to control that behavior.
161 If this value is not passed, Mech uses C<Carp::carp> if L<Carp> is
162 installed, or C<CORE::warn> if not.
164 =item * C<< onerror => \&func >>
166 Reference to a C<die>-compatible function, such as C<< L<Carp>::croak >>,
167 that is called when there's a fatal error.
169 If this is set to C<undef>, no errors will ever be shown.
171 If this value is not passed, Mech uses C<Carp::croak> if L<Carp> is
172 installed, or C<CORE::die> if not.
174 =item * C<< quiet => [0|1] >>
176 Don't complain on warnings. Setting C<< quiet => 1 >> is the same as
177 calling C<< $mech->quiet(1) >>. Default is off.
179 =item * C<< stack_depth => $value >>
181 Sets the depth of the page stack that keeps track of all the downloaded
182 pages. Default is 0 (infinite). If the stack is eating up your memory,
193 agent => "WWW-Mechanize/$VERSION",
199 onwarn => \&WWW::Mechanize::_warn,
200 onerror => \&WWW::Mechanize::_die,
202 stack_depth => 8675309, # Arbitrarily humongous stack
206 my %passed_parms = @_;
208 # Keep the mech-specific parms before creating the object.
209 while ( my($key,$value) = each %passed_parms ) {
210 if ( exists $mech_parms{$key} ) {
211 $mech_parms{$key} = $value;
214 $parent_parms{$key} = $value;
218 my $self = $class->SUPER::new( %parent_parms );
221 # Use the mech parms now that we have a mech object.
222 for my $parm ( keys %mech_parms ) {
223 $self->{$parm} = $mech_parms{$parm};
225 $self->{page_stack} = [];
228 # libwww-perl 5.800 (and before, I assume) has a problem where
229 # $ua->{proxy} can be undef and clone() doesn't handle it.
230 $self->{proxy} = {} unless defined $self->{proxy};
231 push( @{$self->requests_redirectable}, 'POST' );
238 =head2 $mech->agent_alias( $alias )
240 Sets the user agent string to the expanded version from a table of actual user strings.
241 I<$alias> can be one of the following:
247 =item * Windows Mozilla
253 =item * Linux Mozilla
255 =item * Linux Konqueror
259 then it will be replaced with a more interesting one. For instance,
261 $mech->agent_alias( 'Windows IE 6' );
263 sets your User-Agent to
265 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
267 The list of valid aliases can be returned from C<known_agent_aliases()>. The current list is:
273 =item * Windows Mozilla
279 =item * Linux Mozilla
281 =item * Linux Konqueror
288 'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
289 'Windows Mozilla' => 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6',
290 'Mac Safari' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/85 (KHTML, like Gecko) Safari/85',
291 'Mac Mozilla' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.4a) Gecko/20030401',
292 'Linux Mozilla' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624',
293 'Linux Konqueror' => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)',
300 if ( defined $known_agents{$alias} ) {
301 return $self->agent( $known_agents{$alias} );
304 $self->warn( qq{Unknown agent alias "$alias"} );
305 return $self->agent();
309 =head2 known_agent_aliases()
311 Returns a list of all the agent aliases that Mech knows about.
315 sub known_agent_aliases {
316 return sort keys %known_agents;
319 =head1 PAGE-FETCHING METHODS
321 =head2 $mech->get( $uri )
323 Given a URL/URI, fetches it. Returns an L<HTTP::Response> object.
324 I<$uri> can be a well-formed URL string, a L<URI> object, or a
325 L<WWW::Mechanize::Link> object.
327 The results are stored internally in the agent object, but you don't
328 know that. Just use the accessors listed below. Poking at the
329 internals is deprecated and subject to change in the future.
331 C<get()> is a well-behaved overloaded version of the method in
332 L<LWP::UserAgent>. This lets you do things like
334 $mech->get( $uri, ':content_file' => $tempfile );
336 and you can rest assured that the parms will get filtered down
339 B<NOTE:> Because C<:content_file> causes the page contents to be
340 stored in a file instead of the response object, some Mech functions
341 that expect it to be there won't work as expected. Use with caution.
349 $uri = $uri->url if ref($uri) eq 'WWW::Mechanize::Link';
352 ? URI->new_abs( $uri, $self->base )
355 # It appears we are returning a super-class method,
356 # but it in turn calls the request() method here in Mechanize
357 return $self->SUPER::get( $uri->as_string, @_ );
360 =head2 $mech->put( $uri, content => $content )
362 PUTs I<$content> to $uri. Returns an L<HTTP::Response> object.
363 I<$uri> can be a well-formed URI string, a L<URI> object, or a
364 L<WWW::Mechanize::Link> object.
372 $uri = $uri->url if ref($uri) eq 'WWW::Mechanize::Link';
375 ? URI->new_abs( $uri, $self->base )
378 # It appears we are returning a super-class method,
379 # but it in turn calls the request() method here in Mechanize
380 return $self->_SUPER_put( $uri->as_string, @_ );
384 # Added until LWP::UserAgent has it.
386 require HTTP::Request::Common;
387 my($self, @parameters) = @_;
388 my @suff = $self->_process_colonic_headers(\@parameters,1);
389 return $self->request( HTTP::Request::Common::PUT( @parameters ), @suff );
392 =head2 $mech->reload()
394 Acts like the reload button in a browser: repeats the current
395 request. The history (as per the L<back> method) is not altered.
397 Returns the L<HTTP::Response> object from the reload, or C<undef>
398 if there's no current request.
405 return unless my $req = $self->{req};
407 return $self->_update_page( $req, $self->_make_request( $req, @_ ) );
412 The equivalent of hitting the "back" button in a browser. Returns to
413 the previous page. Won't go back past the first page. (Really, what
414 would it do if it could?)
420 $self->_pop_page_stack;
423 =head1 STATUS METHODS
425 =head2 $mech->success()
427 Returns a boolean telling whether the last request was successful.
428 If there hasn't been an operation yet, returns false.
430 This is a convenience function that wraps C<< $mech->res->is_success >>.
437 return $self->res && $self->res->is_success;
443 Returns the current URI as a L<URI> object. This object stringifies
446 =head2 $mech->response() / $mech->res()
448 Return the current response as an L<HTTP::Response> object.
450 Synonym for C<< $mech->response() >>
452 =head2 $mech->status()
454 Returns the HTTP status code of the response.
458 Returns the content type of the response.
462 Returns the base URI for the current response
464 =head2 $mech->forms()
466 When called in a list context, returns a list of the forms found in
467 the last fetched page. In a scalar context, returns a reference to
468 an array with those forms. The forms returned are all L<HTML::Form>
471 =head2 $mech->current_form()
473 Returns the current form as an L<HTML::Form> object.
475 =head2 $mech->links()
477 When called in a list context, returns a list of the links found in the
478 last fetched page. In a scalar context it returns a reference to an array
479 with those links. Each link is a L<WWW::Mechanize::Link> object.
481 =head2 $mech->is_html()
483 Returns true/false on whether our content is HTML, according to the
490 return $self->response->request->uri;
493 sub res { my $self = shift; return $self->{res}; }
494 sub response { my $self = shift; return $self->{res}; }
495 sub status { my $self = shift; return $self->{status}; }
496 sub ct { my $self = shift; return $self->{ct}; }
497 sub base { my $self = shift; return $self->{base}; }
498 sub current_form { my $self = shift; return $self->{form}; }
499 sub is_html { my $self = shift; return defined $self->{ct} && ($self->{ct} eq 'text/html'); }
501 =head2 $mech->title()
503 Returns the contents of the C<< <TITLE> >> tag, as parsed by
504 L<HTML::HeadParser>. Returns undef if the content is not HTML.
510 return unless $self->is_html;
512 require HTML::HeadParser;
513 my $p = HTML::HeadParser->new;
514 $p->parse($self->content);
515 return $p->header('Title');
518 =head1 CONTENT-HANDLING METHODS
520 =head2 $mech->content(...)
522 Returns the content that the mech uses internally for the last page
523 fetched. Ordinarily this is the same as $mech->response()->content(),
524 but this may differ for HTML documents if L</update_html> is
525 overloaded (in which case the value passed to the base-class
526 implementation of same will be returned), and/or extra named arguments
527 are passed to I<content()>:
531 =item I<< $mech->content( format => 'text' ) >>
533 Returns a text-only version of the page, with all HTML markup
534 stripped. This feature requires I<HTML::TreeBuilder> to be installed,
535 or a fatal error will be thrown.
537 =item I<< $mech->content( base_href => [$base_href|undef] ) >>
539 Returns the HTML document, modified to contain a
540 C<< <base href="$base_href"> >> mark-up in the header.
541 I<$base_href> is C<< $mech->base() >> if not specified. This is
542 handy to pass the HTML to e.g. L<HTML::Display>.
546 Passing arguments to C<content()> if the current document is not
547 HTML has no effect now (i.e. the return value is the same as
548 C<< $self->response()->content() >>. This may change in the future,
549 but will likely be backwards-compatible when it does.
555 my $content = $self->{content};
557 if ( $self->is_html ) {
559 if ( exists $parms{base_href} ) {
560 my $arg = (delete $parms{base_href}) || $self->base;
561 $content=~s/<head>/<head>\n<base href="$arg">/i;
563 if ( my $arg = delete $parms{format} ) {
564 if ($arg eq 'text') {
565 require HTML::TreeBuilder;
566 my $tree = HTML::TreeBuilder->new();
567 $tree->parse($content);
569 $tree->elementify(); # just for safety
570 $content = $tree->as_text();
574 $self->die( qq{Unknown "format" parameter "$arg"} );
577 for my $cmd ( sort keys %parms ) {
578 $self->die( qq{Unknown named argument "$cmd"} );
589 Lists all the links on the current page. Each link is a
590 WWW::Mechanize::Link object. In list context, returns a list of all
591 links. In scalar context, returns an array reference of all links.
598 $self->_extract_links() unless $self->{_extracted_links};
600 return @{$self->{links}} if wantarray;
601 return $self->{links};
604 =head2 $mech->follow_link(...)
606 Follows a specified link on the page. You specify the match to be
607 found using the same parms that C<L<find_link()>> uses.
613 =item * 3rd link called "download"
615 $mech->follow_link( text => 'download', n => 3 );
617 =item * first link where the URL has "download" in it, regardless of case:
619 $mech->follow_link( url_regex => qr/download/i );
623 $mech->follow_link( url_regex => qr/(?i:download)/ );
625 =item * 3rd link on the page
627 $mech->follow_link( n => 3 );
631 Returns the result of the GET method (an HTTP::Response object) if
632 a link was found. If the page has no links, or the specified link
633 couldn't be found, returns undef.
639 my %parms = ( n=>1, @_ );
641 if ( $parms{n} eq 'all' ) {
643 $self->warn( q{follow_link(n=>"all") is not valid} );
646 my $link = $self->find_link(%parms);
647 return $self->get( $link->url ) if $link;
651 =head2 $mech->find_link( ... )
653 Finds a link in the currently fetched page. It returns a
654 L<WWW::Mechanize::Link> object which describes the link. (You'll
655 probably be most interested in the C<url()> property.) If it fails
656 to find a link it returns undef.
658 You can take the URL part and pass it to the C<get()> method. If
659 that's your plan, you might as well use the C<follow_link()> method
660 directly, since it does the C<get()> for you automatically.
662 Note that C<< <FRAME SRC="..."> >> tags are parsed out of the the HTML
663 and treated as links so this method works with them.
665 You can select which link to find by passing in one or more of these
670 =item * C<< text => 'string', >> and C<< text_regex => qr/regex/, >>
672 C<text> matches the text of the link against I<string>, which must be an
673 exact match. To select a link with text that is exactly "download", use
675 $mech->find_link( text => 'download' );
677 C<text_regex> matches the text of the link against I<regex>. To select a
678 link with text that has "download" anywhere in it, regardless of case, use
680 $mech->find_link( text_regex => qr/download/i );
682 Note that the text extracted from the page's links are trimmed. For
683 example, C<< <a> foo </a> >> is stored as 'foo', and searching for
684 leading or trailing spaces will fail.
686 =item * C<< url => 'string', >> and C<< url_regex => qr/regex/, >>
688 Matches the URL of the link against I<string> or I<regex>, as appropriate.
689 The URL may be a relative URL, like F<foo/bar.html>, depending on how
690 it's coded on the page.
692 =item * C<< url_abs => string >> and C<< url_abs_regex => regex >>
694 Matches the absolute URL of the link against I<string> or I<regex>,
695 as appropriate. The URL will be an absolute URL, even if it's relative
698 =item * C<< name => string >> and C<< name_regex => regex >>
700 Matches the name of the link against I<string> or I<regex>, as appropriate.
702 =item * C<< id => string >> and C<< id_regex => regex >>
704 Matches the attribute 'id' of the link against I<string> or
705 I<regex>, as appropriate.
707 =item * C<< class => string >> and C<< class_regex => regex >>
709 Matches the attribute 'class' of the link against I<string> or
710 I<regex>, as appropriate.
712 =item * C<< tag => string >> and C<< tag_regex => regex >>
714 Matches the tag that the link came from against I<string> or I<regex>,
715 as appropriate. The C<tag_regex> is probably most useful to check for
716 more than one tag, as in:
718 $mech->find_link( tag_regex => qr/^(a|frame)$/ );
720 The tags and attributes looked at are defined below, at
721 L<< $mech->find_link() : link format >>.
725 If C<n> is not specified, it defaults to 1. Therefore, if you don't
726 specify any parms, this method defaults to finding the first link on the
729 Note that you can specify multiple text or URL parameters, which
730 will be ANDed together. For example, to find the first link with
731 text of "News" and with "cnn.com" in the URL, use:
733 $mech->find_link( text => 'News', url_regex => qr/cnn\.com/ );
735 The return value is a reference to an array containing a
736 L<WWW::Mechanize::Link> object for every link in C<< $self->content >>.
738 The links come from the following:
742 =item C<< <A HREF=...> >>
744 =item C<< <AREA HREF=...> >>
746 =item C<< <FRAME SRC=...> >>
748 =item C<< <IFRAME SRC=...> >>
750 =item C<< <META CONTENT=...> >>
758 my %parms = ( n=>1, @_ );
760 my $wantall = ( $parms{n} eq 'all' );
762 $self->_clean_keys( \%parms, qr/^(n|(text|url|url_abs|name|tag|id|class)(_regex)?)$/ );
764 my @links = $self->links or return;
768 for my $link ( @links ) {
769 if ( _match_any_link_parms($link,\%parms) ) {
771 push( @matches, $link );
775 return $link if $nmatches >= $parms{n};
781 return @matches if wantarray;
788 # Used by find_links to check for matches
789 # The logic is such that ALL parm criteria that are given must match
790 sub _match_any_link_parms {
794 # No conditions, anything matches
795 return 1 unless keys %$p;
797 return if defined $p->{url} && !($link->url eq $p->{url} );
798 return if defined $p->{url_regex} && !($link->url =~ $p->{url_regex} );
799 return if defined $p->{url_abs} && !($link->url_abs eq $p->{url_abs} );
800 return if defined $p->{url_abs_regex} && !($link->url_abs =~ $p->{url_abs_regex} );
801 return if defined $p->{text} && !(defined($link->text) && $link->text eq $p->{text} );
802 return if defined $p->{text_regex} && !(defined($link->text) && $link->text =~ $p->{text_regex} );
803 return if defined $p->{name} && !(defined($link->name) && $link->name eq $p->{name} );
804 return if defined $p->{name_regex} && !(defined($link->name) && $link->name =~ $p->{name_regex} );
805 return if defined $p->{tag} && !($link->tag && $link->tag eq $p->{tag} );
806 return if defined $p->{tag_regex} && !($link->tag && $link->tag =~ $p->{tag_regex} );
808 return if defined $p->{id} && !($link->attrs->{id} && $link->attrs->{id} eq $p->{id} );
809 return if defined $p->{id_regex} && !($link->attrs->{id} && $link->attrs->{id} =~ $p->{id_regex} );
810 return if defined $p->{class} && !($link->attrs->{class} && $link->attrs->{class} eq $p->{class} );
811 return if defined $p->{class_regex} && !($link->attrs->{class} && $link->attrs->{class} =~ $p->{class_regex} );
813 # Success: everything that was defined passed.
818 # Cleans the %parms parameter for the find_link and find_image methods.
822 my $rx_keyname = shift;
824 for my $key ( keys %$parms ) {
825 my $val = $parms->{$key};
826 if ( $key !~ qr/$rx_keyname/ ) {
827 $self->warn( qq{Unknown link-finding parameter "$key"} );
828 delete $parms->{$key};
832 my $key_regex = ( $key =~ /_regex$/ );
833 my $val_regex = ( ref($val) eq 'Regexp' );
837 $self->warn( qq{$val passed as $key is not a regex} );
838 delete $parms->{$key};
844 $self->warn( qq{$val passed as '$key' is a regex} );
845 delete $parms->{$key};
848 if ( $val =~ /^\s|\s$/ ) {
849 $self->warn( qq{'$val' is space-padded and cannot succeed} );
850 delete $parms->{$key};
858 =head2 $mech->find_all_links( ... )
860 Returns all the links on the current page that match the criteria. The
861 method for specifying link criteria is the same as in C<L<find_link()>>.
862 Each of the links returned is a L<WWW::Mechanize::Link> object.
864 In list context, C<find_all_links()> returns a list of the links.
865 Otherwise, it returns a reference to the list of links.
867 C<find_all_links()> with no parameters returns all links in the
874 return $self->find_link( @_, n=>'all' );
877 =head2 $mech->find_all_inputs( ... criteria ... )
879 find_all_inputs() returns an array of all the input controls in the
880 current form whose properties match all of the regexes passed in.
881 The controls returned are all descended from HTML::Form::Input.
883 If no criteria are passed, all inputs will be returned.
885 If there is no current page, there is no form on the current
886 page, or there are no submit controls in the current form
887 then the return will be an empty array.
889 You may use a regex or a literal string:
891 # get all textarea controls whose names begin with "customer"
892 my @customer_text_inputs =
893 $mech->find_all_inputs( {
895 name_regex => qr/^customer/,
899 # get all text or textarea controls called "customer"
900 my @customer_text_inputs =
901 $mech->find_all_inputs( {
902 type_regex => qr/^(text|textarea)$/,
909 sub find_all_inputs {
913 my $form = $self->current_form() or return;
916 foreach my $input ( $form->inputs ) { # check every pattern for a match on the current hash
918 foreach my $criterion ( sort keys %criteria ) { # Sort so we're deterministic
919 my $field = $criterion;
920 my $is_regex = ( $field =~ s/(?:_regex)$// );
921 my $what = $input->{$field};
922 $matched = defined($what) && (
924 ? ( $what =~ $criteria{$criterion} )
925 : ( $what eq $criteria{$criterion} )
929 push @found, $input if $matched;
934 =head2 $mech->find_all_submits( ... criteria ... )
936 C<find_all_submits()> does the same thing as C<find_all_inputs()>
937 except that it only returns controls that are submit controls,
938 ignoring other types of input controls like text and checkboxes.
942 sub find_all_submits {
945 return $self->find_all_inputs( @_, type_regex => qr/^(submit|image)$/ );
953 Lists all the images on the current page. Each image is a
954 WWW::Mechanize::Image object. In list context, returns a list of all
955 images. In scalar context, returns an array reference of all images.
962 $self->_extract_images() unless $self->{_extracted_images};
964 return @{$self->{images}} if wantarray;
965 return $self->{images};
968 =head2 $mech->find_image()
970 Finds an image in the current page. It returns a
971 L<WWW::Mechanize::Image> object which describes the image. If it fails
972 to find an image it returns undef.
974 You can select which image to find by passing in one or more of these
979 =item * C<< alt => 'string' >> and C<< alt_regex => qr/regex/, >>
981 C<alt> matches the ALT attribute of the image against I<string>, which must be an
982 exact match. To select a image with an ALT tag that is exactly "download", use
984 $mech->find_image( alt => 'download' );
986 C<alt_regex> matches the ALT attribute of the image against a regular
987 expression. To select an image with an ALT attribute that has "download"
988 anywhere in it, regardless of case, use
990 $mech->find_image( alt_regex => qr/download/i );
992 =item * C<< url => 'string', >> and C<< url_regex => qr/regex/, >>
994 Matches the URL of the image against I<string> or I<regex>, as appropriate.
995 The URL may be a relative URL, like F<foo/bar.html>, depending on how
996 it's coded on the page.
998 =item * C<< url_abs => string >> and C<< url_abs_regex => regex >>
1000 Matches the absolute URL of the image against I<string> or I<regex>,
1001 as appropriate. The URL will be an absolute URL, even if it's relative
1004 =item * C<< tag => string >> and C<< tag_regex => regex >>
1006 Matches the tag that the image came from against I<string> or I<regex>,
1007 as appropriate. The C<tag_regex> is probably most useful to check for
1008 more than one tag, as in:
1010 $mech->find_image( tag_regex => qr/^(img|input)$/ );
1012 The tags supported are C<< <img> >> and C<< <input> >>.
1016 If C<n> is not specified, it defaults to 1. Therefore, if you don't
1017 specify any parms, this method defaults to finding the first image on the
1020 Note that you can specify multiple ALT or URL parameters, which
1021 will be ANDed together. For example, to find the first image with
1022 ALT text of "News" and with "cnn.com" in the URL, use:
1024 $mech->find_image( image => 'News', url_regex => qr/cnn\.com/ );
1026 The return value is a reference to an array containing a
1027 L<WWW::Mechanize::Image> object for every image in C<< $self->content >>.
1033 my %parms = ( n=>1, @_ );
1035 my $wantall = ( $parms{n} eq 'all' );
1037 $self->_clean_keys( \%parms, qr/^(n|(alt|url|url_abs|tag)(_regex)?)$/ );
1039 my @images = $self->images or return;
1043 for my $image ( @images ) {
1044 if ( _match_any_image_parms($image,\%parms) ) {
1046 push( @matches, $image );
1050 return $image if $nmatches >= $parms{n};
1056 return @matches if wantarray;
1063 # Used by find_images to check for matches
1064 # The logic is such that ALL parm criteria that are given must match
1065 sub _match_any_image_parms {
1069 # No conditions, anything matches
1070 return 1 unless keys %$p;
1072 return if defined $p->{url} && !($image->url eq $p->{url} );
1073 return if defined $p->{url_regex} && !($image->url =~ $p->{url_regex} );
1074 return if defined $p->{url_abs} && !($image->url_abs eq $p->{url_abs} );
1075 return if defined $p->{url_abs_regex} && !($image->url_abs =~ $p->{url_abs_regex} );
1076 return if defined $p->{alt} && !(defined($image->alt) && $image->alt eq $p->{alt} );
1077 return if defined $p->{alt_regex} && !(defined($image->alt) && $image->alt =~ $p->{alt_regex} );
1078 return if defined $p->{tag} && !($image->tag && $image->tag eq $p->{tag} );
1079 return if defined $p->{tag_regex} && !($image->tag && $image->tag =~ $p->{tag_regex} );
1081 # Success: everything that was defined passed.
1086 =head2 $mech->find_all_images( ... )
1088 Returns all the images on the current page that match the criteria. The
1089 method for specifying image criteria is the same as in C<L<find_image()>>.
1090 Each of the images returned is a L<WWW::Mechanize::Image> object.
1092 In list context, C<find_all_images()> returns a list of the images.
1093 Otherwise, it returns a reference to the list of images.
1095 C<find_all_images()> with no parameters returns all images in the page.
1099 sub find_all_images {
1101 return $self->find_image( @_, n=>'all' );
1108 Lists all the forms on the current page. Each form is an L<HTML::Form>
1109 object. In list context, returns a list of all forms. In scalar
1110 context, returns an array reference of all forms.
1116 return @{$self->{forms}} if wantarray;
1117 return $self->{forms};
1121 =head2 $mech->form_number($number)
1123 Selects the I<number>th form on the page as the target for subsequent
1124 calls to C<L<field()>> and C<L<click()>>. Also returns the form that was
1127 If it is found, the form is returned as an L<HTML::Form> object and set internally
1128 for later use with Mech's form methods such as C<L<field()>> and C<L<click()>>.
1130 Emits a warning and returns undef if no form is found.
1132 The first form is number 1, not zero.
1137 my ($self, $form) = @_;
1138 # XXX Should we die if no $form is defined? Same question for form_name()
1140 if ($self->{forms}->[$form-1]) {
1141 $self->{form} = $self->{forms}->[$form-1];
1142 return $self->{form};
1145 $self->warn( "There is no form numbered $form" );
1150 =head2 $mech->form_name( $name )
1152 Selects a form by name. If there is more than one form on the page
1153 with that name, then the first one is used, and a warning is
1156 If it is found, the form is returned as an L<HTML::Form> object and set internally
1157 for later use with Mech's form methods such as C<L<field()>> and C<L<click()>>.
1159 Returns undef if no form is found.
1161 Note that this functionality requires libwww-perl 5.69 or higher.
1166 my ($self, $form) = @_;
1169 my @matches = grep {defined($temp = $_->attr('name')) and ($temp eq $form) } $self->forms;
1170 if ( my $nmatches = @matches ) {
1171 $self->warn( "There are $nmatches forms named $form. The first one was used." )
1173 return $self->{form} = $matches[0];
1176 $self->warn( qq{ There is no form named "$form"} );
1181 =head2 $mech->form_with_fields( @fields )
1183 Selects a form by passing in a list of field names it must contain. If there
1184 is more than one form on the page with that matches, then the first one is used,
1185 and a warning is generated.
1187 If it is found, the form is returned as an L<HTML::Form> object and set internally
1188 for later used with Mech's form methods such as C<L<field()>> and C<L<click()>>.
1190 Returns undef if no form is found.
1192 Note that this functionality requires libwww-perl 5.69 or higher.
1196 sub form_with_fields {
1197 my ($self, @fields) = @_;
1198 die 'no fields provided' unless scalar @fields;
1201 FORMS: for my $form (@{ $self->forms }) {
1202 my @fields_in_form = $form->param();
1203 for my $field (@fields) {
1204 next FORMS unless grep { $_ eq $field } @fields_in_form;
1206 push @matches, $form;
1209 if ( my $nmatches = @matches ) {
1210 $self->warn( "There are $nmatches forms with the named fields. The first one was used." )
1212 return $self->{form} = $matches[0];
1215 $self->warn( qq{There is no form with the requested fields} );
1221 =head2 $mech->field( $name, $value, $number )
1223 =head2 $mech->field( $name, \@values, $number )
1225 Given the name of a field, set its value to the value specified. This
1226 applies to the current form (as set by the L<form_name()> or L<form_number()> method or defaulting
1227 to the first form on the page).
1229 The optional I<$number> parameter is used to distinguish between two fields
1230 with the same name. The fields are numbered from 1.
1235 my ($self, $name, $value, $number) = @_;
1238 my $form = $self->{form};
1240 $form->find_input($name, undef, $number)->value($value);
1243 if ( ref($value) eq 'ARRAY' ) {
1244 $form->param($name, $value);
1247 $form->value($name => $value);
1252 =head2 $mech->select($name, $value)
1254 =head2 $mech->select($name, \@values)
1256 Given the name of a C<select> field, set its value to the value
1257 specified. If the field is not E<lt>select multipleE<gt> and the
1258 C<$value> is an array, only the B<first> value will be set. [Note:
1259 the documentation previously claimed that only the last value would
1260 be set, but this was incorrect.] Passing C<$value> as a hash with
1261 an C<n> key selects an item by number (e.g. C<{n => 3> or C<{n => [2,4]}>).
1262 The numbering starts at 1. This applies to the current form.
1264 Returns 1 on successfully setting the value. On failure, returns
1265 undef and calls C<< $self>warn() >> with an error message.
1270 my ($self, $name, $value) = @_;
1272 my $form = $self->{form};
1274 my $input = $form->find_input($name);
1276 $self->warn( qq{Input "$name" not found} );
1280 if ($input->type ne 'option') {
1281 $self->warn( qq{Input "$name" is not type "select"} );
1285 # For $mech->select($name, {n => 3}) or $mech->select($name, {n => [2,4]}),
1286 # transform the 'n' number(s) into value(s) and put it in $value.
1287 if (ref($value) eq 'HASH') {
1288 for (keys %$value) {
1289 $self->warn(qq{Unknown select value parameter "$_"})
1293 if (defined($value->{n})) {
1294 my @inputs = $form->find_input($name, 'option');
1296 # distinguish between multiple and non-multiple selects
1297 # (see INPUTS section of `perldoc HTML::Form`)
1299 @values = $inputs[0]->possible_values();
1302 foreach my $input (@inputs) {
1303 my @possible = $input->possible_values();
1304 push @values, pop @possible;
1308 my $n = $value->{n};
1309 if (ref($n) eq 'ARRAY') {
1313 $self->warn(qq{"n" value "$_" is not a positive integer});
1316 push @$value, $values[$_ - 1]; # might be undef
1319 elsif (!ref($n) && $n =~ /^\d+$/) {
1320 $value = $values[$n - 1]; # might be undef
1323 $self->warn('"n" value is not a positive integer or an array ref');
1328 $self->warn('Hash value is invalid');
1333 if (ref($value) eq 'ARRAY') {
1334 $form->param($name, $value);
1338 $form->value($name => $value);
1342 =head2 $mech->set_fields( $name => $value ... )
1344 This method sets multiple fields of the current form. It takes a list
1345 of field name and value pairs. If there is more than one field with
1346 the same name, the first one found is set. If you want to select which
1347 of the duplicate field to set, use a value which is an anonymous array
1348 which has the field value and its number as the 2 elements.
1350 # set the second foo field
1351 $mech->set_fields( $name => [ 'foo', 2 ] ) ;
1353 The fields are numbered from 1.
1355 This applies to the current form.
1363 my $form = $self->current_form or $self->die( 'No form defined' );
1365 while ( my ( $field, $value ) = each %fields ) {
1366 if ( ref $value eq 'ARRAY' ) {
1367 $form->find_input( $field, undef,
1368 $value->[1])->value($value->[0] );
1371 $form->value($field => $value);
1376 =head2 $mech->set_visible( @criteria )
1378 This method sets fields of the current form without having to know
1379 their names. So if you have a login screen that wants a username and
1380 password, you do not have to fetch the form and inspect the source (or
1381 use the F<mech-dump> utility, installed with WWW::Mechanize) to see
1382 what the field names are; you can just say
1384 $mech->set_visible( $username, $password ) ;
1386 and the first and second fields will be set accordingly. The method
1387 is called set_I<visible> because it acts only on visible fields;
1388 hidden form inputs are not considered. The order of the fields is
1389 the order in which they appear in the HTML source which is nearly
1390 always the order anyone viewing the page would think they are in,
1391 but some creative work with tables could change that; caveat user.
1393 Each element in C<@criteria> is either a field value or a field
1394 specifier. A field value is a scalar. A field specifier allows
1395 you to specify the I<type> of input field you want to set and is
1396 denoted with an arrayref containing two elements. So you could
1397 specify the first radio button with
1399 $mech->set_visible( [ radio => 'KCRW' ] ) ;
1401 Field values and specifiers can be intermixed, hence
1403 $mech->set_visible( 'fred', 'secret', [ option => 'Checking' ] ) ;
1405 would set the first two fields to "fred" and "secret", and the I<next>
1406 C<OPTION> menu field to "Checking".
1408 The possible field specifier types are: "text", "password", "hidden",
1409 "textarea", "file", "image", "submit", "radio", "checkbox" and "option".
1411 C<set_visible> returns the number of values set.
1418 my $form = $self->current_form;
1419 my @inputs = $form->inputs;
1422 for my $value ( @_ ) {
1423 # Handle type/value pairs an arrayref
1424 if ( ref $value eq 'ARRAY' ) {
1425 my ( $type, $value ) = @$value;
1426 while ( my $input = shift @inputs ) {
1427 next if $input->type eq 'hidden';
1428 if ( $input->type eq $type ) {
1429 $input->value( $value );
1435 # by default, it's a value
1437 while ( my $input = shift @inputs ) {
1438 next if $input->type eq 'hidden';
1439 $input->value( $value );
1449 =head2 $mech->tick( $name, $value [, $set] )
1451 "Ticks" the first checkbox that has both the name and value associated
1452 with it on the current form. Dies if there is no named check box for
1453 that value. Passing in a false value as the third optional argument
1454 will cause the checkbox to be unticked.
1462 my $set = @_ ? shift : 1; # default to 1 if not passed
1464 # loop though all the inputs
1466 while ( my $input = $self->current_form->find_input( $name, 'checkbox', $index ) ) {
1467 # Can't guarantee that the first element will be undef and the second
1468 # element will be the right name
1469 foreach my $val ($input->possible_values()) {
1470 next unless defined $val;
1471 if ($val eq $value) {
1472 $input->value($set ? $value : undef);
1477 # move onto the next input
1481 # got self far? Didn't find anything
1482 $self->warn( qq{No checkbox "$name" for value "$value" in form} );
1485 =head2 $mech->untick($name, $value)
1487 Causes the checkbox to be unticked. Shorthand for
1488 C<tick($name,$value,undef)>
1493 shift->tick(shift,shift,undef);
1496 =head2 $mech->value( $name, $number )
1498 Given the name of a field, return its value. This applies to the current
1501 The option I<$number> parameter is used to distinguish between two fields
1502 with the same name. The fields are numbered from 1.
1504 If the field is of type file (file upload field), the value is always
1505 cleared to prevent remote sites from downloading your local files.
1506 To upload a file, specify its file name explicitly.
1513 my $number = shift || 1;
1515 my $form = $self->{form};
1516 if ( $number > 1 ) {
1517 return $form->find_input( $name, undef, $number )->value();
1520 return $form->value( $name );
1524 =head2 $mech->click( $button [, $x, $y] )
1526 Has the effect of clicking a button on the current form. The first
1527 argument is the name of the button to be clicked. The second and
1528 third arguments (optional) allow you to specify the (x,y) coordinates
1531 If there is only one button on the form, C<< $mech->click() >> with
1532 no arguments simply clicks that one button.
1534 Returns an L<HTTP::Response> object.
1539 my ($self, $button, $x, $y) = @_;
1540 for ($x, $y) { $_ = 1 unless defined; }
1541 my $request = $self->{form}->click($button, $x, $y);
1542 return $self->request( $request );
1545 =head2 $mech->click_button( ... )
1547 Has the effect of clicking a button on the current form by specifying
1548 its name, value, or index. Its arguments are a list of key/value
1549 pairs. Only one of name, number, input or value must be specified in
1554 =item * name => name
1556 Clicks the button named I<name> in the current form.
1560 Clicks the I<n>th button in the current form. Numbering starts at 1.
1562 =item * value => value
1564 Clicks the button with the value I<value> in the current form.
1566 =item * input => $inputobject
1568 Clicks on the button referenced by $inputobject, an instance of
1569 L<HTML::Form::SubmitInput> obtained e.g. from
1571 $mech->current_form()->find_input( undef, 'submit' )
1573 $inputobject must belong to the current form.
1579 These arguments (optional) allow you to specify the (x,y) coordinates
1590 for ( keys %args ) {
1591 if ( !/^(number|name|value|input|x|y)$/ ) {
1592 $self->warn( qq{Unknown click_button parameter "$_"} );
1596 for ($args{x}, $args{y}) {
1597 $_ = 1 unless defined;
1600 my $form = $self->{form};
1602 if ( $args{name} ) {
1603 $request = $form->click( $args{name}, $args{x}, $args{y} );
1605 elsif ( $args{number} ) {
1606 my $input = $form->find_input( undef, 'submit', $args{number} );
1607 $request = $input->click( $form, $args{x}, $args{y} );
1609 elsif ( $args{input} ) {
1610 $request = $args{input}->click( $form, $args{x}, $args{y} );
1612 elsif ( $args{value} ) {
1614 while ( my $input = $form->find_input(undef, 'submit', $i) ) {
1615 if ( $args{value} && ($args{value} eq $input->value) ) {
1616 $request = $input->click( $form, $args{x}, $args{y} );
1623 return $self->request( $request );
1626 =head2 $mech->submit()
1628 Submits the page, without specifying a button to click. Actually,
1629 no button is clicked at all.
1631 Returns an L<HTTP::Response> object.
1633 This used to be a synonym for C<< $mech->click( 'submit' ) >>, but is no
1641 my $request = $self->{form}->make_request;
1642 return $self->request( $request );
1645 =head2 $mech->submit_form( ... )
1647 This method lets you select a form from the previously fetched page,
1648 fill in its fields, and submit it. It combines the form_number/form_name,
1649 set_fields and click methods into one higher level call. Its arguments
1650 are a list of key/value pairs, all of which are optional.
1654 =item * fields => \%fields
1656 Specifies the fields to be filled in the current form.
1658 =item * with_fields => \%fields
1660 Probably all you need for the common case. It combines a smart form selector
1661 and data setting in one operation. It selects the first form that contains all
1662 fields mentioned in C<\%fields>. This is nice because you don't need to know
1663 the name or number of the form to do this.
1665 (calls C<L<form_with_fields>> and C<L<set_fields()>>).
1667 If you choose this, the form_number, form_name and fields options will be ignored.
1669 =item * form_number => n
1671 Selects the I<n>th form (calls C<L<form_number()>>). If this parm is not
1672 specified, the currently-selected form is used.
1674 =item * form_name => name
1676 Selects the form named I<name> (calls C<L<form_name()>>)
1678 =item * button => button
1680 Clicks on button I<button> (calls C<L<click()>>)
1682 =item * x => x, y => y
1684 Sets the x or y values for C<L<click()>>
1688 If no form is selected, the first form found is used.
1690 If I<button> is not passed, then the C<L<submit()>> method is used instead.
1692 Returns an L<HTTP::Response> object.
1697 my( $self, %args ) = @_ ;
1699 for ( keys %args ) {
1700 if ( !/^(form_(number|name|fields)|(with_)?fields|button|x|y)$/ ) {
1701 # XXX Why not die here?
1702 $self->warn( qq{Unknown submit_form parameter "$_"} );
1707 for (qw/with_fields fields/) {
1709 if ( ref $args{$_} eq 'HASH' ) {
1710 $fields = $args{$_};
1713 die "$_ arg to submit_form must be a hashref";
1719 if ($args{'with_fields'}) {
1720 $fields || die q{must submit some 'fields' with with_fields};
1721 $self->form_with_fields(keys %{$fields}) or die;
1723 elsif ( my $form_number = $args{'form_number'} ) {
1724 $self->form_number( $form_number ) or die;
1726 elsif ( my $form_name = $args{'form_name'} ) {
1727 $self->form_name( $form_name ) or die;
1730 # No form selector was used.
1731 # Maybe a form was set separately, or we'll default to the first form.
1734 $self->set_fields( %{$fields} ) if $fields;
1737 if ( $args{button} ) {
1738 $response = $self->click( $args{button}, $args{x} || 0, $args{y} || 0 );
1741 $response = $self->submit();
1747 =head1 MISCELLANEOUS METHODS
1749 =head2 $mech->add_header( name => $value [, name => $value... ] )
1751 Sets HTTP headers for the agent to add or remove from the HTTP request.
1753 $mech->add_header( Encoding => 'text/klingon' );
1755 If a I<value> is C<undef>, then that header will be removed from any
1756 future requests. For example, to never send a Referer header:
1758 $mech->add_header( Referer => undef );
1760 If you want to delete a header, use C<delete_header>.
1762 Returns the number of name/value pairs added.
1764 B<NOTE>: This method was very different in WWW::Mechanize before 1.00.
1765 Back then, the headers were stored in a package hash, not as a member of
1766 the object instance. Calling C<add_header()> would modify the headers
1767 for every WWW::Mechanize object, even after your object no longer existed.
1780 $self->{headers}{$key} = $value;
1786 =head2 $mech->delete_header( name [, name ... ] )
1788 Removes HTTP headers from the agent's list of special headers. For
1789 instance, you might need to do something like:
1791 # Don't send a Referer for this URL
1792 $mech->add_header( Referer => undef );
1797 # Back to the default behavior
1798 $mech->delete_header( 'Referer' );
1808 delete $self->{headers}{$key};
1815 =head2 $mech->quiet(true/false)
1817 Allows you to suppress warnings to the screen.
1819 $mech->quiet(0); # turns on warnings (the default)
1820 $mech->quiet(1); # turns off warnings
1821 $mech->quiet(); # returns the current quietness status
1828 $self->{quiet} = $_[0] if @_;
1830 return $self->{quiet};
1833 =head2 $mech->stack_depth( $max_depth )
1835 Get or set the page stack depth. Use this if you're doing a lot of page
1836 scraping and running out of memory.
1838 A value of 0 means "no history at all." By default, the max stack depth
1839 is humongously large, effectively keeping all history.
1845 $self->{stack_depth} = shift if @_;
1846 return $self->{stack_depth};
1849 =head2 $mech->save_content( $filename )
1851 Dumps the contents of C<< $mech->content >> into I<$filename>.
1852 I<$filename> will be overwritten. Dies if there are any errors.
1858 my $filename = shift;
1860 open( my $fh, '>', $filename ) or $self->die( "Unable to create $filename: $!" );
1861 print {$fh} $self->content or $self->die( "Unable to write to $filename: $!" );
1862 close $fh or $self->die( "Unable to close $filename: $!" );
1867 =head2 $mech->dump_links( [[$fh], $absolute] )
1869 Prints a dump of the links on the current page to I<$fh>. If I<$fh>
1870 is not specified or is undef, it dumps to STDOUT.
1872 If I<$absolute> is true, links displayed are absolute, not relative.
1878 my $fh = shift || \*STDOUT;
1879 my $absolute = shift;
1881 for my $link ( $self->links ) {
1882 my $url = $absolute ? $link->url_abs : $link->url;
1883 $url = '' if not defined $url;
1884 print {$fh} $url, "\n";
1889 =head2 $mech->dump_images( [[$fh], $absolute] )
1891 Prints a dump of the images on the current page to I<$fh>. If I<$fh>
1892 is not specified or is undef, it dumps to STDOUT.
1894 If I<$absolute> is true, links displayed are absolute, not relative.
1900 my $fh = shift || \*STDOUT;
1901 my $absolute = shift;
1903 for my $image ( $self->images ) {
1904 my $url = $absolute ? $image->url_abs : $image->url;
1905 $url = '' if not defined $url;
1906 print {$fh} $url, "\n";
1911 =head2 $mech->dump_forms( [$fh] )
1913 Prints a dump of the forms on the current page to I<$fh>. If I<$fh>
1914 is not specified or is undef, it dumps to STDOUT.
1920 my $fh = shift || \*STDOUT;
1922 for my $form ( $self->forms ) {
1923 print {$fh} $form->dump, "\n";
1928 =head2 $mech->dump_all( [[$fh], $absolute] )
1930 Prints a dump of all links, images and forms on the current page to
1931 I<$fh>. If I<$fh> is not specified or is undef, it dumps to STDOUT.
1933 If I<$absolute> is true, links displayed are absolute, not relative.
1939 my $fh = shift || \*STDOUT;
1940 my $absolute = shift;
1942 $self->dump_links( $fh, $absolute );
1943 $self->dump_images( $fh, $absolute );
1944 $self->dump_forms( $fh, $absolute );
1950 =head1 OVERRIDDEN LWP::UserAgent METHODS
1952 =head2 $mech->clone()
1954 Clone the mech object. We override here to be sure the cookie jar
1961 my $clone = $self->SUPER::clone();
1962 $clone->{cookie_jar} = $self->cookie_jar;
1967 =head2 $mech->redirect_ok()
1969 An overloaded version of C<redirect_ok()> in L<LWP::UserAgent>.
1970 This method is used to determine whether a redirection in the request
1977 my $prospective_request = shift;
1978 my $response = shift;
1980 my $ok = $self->SUPER::redirect_ok( $prospective_request, $response );
1982 $self->{redirected_uri} = $prospective_request->uri;
1989 =head2 $mech->request( $request [, $arg [, $size]])
1991 Overloaded version of C<request()> in L<LWP::UserAgent>. Performs
1992 the actual request. Normally, if you're using WWW::Mechanize, it's
1993 because you don't want to deal with this level of stuff anyway.
1995 Note that C<$request> will be modified.
1997 Returns an L<HTTP::Response> object.
2003 my $request = shift;
2005 $request = $self->_modify_request( $request );
2007 if ( $request->method eq 'GET' || $request->method eq 'POST' ) {
2008 $self->_push_page_stack();
2011 $self->_update_page($request, $self->_make_request( $request, @_ ));
2013 # XXX This should definitively return something.
2016 =head2 $mech->update_html( $html )
2018 Allows you to replace the HTML that the mech has found. Updates the
2019 forms and links parse-trees that the mech uses internally.
2021 Say you have a page that you know has malformed output, and you want to
2022 update it so the links come out correctly:
2024 my $html = $mech->content;
2025 $html =~ s[</option>.{0,3}</td>][</option></select></td>]isg;
2026 $mech->update_html( $html );
2028 This method is also used internally by the mech itself to update its
2029 own HTML content when loading a page. This means that if you would
2030 like to I<systematically> perform the above HTML substitution, you
2031 would overload I<update_html> in a subclass thusly:
2034 use base 'WWW::Mechanize';
2037 my ($self, $html) = @_;
2038 $html =~ s[</option>.{0,3}</td>][</option></select></td>]isg;
2039 $self->WWW::Mechanize::update_html( $html );
2042 If you do this, then the mech will use the tidied-up HTML instead of
2043 the original both when parsing for its own needs, and for returning to
2044 you through L</content>.
2046 Overloading this method is also the recommended way of implementing
2047 extra validation steps (e.g. link checkers) for every HTML page
2048 received. L</warn> and L</die> would then come in handy to signal
2058 $self->{ct} = 'text/html';
2059 $self->{content} = $html;
2061 $self->{forms} = [ HTML::Form->parse($html, $self->base) ];
2062 for my $form (@{ $self->{forms} }) {
2063 for my $input ($form->inputs) {
2064 if ($input->type eq 'file') {
2065 $input->value( undef );
2069 $self->{form} = $self->{forms}->[0];
2070 $self->{_extracted_links} = 0;
2071 $self->{_extracted_images} = 0;
2076 =head2 $mech->credentials( $username, $password )
2078 Provide credentials to be used for HTTP Basic authentication for all sites and
2079 realms until further notice.
2081 The four argument form described in L<LWP::UserAgent> is still supported.
2090 no warnings 'redefine'; ## no critic
2094 and *LWP::UserAgent::get_basic_credentials = $saved_method;
2095 return $self->SUPER::credentials(@_);
2099 or $self->die( 'Invalid # of args for overridden credentials()' );
2101 my ($username, $password) = @_;
2102 $saved_method ||= \&LWP::UserAgent::get_basic_credentials;
2103 *LWP::UserAgent::get_basic_credentials
2104 = sub { return $username, $password };
2109 =head1 INTERNAL-ONLY METHODS
2111 These methods are only used internally. You probably don't need to
2114 =head2 $mech->_update_page($request, $response)
2116 Updates all internal variables in $mech as if $request was just
2117 performed, and returns $response. The page stack is B<not> altered by
2118 this method, it is up to caller (e.g. L</request>) to do that.
2123 my ($self, $request, $res) = @_;
2125 $self->{req} = $request;
2126 $self->{redirected_uri} = $request->uri->as_string;
2128 $self->{res} = $res;
2130 $self->{status} = $res->code;
2131 $self->{base} = $res->base;
2132 $self->{ct} = $res->content_type || '';
2134 if ( $res->is_success ) {
2135 $self->{uri} = $self->{redirected_uri};
2136 $self->{last_uri} = $self->{uri};
2139 if ( $res->is_error ) {
2140 if ( $self->{autocheck} ) {
2141 $self->die( 'Error ', $request->method, 'ing ', $request->uri, ': ', $res->message );
2147 # Try to decode the content. Undef will be returned if there's nothing to decompress.
2148 # See docs in HTTP::Message for details. Do we need to expose the options there?
2149 # use charset => 'none' because while we want LWP to handle Content-Encoding for
2150 # the auto-gzipping with Compress::Zlib we don't want it messing with charset
2151 my $content = $res->decoded_content( charset => 'none' );
2152 $content = $res->content if (not defined $content);
2154 $content .= _taintedness();
2156 if ($self->is_html) {
2157 $self->update_html($content);
2160 $self->{content} = $content;
2168 # This is lifted wholesale from Test::Taint
2170 return $_taintbrush if defined $_taintbrush;
2172 # Somehow we need to get some taintedness into our $_taintbrush.
2173 # Let's try the easy way first. Either of these should be
2174 # tainted, unless somebody has untainted them, so this
2175 # will almost always work on the first try.
2176 # (Unless, of course, taint checking has been turned off!)
2177 $_taintbrush = substr("$0$^X", 0, 0);
2178 return $_taintbrush if _is_tainted( $_taintbrush );
2180 # Let's try again. Maybe somebody cleaned those.
2181 $_taintbrush = substr(join("", @ARGV, %ENV), 0, 0);
2182 return $_taintbrush if _is_tainted( $_taintbrush );
2184 # If those don't work, go try to open some file from some unsafe
2185 # source and get data from them. That data is tainted.
2186 # (Yes, even reading from /dev/null works!)
2187 for my $filename ( qw(/dev/null / . ..), values %INC, $0, $^X ) {
2188 if ( open my $fh, '<', $filename ) {
2190 if ( defined sysread $fh, $data, 1 ) {
2191 $_taintbrush = substr( $data, 0, 0 );
2192 last if _is_tainted( $_taintbrush );
2198 die "Our taintbrush should have zero length!" if length $_taintbrush;
2200 return $_taintbrush;
2204 no warnings qw(void uninitialized);
2206 return !eval { join('', shift), kill 0; 1 };
2210 =head2 $mech->_modify_request( $req )
2212 Modifies a L<HTTP::Request> before the request is sent out,
2213 for both GET and POST requests.
2215 We add a C<Referer> header, as well as header to note that we can accept gzip
2216 encoded content, if L<Compress::Zlib> is installed.
2220 sub _modify_request {
2224 # add correct Accept-Encoding header to restore compliance with
2225 # http://www.freesoft.org/CIE/RFC/2068/158.htm
2226 # http://use.perl.org/~rhesa/journal/25952
2227 if (not $req->header( 'Accept-Encoding' ) ) {
2228 # "identity" means "please! unencoded content only!"
2229 $req->header( 'Accept-Encoding', $HAS_ZLIB ? 'gzip' : 'identity' );
2232 my $last = $self->{last_uri};
2234 $last = $last->as_string if ref($last);
2235 $req->header( Referer => $last );
2237 while ( my($key,$value) = each %{$self->{headers}} ) {
2238 if ( defined $value ) {
2239 $req->header( $key => $value );
2242 $req->remove_header( $key );
2250 =head2 $mech->_make_request()
2252 Convenience method to make it easier for subclasses like
2253 L<WWW::Mechanize::Cached> to intercept the request.
2259 $self->SUPER::request(@_);
2262 =head2 $mech->_reset_page()
2264 Resets the internal fields that track page parsed stuff.
2271 $self->{_extracted_links} = 0;
2272 $self->{_extracted_images} = 0;
2273 $self->{links} = [];
2274 $self->{images} = [];
2275 $self->{forms} = [];
2276 delete $self->{form};
2281 =head2 $mech->_extract_links()
2283 Extracts links from the content of a webpage, and populates the C<{links}>
2284 property with L<WWW::Mechanize::Link> objects.
2296 sub _extract_links {
2300 $self->{links} = [];
2301 if ( defined $self->{content} ) {
2302 my $parser = HTML::TokeParser->new(\$self->{content});
2303 while ( my $token = $parser->get_tag( keys %link_tags ) ) {
2304 my $link = $self->_link_from_token( $token, $parser );
2305 push( @{$self->{links}}, $link ) if $link;
2309 $self->{_extracted_links} = 1;
2320 sub _extract_images {
2323 $self->{images} = [];
2325 if ( defined $self->{content} ) {
2326 my $parser = HTML::TokeParser->new(\$self->{content});
2327 while ( my $token = $parser->get_tag( keys %image_tags ) ) {
2328 my $image = $self->_image_from_token( $token, $parser );
2329 push( @{$self->{images}}, $image ) if $image;
2333 $self->{_extracted_images} = 1;
2338 sub _image_from_token {
2343 my $tag = $token->[0];
2344 my $attrs = $token->[1];
2346 if ( $tag eq 'input' ) {
2347 my $type = $attrs->{type} or return;
2348 return unless $type eq 'image';
2351 require WWW::Mechanize::Image;
2353 WWW::Mechanize::Image->new({
2355 base => $self->base,
2356 url => $attrs->{src},
2357 name => $attrs->{name},
2358 height => $attrs->{height},
2359 width => $attrs->{width},
2360 alt => $attrs->{alt},
2364 sub _link_from_token {
2369 my $tag = $token->[0];
2370 my $attrs = $token->[1];
2371 my $url = $attrs->{$link_tags{$tag}};
2375 if ( $tag eq 'a' ) {
2376 $text = $parser->get_trimmed_text("/$tag");
2377 $text = '' unless defined $text;
2379 my $onClick = $attrs->{onclick};
2380 if ( $onClick && ($onClick =~ /^window\.open\(\s*'([^']+)'/) ) {
2385 # Of the tags we extract from, only 'AREA' has an alt tag
2386 # The rest should have a 'name' attribute.
2387 # ... but we don't do anything with that bit of wisdom now.
2389 $name = $attrs->{name};
2391 if ( $tag eq 'meta' ) {
2392 my $equiv = $attrs->{'http-equiv'};
2393 my $content = $attrs->{'content'};
2394 return unless $equiv && (lc $equiv eq 'refresh') && defined $content;
2396 if ( $content =~ /^\d+\s*;\s*url\s*=\s*(\S+)/i ) {
2398 $url =~ s/^"(.+)"$/$1/ or $url =~ s/^'(.+)'$/$1/;
2405 return unless defined $url; # probably just a name link or <AREA NOHREF...>
2407 require WWW::Mechanize::Link;
2409 WWW::Mechanize::Link->new({
2414 base => $self->base,
2417 } # _link_from_token
2419 =head2 $mech->_push_page_stack() / $mech->_pop_page_stack()
2421 The agent keeps a stack of visited pages, which it can pop when it needs
2422 to go BACK and so on.
2424 The current page needs to be pushed onto the stack before we get a new
2425 page, and the stack needs to be popped when BACK occurs.
2427 Neither of these take any arguments, they just operate on the $mech
2432 sub _push_page_stack {
2435 # Don't push anything if it's a virgin object
2436 if ( $self->{res} && $self->stack_depth ) {
2437 my $save_stack = $self->{page_stack};
2438 $self->{page_stack} = [];
2440 my $clone = $self->clone;
2441 push( @{$save_stack}, $clone );
2443 while ( @{$save_stack} > $self->stack_depth ) {
2444 shift @{$save_stack};
2446 $self->{page_stack} = $save_stack;
2452 sub _pop_page_stack {
2455 if ( $self->{page_stack} && @{$self->{page_stack}} ) {
2456 my $popped = pop @{$self->{page_stack}};
2458 # eliminate everything in self
2459 foreach my $key ( keys %{$self} ) {
2460 delete $self->{ $key } unless $key eq 'page_stack';
2463 # make self just like the popped object
2464 foreach my $key ( keys %{$popped} ) {
2465 $self->{ $key } = $popped->{ $key } unless $key eq 'page_stack';
2472 =head2 warn( @messages )
2474 Centralized warning method, for diagnostics and non-fatal problems.
2475 Defaults to calling C<CORE::warn>, but may be overridden by setting
2476 C<onwarn> in the constructor.
2483 return unless my $handler = $self->{onwarn};
2485 return if $self->quiet;
2487 return $handler->(@_);
2490 =head2 die( @messages )
2492 Centralized error method. Defaults to calling C<CORE::die>, but
2493 may be overridden by setting C<onerror> in the constructor.
2500 return unless my $handler = $self->{onerror};
2502 return $handler->(@_);
2506 # NOT an object method!
2509 return &Carp::carp; ## no critic
2512 # NOT an object method!
2515 return &Carp::croak; ## no critic
2522 =head1 REQUESTS & BUGS
2524 The bug queue for WWW::Mechanize and Test::WWW::Mechanize is at
2525 L<http://code.google.com/p/www-mechanize/issues/list>. Please do
2526 not add any tickets to the old queue at L<http://rt.cpan.org/>.
2528 =head1 WWW::MECHANIZE'S SUBVERSION REPOSITORY
2530 Mech and Test::WWW::Mechanize are both hosted at Google Code:
2531 http://code.google.com/p/www-mechanize/. The Subversion repository
2532 is at http://www-mechanize.googlecode.com/svn/wm/.
2534 =head1 OTHER DOCUMENTATION
2536 =head2 I<Spidering Hacks>, by Kevin Hemenway and Tara Calishain
2538 I<Spidering Hacks> from O'Reilly
2539 (L<http://www.oreilly.com/catalog/spiderhks/>) is a great book for anyone
2540 wanting to know more about screen-scraping and spidering.
2542 There are six hacks that use Mech or a Mech derivative:
2546 =item #21 WWW::Mechanize 101
2548 =item #22 Scraping with WWW::Mechanize
2550 =item #36 Downloading Images from Webshots
2552 =item #44 Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
2554 =item #64 Super Author Searching
2556 =item #73 Scraping TV Listings
2560 The book was also positively reviewed on Slashdot:
2561 L<http://books.slashdot.org/article.pl?sid=03/12/11/2126256>
2563 =head1 ONLINE RESOURCES AND SUPPORT
2567 =item * WWW::Mechanize mailing list
2569 The Mech mailing list is at
2570 L<http://groups.google.com/group/www-mechanize-users> and is specific
2571 to Mechanize, unlike the LWP mailing list below. Although it is a
2572 users list, all development discussion takes place here, too.
2574 =item * LWP mailing list
2576 The LWP mailing list is at
2577 L<http://lists.perl.org/showlist.cgi?name=libwww>, and is more
2578 user-oriented and well-populated than the WWW::Mechanize list.
2582 L<http://perlmonks.org> is an excellent community of support, and
2583 many questions about Mech have already been answered there.
2585 =item * L<WWW::Mechanize::Examples>
2587 A random array of examples submitted by users, included with the
2588 Mechanize distribution.
2592 =head1 ARTICLES ABOUT WWW::MECHANIZE
2596 =item * L<http://www-128.ibm.com/developerworks/linux/library/wa-perlsecure.html>
2598 IBM article "Secure Web site access with Perl"
2600 =item * L<http://www.oreilly.com/catalog/googlehks2/chapter/hack84.pdf>
2602 Leland Johnson's hack #84 in I<Google Hacks, 2nd Edition> is
2603 an example of a production script that uses WWW::Mechanize and
2604 HTML::TableContentParser. It takes in keywords and returns the estimated
2605 price of these keywords on Google's AdWords program.
2607 =item * L<http://www.perl.com/pub/a/2004/06/04/recorder.html>
2609 Linda Julien writes about using HTTP::Recorder to create WWW::Mechanize
2612 =item * L<http://www.developer.com/lang/other/article.php/3454041>
2614 Jason Gilmore's article on using WWW::Mechanize for scraping sales
2615 information from Amazon and eBay.
2617 =item * L<http://www.perl.com/pub/a/2003/01/22/mechanize.html>
2619 Chris Ball's article about using WWW::Mechanize for scraping TV
2622 =item * L<http://www.stonehenge.com/merlyn/LinuxMag/col47.html>
2624 Randal Schwartz's article on scraping Yahoo News for images. It's
2625 already out of date: He manually walks the list of links hunting
2626 for matches, which wouldn't have been necessary if the C<find_link()>
2627 method existed at press time.
2629 =item * L<http://www.perladvent.org/2002/16th/>
2631 WWW::Mechanize on the Perl Advent Calendar, by Mark Fowler.
2633 =item * L<http://www.linux-magazin.de/Artikel/ausgabe/2004/03/perl/perl.html>
2635 Michael Schilli's article on Mech and L<WWW::Mechanize::Shell> for the
2636 German magazine I<Linux Magazin>.
2640 =head2 Other modules that use Mechanize
2642 Here are modules that use or subclass Mechanize. Let me know of any others:
2646 =item * L<Finance::Bank::LloydsTSB>
2648 =item * L<HTTP::Recorder>
2650 Acts as a proxy for web interaction, and then generates WWW::Mechanize scripts.
2652 =item * L<Win32::IE::Mechanize>
2654 Just like Mech, but using Microsoft Internet Explorer to do the work.
2656 =item * L<WWW::Bugzilla>
2658 =item * L<WWW::CheckSite>
2660 =item * L<WWW::Google::Groups>
2662 =item * L<WWW::Hotmail>
2664 =item * L<WWW::Mechanize::Cached>
2666 =item * L<WWW::Mechanize::FormFiller>
2668 =item * L<WWW::Mechanize::Shell>
2670 =item * L<WWW::Mechanize::Sleepy>
2672 =item * L<WWW::Mechanize::SpamCop>
2674 =item * L<WWW::Mechanize::Timed>
2676 =item * L<WWW::SourceForge>
2678 =item * L<WWW::Yahoo::Groups>
2682 =head1 ACKNOWLEDGEMENTS
2684 Thanks to the numerous people who have helped out on WWW::Mechanize in
2685 one way or another, including
2686 Kirrily Robert for the original C<WWW::Automate>,
2714 Dominique Quatravaux,
2724 and the late great Iain Truskett.
2728 Copyright (c) 2005-2007 Andy Lester. All rights reserved. This program is
2729 free software; you can redistribute it and/or modify it under the same
2730 terms as Perl itself.