Python tool that mimics cURL, but performs a login and handles any Cross-Site Request Forgery (CSRF) tokens. Useful for scraping HTML normally only accessible when logged in.
GPL-3.0 License
Python tool that mimics cURL, but performs a login and handles any Cross-Site Request Forgery (CSRF) tokens.
Useful for scraping HTML normally only accessible when logged in.
csrfmiddlewaretoken
)id
of formusage: curl-auth-csrf.py [-h] [-a USER_AGENT_STR] -i LOGIN_URL [-f FORM_ID]
[-p PASSWORD_FIELD_NAME] [-d DATA] [-u SUCCESS_URL]
[-t SUCCESS_TEXT] [-j LOGOUT_URL] [-o FILE]
[--version]
url_after_login [url_after_login ...]
Python tool that mimics curl, but performs a login and handles any Cross-Site
Request Forgery (CSRF) tokens. Useful for scraping HTML normally only
accessible when logged in.
positional arguments:
url_after_login
optional arguments:
-a USER_AGENT_STR, --user-agent-str USER_AGENT_STR
User-Agent string to use
-i LOGIN_URL, --login-url LOGIN_URL
URL that contains the login form
-f FORM_ID, --form-id FORM_ID
HTML id attribute of login form
-p PASSWORD_FIELD_NAME, --password-field-name PASSWORD_FIELD_NAME
name of input field containing password
-d DATA, --data DATA adds the specified data to the form submission
(usually just the username)
-u SUCCESS_URL, --success-url SUCCESS_URL
URL substring constituting successful login
-t SUCCESS_TEXT, --success-text SUCCESS_TEXT
HTML snippet constituting successful login
-j LOGOUT_URL, --logout-url LOGOUT_URL
URL to be visited to perform the logout
-o FILE, --output FILE
write output to <file> instead of stdout
--version show program's version number and exit
-h, --help show this help message and exit
If actual password is not passed in via stdin, the user will be prompted.
The script expects the password to be passed in via stdin, to avoid the plain-text password showing up in shell history. A simple way to do this is as follows:
echo ThisIsMyPassword | ./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page
(Trailing newlines in the password are ignored.)
However, this defeats the purpose, as the password still shows up in the shell history. (Exception: In Bash, start the line with an initial space, which will prevent the line from showing up in the history. Refer to Bash documentation on HISTCONTROL and HISTIGNORE.)
A better way to handle this is with a CLI password management tool, such as pass. This is the recommended approach. For example, assuming that your password is managed by pass and already encrypted under the handle foobar.com
:
pass foobar.com | ./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page
If nothing is passed in via stdin, then the user will be prompted for the password (interactively):
./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page
Password:
If your username is [email protected]
for pbs.org
, following is how you might normally scrape the zip code from your user profile:
curl -sL https://account.pbs.org/accounts/profile/ | grep Zip
However, since doing so requires being logged in, here's one way to do it using curl-auth-csrf:
pass pbs.org | ./curl-auth-csrf.py -i https://account.pbs.org/accounts/login/ -d [email protected] -u https://account.pbs.org/accounts/profile/ -j https://account.pbs.org/accounts/logout/ https://account.pbs.org/accounts/profile/ | grep Zip
Notes:
https://account.pbs.org/accounts/login/
email
https://account.pbs.org/accounts/profile/
https://account.pbs.org/accounts/logout/
https://account.pbs.org/accounts/profile/
Another example, with a logout page and multiple pages fetched while logged in:
pass thefastpark.com | ./curl-auth-csrf.py -i https://www.thefastpark.com/ -d [email protected] -u https://www.thefastpark.com/myrewards/history/ -j https://www.thefastpark.com/myrewards/logout/ https://www.thefastpark.com/myrewards/history/ https://www.thefastpark.com/myrewards/redeempoints/ | egrep -i '(Total Points|points available)'
This script only handles standard logins involving a single form submission with a username, password, and hidden fields for CSRF. It will not handle the following scenarios:
If all you need is basic HTTP authentication, this script is overkill. cURL and Wget can do that out-of-box.
UnicodeEncodeError
exception, try setting PYTHONIOENCODING=UTF-8
in your terminal. See this post.Please don't abuse this tool. Only use it with accounts that rightfully belong to you. If you use this tool with someone else's login, you are solely responsible and may face legal consequences.
This script isn't perfect. See the Limitations section above; also, there may be defects. Beware that some Internet services won't take kindly if you login incorrectly (i.e. not in a normal browser). Your using this tool means that you accept full responsibility for anything that might happen.
If you're having trouble finding the right parameters, you can change the default debugging level from "WARNING" to "DEBUG" at the top of the Python script. See discussion at #2.