A command line tool (and Python library) for archiving Twitter JSON
MIT License
Bot releases are hidden (Show)
v2.8.1 includes a small update to the twarc search --help
message that links
to Twitter's Building Queries for Search Tweets to help users figure out what's
possible.
https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query
v2.8.0 adds some new controls for shaping the data that is returned from the Twitter API. The default behavior is for twarc to retrieve the fullest
representation of a tweet by requesting all tweet, user, media, place and poll fields as well as all available expansions. This is generally good practice with twarc because it means that downstream processing of the collected data can rely on have all this data at its disposal. However there may be cases where you want to customize the data that comes back. This is not recommended practice but it could be useful in some contexts.
The following options allow you to fine tune the types of data that are requested when using the following sub-commands: search, searches, tweet, sample, hydrate, users, mentions, timeline, timelines, conversation, conversations, and stream. The options include:
--expansions TEXT Comma separated list of expansions to
retrieve. Default is all available.
--tweet-fields TEXT Comma separated list of tweet fields to
retrieve. Default is all available.
--user-fields TEXT Comma separated list of user fields to
retrieve. Default is all available.
--media-fields TEXT Comma separated list of media fields to
retrieve. Default is all available.
--place-fields TEXT Comma separated list of place fields to
retrieve. Default is all available.
--poll-fields TEXT Comma separated list of poll fields to
retrieve. Default is all available.
These correspond to the API Fields and Expansions.
There is also --minimal-fields
which requests just a minimal subset of data, and --no-context-annotations
that does not include context-annotations, which allows more tweets to be fetched at one time (500 instead of 100). This also applies to the sub-commands: search, searches, tweet, sample, hydrate, users, mentions, timeline, timelines, conversation, conversations, stream.
--minimal-fields By default twarc gets all available data.
This option requests the minimal retrievable
amount of data - only IDs and object
references are retrieved. Setting this makes
--max-results 500 the default. NOTE: This
argument is mutually exclusive with
arguments: [--counts-only, --poll-fields,
--media-fields, --expansions, --no-context-
annotations, --place-fields, --user-fields,
--tweet-fields].
v2.7.0 adds a new places
command to search for places and their identifiers, which can be used in search and stream queries. Even though it's still on the 1.1 endpoint the 1.1/geo/search.json API endpoint makes these place identifiers available when searching by the name, geo coordinates, or ip address.
Usage: twarc2 places [OPTIONS] VALUE [OUTFILE]
Search for places by place name, geo coordinates or ip address.
Options:
--type [name|geo|ip] How to search for places (defaults to name)
--granularity [neighborhood|city|admin|country]
What type of places to search for (defaults
to neighborhood)
--max-results INTEGER Maximum results to return
--json Output raw JSON response
--help Show this message and exit.
There is a corresponding method twarc.client2.Twarc2.geo()
method which you
can use to do the lookup yourself from Python.
Published by SamHames about 3 years ago
Published by edsu about 3 years ago
This release includes new functionality for working with Twitter's new Batch Compliance API which allow you to upload large datasets of Tweet or user IDs to retrieve their compliance status in order to determine what data requires action in order to bring your datasets into compliance.
Usage: twarc2 compliance-job [OPTIONS] COMMAND [ARGS]...
Create, retrieve and list batch compliance jobs for Tweets and Users.
Options:
--help Show this message and exit.
Commands:
create Create a new compliance job and upload tweet IDs.
download Download the compliance job with the specified ID.
get Returns status and download information about the job ID.
list Returns a list of compliance jobs by job type and status.
Published by edsu about 3 years ago
Published by edsu about 3 years ago
This release ensures that the timeline, timelines, conversation and conversations commands default to a --start-time
of 2006-03-21
(the first day of tweets) when being instructed to use the /tweets/search/all
endpoint behind the scenese. For example when doing:
twarc2 timeline --use-search jack
or:
twarc2 conversation --archive 21
Previously it was defaulting to the last 30 days (which is an unfortunate default set by the /tweets/search/all
endpoint). Many thanks to Darren Halpin and @SamHames for identifying and fixing the issue!
Published by edsu about 3 years ago
This release includes support for requesting the new alt_text
field for media from Twitter's v2 API:
https://twittercommunity.com/t/media-alt-text-field-now-available-in-twitter-api-v2/157939
Published by edsu about 3 years ago
This release includes a new dehydrate command for turning tweets into tweet id datasets.
twarc2 dehydrate tweets.jsonl > ids.txt
It also includes improvements to progress bar and user lookup behavior.
Published by edsu about 3 years ago
A bugfix release so that start-time is not inferred when searching with --archive and also using --until-id.
Published by edsu about 3 years ago
This release includes:
Published by edsu over 3 years ago
This is another attempt at handling exceptions during streaming in a more straightforward way without using decorators2.catch_request_exceptions #505.
Published by edsu over 3 years ago
This bugfix release adds some additional handling of exceptions for accessing streaming endpoints when running twarc2 stream
and twarc2 sample
. See #505 for the details.
Published by edsu over 3 years ago
This bugfix release reuses twarc.decorators2.catch_request_exceptions in the
context of streaming responses (twarc2 sample and twarc2 stream) that use the response.iter_lines
method. Hopefully this will address #505 but it will require testing by people who continue seeing
the error in the wild.
Published by edsu over 3 years ago
v2.3.7 includes new functionality that adds progress bars for twarc2 commands like search, hydrate, timeline and more. These visual indications of how much data has been collected and how much there is to go are extremely useful in data collection jobs. Progress bars display by default when you instruct twarc to write output to a file (#490).
Additionally there is new code to catch 503 Twitter API errors that have recently been occurring much more regularly (#499). Apparently a big reason for these errors was the load that requesting 500 tweets from the search/all endpoint while also asking for context annotations. Twitter recently announced they were no longer making context annotations available for requests asking for more than 100 tweets. Since it's one of twarc's design principles to maximize the representation of tweets the search command has been adjusted to default to 100 now instead of 500, at least for the time being (#504).
Published by edsu over 3 years ago
A bugfix release for Twarc2.stream so that it reconnects after being instructed to disconnect, and then continues to fetch data from the stream.
Published by edsu over 3 years ago
Disable running stream unit test under GitHub Actions since it returns a 400 error.