sq data wrangler
MIT License
Bot releases are hidden (Show)
Small bugfix release.
zip
was determined to be of kind int
, because values 0-1000 were all parseable as integers. But then the 1001st value was BX123
, which obviously is not an integer. sq
will now see the non-integer value, and alter the ingest DB schema to a compatible kind, e.g. text
. This flexibility is powerful, but it does come at the cost of slower ingest speed. But that's a topic for another release.Published by neilotoole 7 months ago
Published by neilotoole 8 months ago
This release features significant improvements to sq diff
.
Previously sq diff --data
diffed every row, which could get crazy with a large table. Now the command stops after N differences, where N is controlled by the --stop
(-n
) flag, or the new config option diff.stop
. The default stop-after value is 3
; set to 0
to show all differences.
# Stop on first difference
$ sq diff @prod.actor @staging.actor --data --stop 1
# Stop after 5 differences, using the -n shorthand flag
$ sq diff @prod.actor @staging.actor --data -n5
[#353]: The performance of sq diff
has been significantly improved. There's still more to do.
Previously, sq diff --data
compared the rendered (text) representation of each value. This could lead to inaccurate results, for example with two timestamp values in different time zones, but the text rendering omitted the time zone. Now, sq diff --data
compares the raw values, not the rendered text. Note in particular with time values that both time and location components are compared.
sq
can now handle a SQLite DB on stdin
. This is useful for testing, or for working with SQLite DBs in a pipeline.
$ cat sakila.db | sq '.actor | .first_name, .last_name'
It's also surprisingly handy in daily life, because there are sneaky SQLite DBs all around us. Let's see how many text messages I've sent and received over the years:
$ cat ~/Library/Messages/chat.db | sq '.message | count'
count
215439
I'm sure that number makes me an amateur with these millenials 👴🏻.
Note that you'll need to enable macOS Full Disk Access to read the
chat.db
file.
sq
now allows you to use true
and false
literals in queries. Which, in hindsight, does seem like a bit of an oversight 😳. (Although previously you could usually get away with using 1
and 0
).
$ sq '.people | where(.is_alive == false)'
name is_alive
Kubla Khan false
$ sq '.people | where(.is_alive == true)'
name is_alive
Kaiser Soze true
☢️ Previously, sq diff
only exited non-zero on an error. Now, sq diff
exits 0
when no differences, exits 1
if differences are found, and exits 2
on any error. This aligns with the behavior of GNU diff:
Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.
Minor fiddling with the color scheme for some command output.
Published by neilotoole 8 months ago
Published by neilotoole 9 months ago
Minor bug fix release. See the earlier v0.47.0
release for recent headline features.
bash
only worked for top-level commands, not for subcommands, flags, args, etc. This bug was due to an unnoticed behavior change in an imported library 🤦♂️. It's now fixed, and tests have been added.shell-completion.group-filter
.Published by neilotoole 9 months ago
Published by neilotoole 9 months ago
This is a significant release, focused on improving i/o, responsiveness, and performance. The headline features are caching of ingested data for document sources such as CSV or Excel, and download caching for remote document sources. There are a lot of under-the-hood changes, so please open an issue if you encounter any weirdness.
progress
and progress.delay
. You can also use the --no-progress
flag to disable the progress bar.
stderr
and is always zapped from the terminal when command output begins. It won't corrupt the output.sq
command. It is now a one-time cost; subsequent use of the document source utilizes the cache DB. Until, that is, the source document changes: then the ingest cache DB is invalidated and ingested again. This is a significantly improved experience for large document sources.sq cache enable
and sq cache disable
control cache usage. You can also instead use the new ingest.cache
config option.sq cache clear
clears the cache.sq cache location
prints the cache location on disk.sq cache stat
shows stats about the cache.sq cache tree
shows a tree view of the cache.https://sq.io/testdata/actor.csv
) has been completely overhauled. Previously, sq
would re-download the remote file on every command. Now, the remote file is downloaded and cached locally. Subsequent sq
invocations check for staleness of the cached download, and re-download if necessary.http.request.timeout
is the timeout for the initial response from the server, and http.response.timeout
is the timeout for reading the entire response body. We separate these two timeouts because it's possible that the server responds quickly, but then for a large file, the download takes too long.https.insecure-skip-verify
controls whether HTTPS connections verify the server's certificate. This is useful for remote files served with a self-signed certificate.download.cache
controls whether remote files are cached locally.download.refresh.ok-on-err
controls whether sq
should continue with a stale cached download if an error occurred while trying to refresh the download. This is a sort of "Airplane Mode" for remote document sources: sq
continues with the cached download when the network is unavailable.cache.lock.timeout
controls the time that sq
will wait for a lock on the cache DB. The cache lock is introduced for when you have multiple sq
commands running concurrently, and you want to avoid them stepping on each other.config.lock.timeout
controls the timeout for acquiring the (newly-introduced) lock on sq
's config file. This helps prevent issues with multiple sq
processes mutating the config concurrently.sq
's own logs previously outputted in JSON format. Now there's a new log.format
config option that permits setting the log format to json
or text
. The text
format is more human-friendly, and is now the default.conn.open-timeout
.Published by neilotoole 11 months ago
sq
sometimes failed to read from stdin if piped input was slowto arrive. This is now fixed.Published by neilotoole 11 months ago
#338: While sq
has had group_by
for some time, somehow the having
mechanism was never implemented. That's fixed.
$ sq '.payment | .customer_id, sum(.amount) | group_by(.customer_id) | having(sum(.amount) > 200)'
customer_id sum(.amount)
526 221.55
148 216.54
#340: The group_by
function now has a synonym gb
, and order_by
now has synonym ob
. These synonyms are experimental 🧪. The motivation is to reduce typing, especially the underscore (_
) in both function names, but it's not clear that the loss of clarity is worth it. Maybe synonyms group
and order
might be better? Feedback welcome.
# Previously
$ sq '.payment | .customer_id, sum(.amount) | group_by(.customer_id) | order_by(.customer_id)'
# Now
$ sq '.payment | .customer_id, sum(.amount) | gb(.customer_id) | ob(.customer_id)'
#340: sq inspect
: added flag shorthand -C
for --catalogs
and -S
for --schemata
. These were the only inspect
flags without shorthand.
Published by neilotoole 11 months ago
#335: Previously, sq
didn't handle decimal values correctly. It basically shoved a decimal value into a float
or string
and hoped for the best. As is known, floats are imprecise, and so we saw unwanted behavior, e.g.
db_type_test.go:194:
Error Trace: D:/a/sq/sq/drivers/sqlite3/db_type_test.go:194
Error: Not equal:
expected: "77.77"
actual : "77.77000000000001"
Now, sq
uses a dedicated Decimal
type end-to-end. No precision is lost, and at the output end, the value is rendered with the correct precision.
There is a proposal to add decimal support to the Go database/sql
package. If that happens, sq
will happily switch to that mechanism.
NUMERIC(10,5)
value might have been rendered as 100.00
, but will now accurately render 100.00000
.Published by neilotoole 11 months ago
New SLQ function rownum()
that returns the one-indexed row number of the current record.
$ sq '.actor | rownum(), .actor_id, .first_name | order_by(.first_name)'
rownum() actor_id first_name
1 71 ADAM
2 132 ADAM
3 165 AL
sq inspect
has two new flags:
--schemata
: list the source's schemas
$ sq inspect @sakila/pg12 --schemata -y
- schema: information_schema
catalog: sakila
owner: sakila
- schema: public
catalog: sakila
owner: sakila
active: true
--catalogs
: list the source's catalogs
$ sq inspect @sakila/pg12 --catalogs
CATALOG
postgres
sakila
sq version
now honors option format.datetime
when outputting build timestamp.The --exec
and --query
flags for sq sql
were removed in the preceding release (v0.43.1).
That was probably a bit hasty, especially because it's possible those flags could be reintroduced when the query vs exec situation is figured out. So, those two flags are now restored, in that their use won't cause an error, but they've been hidden from command help, and remain no-op.
Published by neilotoole 11 months ago
sq inspect
now includes the source's catalog (in JSON and YAML output formats).sq inspect --overview
.--exec
and --query
flags from sq sql
command.Published by neilotoole 11 months ago
Published by neilotoole about 1 year ago
Published by neilotoole about 1 year ago
Published by neilotoole about 1 year ago
sq version
was missing a newline in its output.Published by neilotoole about 1 year ago
This release is heavily focused on improvements to Microsoft Excel support.
The underlying Excel library has been changed from tealeg/xlsx
to qax-os/excelize
, largely because tealeg/xlsx
is no longer actively maintained.
Thus, both the XLSX output writer and the XLSX driver have been rewritten. There should be some performance improvements, but it's also possible that the rewrite introduced bugs. If you discover anything strange, please open an issue.
#99: The CSV and XLSX
drivers can now handle duplicate header column names in the ingest data.
For example, given a CSV file:
actor_id,first_name,actor_id
1,PENELOPE,1
2,NICK,2
The columns will be renamed to:
actor_id,first_name,actor_id_1
The renaming behavior is controlled by a new option ingest.column.rename
This new option is effectively the ingest counterpart of the existing output option
result.column.rename
.
#191: The XLSX driver now detects header rows, like
the CSV driver already does. Thus, you now typically don't need to specify
the --ingest.header
flag for Excel files. However, the option remains available
in case sq
can't figure it out for a particular file.
The Excel writer has three new config options for controlling date/time output.
Note that these format strings are distinct from format.datetime
and friends, because Excel has its own format string mechanism.
format.excel.datetime
: Controls datetime format, e.g. 2023-08-03 16:07:01
.format.excel.date
: Controls date-only format, e.g. 2023-08-03
.format.excel.time
: Controls time-only format, e.g. 4:07 pm
.The ingest kind detectors (e.g. for CSV
or XLSX
)
now detect more date & time formats as kind.Datetime
, kind.Date
, and kind.Time
.
If an error occurs when the output format is text
, a stack trace is printed
to stderr
when the command is executed with --verbose
(-v
).
There's a new option error.format
that controls error output format independent
of the main format
option . The error.format
value must be one of text
or json
.
☢️ The default Excel date format has changed. Previously
the format was 11/9/89
, and now it is 1989-11-09
. The same applies
to datetimes, e.g. 11/9/1989 00:00:00
becomes 1989-11-09 00:00
.
This change is made to reduce ambiguity and confusion.
sq
uses a library
to interact with Excel files, and it seems that the library chooses a particular format
by default (11/9/89
). There are several paths we could take here:
11/9/89
.11/9/89
.We pick the third option. The first option (locale-dependent)
is excluded because, as a general rule, we want sq
to produce the same
output regardless of locale/system settings. We exclude the second option
because month/day confuses most of the world. Thus, we're left with picking a
default, and 1989-11-09
is the format used in
RFC3339 and friends.
Whether this is the correct (standard?) approach is still unclear, and
feedback is welcome. However, the user can make use of the new config options
(format.excel.datetime
etc.)
to customize the format as they see fit.
The XLSX writer now outputs header rows in bold text.
☢️ The XLSX writer now outputs blob (bytes
) cell data as a base64-encoded string,
instead of raw bytes.
Published by neilotoole over 1 year ago
This release features a complete overhaul of the join
mechanism.
#277: A table selector can now have an alias. This in and of itself is not
particularly useful, but it's a building block for multiple joins.
$ sq '@sakila | .actor:a | .a.first_name'
New option result.column.rename
that exposes a template used to rename
result set column names before display. The primary use case is to de-duplicate
columns names on a SELECT * FROM tbl1 JOIN tbl2
, where tbl1
and tbl2
have clashing column names (docs).
#157: Previously only join
(INNER JOIN
) was available: now the rest of
the join types such as left_outer_join
, cross_join
, etc. are
implemented (docs).
☢️ #12: The table join mechanism has been
completely overhauled. Now there's support for multiple joins. See docs.
# Previously, only a single join was possible
$ sq '.actor, .film_actor | join(.actor_id)'
# Now, an arbitrary number of joins
$ sq '.actor | join(.film_actor, .actor_id) | join(.film, .film_id)'
☢️ The alias for --jsonl
(JSON Lines) has been changed to -J
.
Published by neilotoole over 1 year ago
sq version
output on Windows.