siuba | Python Ecosystem Directory

Bot releases are hidden (Show)

siuba - Experimental Symbolic autocompletion

Published by machow about 4 years ago

Thanks to @tmastny for the PR (#248)!

import siuba.experimental.completer

download

siuba - Fix lhs ops, support kwargs in sql count

Published by machow over 4 years ago

Fix lhs ops (#235)
Support kwargs in SQL count (#234)

siuba - Small fix for summarize, w/ Series results

Published by machow over 4 years ago

See issue #138. This release ensures summarize...

validates results are scalar or length 1.
uses a Series results underlying array, to issues around Series indexes in DataFrame construction.

siuba - Small update for docs: Call.map_replace and cars data

Published by machow over 4 years ago

This is a small release, designed to support the new siuba documentation.

Features

added Call.map_replace method, which is like map_subcall but replaces subcalls with the result
added to siuba.data: cars, cars_sql

siuba - top_n, floor_date, custom sql joins, and full method spec

Published by machow over 4 years ago

Fixes

filter now preserves column order, rather than moving grouping columns to left (#205)
symbolic representations now correctly align on keywords (#222)

Features

sql supports custom join conditions via sql_on (#202)
siuba.series.spec now includes all Series methods, even unsupported ones (#209)
the spec also now is derived from the file siuba/series/spec.yml (#211)
siu Symbolic is no longer falsey (#210)
added new verb top_n (#222)
added vector functions ceil_date and floor_date to siuba.experimental.datetime (#222)

QA

re-enabled testing of example jupyter notebooks (#206)

siuba - Add fct_lump prop argument, fix fast grouped summarize

Published by machow over 4 years ago

Fixes

added more fast grouped method tests, and fixed fast summarize (#197)

Features

support prop argument in fct_lump (#195)

siuba - fix if_else, remove psycopg2 dependency

Published by machow over 4 years ago

Fixes

if_else doesn't try to coerce to new type at end (#179)
removed psycopg2 dependency (causes install to fail if user does not have postgres) #189

siuba - Fix nest function to support pandas v1.0.0

Published by machow over 4 years ago

Fixes nest raising the error "TypeError: copy() takes no keyword arguments". Nest now uses a more principled approach to splitting a grouped DataFrame, and creating a list of sub frames! (see #182)

Also fixed doc build, by not trying to run notebooks starting with draft-. (#186)

siuba - Support for user defined functions (UDFs)

Published by machow over 4 years ago

New Feature: support user defined functions (#146)

Support for user defined functions (UDFs). Note that these require annotating the return type. For more on the theory behind these see ADR-003.

from siuba.siu import symbolic_dispatch
from pandas.core.groupby import SeriesGroupBy, GroupBy
from pandas import Series

@symbolic_dispatch(cls = Series)
def cummean(x):
    """Return a same-length array, containing the cumulative mean."""
    return x.expanding().mean()


@cummean.register(SeriesGroupBy)
def _cummean_grouped(x) -> SeriesGroupBy:
    grouper = x.grouper
    n_entries = x.obj.notna().groupby(grouper).cumsum()

    res = x.cumsum() / n_entries

    return res.groupby(grouper)

from siuba import _, mutate
from siuba.data import mtcars

# a pandas DataFrameGroupBy object
g_cyl = mtcars.groupby("cyl")

mutate(g_students, cumul_mean = cummean(_.score))

Support for many methods in vector.py, using UDFs (#158)

Bug Fixes

Fix regression where .str wasn't being removed when processing siu expressions for SQL (#159)
Grouped filter now preserves order
Verbs now tested to preserve original index (https://github.com/machow/siuba/commit/d938ab323e080832af8274f330c6562cf9b447b0)

Tests

Add many more versions of python and pandas to travis CI test matrix (#161)

siuba - Opt-in speedy support for grouped pandas

Published by machow almost 5 years ago

Features

Implementation of fast mutate, filter, and summarize using CallTreeLocal (#134). For even just a couple thousand groups, the fast methods are close to optimal hand-written pandas, and the slow versions are almost 1000x slower :o.
fixed current grouped pandas mutate to preserve row order (#139)
laid down tests of all supported series methods, currently skipping SQL backends (but ready to go!)
put up some very basic documentation (#145)
wrote an ADR on the rational for fast groupby (#135)

Note that CallTreeLocal has new options, allowing it to look up based on chained attributes (e.g. look for an entry named "dt.year", and override custom function calls.).

I still need to finish support for user defined operations and some light siu refactoring.

Breaking changes

Removed the rm_attr argument from CallTreeLocal, since converting subattrs like dt.year will consume dt anyway (can't imagine a situation where we'd want to keep it, and couldn't do that in the translator function)

Demo

from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
from siuba import *
from siuba.data import mtcars

g_cars = mtcars.groupby(['cyl', 'gear'])

fast_mutate(g_cars, _.hp - _.hp.mean())

siuba -

Published by machow about 5 years ago

User Facing Changes

Arrange correctly resets index. https://github.com/machow/siuba/pull/106
Filter correctly resets index. https://github.com/machow/siuba/pull/130
Switch to MIT license
SqlTranslator is now mutable, so it's easier to tweak them on the fly (e.g. in a notebook. b9c81d77975a87b0f860dfb594f26c2d82e1052c
pandas join feats / fixes
- the implementation of semi_join was duplicating rows as standard joins do.
  this resulted in some very large results.... https://github.com/machow/siuba/pull/115
- implement anti_join
- fix failures when the on arg received a mapping
- fix full_join failing for pandas, since pandas calls it an 'outer' join
implement sql transmute. https://github.com/machow/siuba/pull/108
clean up, document, add docstring tests for vector funcs and forcats. https://github.com/machow/siuba/pull/120
add full tests for forcats (thanks @bakera81!) .https://github.com/machow/siuba/pull/121
clean up and document separate(). https://github.com/machow/siuba/pull/119
allow raw SQL. https://github.com/machow/siuba/pull/123
generalize spread and gather. https://github.com/machow/siuba/pull/129
mutate generates fewer subqueries now. https://github.com/machow/siuba/pull/127

As an experimental feature, I shortened the stacktraces for SQL translator errors! https://github.com/machow/siuba/pull/125

Chores

clean up pytest. https://github.com/machow/siuba/pull/116

siuba - count without args returns total rows

Published by machow about 5 years ago

Small fix, supporting count without args. This is a very common case against SQL dbs, since it lets you know how big (how many rows) a table has.

e.g.

tbl_something >> count()

siuba - add gather, improve case when support

Published by machow about 5 years ago

all case_when values to be Calls (e.g. case_when({True: _.a + 1}) https://github.com/machow/siuba/pull/103
add gather function
add developer docs

siuba - Support connections to redshift

Published by machow about 5 years ago

Allow custom specification of table columns (#78) - this was preventing people from working with SQL when SqlAlchemy couldn't reflect column names from a database.
Add some intro docs

siuba - document core 1 table verbs, and add fct_rev

Published by machow over 5 years ago

siuba - SQL support and unit tests

Published by machow over 5 years ago

This release implements extensive testing for postgres and sqlite. It also sets up (but skips) pandas unit tests.

It follows from this PR: https://github.com/machow/siuba/pull/36

SQL Improvements:

fix distinct when following an arrange (#65)
move collect and show_query into dply.verbs (#23)
mutate now doesn't put redefined columns at end of table (#42)
full join gets all values in joining columns, even if unique to only 1 table (#57)
complete implementation of joins, including anti_ and semi_join (#58, #55, #53)
allow mutate to accept only scalar values, e.g. mutate(a = 1). (#39)
handle arbitrary siu expressions inside arrange (#30)
track arrange expressions, in order to allow cumulative functions (#30)
clearer messaging that summarize can't refer to just defined column (#49)
summarize, and mutate better identify when referencing a variable requires a CTE (#46, #45)
allow keyword args to group_by (#52)