siuba

Python library for using dplyr like syntax with pandas and SQL

MIT License

Downloads
24.6K
Stars
1.1K
Committers
10

Bot releases are hidden (Show)

siuba - Experimental Symbolic autocompletion

Published by machow about 4 years ago

Thanks to @tmastny for the PR (#248)!

import siuba.experimental.completer

download

siuba - Fix lhs ops, support kwargs in sql count

Published by machow over 4 years ago

  • Fix lhs ops (#235)
  • Support kwargs in SQL count (#234)
siuba - Small fix for summarize, w/ Series results

Published by machow over 4 years ago

See issue #138. This release ensures summarize...

  • validates results are scalar or length 1.
  • uses a Series results underlying array, to issues around Series indexes in DataFrame construction.
siuba - Small update for docs: Call.map_replace and cars data

Published by machow over 4 years ago

This is a small release, designed to support the new siuba documentation.

Features

  • added Call.map_replace method, which is like map_subcall but replaces subcalls with the result
  • added to siuba.data: cars, cars_sql
siuba - top_n, floor_date, custom sql joins, and full method spec

Published by machow over 4 years ago

Fixes

  • filter now preserves column order, rather than moving grouping columns to left (#205)
  • symbolic representations now correctly align on keywords (#222)

Features

  • sql supports custom join conditions via sql_on (#202)
  • siuba.series.spec now includes all Series methods, even unsupported ones (#209)
  • the spec also now is derived from the file siuba/series/spec.yml (#211)
  • siu Symbolic is no longer falsey (#210)
  • added new verb top_n (#222)
  • added vector functions ceil_date and floor_date to siuba.experimental.datetime (#222)

QA

  • re-enabled testing of example jupyter notebooks (#206)
siuba - Add fct_lump prop argument, fix fast grouped summarize

Published by machow over 4 years ago

Fixes

  • added more fast grouped method tests, and fixed fast summarize (#197)

Features

  • support prop argument in fct_lump (#195)
siuba - fix if_else, remove psycopg2 dependency

Published by machow over 4 years ago

Fixes

  • if_else doesn't try to coerce to new type at end (#179)
  • removed psycopg2 dependency (causes install to fail if user does not have postgres) #189
siuba - Fix nest function to support pandas v1.0.0

Published by machow over 4 years ago

Fixes nest raising the error "TypeError: copy() takes no keyword arguments". Nest now uses a more principled approach to splitting a grouped DataFrame, and creating a list of sub frames! (see #182)

Also fixed doc build, by not trying to run notebooks starting with draft-. (#186)

siuba - Support for user defined functions (UDFs)

Published by machow over 4 years ago

New Feature: support user defined functions (#146)

  • Support for user defined functions (UDFs). Note that these require annotating the return type. For more on the theory behind these see ADR-003.
from siuba.siu import symbolic_dispatch
from pandas.core.groupby import SeriesGroupBy, GroupBy
from pandas import Series

@symbolic_dispatch(cls = Series)
def cummean(x):
    """Return a same-length array, containing the cumulative mean."""
    return x.expanding().mean()


@cummean.register(SeriesGroupBy)
def _cummean_grouped(x) -> SeriesGroupBy:
    grouper = x.grouper
    n_entries = x.obj.notna().groupby(grouper).cumsum()

    res = x.cumsum() / n_entries

    return res.groupby(grouper)

from siuba import _, mutate
from siuba.data import mtcars

# a pandas DataFrameGroupBy object
g_cyl = mtcars.groupby("cyl")

mutate(g_students, cumul_mean = cummean(_.score))

  • Support for many methods in vector.py, using UDFs (#158)

Bug Fixes

Tests

  • Add many more versions of python and pandas to travis CI test matrix (#161)
siuba - Opt-in speedy support for grouped pandas

Published by machow almost 5 years ago

Features

  • Implementation of fast mutate, filter, and summarize using CallTreeLocal (#134). For even just a couple thousand groups, the fast methods are close to optimal hand-written pandas, and the slow versions are almost 1000x slower :o.
  • fixed current grouped pandas mutate to preserve row order (#139)
  • laid down tests of all supported series methods, currently skipping SQL backends (but ready to go!)
  • put up some very basic documentation (#145)
  • wrote an ADR on the rational for fast groupby (#135)

Note that CallTreeLocal has new options, allowing it to look up based on chained attributes (e.g. look for an entry named "dt.year", and override custom function calls.).

I still need to finish support for user defined operations and some light siu refactoring.

Breaking changes

  • Removed the rm_attr argument from CallTreeLocal, since converting subattrs like dt.year will consume dt anyway (can't imagine a situation where we'd want to keep it, and couldn't do that in the translator function)

Demo

from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
from siuba import *
from siuba.data import mtcars

g_cars = mtcars.groupby(['cyl', 'gear'])

fast_mutate(g_cars, _.hp - _.hp.mean())
siuba -

Published by machow about 5 years ago

User Facing Changes

As an experimental feature, I shortened the stacktraces for SQL translator errors! https://github.com/machow/siuba/pull/125

image

Chores

siuba - count without args returns total rows

Published by machow about 5 years ago

Small fix, supporting count without args. This is a very common case against SQL dbs, since it lets you know how big (how many rows) a table has.

e.g.

tbl_something >> count()
siuba - add gather, improve case when support

Published by machow about 5 years ago

siuba - Support connections to redshift

Published by machow about 5 years ago

  • Allow custom specification of table columns (#78) - this was preventing people from working with SQL when SqlAlchemy couldn't reflect column names from a database.
  • Add some intro docs
siuba - document core 1 table verbs, and add fct_rev

Published by machow over 5 years ago

siuba - SQL support and unit tests

Published by machow over 5 years ago

This release implements extensive testing for postgres and sqlite. It also sets up (but skips) pandas unit tests.

It follows from this PR: https://github.com/machow/siuba/pull/36

SQL Improvements:

  • fix distinct when following an arrange (#65)
  • move collect and show_query into dply.verbs (#23)
  • mutate now doesn't put redefined columns at end of table (#42)
  • full join gets all values in joining columns, even if unique to only 1 table (#57)
  • complete implementation of joins, including anti_ and semi_join (#58, #55, #53)
  • allow mutate to accept only scalar values, e.g. mutate(a = 1). (#39)
  • handle arbitrary siu expressions inside arrange (#30)
  • track arrange expressions, in order to allow cumulative functions (#30)
  • clearer messaging that summarize can't refer to just defined column (#49)
  • summarize, and mutate better identify when referencing a variable requires a CTE (#46, #45)
  • allow keyword args to group_by (#52)
Package Rankings
Top 28.86% on Conda-forge.org
Top 2.83% on Pypi.org
Badges
Extracted from project README
CI Documentation Status Binder