A filter plugin for Embulk to change timestamp format
MIT License
A filter plugin for Embulk to change timestamp format
string
, timestamp
, long
(unixtimestamp), double
(unixtimestamp) (string, default is string
)["%Y-%m-%d %H:%M:%S.%N %z"]
)UTC
)%Y-%m-%d %H:%M:%S.%N %z
)UTC
)sec
(for second), ms
(for milli second), us
(for micro second), ns
(for nano second) for the input unixtimestamp (string, default is second
)sec
(for second), ms
(for milli second), us
(for micro second), ns
(for nano second) for the output unixtimestamp (string, default is second
)false
)auto_java
to try to convert ruby format to java format to use faster java timestamp parser (string, default is auto
)Say example.jsonl is as follows (this is a typical format on exporting a BigQuery table):
{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl # not json parser
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
default_to_timezone: "Asia/Tokyo"
default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
Output will be as:
{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}
See ./example for more examples.
For type: json
column, you can specify JSONPath for column's name as:
name: $.payload.key1
name: "$.payload.array[0]"
name: "$.payload.array[*]"
name: $['payload']['key1.key2']
Following operators of JSONPath are not supported:
['name','name']
[1,2]
[1:2]
[?(<expression>)]
NEWS: (2017/07/10) embulk 0.8.27 is released with a fast Timestamp jruby parser. This issue should be resolved, so Java Timestamp parser support will be dropped in future releases.
Embulk's timestamp parser originally uses jruby implementation, but it is slow. To improve performance, this plugin also supports Java's Joda-Time DateTimeFormat format as:
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
default_to_timezone: "Asia/Taipei"
default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
If format strings contain %
, jruby parser/formatter is used. Otherwirse, java parser/formatter is used
Automatic Conversion of Ruby Timestamp Format to Java Timestamp Format (experimental)
If you configure timestamp_parser: auto_java
, this plugin tries to convert ruby format into java format automatically to use faster java timestamp parser.
COMPARISON:
Benchmark test sets are available at ./bench. In my environment (Mac Book Pro), for 1000000 timestamps:
JRuby parser is slow, but JRuby formatter is not so slow.
JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution.
Nano second resolution is partially supported by this plugin itself. Use parser format nnnnnnnnn
for Java parser as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
This plugin finds places of nano second from texts with regular expression \.(\d+)
.
For formatter, you can use nnnnnnnnn
for nano and nnnnnn
for micro as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z
FYI: Java8's DateTimeFormatter supports nano second resolution, but we can not use it because embulk supports Java7.
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run checkstyle:
$ ./gradlew check
Release gem:
$ ./gradlew gemPush