Jupyter tips

As I do data-related projects only on a semi-regular basis, I often find myself regrouping from the context switch. There's not much one can do about the inherent complexity of the subject, but maybe we can lessen the incidental complexity. One of the things I do is amend my Jupyter workflow with slight touches that, I think, improve overall ergonomics. Gathering the following tips will help my future self make this context switch a bit easier - maybe you'll find some value in them as well.

Stable test-set split

One thing I always find myself in need of is a future-stable ID for a test-set split and I haven't found an option in pandas.train_test_split that would offer this. Let me know if I missed it, or if there's an equivalent in some other package. In lieu of that I often use the following split_train_test_by_id, which splits the data based on each record's last-hash-byte membership in the [0, test_ratio*256) range.

from hashlib import md5

def _is_in_test_set(id_, test_ratio):
  id_hash = md5(id_.encode('utf-8')).digest()
  return id_hash[-1] < test_ratio * 256

def split_train_test_by_id(data, test_ratio, id_column):  in_test_set = data[id_column].apply(
    lambda id_: _is_in_test_set(id_, test_ratio))
  return data.loc[~in_test_set], data.loc[in_test_set]

df_train, df_test = split_train_test_by_id(df, test_ratio=0.2, id_column='an_id_column')


One can rebind some built-in commands either A) through the web interface or B) via the "keys" object in ~/.jupyter/nbconfig/notebook.json. Any new, custom shortcuts should go to ~/.jupyter/custom/, though I didn't yet find the right incantation - more on that below. Thus, for now I execute such %%javascript cell in each notebook.

Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('ctrl-alt-k', {
  help: 'Kill line',
  help_index: 'zz',
  handler: function(env) {
    var cm = env.notebook.get_selected_cell().code_mirror
    return false

As mentioned, rather than keep copy pasting above custom shortcuts between notebooks, a better solution would be to have them automatically set on startup.

$([IPython.events]).on('app_initialized.NotebookApp', function(){
  // WIP, both in actually getting it working and
  // having the command duplicate a line robustly.
  CodeMirror.keyMap.macDefault["cmd-shift-d"] = function(cm){
    var current_cursor = cm.doc.getCursor();
    var line_content = cm.doc.getLine(current_cursor.line);
    cm.doc.setCursor(current_cursor.line + 1, current_cursor.ch);

Built-in commands reference

Haven't found an official, exhaustive list of provided commands with which one can build further, custom ones. These are some that I scrapped from the web.


Better plotting defaults

Arguably better plotting defaults, especially the usage of ConciseDateConverter that keeps the axes from being overcrowded and unreadable.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (16, 9)

import datetime
import matplotlib.dates as mdates
import matplotlib.units as munits
converter = mdates.ConciseDateConverter()munits.registry[np.datetime64] = converter
munits.registry[datetime.date] = converter
munits.registry[datetime.datetime] = converter