Jupyter tips

As I do data-related projects only on a semi-regular basis, I often find myself regrouping from the context switch. There’s not much one can do about the inherent complexity of the subject, but maybe we can lessen the incidental one.

One of the things I often do is amend my Jupyter workflow with slight ergonomic improvements. Gathering the following tips will help my future self make this context switch a bit easier; maybe you’ll find them helpful as well.

Stable test-set split

One thing I often find myself in need of is a future-stable ID for a test-set split and I haven’t found an option in pandas.train_test_split that would offer this. Let me know if I missed it, or if there’s an equivalent in some other package. In lieu of that I often use the following split_train_test_by_id, which splits the data based on each record’s last-hash-byte membership in the [0, test_ratio*256) range.

from hashlib import md5

def _is_in_test_set(id_, test_ratio):
  id_hash = md5(id_.encode('utf-8')).digest()
  return id_hash[-1] < test_ratio * 256

def split_train_test_by_id(data, test_ratio, id_column):  in_test_set = data[id_column].apply(
    lambda id_: _is_in_test_set(id_, test_ratio))
  return data.loc[~in_test_set], data.loc[in_test_set]

df_train, df_test = split_train_test_by_id(df, test_ratio=0.2, id_column='an_id_column')

Shortcuts

One can rebind some built-in commands either A) through the web interface or B) via the "keys" object in ~/.jupyter/nbconfig/notebook.json. Any new, custom shortcuts should go to ~/.jupyter/custom/. Or, each such %%javascript cell should be execute per notebook.

%%javascript
Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('ctrl-alt-k', {
  help: 'Kill line',
  help_index: 'zz',
  handler: function(env) {
    var cm = env.notebook.get_selected_cell().code_mirror
    cm.execCommand('goLineStart')
    cm.execCommand('killLine')
    cm.execCommand('delCharAfter')
    return false
  }}
)

As mentioned, rather than keep copy pasting above custom shortcuts between notebooks, a better solution would be to have them automatically set on startup.

$([IPython.events]).on('app_initialized.NotebookApp', function(){
  // WIP: have the command duplicate a line robustly.
  CodeMirror.keyMap.macDefault["cmd-shift-d"] = function(cm){
    var current_cursor = cm.doc.getCursor();
    var line_content = cm.doc.getLine(current_cursor.line);
    CodeMirror.commands.goLineEnd(cm);
    CodeMirror.commands.newlineAndIndent(cm);
    cm.doc.replaceSelection(line_content);
    cm.doc.setCursor(current_cursor.line + 1, current_cursor.ch);
  };
});

Better plotting defaults

Arguably better plotting defaults, especially the usage of ConciseDateConverter that keeps the axes from being overcrowded and unreadable.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (16, 9)
sns.set()

import datetime
import matplotlib.dates as mdates
import matplotlib.units as munits
converter = mdates.ConciseDateConverter()munits.registry[np.datetime64] = converter
munits.registry[datetime.date] = converter
munits.registry[datetime.datetime] = converter