Jupyter tips
As I do data-related projects only on a semi-regular basis, I often find myself regrouping from the context switch. There’s not much one can do about the inherent complexity of the subject, but maybe we can lessen the incidental one.
One of the things I often do is amend my Jupyter workflow with slight ergonomic improvements. Gathering the following tips will help my future self make this context switch a bit easier; maybe you’ll find them helpful as well.
Stable test-set split
One thing I often find myself in need of is a future-stable ID for a test-set split and I haven’t found an option in pandas.train_test_split
that would offer this. Let me know if I missed it, or if there’s an equivalent in some other package. In lieu of that I often use the following split_train_test_by_id
, which splits the data based on each record’s last-hash-byte membership in the [0, test_ratio*256)
range.
from hashlib import md5
def _is_in_test_set(id_, test_ratio):
id_hash = md5(id_.encode('utf-8')).digest()
return id_hash[-1] < test_ratio * 256
def split_train_test_by_id(data, test_ratio, id_column): in_test_set = data[id_column].apply(
lambda id_: _is_in_test_set(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
df_train, df_test = split_train_test_by_id(df, test_ratio=0.2, id_column='an_id_column')
Shortcuts
One can rebind some built-in commands either A) through the web interface or B) via the "keys"
object in ~/.jupyter/nbconfig/notebook.json
. Any new, custom shortcuts should go to ~/.jupyter/custom/
. Or, each such %%javascript
cell should be execute per notebook.
%%javascript
Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('ctrl-alt-k', {
help: 'Kill line',
help_index: 'zz',
handler: function(env) {
var cm = env.notebook.get_selected_cell().code_mirror
cm.execCommand('goLineStart')
cm.execCommand('killLine')
cm.execCommand('delCharAfter')
return false
}}
)
As mentioned, rather than keep copy pasting above custom shortcuts between notebooks, a better solution would be to have them automatically set on startup.
$([IPython.events]).on('app_initialized.NotebookApp', function(){
// WIP: have the command duplicate a line robustly.
CodeMirror.keyMap.macDefault["cmd-shift-d"] = function(cm){
var current_cursor = cm.doc.getCursor();
var line_content = cm.doc.getLine(current_cursor.line);
CodeMirror.commands.goLineEnd(cm);
CodeMirror.commands.newlineAndIndent(cm);
cm.doc.replaceSelection(line_content);
cm.doc.setCursor(current_cursor.line + 1, current_cursor.ch);
};
});
Better plotting defaults
Arguably better plotting defaults, especially the usage of ConciseDateConverter
that keeps the axes from being overcrowded and unreadable.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (16, 9)
sns.set()
import datetime
import matplotlib.dates as mdates
import matplotlib.units as munits
converter = mdates.ConciseDateConverter()munits.registry[np.datetime64] = converter
munits.registry[datetime.date] = converter
munits.registry[datetime.datetime] = converter