Git commands

 Command №1: git diff

git-diff is the command for you if you need to check different commits or between commits and working tree. If you’re not familiar with the concept, the working tree is the directory associated with your repository on your system.

To analyze the status of a repo git diff command is often used in addition to git status and git log.


In general, git diff is used to get the difference between two “things.” These two things can be one of 6 options:

  1. It can be used to show changes within a local repo. That would be shown if some changes occurred anywhere in the repo’s file directory.
  2. It can be used to show the difference between local and remote repos. So, if you made changes on your local device and some on the Git repo, git diff can help you identify exactly what changed.
  3. Git-diff can also be used to identify differences between two commits of the repo in general.
  4. It also shows the difference between two specific files in two or more commits by showing the changes' line numbers.
  5. Show the difference between two local or remote branches.
  6. Show the difference between two tags of the repo. Tags are often used to refer to working versions of the repo. For example, you can use git diff to identify the differences between version 1.0.0 and version 1.1.0 or your application.

Command №2: git filter-branch

This command is used to rewrite your repo’s history. It does that by applying custom filters to each revision of the repo. The custom filters can change the working-tree or the commits' information, but it can’t change the commit times or merge information.


The general syntax for this command is git filter-branch <filters> branch_name . There are 7options of filters that you can use in this command to rewrite history for different aspects of the branch.

  1. subdirectory-filter: This filter only checks out a specific subdirectory or the branch.
  2. env-filter: This filter is often used to rewrite the environment information of a specific commit. For example, rewrite the author’s name, email, or time of the commit.
  3. tree-filter: This filter option is very powerful; you can use it to check out all commits to the branch. Which means it can change, remove, add, or even move or change files.
  4. index-filter: Similar to the tree-filter, but this one doesn’t check out the entire tree, only the indices of it. Hence it is much faster, especially for large repos.
  5. parent-filter: This option changes the parents' list of a commit.
  6. msg-filter: If you only want to change the commit messages, this filter is the way to go.
  7. tag-name-filter: If you want to edit the tags of your commits, use this command.

Command №3: git bisect

This is probably one of the most important Git commands — in my opinion. Bugs can kill your application, and sometimes debugging a repo is not an easy task. git bisect can be used to find bugs in a repo.

The entire idea behind git bisect is to perform a binary search in the commits history to find a particular regression bug — a regression bug is a problem resulting from an unrelated change in the code.

git bisect walks you through all recent commits, asking you if they are good or bad — that is, if the regressing bug is present in the commit or not. Doing so narrows down the options to the broken commit.

Command №4: git grep

Trying to find something in your repo? Want to search all your branches for a specific file? git grep is here to help you achieve this smoothly and with ease. git grep is basically used to search for a pattern in a working-tree.

You can use git grep to search for wither exact words or regex in the repo. There are various options you can use with this command. Assume we are looking for the doc in the repo; we can use one of these options:

  1. Search by line number git grep -n doc .
  2. Search only file namesgit grep -l doc .
  3. Search using a regex pattern git grep "f[^\s]\w" .
  4. Specify how many matches in files git grep -c doc .

git grep can also be used to search for multiple words using and/or relations. Moreover, it can search in a specific commit, branch, or find all the occurrences between two commits or tags in the repo.

Command №5: git blame

The git blame command is used to display the information of the author of each commit. It can be used to track bugs and find the commits the produced an error. On a higher level, git blame is used to inspect specific points in repo history and obtain information on who last committed and what they really changed.

git blame displays the last author that modified a line; you can even specify exact line numbers and get the commits that affected that line and who performed them.

Some people are often confused between git blame and git log. Although they may sound similar, if you just need to display the commits performed, what they changed, and when they were done, it’s troublesome to use get blame to achieve that. In this case, git log is the better option. However, if you only want to display the metadata of the person who performed the commit, then git blame should be your command of choice.

Python sort() and sorted()

 The sort() function is actually an instance method of the list data type, such that the userage is list.sort()

The sorting operation byy sort() is in place and returns None. type(nums.sort())

sorted allows the argument to be a list of any other iterable objects

a lambda in Python is just an anonymous function that has one or more arguments, but only one expression.

sorted_activities = sorted(df, key=lambda x: x['col1'], reversed = True)

sorted_activities = sorted(df, key=lambda x: x'['col1', 'col2'], reversed = True)

from operator import itemgetter
sorted_activities = sorted(activities, key=itemgetter('day', 'activity'), reverse=True)

Recommendation ranking

 这篇文章主要讨论推荐系统的分析技巧,杜绝无脑调参,探讨如何靠简单无脑而且 low到爆的办法,快速搞一把,做到指标增长。



答:这是一个常见的疑惑,大家指标卡在瓶颈的时候,很容易迷信大力出奇迹的方案。其实,rank并不是提升指标的工具,你加多少特征,本质只是方便 rank 更好地还原系统分布的工具,策略才是提供增长,带动系统往良性(黄赌毒方向)发展的利器。要分析系统的指标瓶颈是不是卡在rank缺特征上,就需要从各种角度去分析,常见的分析方案是考察带条件的copc。

分析办法:把rank分数分成若干区间,每个区间统计真实的ctr,更近一步,可以拆分成多个桶,比如按照某个特征拆分成 A,B两组,单独统计每组的真实ctr。

rank 分数单调递增,ctr 单调递增,但是增长非常慢:比如说0.9-1.0区间的ctr 仅仅比0.3-0.4区间的 ctr 高一丁点。这个原因才是你模型缺特征,尤其是缺乏活跃用户的特征,通常活跃用户,系统 rank 为了指标,会使用很重的行为画像作为特征,很容易放大历史点击记录,更加倾向于把他们排上去,如果你高估了该用户的点击倾向,就会导致分数给的很高,但是现实用户不怎么点的现象
rank分数单调递增,ctr也单调递增,但是 A,B两组的ctr比值差异过大:比如 A,B表示上午和下午,如果这两个时间段,同一个分数区间的ctr差异过大,说明模型对时间这个维度的建模不足,需要进一步改善。
数学很好证明:假设模型分布是q(y|A), q(y|B),真实的线上分布是 p(y|A),p(y|B),A,B处于同一个分数段,数学上等价于:q(y|A) = q(y|B),由于你假设你模型正确拟合了真实分布,也就是 q(y|A) = p(y|A), q(y|B) = p(y|B),但是现实上 A,B的ctr并不相等,P(y|A) != p(y|B),故此,你模型正确拟合A,B两个条件分布的假设不成立。
rank分数单调递增,ctr单调递增,各种维度分组下的ctr比值也接近平稳:恭喜你,到达这一步,表示你 rank 几乎没事情可做了,你剩下要做就是优化召回,在策略上引导你rank往新的产品思路上走,在更加高的层面带动系统往良性地方发展(黄赌毒方向)。


答:数学上,PV增加CTR就会跌,但一般都会是常数,如果PV增加CTR跌太厉害,你就需要警惕你的投放人群了。数学上来说,假设你有 A,B两个人群,A人群活跃喜欢点击,ctr高,B人群不太活跃,不怎么点击,ctr比较低。如果是随机投放,CTR一般来说都是比较平稳的。但是现实rank如果做的不好,都喜欢向A这种活跃人群投放大量同类内容,甚至高热内容,这些用户短时间内接受到大量同质的内容冲击后,A人群兴趣饱和,贡献的点击就会下降。然后,系统随后的收益都是B人群零零星星提供的,导致PV一直增加,但是点击数不见增长。
如果老板要你不择手段保CTR,可以参考广告pacing思路,在短时间内,某个 item触达的优质用户达到一定峰值,就需要退场冷却一下,防止一直被rank推到高位,消费大量注意力和曝光坑位。但是这个办法指标不治本,毕竟为了粘性,人们喜欢点的还是要多推。


答:有一些野路子可以提供,说其是野路子,着实是没法有通用的解法,每个人都有自己的一套特殊的秘方。其次,生产线大部分策略都是被动产生,比如修各种 bad case,或者紧急针对某个单一业务指标做提升,或者产品拍了脑袋找你改进样式啥的。主动寻找增量策略本身是一个比较老中医的方案。
其中min里面是带截断的IPS分数,如果离线评测凑合。上线以后就靠argmax p(s|A) 生成你的策略环境。

问:如何统计曝光次数少的 item 的热度

答:一般来说,很多item的曝光次数可能只有数十次,高热度的item曝光可能是上万,甚至百万次,曝光过低的item,只要产生少数几次点击,其ctr就有可能非常高,甚至吊打高热item的点击率,统计学上针对这种问题,一般是采取 wilson ctr纠正,但是现实来说,wilson ctr非常不靠谱,曝光低的item,大概率是你精准投放人群导致的,并不满足wilson ctr随机投放的基本假设。确切来说,我们要分人群去统计相对的ctr,消除投放人群的bias。
方法:假设item A被投放给N个人,曝光200次,产生10次点击,同时,这N个人当中,高热item B给他们曝光了100000次,产生 900 次点击。所以,A和B 在同一批人群当中的 ctr 分别是:(10 / 200, 900 / 100000),一般我们认为高热的 item 都是无关个性化的,比如热点新闻,促销商品,黄色暴力内容,大家都爱点,高热item的点击率和投放人群的关系不是很大,几乎人人都会点,可以作为CTR本底。扣除这种ctr表示,相比大众货,用户更喜欢点那些item,用这种相对的ctr作为item热度的衡量。

import xgboost as xgb # Create XGB Classifier object xgb_clf = xgb.XGBClassifier(objective = "multi:softmax") # Fit model xgb_model = xgb_clf.fit(X_train, target_train) # Predictions y_train_preds = xgb_model.predict(X_train) y_test_preds = xgb_model.predict(X_test) # Print F1 scores and Accuracy print("Training F1 Micro Average: ", f1_score(target_train, y_train_preds, average = "micro")) print("Test F1 Micro Average: ", f1_score(target_test, y_test_preds, average = "micro")) print("Test Accuracy: ", accuracy_score(target_test, y_test_preds))
from sklearn.model_selection import RandomizedSearchCV import xgboost as xgb

# Create XGB Classifier object
xgb_clf = xgb.XGBClassifier(tree_method = "gpu_exact", predictor = "gpu_predictor", verbosity = True
eval_metric = ["merror", "map", "auc"], objective = "multi:softmax")
# Create parameter grid
parameters = {"learning_rate": [0.1, 0.01, 0.001],
"gamma" : [0.01, 0.1, 0.3, 0.5, 1, 1.5, 2],
"max_depth": [2, 4, 7, 10],
"colsample_bytree": [0.3, 0.6, 0.8, 1.0],
"subsample": [0.2, 0.4, 0.5, 0.6, 0.7],
"reg_alpha": [0, 0.5, 1],
"reg_lambda": [1, 1.5, 2, 3, 4.5],
"min_child_weight": [1, 3, 5, 7],
"n_estimators": [100, 250, 500, 1000]}
# Create RandomizedSearchCV Object
xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions = parameters, scoring = "f1_micro",
cv = 7, verbose = 3, random_state = 40)
# Fit the model
model_xgboost = xgb_rscv.fit(X_train, target_train)

Now, let’s take a look at each hyperparameter individually.

  • learning_rate: to start, let’s clarify that this learning rate is not the same as in gradient descent. In the case of gradient boosting, the learning rate is meant to lessen the effect of each additional tree to the model. In their paper, A Scalable Tree Boosting System Tianqi Chen and Carlos Guestrin refer to this regularization technique as shrinkage, and it is an additional method to prevent overfitting. The lower the learning rate, the more robust the model will be in preventing overfitting.
  • gamma: mathematically, this is known as the Lagrangian Multiplier, and its purpose is complexity control. It is a pseudo-regularization term for the loss function; and it represents by how much the loss has to be reduced when considering a split, in order for that split to happen.
  • max_depth: refers to the depth of a tree. It sets the maximum number of nodes that can exist between the root and the farthest leaf. Remember that deeper trees are prone to overfitting.
  • colsample_bytreee: represents a fraction of the columns (features) to be considered at each tree built, and so it occurs once for every tree constructed. It is referred to in the paper A Scalable Tree Boosting Systemby Tianqi Chen and Carlos Guestrin as another of the main techniques to prevent overfitting and to improve the computational speed.
  • subsample: represents a fraction of the rows (observations) to be considered when building each subtree. Tianqi Chen and Carlos Guestrin in their paper A Scalable Tree Boosting System recommend colsample_bytree over subsample to prevent overfitting, as they found that the former is more effective for this purpose.
  • reg_alpha: L1 regularization term. L1 regularization encourages sparsity (meaning pulling weights to 0). It can be more useful when the objective is logistic regression since you might need help with feature selection.
  • reg_lambda: L2 regularization term. L2 encourages smaller weights, this approach can be more useful in tree-models where zeroing features might not make much sense.
  • min_child_weight: similar to gamma, as it performs regularization at the splitting step. It is the minimum Hessian weight required to create a new node. The Hessian is the second derivative.
  • n_estimators: the number of trees to fit.
  • booster: allows you to choose which booster to use: gbtree, gblinearor dart. We’ve been using gbtree, but dart and gblinear also have their own additional hyperparameters to explore.
  • scale_pos_weight: balances between negative and positive weights, and should definitely be used in cases where the data present high class imbalance.
  • importance_type: refers to the feature importance type to be used by the feature_importances_ method. gain calculates the relative contribution of a feature to all the trees in a model (the higher the relative gain, the more relevant the feature). cover calculates the relative number of observations related to a feature when used to decide the leaf node. weight measures the relative number of times a feature is used to split the data across all the trees in a model.
  • base_score: global bias. This parameter is useful when dealing with high class imbalance.
  • max_delta_step: sets the maximum absolute value possible for the weights. Also useful when dealing with unbalanced classes.
# Create XGB Classifier object
xgb_clf = xgb.XGBClassifier(tree_method = "exact", predictor = "cpu_predictor", verbosity = True,
objective = "multi:softmax")
# Create parameter grid
parameters = {"learning_rate": [0.1, 0.01, 0.001],
"gamma" : [0.01, 0.1, 0.3, 0.5, 1, 1.5, 2],
"max_depth": [2, 4, 7, 10],
"colsample_bytree": [0.3, 0.6, 0.8, 1.0],
"subsample": [0.2, 0.4, 0.5, 0.6, 0.7],
"reg_alpha": [0, 0.5, 1],
"reg_lambda": [1, 1.5, 2, 3, 4.5],
"min_child_weight": [1, 3, 5, 7],
"n_estimators": [100, 250, 500, 1000]}
from sklearn.model_selection import RandomizedSearchCV
# Create RandomizedSearchCV Object
xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions = parameters, scoring = "f1_micro",
cv = 10, verbose = 3, random_state = 40 )
# Fit the model
model_xgboost = xgb_rscv.fit(X_train, target_train)
# Model best estimators
print("Learning Rate: ", model_xgboost.best_estimator_.get_params()["learning_rate"])
print("Gamma: ", model_xgboost.best_estimator_.get_params()["gamma"])
print("Max Depth: ", model_xgboost.best_estimator_.get_params()["max_depth"])
print("Subsample: ", model_xgboost.best_estimator_.get_params()["subsample"])
print("Max Features at Split: ", model_xgboost.best_estimator_.get_params()["colsample_bytree"])
print("Alpha: ", model_xgboost.best_estimator_.get_params()["reg_alpha"])
print("Lamda: ", model_xgboost.best_estimator_.get_params()["reg_lambda"])
print("Minimum Sum of the Instance Weight Hessian to Make a Child: ",
print("Number of Trees: ", model_xgboost.best_estimator_.get_params()["n_estimators"])