【比赛】Kaggle - G-Research Crypto Forecasting 【正在进行】

前期分析

需求分析

简述:预测14款比特币未来的趋势,七七八八弄了一个较为复杂的Target,内核就是预测趋势,准确来说,是未来加密货币的随着时间的价格变化率的近似,而评估指标是加权的皮尔逊相关系数。

Dataset Structure

train.csv - The training set

  1. timestamp - A timestamp for the minute covered by the row.
  2. Asset_ID - An ID code for the cryptoasset (加密资产).
  3. Count - The number of trades that took place this minute. 一分钟发生的交易数量。
  4. Open - The USD price at the beginning of the minute. 起始价
  5. High - The highest USD price during the minute. 最高价
  6. Low - The lowest USD price during the minute. 最低价
  7. Close - The USD price at the end of the minute. 结束价
  8. Volume - The number of cryptoasset u units traded during the minute. 分钟总交易额
  9. VWAP - The volume-weighted average price for the minute. 一分钟内的加权平均交易金额(单个货币)
  10. Target - 15 minute residualized returns. See the ‘Prediction and Evaluation section of this notebook for details of how the target is calculated.
  11. Weight - Weight, defined by the competition hosts here 每个交易资产的权重
  12. Asset_Name - Human readable Asset name.

example_test.csv - An example of the data that will be delivered by the time series API.

example_sample_submission.csv - An example of the data that will be delivered by the time series API. The data is just copied from train.csv.

asset_details.csv - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.

supplemental_train.csv - After the submission period is over this file’s data will be replaced with cryptoasset prices from the submission period. In the Evaluation phase, the train, train supplement, and test set will be contiguous in time, apart from any missing data. The current copy, which is just filled approximately the right amount of data from train.csv is provided as a placeholder.

  • 📌 There are 14 coins in the dataset
  • 📌 There are 4 years in the [full] dataset

Baseline

Competetion Target

  • 监督学习,所需要预测目标——“Target”特征

    • Target: Residual log-returns for the asset over a 15 minute horizon.

    Target

  • 评价指标:Target的加权皮尔逊相关系数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def weighted_correlation(a, b, weights):
"""
a:预测值
b:实际值
weights:Asset权重
"""

w = np.ravel(weights)
a = np.ravel(a)
b = np.ravel(b)

sum_w = np.sum(w)
mean_a = np.sum(a * w) / sum_w
mean_b = np.sum(b * w) / sum_w
var_a = np.sum(w * np.square(a - mean_a)) / sum_w
var_b = np.sum(w * np.square(b - mean_b)) / sum_w

cov = np.sum((a * b * w)) / np.sum(w) - mean_a * mean_b
corr = cov / np.sqrt(var_a * var_b)

return corr
  • 进一步分析,该分数重点可拆分为两点:
    1. 权重,由于是加权皮尔逊相关系数,那么权重更大的Target对相关系数的贡献越多
    2. 相关系数,也即比较的是相关性,是变化率之间的联系,相当于是时序之中的斜率,也即在未来时间段看涨还是看跌。

数据观测

  1. 先做大致的观测
1
2
3
4
5
# 观察数据大致分布情况,inf情况
train.describe().T.style.bar(subset=['mean'], color='#606ff2')\
.background_gradient(subset=['std'], cmap='BrBG')\
.background_gradient(subset=['min'], cmap='BrBG')\
.background_gradient(subset=['50%'], cmap='BrBG')

EDA

1
2
3
# 观察发现Volume异常,尤其是max值
# 观察发现Volume-min异常,该值为交易量,不应该为负值,需要予以删除处理
train.sort_values(by = 'Volume',ascending = False).head(100)[['Asset_ID','Asset_Name']].value_counts(normalize = True) #观察volume降序前100,发现狗狗币有过短暂的爆发期,产生了Volume离群值。

image-20211209150340417

1
2
3
4
5
6
7
8
9
10
11
12
13
def EDA1(data):
"""
Input
data: 待观测DataFrame
"""
stats = []
for col in data.columns:
stats.append((col, data[col].nunique(), (data[col].isnull().sum() * 100 / data.shape[0]).round(3).astype('str') + '%',
(((data[col] == np.inf).sum() + (data[col] == -np.inf).sum())/data.shape[0]).round(3).astype('str') + '%',
(data[col].value_counts(normalize = True, dropna = False).values[0] * 100).round(3).astype('str') + '%', train[col].dtype))
stats_df = pd.DataFrame(stats, columns=['Feature','Unique','Percentage of missing values','Percentage of infty values','Percentage of values in the biggest category','type'])
return stats_df.sort_values('Percentage of missing values', ascending = False)
EDA1(train)

image-20211210182359658