股票多因子模型实战
所属分类 quant
浏览量 10
股票多因子模型是量化投资的核心工具之一,
其核心逻辑是通过挖掘影响股票收益的多个因子(如估值、成长、动量、质量等),
构建因子组合来筛选股票、预测收益并构建投资组合。
多因子模型的核心流程,
包括数据获取、因子构建、因子有效性检验、因子合成与组合优化等环节
一、环境准备
首先安装核心依赖库,涵盖数据处理、金融计算、可视化等功能:
# 安装依赖
!pip install tushare pandas numpy scipy statsmodels matplotlib scikit-learn --upgrade
导入核心库:
import tushare as ts
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')
# 设置Tushare token(需自行注册获取:https://tushare.pro/)
ts.set_token('你的Tushare Token')
pro = ts.pro_api()
二、数据获取
2.1 基础数据获取
获取 A 股股票列表、日度行情、财务指标等核心数据,
这里以沪深 300 成分股为例(时间范围:2020-2023 年):
# 获取沪深300成分股
def get_hs300_stocks(date='20230101'):
hs300 = pro.index_weight(index_code='000300.SH', start_date=date, end_date=date)
return hs300['con_code'].tolist()
# 获取股票日度行情数据
def get_stock_data(stock_codes, start_date='20200101', end_date='20231231'):
df_list = []
for code in stock_codes:
try:
# 获取日度行情
df = pro.daily(ts_code=code, start_date=start_date, end_date=end_date)
# 获取复权因子(处理除权除息)
adj = pro.adj_factor(ts_code=code, start_date=start_date, end_date=end_date)
df = df.merge(adj[['trade_date', 'adj_factor']], on='trade_date', how='left')
# 计算复权收盘价
df['close_adj'] = df['close'] * df['adj_factor']
df_list.append(df)
except Exception as e:
print(f"获取{code}数据失败:{e}")
continue
# 合并数据
data = pd.concat(df_list, ignore_index=True)
# 日期格式转换
data['trade_date'] = pd.to_datetime(data['trade_date'])
# 按股票和日期排序
data = data.sort_values(['ts_code', 'trade_date']).reset_index(drop=True)
return data
# 获取财务指标(季度)
def get_finance_data(stock_codes, start_date='20200101', end_date='20231231'):
df_list = []
for code in stock_codes:
try:
# 财务指标(资产负债表、利润表、现金流量表可按需补充)
fina = pro.fina_indicator(ts_code=code, start_date=start_date, end_date=end_date)
df_list.append(fina)
except Exception as e:
print(f"获取{code}财务数据失败:{e}")
continue
fina_data = pd.concat(df_list, ignore_index=True)
fina_data['end_date'] = pd.to_datetime(fina_data['end_date'])
return fina_data
# 执行数据获取
hs300_codes = get_hs300_stocks()
price_data = get_stock_data(hs300_codes)
fina_data = get_finance_data(hs300_codes)
2.2 数据预处理
处理缺失值、异常值,对齐时间维度:
# 1. 处理价格数据缺失值
price_data = price_data.dropna(subset=['close_adj', 'vol'])
# 2. 剔除异常值(3倍标准差)
def remove_outliers(df, col):
mean = df[col].mean()
std = df[col].std()
return df[(df[col] >= mean - 3*std) & (df[col] <= mean + 3*std)]
price_data = remove_outliers(price_data, 'close_adj')
# 3. 财务数据与价格数据对齐(按季度匹配)
price_data['year'] = price_data['trade_date'].dt.year
price_data['quarter'] = price_data['trade_date'].dt.quarter
fina_data['year'] = fina_data['end_date'].dt.year
fina_data['quarter'] = fina_data['end_date'].dt.quarter
# 合并财务数据到价格数据(按股票、年、季度)
merge_data = price_data.merge(
fina_data.drop(['end_date', 'ts_code'], axis=1),
on=['year', 'quarter', 'ts_code'],
how='left'
)
# 填充财务数据缺失值(前向填充)
merge_data = merge_data.sort_values(['ts_code', 'trade_date'])
merge_data = merge_data.fillna(method='ffill')
三、因子构建
选取经典的 4 类核心因子:估值因子、成长因子、动量因子、质量因子
3.1 单因子计算
# 1. 估值因子:市盈率(PE)= 股价 / 每股收益
merge_data['pe'] = merge_data['close'] / merge_data['eps']
# 市净率(PB)= 股价 / 每股净资产
merge_data['pb'] = merge_data['close'] / merge_data['bvps']
# 2. 成长因子:营业收入同比增长率
merge_data['revenue_growth'] = merge_data['tr_yoy']
# 净利润同比增长率
merge_data['profit_growth'] = merge_data['profit_yoy']
# 3. 动量因子:过去20日收益率(剔除当日)
def calc_momentum(df):
# 按股票分组计算滚动收益率
df['momentum_20d'] = df.groupby('ts_code')['close_adj'].pct_change(20)
return df
merge_data = calc_momentum(merge_data)
# 4. 质量因子:净资产收益率(ROE)
merge_data['roe'] = merge_data['roe']
# 资产负债率(反向因子,越低越好)
merge_data['debt_ratio'] = merge_data['debt_to_assets']
# 剔除因子计算后的缺失值
factor_cols = ['pe', 'pb', 'revenue_growth', 'profit_growth', 'momentum_20d', 'roe', 'debt_ratio']
merge_data = merge_data.dropna(subset=factor_cols)
3.2 因子标准化
因子存在量纲差异,需做 Z-score 标准化(去均值、归一化):
# 按时间截面标准化(避免时间序列偏差)
def standardize_factor(df, factor_cols):
# 按交易日分组
for date in df['trade_date'].unique():
mask = df['trade_date'] == date
scaler = StandardScaler()
df.loc[mask, factor_cols] = scaler.fit_transform(df.loc[mask, factor_cols])
return df
merge_data = standardize_factor(merge_data, factor_cols)
# 反向因子处理(如资产负债率、PE,值越小越好,取负)
merge_data['pe'] = -merge_data['pe']
merge_data['pb'] = -merge_data['pb']
merge_data['debt_ratio'] = -merge_data['debt_ratio']
四、因子有效性检验
因子有效性是多因子模型的核心,需验证因子与股票未来收益的相关性,
常用方法:IC 检验、分组回测
4.1 IC 检验(信息系数)
IC 值衡量因子与未来收益的秩相关系数,绝对值越大,因子预测能力越强(通常 | IC|>0.05 为有效)。
# 计算股票未来5日收益率(预测目标)
merge_data['future_return'] = merge_data.groupby('ts_code')['close_adj'].pct_change(5).shift(-5)
# 计算每日IC值
def calc_ic(df, factor_col):
ic_list = []
for date in df['trade_date'].unique():
daily_data = df[df['trade_date'] == date]
# 秩相关系数(Spearman)
ic = stats.spearmanr(daily_data[factor_col], daily_data['future_return'])[0]
ic_list.append({
'trade_date': date,
'factor': factor_col,
'ic': ic
})
return pd.DataFrame(ic_list)
# 计算所有因子的IC
ic_results = []
for factor in factor_cols:
ic_df = calc_ic(merge_data, factor)
ic_results.append(ic_df)
ic_all = pd.concat(ic_results)
# 计算IC均值和t检验(验证显著性)
ic_summary = ic_all.groupby('factor')['ic'].agg(['mean', 'std', 'count'])
ic_summary['t_stat'] = ic_summary['mean'] / (ic_summary['std'] / np.sqrt(ic_summary['count']))
# 计算p值(双侧检验)
ic_summary['p_value'] = ic_summary['t_stat'].apply(lambda x: stats.t.sf(abs(x), df=ic_summary['count']-1)*2)
print("因子IC检验结果:")
print(ic_summary)
4.2 分组回测
将股票按因子值分为 5 组,计算每组的收益率,验证因子单调性:
def group_backtest(df, factor_col, n_groups=5):
# 按交易日分组,对因子值排序并分组
df['group'] = df.groupby('trade_date')[factor_col].rank(pct=True).apply(
lambda x: int(x * n_groups) + 1
)
# 计算每组每日收益率
group_return = df.groupby(['trade_date', 'group'])['future_return'].mean().reset_index()
# 计算累计收益率
group_cum_return = group_return.pivot(index='trade_date', columns='group', values='future_return').fillna(0)
group_cum_return = (1 + group_cum_return).cumprod()
return group_cum_return
# 以动量因子为例进行分组回测
momentum_group = group_backtest(merge_data, 'momentum_20d')
# 可视化分组累计收益
plt.figure(figsize=(12, 6))
for group in momentum_group.columns:
plt.plot(momentum_group.index, momentum_group[group], label=f'Group {group}')
plt.title('动量因子分组回测累计收益')
plt.xlabel('日期')
plt.ylabel('累计收益')
plt.legend()
plt.grid(True)
plt.show()
五、多因子合成
通过加权方式合成多因子得分,常用方法:等权、回归加权、机器学习加权
5.1 等权合成(基础版)
# 筛选有效因子(IC均值绝对值>0.05且p值<0.05)
valid_factors = ic_summary[
(abs(ic_summary['mean']) > 0.05) & (ic_summary['p_value'] < 0.05)
].index.tolist()
# 等权计算多因子得分
merge_data['multi_factor_score'] = merge_data[valid_factors].mean(axis=1)
5.2 回归加权(进阶版)
以未来收益为因变量,因子为自变量,通过线性回归计算因子权重:
# 按时间截面回归,获取每日因子权重
weight_list = []
for date in merge_data['trade_date'].unique():
daily_data = merge_data[merge_data['trade_date'] == date]
X = daily_data[valid_factors]
y = daily_data['future_return']
# 线性回归
model = LinearRegression()
model.fit(X, y)
# 保存权重
weights = dict(zip(valid_factors, model.coef_))
weights['trade_date'] = date
weight_list.append(weights)
weight_df = pd.DataFrame(weight_list)
# 合并权重到主数据
merge_data = merge_data.merge(
weight_df,
on='trade_date',
how='left'
)
# 计算加权多因子得分
merge_data['weighted_factor_score'] = 0
for factor in valid_factors:
merge_data['weighted_factor_score'] += merge_data[factor] * merge_data[factor]
六、投资组合构建与回测
6.1 选股规则
每月末选取多因子得分前 20% 的股票,等权配置,下月调仓:
# 按月份分组
merge_data['month'] = merge_data['trade_date'].dt.to_period('M')
# 每月选股
def select_stocks(df, score_col, top_pct=0.2):
stock_selection = []
for month in df['month'].unique():
month_data = df[df['month'] == month]
# 计算得分阈值
threshold = month_data[score_col].quantile(1 - top_pct)
# 筛选前20%股票
top_stocks = month_data[month_data[score_col] >= threshold]['ts_code'].unique()
stock_selection.append({
'month': month,
'stocks': list(top_stocks)
})
return pd.DataFrame(stock_selection)
# 执行选股
stock_selection = select_stocks(merge_data, 'weighted_factor_score')
# 计算组合收益率
def calc_portfolio_return(df, selection, score_col):
portfolio_returns = []
for idx, row in selection.iterrows():
month = row['month']
stocks = row['stocks']
# 获取下月数据
next_month = pd.Period(month) + 1
next_month_data = df[df['month'] == next_month]
# 筛选选中股票的收益率
stock_returns = next_month_data[next_month_data['ts_code'].isin(stocks)]['future_return']
# 等权收益率
portfolio_return = stock_returns.mean()
portfolio_returns.append({
'month': next_month,
'return': portfolio_return
})
return pd.DataFrame(portfolio_returns)
# 计算组合月度收益
portfolio_returns = calc_portfolio_return(merge_data, stock_selection, 'weighted_factor_score')
portfolio_returns['month'] = portfolio_returns['month'].dt.to_timestamp()
portfolio_returns = portfolio_returns.sort_values('month')
# 计算累计收益
portfolio_returns['cum_return'] = (1 + portfolio_returns['return']).cumprod()
# 可视化组合收益
plt.figure(figsize=(12, 6))
plt.plot(portfolio_returns['month'], portfolio_returns['cum_return'], label='多因子组合')
# 对比沪深300指数(需补充指数数据)
hs300_index = pro.index_daily(ts_code='000300.SH', start_date='20200101', end_date='20231231')
hs300_index['trade_date'] = pd.to_datetime(hs300_index['trade_date'])
hs300_index = hs300_index.sort_values('trade_date')
hs300_index['cum_return'] = (1 + hs300_index['pct_chg']/100).cumprod()
plt.plot(hs300_index['trade_date'], hs300_index['cum_return'], label='沪深300')
plt.title('多因子组合 vs 沪深300 累计收益')
plt.xlabel('日期')
plt.ylabel('累计收益')
plt.legend()
plt.grid(True)
plt.show()
6.2 绩效评估
计算年化收益、夏普比率、最大回撤等核心指标:
# 年化收益率
annual_return = (portfolio_returns['cum_return'].iloc[-1] ** (12/len(portfolio_returns))) - 1
# 夏普比率(无风险利率取3%)
risk_free_rate = 0.03
monthly_return = portfolio_returns['return']
sharpe_ratio = (monthly_return.mean() - risk_free_rate/12) / monthly_return.std() * np.sqrt(12)
# 最大回撤
def max_drawdown(returns):
cum_returns = (1 + returns).cumprod()
running_max = cum_returns.cummax()
drawdown = (cum_returns / running_max) - 1
return drawdown.min()
max_dd = max_drawdown(portfolio_returns['return'])
# 输出绩效指标
print(f"年化收益率:{annual_return:.2%}")
print(f"夏普比率:{sharpe_ratio:.2f}")
print(f"最大回撤:{max_dd:.2%}")
七、模型优化与拓展
7.1 因子正交化
消除因子间的多重共线性:
# 对有效因子进行正交化(以第一个因子为基准)
from sklearn.decomposition import PCA
# 按时间截面正交化
def orthogonalize_factors(df, factor_cols):
orthogonal_factors = []
for date in df['trade_date'].unique():
daily_data = df[df['trade_date'] == date][factor_cols]
# PCA正交化
pca = PCA(n_components=len(factor_cols))
ortho_data = pca.fit_transform(daily_data)
ortho_df = pd.DataFrame(ortho_data, columns=[f'ortho_{f}' for f in factor_cols])
ortho_df['trade_date'] = date
ortho_df['ts_code'] = df[df['trade_date'] == date]['ts_code'].values
orthogonal_factors.append(ortho_df)
return pd.concat(orthogonal_factors)
ortho_factors = orthogonalize_factors(merge_data, valid_factors)
merge_data = merge_data.merge(ortho_factors, on=['trade_date', 'ts_code'], how='left')
7.2 机器学习增强(XGBoost)
用 XGBoost 替代线性回归,提升因子权重的非线性拟合能力:
from xgboost import XGBRegressor
# 按时间截面训练XGBoost模型
xgb_weights = []
for date in merge_data['trade_date'].unique():
daily_data = merge_data[merge_data['trade_date'] == date]
X = daily_data[valid_factors]
y = daily_data['future_return']
if len(X) < 10: # 样本量不足跳过
continue
model = XGBRegressor(n_estimators=50, max_depth=3, learning_rate=0.1)
model.fit(X, y)
weights = dict(zip(valid_factors, model.feature_importances_))
weights['trade_date'] = date
xgb_weights.append(weights)
xgb_weight_df = pd.DataFrame(xgb_weights)
merge_data = merge_data.merge(xgb_weight_df, on='trade_date', how='left')
# 计算XGBoost加权因子得分
merge_data['xgb_factor_score'] = 0
for factor in valid_factors:
merge_data['xgb_factor_score'] += merge_data[factor] * merge_data[factor]
八、总结
本文完整实现了股票多因子模型的核心流程:
从数据获取与预处理,到因子构建、有效性检验,再到多因子合成与组合回测。
核心要点总结:
因子有效性是前提:
IC 检验和分组回测是验证因子的核心手段,需剔除无效因子;
因子标准化与正交化:
消除量纲和多重共线性,提升模型稳定性;
权重优化:
等权适合入门,回归 / 机器学习加权可提升收益;
风险控制:
需结合最大回撤、行业中性等策略,避免单一因子暴露过高
实际应用中,还可拓展:
加入行业因子、风格因子,引入止损止盈规则,
优化调仓频率,或结合高频数据提升因子时效性。
上一篇
量化策略研发全流程
均值回归策略实例
python @classmethod 和 @staticmethod 区别