MergeKit wrapper
Running MergeKit methods¶
The toolkit implements MergeKit methods via a StructuralControl
wrapper. Methods are initialized via either a config_dict
or a config_path
(to a yaml
file). Since merging results in a model, the option lazy_init=True
must be set when creating a SteeringPipeline
(rather than passing in model_name_or_path
). This notebook outlines how to construct some of MergeKit's methods in our toolkit; for a more complete list of implementations enabled by MergeKit please see the example configs and the documentation.
Note: Please note that the toolkit depends on MergeKit 0.0.5.1 (due to more restrictive licenses of more recent versions). As a result, not all recent MergeKit methods are available in our toolkit. Merging operations can be resource and storage intensive.
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline
from aisteer360.algorithms.structural_control.wrappers.mergekit import MergeKit
prompt = "Who was the fifth president of the United States?"
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
The following authentication steps may be necessary to access any gated models (even after being granted access by Hugging Face). Uncomment the following if you need to log in to the Hugging Face Hub using your token stored in the .env
file:
# !pip install python-dotenv
# from dotenv import load_dotenv
# import os
# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)
Linear merge¶
Linear merge is a method that combines multiple models by averaging their weights (see the original paper for details). To run this method via MergeKit, specify the source models (to average) and associated scalar weights. Note that the weights are not required to sum to one as weights are scaled appropriately internally.
The config below creates a float16 model by weighted-averaging corresponding tensors from three 13B models. Orca Mini v3 (weight=1.0
) is the dominant contributor, Wizard 13B v1.2 adds a moderate influence (weight=0.5
), and WizardLM contributes lightly (weight=0.3
).
The final parameters are proportional to the models[].parameters.weight
values (i.e., a normalized blend).
linear_merge_config = {
"merge_method": "linear",
"dtype": "float16",
"models": [
{"model": "pankajmathur/orca_mini_v3_13b", "parameters": {"weight": 0.5}},
{"model": "WizardLMTeam/WizardLM-13B-V1.2", "parameters": {"weight": 0.5}},
],
}
linear_merge = MergeKit(
config_dict=linear_merge_config,
out_path="./mergekit_models/orca-wizard-blend-linear",
trust_remote_code=True
)
# create steering pipeline
linear_merge_pipeline = SteeringPipeline(
lazy_init=True, # required when calling MergeKit methods
controls=[linear_merge],
device="cuda"
)
linear_merge_pipeline.steer()
# inference
tokenizer = linear_merge_pipeline.tokenizer
inputs = tokenizer(prompt, return_tensors="pt", padding=True, padding_side="left")
gen_params = {
"max_new_tokens": 500
}
steered_response = linear_merge_pipeline.generate_text(
inputs.input_ids,
**gen_params
)
print("Response (linear merge):\n", *steered_response) # mergekit returns a list of responses; unpack
Fetching 7 files: 100%|███████████████████████████| 7/7 [00:31<00:00, 4.47s/it] Fetching 11 files: 100%|████████████████████████| 11/11 [00:30<00:00, 2.79s/it] Warmup loader cache: 100%|████████████████████████| 2/2 [01:12<00:00, 36.25s/it] Executing graph: 100%|██████████████████████| 1817/1817 [02:36<00:00, 11.61it/s] Loading checkpoint shards: 100%|██████████████████| 6/6 [00:41<00:00, 6.87s/it]
Response (linear merge): James Monroe was the fifth president of the United States.
# optional cleanup
import shutil
shutil.rmtree("./mergekit_models/orca-wizard-blend-linear")
SLERP merge¶
SLERP (spherical linear interpolation) merge is a method that combines model weights by moving along the surface of a high‑dimensional hypersphere with the goal of yielding a merged model that better preserves scale and source model behaviors.
The setup below builds on Orca Mini v3 as the base_model
and merges it with Wizard 13B v1.2 over slices[0].sources
spanning layer_range=[0,40]
. Instead of a straight average, it uses spherical linear interpolation, controlled by parameters.t
schedules: attention blocks (filter="self_attn"
) follow a layerwise t pattern [0, 0.5, 0.3, 0.7, 1]
, MLP blocks (filter="mlp"
) use [1, 0.5, 0.7, 0.3, 0]
, and everything else defaults to t=0.5
.
The resulting model is a float16 hybrid where attention and MLP mix ratios vary across depth.
slerp_merge_config = {
"merge_method": "slerp",
"dtype": "float16",
"base_model": "pankajmathur/orca_mini_v3_13b",
"slices": [
{
"sources": [
{"model": "pankajmathur/orca_mini_v3_13b", "layer_range": [0, 40]},
{"model": "WizardLMTeam/WizardLM-13B-V1.2", "layer_range": [0, 40]},
]
}
],
"parameters": {
"t": [
{"filter": "self_attn", "value": [0, 0.5, 0.3, 0.7, 1]},
{"filter": "mlp", "value": [1, 0.5, 0.7, 0.3, 0]},
{"value": 0.5},
]
},
}
slerp_merge = MergeKit(
config_dict=slerp_merge_config,
out_path="./mergekit_models/orca-wizard-blend-slerp",
trust_remote_code=True
)
slerp_merge_pipeline = SteeringPipeline(
lazy_init=True,
controls=[slerp_merge],
device="cuda"
)
slerp_merge_pipeline.steer()
tokenizer = slerp_merge_pipeline.tokenizer
inputs = tokenizer(prompt, return_tensors="pt", padding=True, padding_side="left")
gen_params = {
"max_new_tokens": 500
}
steered_response = slerp_merge_pipeline.generate_text(
inputs.input_ids,
**gen_params
)
print("Response (SLERP merge):\n", *steered_response)
Warmup loader cache: 100%|█████████████████████| 2/2 [00:00<00:00, 30174.85it/s] Executing graph: 100%|██████████████████████| 1817/1817 [03:05<00:00, 9.78it/s] Loading checkpoint shards: 100%|██████████████████| 6/6 [00:49<00:00, 8.21s/it] WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
Response (SLERP merge): The fifth president of the United States was James Monroe. He was in office from 1817 to 1825.
# optional cleanup
import shutil
shutil.rmtree("./mergekit_models/orca-wizard-blend-slerp")
TIES merge¶
The TIES method merges models by first identifying/removing any redundant parameters across models, selecting the most important parameters (via a vote), resolving sign conflicts, and finally merging the aligned parameters to create a unified multi-task model.
The setup below produces a sparse, float16 hybrid on top of Llama-2-13B using TIES selection rather than full blending. Global parameters
enable normalize=True
(scale alignment) and int8_mask=True
(efficient sparsity masking). Per-model controls set what fraction to keep (density
) and how strongly to scale (weight
), optionally varying by layer or module:
- Orca Mini v3
density=[1, 0.7, 0.1]
(keep most early, little late),weight=1.0
. - Platypus2
density=0.5
,weight=[0, 0.3, 0.7, 1]
(growing influence with depth). - WizardLM
density=0.33
,weight=[{"filter":"mlp","value":0.5},{"value":0}]
(only MLPs contribute at 0.5; others ignored).
The result is a model that retains the strongest weights from each source with layer-/module-aware sparsity and scaling.
Note: TIES merging can be computationally intensive to run.
ties_merge_config = {
"merge_method": "ties",
"dtype": "float16",
"base_model": "TheBloke/Llama-2-13B-fp16",
"parameters": {
"normalize": True,
"int8_mask": True,
},
"models": [
{
"model": "pankajmathur/orca_mini_v3_13b",
"parameters": {
"density": [1, 0.7, 0.1],
"weight": 1.0,
},
},
{
"model": "garage-bAInd/Platypus2-13B",
"parameters": {
"density": 0.5,
"weight": [0, 0.3, 0.7, 1],
},
},
{
"model": "WizardLMTeam/WizardLM-13B-V1.2",
"parameters": {
"density": 0.33,
"weight": [
{"filter": "mlp", "value": 0.5},
{"value": 0},
],
},
},
],
}
ties_merge = MergeKit(
config_dict=ties_merge_config,
out_path="./mergekit_models/llama-orca-platypus-wizard-blend-ties",
trust_remote_code=True
)
ties_merge_pipeline = SteeringPipeline(
lazy_init=True,
controls=[ties_merge],
device="cuda"
)
ties_merge_pipeline.steer()
tokenizer = ties_merge_pipeline.tokenizer
inputs = tokenizer(prompt, return_tensors="pt", padding=True, padding_side="left")
gen_params = {
"max_new_tokens": 500
}
steered_response = ties_merge_pipeline.generate_text(
inputs.input_ids,
**gen_params
)
print("Response (TIES merge):\n", *steered_response)
Fetching 11 files: 100%|████████████████████████| 11/11 [00:26<00:00, 2.37s/it] Fetching 10 files: 100%|████████████████████████| 10/10 [02:20<00:00, 14.04s/it] Warmup loader cache: 100%|████████████████████████| 4/4 [02:46<00:00, 41.68s/it] Executing graph: 100%|██████████████████████| 2543/2543 [41:24<00:00, 1.02it/s] Loading checkpoint shards: 100%|██████████████████| 6/6 [00:41<00:00, 6.98s/it]
Response (TIES merge): James Monroe was the fifth president of the United States. He served from 1817 to 1825.
# optional cleanup
import shutil
shutil.rmtree("./mergekit_models/llama-orca-platypus-wizard-blend-ties")