Probably a big step forward. It looks like there is a recommended next procedure:
This looks like a much stronger proof-of-concept than the previous stage.
The important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.
I would still be careful with the claim, but the direction is good.
My direct answer would be:
Do not scale the synthetic setup much further yet.
The next high-value step is a small natural-data test with DOMAIN, shuffled-structure, and random-structure controls.
1. How I would read the current result
From what you describe, the result is now approximately:
| Stage |
What it shows |
What it does not yet show |
| Earlier technical QA result |
Bryła can improve a tiny matched setup |
Maybe fragile / domain-specific |
| Field ablation |
compact fields are better than default-heavy FULL |
not yet general |
| Clean PPL / masked loss |
tags must be treated as context, not target |
PPL alone is still incomplete |
| Current 24/27 grid |
Bryła can transmit useful structured signal in a controlled setup |
not yet proven on messy natural Polish data |
So the current claim I would make is:
In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.
That is already a good research position.
2. The next decisive experiment
I would run a very small natural-data test, not a larger synthetic one.
Use four or five conditions:
RAW
DOMAIN + RAW
BRYLA + RAW
SHUFFLED-BRYLA + RAW
RANDOM-BRYLA + RAW
The most important comparisons:
| Comparison |
Meaning |
BRYLA > RAW |
Bryła still helps outside the synthetic setup |
BRYLA > DOMAIN |
Bryła adds more than a simple domain label |
BRYLA > SHUFFLED-BRYLA |
field-value alignment matters |
BRYLA > RANDOM-BRYLA |
result is not just prefix-format regularization |
DOMAIN ≈ BRYLA |
current Bryła may mostly encode domain/topic |
SHUFFLED-BRYLA ≈ BRYLA |
structure labels may not be semantically used |
RANDOM-BRYLA helps |
possible regularization / format artifact |
The core next question is:
Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?
If yes, the claim becomes much stronger.
3. Natural-data mini-benchmark
I would start small:
6 domains × 50 examples = 300 examples
Suggested domains:
| Domain |
Why useful |
| technical / welding / materials |
original strongest area |
| geography / places |
tests templatic factual data |
| biographies |
tests people, dates, roles, events |
| science explanations |
tests definitions and causal relations |
| daily-life / practical QA |
tests intent, urgency, user-facing pragmatics |
| sports / events |
tests event structure and temporal facts |
Report results by domain, not only aggregate.
Example table:
| Domain |
RAW |
DOMAIN |
BRYLA |
SHUFFLED |
Best |
Note |
| technical |
|
|
|
|
|
|
| geography |
|
|
|
|
|
|
| biography |
|
|
|
|
|
|
| science |
|
|
|
|
|
|
| daily life |
|
|
|
|
|
|
| sports |
|
|
|
|
|
|
This matters because Bryła may help one domain and hurt another. That would still be useful information.
4. Replace synthetic parser noise with real parser error
The 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.
A good progression would be:
synthetic parser noise
→ real parser errors
→ real domain shift
→ real QA / generation metric
Synthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.
I would report:
% parsed
% partial
% OTHER
field default rate
field entropy
field/domain correlation
Example parser dashboard:
| Domain |
Parsed % |
Partial % |
OTHER % |
Main failure mode |
| technical |
|
|
|
|
| geography |
|
|
|
|
| biography |
|
|
|
|
| science |
|
|
|
|
| daily life |
|
|
|
|
| sports |
|
|
|
|
The parser is now part of the research object, not just preprocessing.
5. Keep structure-diversity metrics permanently
The best methodological insight in the new result may be the structured-diversity issue.
I would report these in every experiment:
unique raw texts
unique Bryła strings
Bryła/raw diversity ratio
field entropy
default-field ratio
parser OTHER%
average input tokens
Example:
| Metric |
RAW |
BRYLA |
| unique text strings |
|
|
| unique structured strings |
— |
|
| average source tokens |
|
|
| field entropy |
— |
|
| default-field ratio |
— |
|
| parser OTHER% |
— |
|
This helps distinguish:
Bryła is useful
from:
Bryła collapsed many examples into the same structure
or:
Bryła mostly encoded domain/template identity
6. Keep clean PPL / masked loss
For prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.
So I would keep:
val_ppl_clean = only natural Polish target text
val_ppl_tags = only Bryła tags
val_ppl_std = full sequence, diagnostic only
Primary metric:
val_ppl_clean
The Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:
For decoder-only prefix conditioning, I would use masked loss:
input:
[BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]
labels:
[-100 ... -100] [-100] [POLISH TEXT LABELS]
That matches the conceptual setup:
Bryła = context
Polish text = target
7. Try cooldown
The most interesting next experiment after the control ladder is cooldown.
This is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.
Resource:
For Bryła:
Phase 1:
train on BRYLA + text
Phase 2:
short cooldown on RAW-only text
Eval:
RAW-only
Controls:
RAW baseline
DOMAIN + text -> RAW cooldown
BRYLA + text -> RAW cooldown
RANDOM-BRYLA + text -> RAW cooldown
Interpretation:
| Result |
Meaning |
| Bryła cooldown > RAW |
Bryła may work as a training scaffold |
| Bryła cooldown ≈ RAW |
no retained scaffold effect |
| DOMAIN cooldown ≈ Bryła cooldown |
domain metadata may explain much of the gain |
| random-prefix cooldown helps |
possible curriculum/regularization effect |
| Bryła requires Bryła at inference |
useful, but deployment depends on parser |
If cooldown works, the story becomes stronger:
Bryła is not only an inference-time representation.
It may be a training scaffold for small models.
8. Test serialization format
Current Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.
Structured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.
Useful resource:
I would test:
BRYLA-symbolic
BRYLA-verbalized
BRYLA-hybrid
BRYLA-no-defaults
Example:
Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]
Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.
Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]
Also test field order, because sequence order can matter a lot for structured inputs.
Useful resource:
9. Polish datasets and resources
For natural Polish QA / MRC testing, I would look at these.
I would not mix all of these into one training soup immediately.
Better:
small clean natural benchmark
+ controlled ablations
+ separate larger-data experiments later
10. Suggested next reporting table
A compact table like this would be very clear:
| Setup |
Data |
Control type |
Clean PPL |
Task metric |
Tokens |
Wins/seeds |
Comment |
| RAW |
natural |
baseline |
|
|
|
|
|
| DOMAIN |
natural |
simple metadata |
|
|
|
|
|
| BRYLA |
natural |
real structure |
|
|
|
|
|
| SHUFFLED |
natural |
broken alignment |
|
|
|
|
|
| RANDOM |
natural |
format control |
|
|
|
|
|
And by domain:
| Domain |
BRYLA > RAW? |
BRYLA > DOMAIN? |
BRYLA > SHUFFLED? |
Parser OTHER% |
Note |
| technical |
|
|
|
|
|
| geography |
|
|
|
|
|
| biography |
|
|
|
|
|
| science |
|
|
|
|
|
| daily life |
|
|
|
|
|
| sports |
|
|
|
|
|
11. What would make the claim much stronger
The result would become much harder to dismiss if the next stage shows:
BRYLA > RAW
BRYLA > DOMAIN
BRYLA > SHUFFLED-BRYLA
BRYLA > RANDOM-BRYLA
on small natural Polish data, with:
clean target-only loss
parser coverage reported
field entropy reported
token cost reported
domain-level breakdown
That would support the claim:
Bryła adds useful structure beyond domain conditioning and prefix-format effects.
12. What would weaken the claim
These would not kill the project, but they would change the interpretation:
| Observation |
Interpretation |
DOMAIN ≈ BRYLA |
Bryła may mostly encode domain/topic |
SHUFFLED ≈ BRYLA |
field-value alignment may not matter |
RANDOM helps |
prefix format may act as regularization |
| gains vanish on natural data |
synthetic setup may be too clean |
| gains only appear in full PPL |
tag-prediction artifact |
parser outputs mostly [OTHER] |
structure is not reaching the model |
| Bryła works only in one domain |
still useful, but domain-specific |
Short version
This is good progress.
The next step is not “make it bigger.”
The next step is:
small natural data
+ DOMAIN control
+ shuffled-structure control
+ random-prefix control
+ clean PPL
+ parser diagnostics
If Bryła still wins there, the result becomes much stronger.