This story comes from a software technical craft person of some standing who has never been an AI pundit or promoter.
It seems to be helpful when incorporated with other good habits and managed respectfully.
I used the Gilded Rose kata, which may have colored my results. It is possible that the AI assistant has been trained on various solutions to Gilded Rose, and merely memorized the results. I can’t tell if this is true or not, so I can only report what I know.
Start Well
For my example, I cloned the Python version of the Gilded Rose kata, created a virtual environment and git repository ( via UV), and installed the requirements.
I fired up my editor, in this case PyCharm.
Following the usual Clean Start protocol, I began by running all the tests in the project.
I clicked the “rerun” option so that the full test suite will run automatically when any file changes. I find that this makes continuous testing a trivial undertaking, and I’m a big believer in both making my work effortless and also continuously testing as I code.
Famously, it comes with only one test, which is made to fail. The test ran and failed as expected. Good.
I used the gnext script, described in Save Your Game to set the intention of my next action.
I have since put an improved version on github so you can feel free to clone it and use these commands on your daily work (without warranty and at your own risk).
I made the test pass, and used gdone
to commit my changes.
Add Tests
The whole point of the Gilded Rose is that it is a bit of ugly legacy code without tests.
The scenario of “I have to work on this code, but there are no tests” is quite familiar to most of us.
gnext "test: add unit tests"
In the AI window, I set the scope to be the gilded_rose.py source file, and asked the AI to write unit tests for the update_quality() method.
I have used test generators before, and have been uniformly disappointed. They tend to be run-on, ill-structured tests with meaningless names, but they do provide coverage for code I want to refactor. This is my first time doing this with an AI.
In a radical departure from expectations, it did a pretty good job. Not great, but pretty good.
The tests were of the form:
def test_normal_item_quality_does_not_go_below_zero(self):
items = [Item("Elixir of the Mongoose", 5, 0)]
gilded_rose = GildedRose(items)
gilded_rose.update_quality()
self.assertEqual(items[0].sell_in, 4)
self.assertEqual(items[0].quality, 0)
The test naming isn’t too bad. It’s roughly as good as i sometimes see from students in my TDD training.
The tests follow the Arrange, Act, Assert pattern. They have exactly one action per test, as I prefer. It also is willing to have multiple assertions when there is more than one evidence of the action, which is also to my taste ( YMMV).
On the other hand, I don’t like how it subscripts the list to reference the item.
Some people will argue that the micro-efficiency is poor since it subscripts the array multiple times, and array indexing isn’t free. I will recognize that, but it is not my objection.
I worry that the array indexing distracts the reader and complicates the test just a little.
Since the tests work, I’ve satisfied one level of the Code Virtues and now duplication and simplicity become more prominent concerns.
I asked the AI to try again but give the item in each test a name instead of referencing it through the items array. I decided to push my luck and not give a very concise and clear set of instructions.
My prompt is imprecise and curt: “rewrite these tests but reference the item directly instead of using items[0]”
It now created a named object in each test and passed the object in an “immediate” array to the GildedRose constructor.
def test_aged_brie_increases_quality(self):
aged_brie = Item("Aged Brie", 2, 0)
gilded_rose = GildedRose([aged_brie])
gilded_rose.update_quality()
self.assertEqual(aged_brie.sell_in, 1)
self.assertEqual(aged_brie.quality, 1)
I used the handy “copy” button from the AI’s code window and pasted it all, replacing the content of the test_gilded_rose.py file entirely in one easily revertable step.
The tests ran, and they all passed. So far so good.
gdone
I suddenly realized that i hadn’t run the tests with coverage, which I think is reasonable if one is going to rely upon tests for refactoring.
So, okay, we run with coverage. I have line coverage of the entire source. That’s nice.
But what about mutation testing? In python, I installed mutmut and configured it quickly. It also found no problems with the tests. There were no “surviving mutations” from the code.
The “Production” Source
At this point, I decided not to follow the standard Gilded Rose script, for fear that the AI assistant was parroting all the prior answers it had memorized.
I decided to attack negation in the source.
gnext "refactor: avoid negation in conditions"
My prompt: “refactor update_quality to avoid using negation in if statements.”
This is a thorny thing to ask, given the nature of Gilded Rose.
The AI provided some refactored code, and I copied it all and pasted it over the top of the entire gilded_rose.py contents.
Immediately the tests failed. The refactoring was not valid.
That’s okay, I use gnext/gdone to save my state, and I’m making changes one at a time so that the undo is easy.
I used undo, and got my original code back: all tests are green.
My prompt is even less specific: “try again, but be sure all the tests pass”
I paste, and the tests pass. Surprising! I guess I should have stated that the tests need to pass to begin with. Lesson learned.
The code is structured very differently now. Methods have been extracted, and a guard clause added for “Sulfuras”:
def update_quality(self):`
for item in self.items:
if item.name == "Sulfuras, Hand of Ragnaros":
# "Sulfuras" never changes in quality or sell_in
continue
# Update `sell_in` for all items except "Sulfuras"
item.sell_in -= 1
if item.name == "Aged Brie":
self._update_aged_brie(item)
elif item.name == "Backstage passes to a TAFKAL80ETC concert":
self._update_backstage_passes(item)
else:
self._update_normal_item(item)
This is more improvement than I’d asked or hoped for. I’m torn whether I like that big change or not, but overall I think I do.
All tests pass. At this point, I’ve achieved the intention of this commit, so one more time with gdone
.
Cross-Class Refactor
For fun, I decided to even further break with tradition and modify the Item class.
gnext "refactor: items know their type"
This might not be a thing that I would do normally, but I’m kicking tires. My prompt: “refactor update_quality so that all the name comparisons are part of the item class.”
I’m intentionally vague, but it seems to understand what I’m asking just the same.
An excerpt:
# Helper properties for special item checks
@property
def is_aged_brie(self):
return self.name == "Aged Brie"
@property
def is_backstage_pass(self):
return self.name == "Backstage passes to a TAFKAL80ETC concert"
Not only did it add name checks to the Item class, it implemented them as properties rather than normal member functions. I didn’t ask for that, but it’s not a strange or troubling choice.
The update_quality isn’t odd-looking now, either, so I paste the new code and watch as the tests pass.
class GildedRose:
def __init__(self, items):
self.items = items
def update_quality(self):
for item in self.items:
if item.is_sulfuras:
# Sulfuras has no quality or sell_in changes
continue
# Decrease sell_in for all non-Sulfuras items
item.sell_in -= 1
if item.is_aged_brie:
self._update_aged_brie(item)
elif item.is_backstage_pass:
self._update_backstage_pass(item)
elif item.is_normal_item:
self._update_normal_item(item)
I can do without the superfluous comments, but that’s a separate intention. With the code valid (passing tests) and the current intention met, I use gdone again.
It modified both classes, the caller and the callee, in cooperation. This is some good semantic editing, even if I know later I will be renaming the functions to be less string-specific and more type-specific. It’s a good step on the path.
Reducing Conditional Complexity
The methods for updating various items are riddled with silliness like this:
if item.quality < 50:
item.quality += 1
So:
gnext "refactor: eliminate needless comparisons to 50"
It seems to me that taking that much space to deal with “increment up to 50 only” is pointless. Can the assistant work out what I mean from a simple prompt? Let’s try “in _update_backstage_pass, refactor the code to use max instead of checking the value is less than 50 and then incrementing.”
Will it understand?
def _update_backstage_pass(self, item):
if item.sell_in < 0:
item.quality = 0 # After the concert, quality drops to 0
elif item.sell_in < 5:
# Increase by 3 if sell_in is 5 or less using `min` to keep it at most 50
item.quality = min(50, item.quality + 3)
elif item.sell_in < 10:
# Increase by 2 if sell_in is 10 or less using `min` to keep it at most 50
item.quality = min(50, item.quality + 2)
else:
# Increase by 1 otherwise using `min`
item.quality = min(50, item.quality + 1)
Not only did it understand, it recognized that I should specify min
, rather than max
as I suggested. This is a
signature error - I have done it several times, and then have to go back to it. This time, my AI “programming pair”
didn’t let me do something that backward and silly.
I paste the code, the tests still all run green. All code has line coverage.
gdone
Ideating
My new partner and I pursued a logical improvement. We hit a few minor snags along the way, but I thought it was illuminating.
Here is a bit of ugly conditional code:
def _update_backstage_pass(self, item):`
if item.sell_in < 0:
item.quality = 0 # After the concert, quality drops to 0
else:
change_amount = 1 # Default increase
if item.sell_in < 5:
change_amount = 3 # Increase by 3 when 5 days or fewer are left
elif item.sell_in < 10:
change_amount = 2 # Increase by 2 when 6-10 days are left
# Update quality and ensure it doesn't exceed 50
item.quality = min(50, item.quality + change_amount)
In the way of improvement, it suggested this:
change_amount = (
3 if item.sell_in < 5 else
2 if item.sell_in < 10 else
1
)
It is relatively terse, relatively clear, has pretty formatting, and works.
On the other hand, I’m not sure that cascading ternaries will be anyone’s favorite choice for expressing this idea.
I’m willing to be convinced otherwise.
It is, however, an implementation I’d not considered before since I have a bit of a prejudice against nested ternary expressions. The AI tool doesn’t have my prejudices, and may suggest things I wouldn’t have tried. That’s not a bad thing
I asked about ways to represent a lookup table instead. It suggested this form:
lookup_table = {
range(0, 51): "a",
range(51, 101): "b",
range(101, float("inf")): "c"
}
for key_range, result in lookup_table.items():
if value in key_range:
return result
return "unknown" # In case value doesn't fit any range
This isn’t bad. It is quirky though.
Using a range object as a dictionary key is unusual. In earlier versions of Python, it wasn’t even legal.
The weirdness of this even confused my AI Assistant. In some versions of python, the range object isn’t hashable and can’t be a key. In later versions, it is valid. The AI seemed to be confused about what version we are using, and went back and forth between recommending this structure and declaring it illegal.
Since we are not using the dictionary as a dictionary anyway, why use a dictionary? It seems to be a lot of mechanism for a simple thing. I would think a list of tuples would be more natural, and we could even have made this a tuple of tuples to keep it immutable, for what little difference that would make.
I switched to use the more obvious array of tuples, and didn’t have any issues with either the tests, the compiler, or the assistant.
In the next move, it deleted my version and replaced it with the nested if statements again. I can see that I will need to watch this AI thing very closely if I want to maintain my own updates.
Impression
Using an AI can help with proper code craft including coding, testing, and refactoring.
The fact that it is trained to know programming languages and their libraries means that it may suggest solutions I’d not considered, though I did find it lacking on knowledge of some python libraries like itertools and operators.
I’m surprised how my intentionally poor, loose prompts still got good behaviors most of the time.
I was also surprised at how often I needed to re-center it to use the existing version of the code and not something it had imagined or suggested earlier in the session.
I was pleased with how it generated tests for existing code.
When I’m dealing with legacy code, I will probably extract methods and have an AI provide tests.
In code other than this kata, I’ve had it generate invalid tests, even some that didn’t compile. When it works, it works.
These tools will be helpful to people who work with discipline and proper craft. It may also help people who don’t, but risks are greater that they will generate troublesome code that they don’t know how to fix.
With microcommits and tests, I think I’ve limited the damage it can do.
I will remind you at this juncture that my AI assistant may have been trained on Gilded Rose and may have been merely regurgitating memorized code samples
It seems to be doing my bidding with semantic editing, but it could be a convenient parlor trick in this case.
I will have to try the same techniques and behaviors in code bases the AI has never seen before.
Don’t turn off your own intelligence, intuition, creativity, and code sense, but consider using AI as a partner. Don’t trust it too much, but don’t ignore it.
AI gets better fast; it currently is the worst it will ever be. Where it falls short today it may be soon be more competent.
See also:
Consider using Industrial Logic’s competence to upskill your teams.