Skip to main content
Steven Cramer

Designing an Architecture for StreetSmart

During a 2025 climate-tech fellowship through Climatebase, I led the technical side of StreetSmart, a digital platform that turns street images into usable insight for cities. Built in close collaboration with urban planner Reena Mahajan and a multidisciplinary team of designers and engineers, the platform makes everyday street experience legible, identifying what makes a street work, or not, for the people it's meant to serve.

A person uploads a picture of a street and our app shows parts of the street design that make it good or bad to walk around/live in. Send me an email if you want to join the private beta!

In what mostly amounted to a three-month sprint from product conception to demo day, I led a small global team of software engineers to design and build our AI-powered MVP in collaboration with urban planners. It was successful by most measures: we formalized urban planners' notions of a liveable street, made a working prototype, ran two pilots with partners in India, and are now attracting a fresh batch of volunteers for ongoing development.

I'm continuing on as an advisor, and wanted to write down three architectural decisions that shaped the codebase while they're still fresh:

  1. Use tech that the team knows
  2. Maintain a $0/yr annual budget
  3. Don't blindly trust an LLM

Context: What we built

From the outset, our MVP was to build an app where people could take a picture of a street and the app would call out aspects of the street's design that make it good and bad for liveability/walkability. We chose to build this as more of a small, loveable, correct app rather than as the shell of an enterprise application with a lot of dead ends.

We worked with a team of urban planners and architects to articulate concretely what aspects of the street's design influence liveability/walkability. This team still has many quibbles with what our app recommends, but on the whole think it gets people to notice and think more carefully about their living environment.

The app is designed to work on desktop and mobile. Mobile because we assumed that's where most users would use it, desktop because we needed to test and build it on our computers.

Decision 1: Use tech that the team knows

As a volunteer group, we were limited to whatever people could be inspired to contribute. Which, between working other jobs/school/job hunting, was limited.

We settled on a doacracy mindset, where people who do work have a strong influence on how the work is done. In practice, I'd scope out a component — what it could do and how it might deploy — and a contributor would be responsible for vetting those assumptions and delivering something the project would find useful. I'd then integrate their work into the main app.

We decided that we'd make a web app. The team had a lot of AI-focused data people on it and most of their experience (and most AI tooling) were in Python, so we opted to use that as our backend rather than a more trendy Node.js/Typescript.

Overall, this structure worked reasonably well and left us with these architectural choices:

In retrospect this led to some unfortunate outcomes:

Decision 2: Maintain a $0/yr annual budget

The AI startup ecosystem is full of free trial offers. But free trials have two relevant periods: the free period, and then the paid period.

I had an intuition that the transition to the paid period of too many products would introduce uncomfortable questions for the project. I had no doubts that the founder was bought in to the success of the project, but I knew she had real financial constraints. Could she indefinitely invest $1/mo in the project? Probably. $100/mo? Probably not without a clear path to revenue.

Should we have signed up for these trials, we would have also had a lot more management overhead: Whose credit card would be on the trial? Does it matter whether the payer is in Europe, India, or the US? Who would keep track of how much we're spending or on track to spend? How did we avoid unbounded spending like people sometimes experience with AWS? Who would be in charge of winding down the service at the end of the free trial, and who would decide that when? Lots of questions, so few great answers.

I also quietly expected that the end of major trial periods would prompt a reckoning for the project: if it wasn't making tangible progress, our product would be unceremoniously shut down and future teams would be forced to start with scratch (or, rather, a bunch of low-quality code and some sketchy instructions about how to run it... which is pretty close to scratch). Conversely, if we were on free-forever plans, the project could lie dormant for a while without it prompting these sorts of shut-down decisions. It was also advantageous for the team, because I suspected some of us might want to use it as a portfolio piece, and it's more engaging to have a working product than a bunch of screenshots.

If we could keep it to $0/yr, lots of problems went away.

How we did it

Indeed, we managed to keep our infrastructure costs very low! Our only ongoing cost is $12/year for the internet domain, which I think is a pretty unavoidable expense. It's also one that makes sense as long as the project lives as a twinkle in the founder's eye, and can survive through several technical architecture changes.

We found acceptable free tiers of products:

Sticking to the free tier gave us a few constraints:

Decision 3: Don't blindly trust the LLM

There was a stage of development where some of the tech team thought we could make the app work with just an LLM prompt with the response piped back to the user. We never tried this because of three concerns:

As a result, our final analysis has three layers. In the order that we built them (and that they often execute):

First model: panoptic mask2former

This is the same sort of computer vision model that is designed to power a self-driving car: it labels different parts of the street while also separating different objects from each other (so, for example, you get 2 different pole objects for different poles).

We started visualizing this output in a view like this: the yellow dot is the computed center of each object, and then we have a bunch of metadata about what the object is. For the sake of moving quickly, we chose to simplify each object into one point rather than try to cope with the full polygon data that the model computes.

The mask2former piece was the first piece we wrote. We let the team upload their own images and visualize the output in this UI, which was both very motivating for the team and the basis for great conversations.

Next: Gemini LLM

Working with urban planners, we wrote 100+ individual specific questions for an LLM to answer about the picture.

You can see some of the raw output of the LLM at the bottom. That first question (is_outdoor_streetscape) is the result of the prompt "determine if the main subject of the photo is an outdoor streetscape in an urban or suburban setting". In the table you will also see some summaries of different objects. For example, the section we see about the road ("This is a car-oriented...") is the result of us instructing the LLM "In less than 30 words, describe the street type and how it influences pedestrian movement. If car-oriented and applicable, mention a missing crosswalk".

We added some of these rules to our initial output-visualizer, which continued to create great conversations:

Three: Apply rules to stick it together

We apply a set of rules and heuristics to stitch the model outputs into an interactive visualization:

(You can see the interactive version of this example here)

While the Urban team had clear ideas for a more complex version, what we shipped amounted to show the Gemini summary for objects that mask2former recognizes, sorted with the most prominent objects first. Then when the user clicks "dig deeper" use a complex set of rules to figure out which piece of human-written urban educational copy to show.

For this street, StreetSmart calls out a lot of design features that make this street unpleasant to walk around. This is consistent with my memory of this intersection: it was not a pleasant place to walk or be, although it was a fine place to bike or bus through. I guess that's consistent with how the City of Cleveland describes these five miles of this renovated street: a complete street focused on bicycles, bus rapid transit, and car users.

Evaluating this structure

So even though our Gemini prompt contains phrases about the prominent bus stop in the middle of the road like "Despite bus stop shelters, the street lacks visible seating, water fountains, or public toilets for pause" for absence_of_any_street_furniture_utilities, we don't show that because we hadn't reliably found a way to tie that insight to a mask2former element in the picture.

Our current structure forces us to anticipate what will be in the picture and comment on it. When something's not in the picture (like the absence of street furniture), we don't comment on it.

But it also prevents us from making incorrect observations: For example, consider the bike lane on this street. In reality, Euclid Avenue has a bike lane painted on the road, but those lines are pretty faint. Our models tell us different things:

If we were simply looking at the Gemini output, we might confidently comment on the lack of the bike lane (which is untrue). However, because mask2former sees the bike lane as both very small and of a low confidence, we decide not to talk about it. There is assuredly more nuance we can derive from this kind of mismatch.

In conclusion

If I were starting again, I'd probably make similar core choices: boring tech, $0 infrastructure, strong guardrails on the LLM. What I'd do differently is be clearer to the team that a doacracy means real agency but also real ownership, and make sure everyone was on board with that tradeoff. I'd also reconsider whether tech that was boring to me was also approachable for the average volunteer developer, although I don't know that I should actually have written different code.

A mantra I hear about startups is that what gets you to one step probably won't get you to the next. If StreetSmart's next iteration does something functionally similar, the team will probably want to revisit several architectural decisions. Collecting reliable photo locations probably means shipping native iOS and Android apps. Exposing so much of the app's logic as client-side JS is fine for an MVP but irresponsible for a real tool. And the current backend has no systematic backups or rollback plan to the extent that a few strategic clicks could delete everything: that's a problem.

We made some choices. They got the project to the next step. The coming team will have to decide what gets them to theirs.