Sometimes the unplanned happens. When it does, getting things running again is important, but sometimes not as important as making sure it doesn’t happen again. That’s where John Frahm, High Availability Guru can teach us all a thing or two.

Find the Terminal Talk podcast on iTunes, or go to the mp3 directly at terminaltalk.net

Transcript below is done by machines and may not be accurate:

Frank:                    Welcome to Terminal Talk, a Podcast on Mainframe and Mainframe related topics. I’m Frank.

Jeff:                       I’m Jeff.

John Frahm:          And I’m John.

Frank:                    John is our guest this week. And we’re going to do more Mainframe related topics this time and just Mainframe because John is, he runs the High Availability Center of Competence. And so he has a lot of experience around keeping systems up and keeping datacenters up more than just systems. We thought it would be really good to have him come and talk about that.

So when you kind of do these things, you must have a starting point. Do you start with problems that they’re having or do you just say you know let’s look at everything you do.

John Frahm:          Yes Frank great question. So my team is contacted unfortunately a lot of times after a client has had a failure, an outage of some sort. So we’re you know connected at the hip to the client care centers, mainly Z but also POWER and Storage. And we are called in after clients have had a failure and it’s been a business impacting failure.

We’d like to do proactive work but unfortunately we’re called many of the times, 95% of the time for reactive.

Frank:                    Yes and that’s a really important point because people don’t think they have a problem. Nobody ever thinks they have a problem. It isn’t until something pops up that they go maybe we didn’t do things the way we should have.

John Frahm:          Sure and what happens and what we find is there’s some common things we’ve found. Clients rely on the reliability of the product. And when they fail they cause business impact because a client is not configured for high availability. They haven’t built their systems with data sharing (sisplexes), etcetera.

Frank:                    So that’s a good probably starting point because people are relying maybe a little bit too much on the promise that these things never go down or rarely go down. And all the redundancy built in, right, the key is not just hey we need machines that don’t go down or software that doesn’t break. We need to do more than that, right? What, again where do you start that? I mean do you start with okay here’s the failure and do you start by whose fault it is or…

John Frahm:          Sure, well we don’t like to place blame but certainly we have a collaborative workshop methodology where we ask the client’s best and brightest to get in the room with IBM’s best and brightest. So we bring a core set of High Availability Center of Competency folks with us, that’s my team. And then we bring subject matter experts from the layup. Whether they be layup services, development type tests, tests environments, whatever.

We get everybody in a room, we start to look at the infrastructure that supports the key business applications. So if it’s a bank online banking perhaps, we start to pull that apart.

Now we do, once in a while, look at the root cause of the components but you know we allow client care office to do that. So for book fail to power supplies failed certainly there’s a root cause of that. What we want to get to is the root cause of the business impacting failure, two root causes, so why did this happen?

And a lot of times we see that it happens because of process failures. Somebody in the shop did something, changed something, changed the script, right, did change. And now as a result of change we have an issue.

But what we like to point out to clients is we need to build an infrastructure to support the key business application and have it run anywhere, right. So if you lose a component you’re still able to run. I mean it sounds basic but – and clients think they’ve got these infrastructures but when we go look we find single points of failure, we find non-exploitation of HA functionality, etcetera.

Frank:                    Is it primarily because software has been misconfigured? Or is it just we never thought that that was going to be a problem?

John Frahm:          I think it’s we never thought that that was going to be a problem, they think they’re good to go. And we find out that due to the business pressures they don’t have a lot of time to do maintenance. They’ve reduced people resources, they don’t have the same number of folks supporting the environments as they used to. Thus maintenance isn’t getting on, things are getting pushed out, especially if they don’t have the ability to mass plan out so these things go to the wayside, right. They don’t have the ability to put our hyper’s and our fixes on and are going to suffer the consequences of that. So we see a lot of that as well.

Frank:                    Is there more and more of a clamor for automation in this? Or is it just we need our people to be smarter about what they do?

John Frahm:          That’s another really good question, right, so we do push automation. We do tell our clients, look automate as much as you can so that the systems can detect these failures and take an action and that eliminates people getting involved. But in the same respect, much like a pilot of an airplane, we want to make sure that our pilots, our operators are trained, our technical staff are trained should automation fail. What would we do? What will we take out of the plex? What scripts do we need to run? And how do we recover the business?

It’s just like a pilot goes into a simulator and he tests sailing conditions, we’d want our clients and our operators to do the same thing, be trained to do the same thing.

Jeff:                       In the field of like security and penetration testing I hear a lot about red team, blue team type testing where you know one team becomes the adversary you know within the company. And says I’m going to try to break something. Obviously you don’t want somebody bringing down a (sisplex) and saying hurray I won, but is there any sort of opportunity for a proactive or you know type of testing or work that can be done to you know make it so you don’t get woken up at night?

John Frahm:          We do advocate that as well, right, so we want, we ask customers to show us your test environments and do those test environments mimic production? Are they production like such that you can do this testing? Right, you can pull a fiber cable, have it simulate a failure, have somebody practice a recovery procedure, make sure that automation does indeed work the way it’s scripted to work. So, exactly, we advocate that as well.

Again, we don’t see a lot of that, we see a lot of test environments that really don’t mimic production, so they really can’t do the low testing and things that they really should be able to do to protect their production environments.

Frank:                    How many of the outages did you guys kind of peruse our cascaded failures as opposed to oh that’s you know…

Jeff:                       That’s a good band name. I’m just writing all these down over here.

Frank:                    Cascaded failures…

Jeff:                       Yes.

Frank:                    Is the name of the band. How many of the more like that as opposed to oh Jeff just hit the switch when he wasn’t supposed to.

John Frahm:          Well it’s hard to tell. Like so we see many different types of failures. We see a lot of OEM product failures, non-IBM product failures, incompatibility with new level. So the customer does change, right, we see a lot of issues as a result of change. So perhaps the customer puts a new level of z/OS up and then they find there’s an OEM product that just doesn’t play nice with the new levels or there were some fixes needed to support that, right. So we see that type of failure.

We do see, as I mentioned earlier, process failure. Where somebody makes a finger check, or somebody goes in outside of change control and decides you know I’m going to do this, it’s not going to affect anything and it does. Or we see the non-separation of production and test environments. So if somebody’s in a test environment and they do something in a test environment that affects productions. We see that more often than we’d like.

So we always usually have a recommendation that says separate production and tests, it’s simple right, it’s basic. But these are the common things we see.

Frank:                    So could you describe one of the clients, you know obviously without names, but one of the client scenarios where that happened? Because I think a lot of people are kind of trying to understand how can tests if it’s in a separate LPAR cause a failure to my production system?

John Frahm:          Yes I’m trying to remember off the top of my head I can’t remember any particulars of recent but you know certainly you know touching the database that touches both.

Frank:                    Okay.

John Frahm:          And you know they’ll have some situation like that. You know it’s a lot of things you don’t think would happen, happen. And we also find clients saying wow I didn’t know that would impact my business.

So what we try to do is when we do our workshops we do what we call high level CFIA, Component Failure Impact Analysis. And that’s going through the whole infrastructure. And a lot of times it includes distributed frontend all the way to Z, DB2 backend, whatever database they’re running. And we run a transaction flow through that frontend to backend and we say what if this component fails, what happens? Does it impact my business? Does it have fail over? Is it automated? Is it manual? So we do that high level CFIA?

I’ll tell you a story, a very interesting story. We were working with a bank in Ireland and we did a Z engagement. And they said, they called us back and they said we’ve recently had a failure in our HP tandem point of sale on environment, can you come and study that? So we said sure, we can bring subject matter experts, we can use our methodology and study that.

And what it turned out to be was that they had three authentication boxes. So in the summer time one of their support people took one of those authentication boxes down to do some change activity on it? Never re-enabled it so they’re running with two. The load capacity for two is fine. Come December when everybody’s shopping and trying to buy goods and use their ATM cards, the capacity, the performance was horrible and they couldn’t figure it out.

Well this had happened in July when they did the change, right, no change control right.

Frank:                    Right

John Frahm:          So you know that’s just a simple example of what can happen to you. And they couldn’t find it. They had no monitoring of that component to say hey, two are green, one is red, we need to re-enable it. So simple thing you know such as taking a component out makes it very difficult.

Jeff:                       When you’re called in to assess a problem that happened, do you typically need the system to be like down and in its broken state? Or what we’ll say non-optimal, sub-optimal state.

Frank:                    Sub-optimal state.

Jeff:                       To do your work. Or has kind of first failure data capture kind of gotten better to the point where you can rely on logs and stuff like that.

John Frahm:          Another good question, right, so we want to make sure we come in after all that work is done.

Jeff:                       Right.

John Frahm:          We want to make sure that client care and the customer and everybody does you know we ask the customer did you do root cause analysis? Of course they’re talking about well IBM you would need to tell us about why this broke, right. Maybe it’s not always IBM’s fault, maybe it’s another component or something they did, right. So we ask them to do that root cause analysis. But all that work is done.

And we want to make sure all of that is done for a reason because we’re going to get in a room and try to collaborate with a client and sometimes they’re angry especially right after we’ve had a failure.

Jeff:                       So you want to be able to close that door.

John Frahm:          Yes, yes. So we want everybody to be calm, we want the client care office to explain to the client why this happened, if it was a deficiency in an IBM product or whether it be (Micro-Code), hardware, whatever. And then we want to go when everyone is calm and has a more level head so we can have a good discussion about what do we need to do to configure this environment for high availability to meet our business goals. In the world that we’re in, (unintelligible) high availability is very important.

Frank:                    How many times in this kind of scenario that you’re talking about have you found additional stuff? So, oh you came in to solve problem A but oh yes we see problem A and we figured out how to fix that but in the meantime we’ve raise issues on B, C and D.

John Frahm:          Yes we do, we find that a lot. So we come in, we look at a particular environment, and where we really find those type of issues are around service management. So we find people at single points of failure.

I can remember doing an outage analysis at one client and I’m going through the timeline of the outage and I said wow you have three hours of recovery here where you’re trying to run a script. I said what happened there? Why so long? And they said well the guy who wrote the script no longer worked there so we never tested the script. It worked that one time and then we put so many changes in our environment we went to run the script again to do recovery and it didn’t work and he wasn’t here, he was long gone, right.

So those are some of the things we find people with single points of failure, lack of process maturity, things like that. Testing of our environments, testing of the recovery procedures. Big one we find all the time.

So we’re in looking at these environments from a technology perspective and we say you know if you needed to recover these components have you tested that? Again going back to the flight simulators scenario.

Frank:                    Have you generated, you know you said something that really kind of jumps out at me. Have you talked to customers about coming up with management processes for the scripts? You know I think when we develop code in a modern environment we have a repository of not only the code but a lot of the kind of documentation around the code so everything is kind of managed together and that’s great for applications. Is that kind of thing being done for the scripts that keep the business afloat?

John Frahm:          We do advocate that, right, as part of our service management. So again we focus day one as technology. Sometimes that bleeds into day two of a two day workshop. We try to keep this workshop two days because of tying up a bunch of you know client support folks for two full days. But that is very valuable, right, everybody in a room together. Again they’ve got IBM experts there with them, they’ve got all their people there with them that they normally don’t get a chance to talk to this type of topic about, right.

So day one usually technology, sometimes it bleeds into day two. But we focus on day two and half of day two on service management, those type of things. Do you have these scripted, tested, documented, recovery procedures for recovering any component or any part of your business including applications. You know whatever it may be have you documented that and tested it?

And a lot of times we don’t – or a lot of times we’ll find as in the example I gave with the script and the person leaving the company that they had it and nobody’s done anything with it or tested it until you need it, right, so.

Jeff:                       Do you ever have findings from the work you do end up in like a health check?

John Frahm:          Yes good question again. We actually advocate follow-on health checks. So being part of lab services we have our lab services team go in and do health checks after. So you know we’ll do a high level architectural view of the environment. And then we’ll say look we really think you need a deeper dive into DB2 or CICS or MQ or whatever and do a health check, right.

So we’re looking at settings and configurations and turning knobs and dials right, that’s important as well. Sometimes the client wants that done upfront and sometimes when you say health check they think that’s what a high availability assessment is.

Jeff:                       Oh that’s why I brought you here.

John Frahm:          Yes, yes exactly. So we have to kind of make sure they understand what’s the difference between the high availability assessment and a health check, right. Health check is a little deeper dive turning knobs and dials.

Frank:                    Right, you’re more looking at the process and the things that cause the failure, how do I, are more than that it’s things that cause the outage, not the failure right?

John Frahm:          Right and of course looking to make sure their architected for high availability. What does this architect – I call it the beast, let’s put the beast on the board.

And that’s another thing, sometimes they have diagrams of the beast and sometimes we’ll build the diagram and they’re like don’t erase that I need a picture of it. Because we don’t have that, right, or we have an old picture of this environment. So you know that’s really what we’re looking at.

Frank:                    And how often do you change that picture of the beast along the way? I mean it would seem to me that every time I’ve seen a client who’s showed me hey this is our infrastructure, there’s always somebody in the room that goes well that’s not exactly it. It was that way six months ago but now it’s not.

John Frahm:          Yes it changes often Frank, yes absolutely. We put it up and you’re right, and another reason for the collaborative workshop. There are folks in the room that will say no.

I’ll give you another example that I think the collaborative piece of the workshops an important piece to focus on as well. I was with a client and they said well we’re going to do some breakout sessions for some of the service management stuff. So we went to another room and we went with the change team so there was a change leader there and as a matter of fact the manager was there and everything you know making sure that they don’t say the wrong things.

So you know they had documented processes you know do you have metrics around successful change control and all of that and they’re like yes, yes change is great and everything. So then we wrap up and say okay it sounds like you’ve had a pretty good change process here. We go back into the big room where everybody is and one of the senior managers of the operators says hey did you guys talk about change? And we’re like yes we did and they said everything was good.

And he says good, it’s not good, it’s horrible. They do changes and they don’t notify my office team and we don’t know about them and all of a sudden we see CPU running away or we don’t know if something has happened. So the truth comes out right.

Just as you pointed out, right, we’re in a room, we put the beast up, we put the architecture up. Somebody says no it doesn’t look like that, right, it looks like this. So yes we see that quite often.

Frank:                    Yes I remember doing a workshop and I too am big on the collaborative and we couldn’t do that for this particular client. And we ended up having to interview three different groups and each one said the same thing. Yes we have standards if those other two groups would just follow our standards.

John Frahm:          Exactly.

Jeff:                       We’ve kind of been focusing on systems going down and some data loss but I think problems can kind of surface in other ways. And you kind of mentioned what you were just talking about like runaway CPU and to that extent, cost. Is that something you’re ever just kind of called in on is just like yes everything’s still working but like things just aren’t behaving exactly right. Or is it usually like a catastrophic type of thing?

John Frahm:          No we get called in for performance issues as well, right.

Jeff:                       Okay.

John Frahm:          So we all know that performance you know is a characteristic of high availability, right. So certainly performance issues.

I’ll tell you another quick story. I was working with…

Jeff:                       I like these stories.

John Frahm:          I’m sorry I have a lot.

John Frahm:          I was working with a wireless carrier, this wasn’t, this was mid-range platform and they said they were running into performance issues. And I said well what was happening at the time? They said well the business was offering new functionality for folk’s cellphones. I don’t know if it was apps or perhaps it was oh now you can get ESPN as an app and check your scores and all that. Well they didn’t tell IT. They roll this out and IT’s like woe CPU, calls me, what’s going on. And they were wanting to uninstall machines and put CPU cards in to draw this thing back, right.

So again disconnect between IT and the business. So another key thing we find out is there are those disconnects. And IT and the business need to be connected especially as I described earlier in this always on world.

Frank:                    Right.

John Frahm:          It’s very important right, IT is the business.

Frank:                    Yes and that’s probably something that is not a pervasive thought in most of the companies that you go see.

Jeff:                       They’ve got some big computers over there, they can run this, right?

Jeff                        Turn it on, let it go.

Frank:                    Especially now because it’s in the Cloud.

Jeff:                       It’s in the Cloud.

Frank:                    Everything works.

Jeff:                       You can’t see, we’re all waving our hands.

Frank:                    Yes, it’s in the Cloud.

Jeff:                       So you mentioned mid-range, did you mean mid-range Z or?

John Frahm:          Power.

Jeff:                       Oh really, okay.

John Frahm:          Power AIX, we did Power AI you know and we do all non-IBM products as well. We’ve been asked by customers as I mentioned HP tandem but we’ve been asked to look at X86. You know a lot of frontend is running on IBM product so we try to find that expertise bring those folks with us and talk about the high availability functionality of all those products whether it be X86, Power, Z, whatever you know.

Frank:                    So you must have some insight into the fact that it seems like a lot of the outage, maybe not the problems but the outage is based on people more than systems. Would you say that’s like an 80/20 kind of thing or is it 50/50? Or is it just the stories you’re telling us are more the fact that people made what in hindsight is obvious but wouldn’t be you know from the people when it’s happening.

John Frahm:          Yes I’d say it’s 80/20, right. So for the 20% that are non-people or process and its technology, there were opportunities to minimize or eliminate those outages. And it could be an example of an admin, the OS admin for which a PTF was available and for whatever reason they didn’t put it on. So we see some of that as well.

And then we see of course through change control people making errors not testing their changes in a test environment that is again the mixed production so they understand what it’s going to really do in production when they roll it out. So that’s the 20% – or the 80%. And then the 20% is the like I said components, things like that, so we see a lot more process errors than we do actual hardware, software failures.

Frank:                    Sure, sure, one of our old episodes…

Jeff:                       Classic Terminal Talk.

Frank:                    Yes one of our Classic Terminal Talk episodes, let’s go with that. One of our speakers talked about the fact that when he started he basically was an apprentice for in the datacenter and they don’t do that kind of thing anymore. How much of that do you see as a problem, a loss for lack of a better term the data center (unintelligible) of a company?

John Frahm:          Frank we see a lot of that. We see a lot of resource constraints that the client has with the skills that are needed. And even if you remember in IBM years ago, computer operators would become the system programmers. Right, so they would start as computer operators and there was a roadmap to become assistant programming. They’d gain all that knowledge operating the system and then move on. We don’t see that much anymore. We see a lot of folks retiring, we see a lot of reduction in staff and people are single points of failure.

So, and clients ask us about that. They said what is IBM doing to you know help us find the people with the skills or help develop those skills in the younger folks that are coming up.

Jeff:                       And I’d be remiss if I didn’t, there’s an article on the register that came out earlier this week, it was shared to us with the hashtag from, I’m going to butcher this last name, I’m sorry. But she shared this article it says assist admin sank IBM Mainframe by going one VM too deep. The sub-headline is tried to blame it on a bug but logs don’t lie.

And it basically goes through somebody playing with VM and CMS saying oh I wonder if I can create a VM within this VM and setting a control character for it. Oh I wonder if I can make one within that. It says oh look I’ve made five VM’s, I’m just going to shut them down now and type CP shutdown and took all of their 72 developers with them. There you go.

Frank:                    CP shuts down at the wrong level.

Jeff:                       Exactly, with no control character in front of it.

John Frahm:          There you go, great example of what can happen innocently right?

Jeff:                       But you need one of those under your belt so you can never do it again.

Frank:                    It is true that you learn by bad experiences, right, and bad experiences can affect the business.

John Frahm:          Yes and speaking of logs we had one client where they wanted to root cause but their culture was a culture of blame.

Jeff:                       Yes.

John Frahm:          And their folks made sure there were no logs available to determine who did what. And we finally had to tell their execs, we did an executive presentation like we always do after we’re done you know doing a high availability assessment that we are unable to determine the root cause because the logs don’t exist. Somebody had made sure they had cleaned them up, right, unfortunately.

And we were quite honest with them because we said there’s a culture here of blaming people and you really have to move away from that, right. You have to understand what happened, why it happened without placing blame and then what do you do to protect it from happening again.

I can remember one of the people in the workshop I was up talking about root cause analysis, problem management. And this individual said at the break I want to show you something. And he showed me a letter it was written in Spanish and he said I’ll translate for you. Basically they said if you ever do anything like this again you will be terminated from the company. And what he had done was he was walking across the raised floor, there was box with a red light and somebody said hey you need to go fix that and he said I don’t fix that, I don’t work on those products. He said you need to fix it. Well he tried to fix it and took the system down and then he was blamed for it.

And his response to me was if I ever saw a red light again I would run away from it. So it’s interesting, right?

Frank:                    Yes, right.

John Frahm:          The culture that they were developing, the culture of blame, you don’t really want to do that when you’re trying to get the real root cause and put preventive actions in place to try to prevent these. You want to educate people, you want to put procedures in place so that these things don’t happen.

Frank:                    That’s really important in an environment where an outage costs a lot of money. Right, because nobody wants to be blamed for the fact that we lost a million dollars because we were down for 15 minutes.

John Frahm:          Sure.

Frank:                    It’s kind of counter intuitive right to start creating that culture. Do you guys give recommendations on how to do that or is it just here’s a problem that you own Mr. Executive, you need to fix it.

John Frahm:          Yes, no and we do talk about that, right. We talk about the importance of really getting to the root cause and the root cause really should be some kind of process change when you think about it. Right, so if I have a z/OS admin example for which a PTF was available to fix it and I don’t put it on somebody would say oh that goes in the z/OS bucket, it was a z/OS failure. No it’s a process failure.

Because if you had the PTF what’s your process for putting fixes on. Maybe you need to look at that, maybe you don’t put them on frequently enough, maybe you need to increase the frequency of looking at them and installing them, right, and that requires change windows and a lot of other issues. But it’s really a process failure. So let’s not blame the person who didn’t put it on, let’s look at the process that says you know we have too much time between when we’re putting on maintenance on to get all the fixes on to protect our system.

Jeff:                       I know my mechanic has a sign in their office that says something to the effect of our rate for fixing your car is this, if you’re going to tell us what to do or if you want to watch it’s this much and if you already tried to fix it it’s this much which is like you know five times as much. Do people trying to fix problems often compound problems?

John Frahm:          I think so, right, and you know what we see too right is the hero, what I call the hero syndrome.

Frank:                    Yes.

John Frahm:          Right, so you know I know they’re going to call me and there were five of us as I described earlier. Now there’s only me and I’ll come and I’ll save the system, right, and I’ll fix it and I know what to do. So it’s more of I get complacent or I become reactive, I go into react mode and that’s how I operate. We see a lot of clients doing that, they’re stuck in react mode. And they don’t get out of react mode, they’re in proactive – they’re not proactive and people say hey I’ll just come and save the day. And then I feel proud because I’ve been able to do that, and you should right? Because you’re a technical savvy person, you understand the systems and the environment. But you know we want to prevent those things from happening proactively, right, and not be in react mode.

Frank:                    Yes in a lot of those cases, do people unintentionally destroy things like logs in an effort to oh my gosh, I’ve got to save the world right now. And whatever it is, I just want to get stuff up and running. Is there a lot of that?

John Frahm:          You know I did describe that one issue where the logs had disappeared but we don’t see that very often. We see people wanting to understand what happened and getting the data that’s needed in the data pulls that are needed to understand what happened and how to prevent it from happening again. Regardless of the fact that you know they know they can be the hero and save the system you know they still want to get that data.

But I guess the important point is that if you don’t need a standalone dump for example or if you don’t need to dump that data because you know what’s going on or you made a mistake. Right, if I had, when I was an availability manager I had system programs come to me and say don’t standalone dump just the IPL that was me. I was doing something and I made a mistake, right, that’s the way you want it. Okay I’ll deal with you later. What are you doing on the system, right, during the day.

Frank:                    Right.

John Frahm:          But thanks for telling me why do we want to waste time on a standalone dump which is you know another X 20 minutes or whatever it takes these days, I don’t know, but so important.  And that’s a culture you’ve developed, right, a culture. We’re not blaming, we’re all here together, lets fix it, let’s understand what happened, let’s understand what you were doing at the time when the system went down and how we can prevent that from happening again.

Frank:                    You’ve had a bunch of different jobs, right. When we first met, we won’t say how many years ago, you were really more of a hardware person. How did you go from being a hardware guy up to doing this kind of work process?

John Frahm:          So I was, it seems like and if people could see me they’d see I have a lack of hair on my head. I’ve been doing (unintelligible) management pretty much my career. The only time I haven’t is when I was at the HACOC. So you know being an engineer and graduating from engineering school I came and I wanted to fix things, right, that was something I knew I was good at even before working in labs at college, etcetera.

I came here, I was a field engineer and I went out and fixed you know the large, large system, 3033’s you know these things, talk about beast. I mean these things were beasts, right. And back then you had attached to you and you know you were deep in the machine you know tri-leads, etcetera. So I’ve always been techy type of guy.

I went from that to doing software support and then I went and trained to be a z/OS system programmer. So I did both hardware and software. So I would go places, clients, and fix whatever problems they had whether they were hardware or software.

And then from there I became an availability manager. So I managed availability for key IBM System, manufacturing engineering systems. So when those systems went down I was standing in front of some IBM execs saying why were no machines coming off the floor and how is it never going to happen again and how many millions of dollars are we in the hole because we didn’t you know manufacture any machines because the manufacturing system was down?

So that you know I gravitated to that and then doing that availability manager job is where I learned all my process idle, process disciplines. And read them and trained on them but then did them. So a lot of people become idle trained and they understand the importance of incident problem, change, release management. Managing by them but when you actually go and do it. So when you run recovery and you do incident management and then you do problem management where you’re identifying the real root cause, all the secondary contributing problems that elongated the outage and eliminating those you start to really live through it?

And that’s what I would take to the consulting part of it when I joined High Availability Center of Competency. And then I left and went to client care, Z client care. So I was back in mode and then I came back as the manager so that’s kind of my progression of dealing with problems and client problems and availability issues for a long time.

Frank:                    So I think we’re probably at the bottom of the hour here and I want, first want you to know that you are I believe the first manager we’ve had on the show. We don’t usually have managers because we have all those bad words for them.

Jeff:                       I think you’re right.

Frank:                    Yes so you should feel honored.

John Frahm:          I do feel honored, I feel honored anyway even if I wasn’t a manager to sit down with you guys and have a good talk about availability and what we see out there in the real world with our clients.

Now I’ll say this, for folks who are interested who want, you know at IBM, who want to come out with us and come on an engagement and be a subject matter expert. It’s good learning experience for people, especially young people. We’ve brought some of the young guys from test to come work with us and it’s a big eye opener for them what the clients are really doing and the issues that they’re having.

Frank:                    For a ride along.

John Frahm:          Exactly.

Frank:                    Sounds like fun. They probably won’t let us go because too much sarcasm.

Jeff:                       We’ll put that information in the show notes, yes that sounds like fun, I might do that.

Frank:                    Awesome.

John Frahm:          Sounds good.

Frank:                    Awesome, well thank you very much John, appreciate you coming.

John Frahm:          You’re welcome, it’s my pleasure, thanks. Thanks for having me.

Frank:                    Old man (Charlie), run us out.

You’ve been listening to Terminal Talk with Frank and Jeff. For questions or comments or if you have a topic you’d like to see covered on a future episode, direct all correspondence to contact@terminaltalk.net. That’s contact@terminaltalk.net. Until the next time, I’m Charlie Lawrence signing off.

 

 

END