Advertisement

Unix Failures Stop FAS E-mail

Harvard Arts and Sciences Computer Services (HASCS) performed emergency system repairs yesterday when a piece of equipment in the Unix system—related to one which had just been replaced after failures last weekend—failed once more. The resulting outage lasted from around 8:15 p.m. to 11:30 p.m.

Coordinator of Residential Computing Kevin S. Davis ’98 said yesterday that HASCS did not know what is at the root of the repeated failures. But he said that Hewlett Packard (HP), the vendor of Harvard’s servers, was called in within the first half-hour of the failures, as HASCS believe the problem most likely lies with the equipment provided by the vendor.

“The cause of this type of massive failure was clearly a low-level problem related to hardware or software that could only be addressed by HP,” Davis wrote in an e-mail yesterday.

Davis said that the problem had “gained visibility on the executive level” at Hewlett Packard, and that the company is conducting an investigation to find out the cause of the problems, and should have a statement by next Wednesday.

“Our top focus is to proceed with a high degree of caution,” Davis said. “Our number one priority is the security of the data on the system.”

Advertisement

Last weekend, emergency repairs caused e-mail services to be unavailable from 6 p.m. last Friday, to 4:30 a.m. Sunday.

Beyond lack of access to e-mail for the weekend, most users were not affected permanently. But those whose FAS accounts begin with the later ‘m’ irrevocably lost all e-mails sent and received in the previous day.

HASCS sent out an apology and explanation to ‘m’-lettered users Monday, which confirmed that any e-mails sent or received from 12:30 a.m. last Thursday to 6 p.m. last Friday were lost and could not be recovered.

Kim M. McCarthy, whose Harvard e-mail address is mccarth@fas, said the loss was extremely inconvenient, causing her to lose schedules of when she was supposed to see different subjects as a research assistant in a psychology lab.

“I had a bunch of people scheduled for Sunday, and I lost all of them, which was a major pain,” McCarthy said.

Unrelated additional downtime may be necessary in coming weeks to implement previously scheduled upgrades. Davis says the downtimes will be brief, and advertised before hand on the message of the day, which comes up when users log-in to secure telnet or Webmail.

Notification did not go out before e-mail access was stopped last weekend because HASCS did not know the repairs would end up taking so long.

Problems began Sunday, Aug. 4, when the core-misc1, a storage area, caused the FAS Unix systems to go down for several hours late at night. This is the same area where trouble occurred again yesterday.

HASCS looked into the problem, and deemed it small enough in scope that they could wait until the end of Harvard Summer School to make the necessary repairs.

But early last Friday, the Unix system suffered an additional outage, again caused by the core-misc1. HASCS decided that the second outage was serious enough to warrant immediate repairs.

These repairs should have lasted from 6 p.m. to 9 p.m. Friday night.

But just as they began work, part of the hardware crashed, which was followed by a failure in another part of the Unix system—the part which contained, among other files, the data and home directories for accounts beginning with the letter ‘m.’

This second emergency was completely unrelated to the core-misc1 failure that had prompted HASCS to start repairs, according to Davis.

“It’s as if you were trying to extinguish a fire in your kitchen, when all of a sudden there was a fire in some other part of your house,” he said.

At that point, HASCS “ceased all work at once,” Davis said. “This is just not something that is supposed to happen.”

—Staff writer Eugenia B. Schraa can be reached at schraa@fas.harvard.edu.

Advertisement