Re: "no such file" from one node only
- From: "BIO" <bio2935c@xxxxxxxxxxx>
- Date: 16 Jan 2007 20:09:36 -0800
All: (sorry about my sluggish reply, but my connection to Google groups
kept timing out on posting; so now that I'm at home it works ok again!)
First - Hein:
I spoke too soon about the directory error. As it turns out it seems
that every file on every disk in every directory on the entire system
(well, those few that I sampled anyway) is showing this symptom. I'm
not sure what that means (incorrect dismount/mount maybe??), but I DO
know that it's NOT causing us any kind of problem whatsoever for any of
all those others. So I am reluctant to try anything to fix a possibly
more global problem.
So I'm now thinking that recreating the directory won't eliminate the
%DUMP-E-JUNKINDIR error, and I should just delete the two files in
question, and move on.
As for your subsequent suggestions - too late for me to experiment now;
don't have infinite time to do the very best job possible.
Second - Norm:
What do you mean by "is not two separate files"? There are (or should
be) two files (with two different file_id's), but one of them is not
accessible from one node (only).
I did try the set volume/rebuild=force (on both nodes) and it had no
effect at all.
Third - Alan:
Yes there are some disk errors (3) but none of those are recent.
The device info does not look unusual to me:
Disk $1$DKD306: (A.....), device type ........ HSZ70, is online,
mounted, file-
oriented device, shareable, available to cluster, error logging is
enabled.
Error count 3 Operations completed
517479707
Owner process "" Owner UIC
[SYSTEM]
Owner process ID 00000000 Dev Prot
S:RWPL,O:RWPL,G:R,W
Reference count 1743 Default buffer size
512
Total blocks 88824180 Sectors per track
169
Total cylinders 26280 Tracks per cylinder
20
Allocation class 1
Volume label "DKD306" Relative volume number
0
Cluster size 86 Transaction count
1745
Free blocks 17694844 Maximum files allowed
510483
Extend quantity 5 Mount count
2
Mount status System Cache name
"_$1$DKD306:XQPCACHE"
Extent cache size 64 Maximum blocks in extent cache
1769484
File ID cache size 64 Blocks currently in extent
cache 0
Quota cache size 0 Maximum buffers in FCP cache
2990
Volume owner UIC [SYSTEM] Vol Prot
S:RWCD,O:RWCD,G:RWCD,W:RWCD
Volume Status: subject to mount verification, write-back caching
enabled.
Volume is also mounted on G.....
and $ anal/disk dkd306 shows nothing
Analyze/Disk_Structure for _$1$DKD306: started on 16-JAN-2007
16:04:15.86
%ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
-SYSTEM-W-NOSUCHFILE, no such file
$
Resolution?
I finally just plain deleted the file (on the node that could see it)
and the directory listings looked "clean" from both nodes.
I then manually created a new file with the same name (from node G, the
one that appeared ok). But AAARRGGHH, it, too, was visible from G but
not from A!! Exactly the same symptom as before. So I created a file
with a different name -> no problems on either node. This suggests that
there is some issue with the directory itself at the specific location
of this filename. Then I did the creation from node A. Lo and behold,
it was visible from both nodes! Deleted it again. So I am now able to
create, see and delete files with this name from either node. Somehow
doing the creation/deletion from node A managed to make the directory
repair itself. I'm happy that it works, but still mightily puzzled.
Ingemar
Hein RMS van den Heuvel wrote:
AEF wrote:
norm.raphael@xxxxxxxxx wrote::
"BIO" <bio2935c@xxxxxxxxxxx> wrote on 01/16/2007 02:01:21 PM:
The directory/file_id does in fact show the same id's on both nodes.
I find this puzzling. Since the problem system can read the correct FID
from the directory, then doesn't that mean the problem lies with the
file header? But the directory still has some corruption in it. So
maybe both are corrupted?
I think the file is fine, but the corrupted directory, at some point in
time, returned an invalid id, even though it seems valid now. As
suggested, maybe 1 bad read loaded bad data into a cache, and some got
refreshed since. It's pointless to speculate what further oddities may
happen after a corruption.
It's unlikely that two not-so-related error happened at the same time.
So the beginning of the directory was corrupted.
Next step woudl be to try the target block itself.
Best I can think of, you need a roundabout approach.
On a 'good' node:
$SEARCH/NUM/FORM=NON suspect.dir suspect.dat
$! Look at Record Number from search: recnum
$DUMP/RECORD=(COUNT=1,START='recnum;)
$! Look mostly are RFA in dump header, and a little at the data
$! Convert first hex fields in RFA to decimal: vbn
On 'bad' node:
$DUMP/DIREC/BLOC=(COUNT=1,START='vbn') suspect.dir
You might want to dump more blocks, but this is the one to focus on.
I've heard about rare cases of shadowing goign awry were the system
thought it succesfully created a two identical copies of the data on
each disk, but one stayed as old data (prior version of similar data,
or old file 'shining through' for new allocation). After that, it
becomes a guessing game whether you a read will come from the bad disk
or good disk. A subsequent update can fix or break it for all!
Cheers,
Hein.
.
- Follow-Ups:
- Re: "no such file" from one node only
- From: AEF
- Re: "no such file" from one node only
- References:
- Re: "no such file" from one node only
- From: norm . raphael
- Re: "no such file" from one node only
- From: AEF
- Re: "no such file" from one node only
- From: Hein RMS van den Heuvel
- Re: "no such file" from one node only
- Prev by Date: Re: Blast from the 1988s (DEC proposal)
- Next by Date: Re: (Update) Re: Is this a (permanent) disk failure?
- Previous by thread: Re: "no such file" from one node only
- Next by thread: Re: "no such file" from one node only
- Index(es):
Relevant Pages
|