java.io.FileNotFoundException: File does not exist: hdfs://xxx

一、产生问题背景

我们公司正在准备从cdh迁回社区版hadoop集群,启动flink任务的时候,还未运行就直接报错:

Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1611303948765_0059 failed 2 times in previous 10000 milliseconds due to AM Container for appattempt_1611303948765_0059_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-01-27 10:02:57.833]File does not exist: hdfs://4399cluster/user/hadoop/.flink/application_1611303948765_0059/flink-dist_2.11-1.12-SNAPSHOT.jar
java.io.FileNotFoundException: File does not exist: hdfs://4399cluster/user/hadoop/.flink/application_1611303948765_0059/flink-dist_2.11-1.12.0.jar
  at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1729)
  at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1722)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1737)
  at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:271)
  at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:68)
  at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:415)
  at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:412)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:412)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:247)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:240)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:228)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)

二、定位问题及解决

通过去修改代码pom的依赖问题,发现都无用,后来搜到这篇文章:https://www.itread01.com/p/1330385.html

发现我们都有共同的错,AM 报出来同一个错,File not Found

Diagnostics from YARN: Application application_1611303948765_0059 failed 2 times in previous 10000 milliseconds due to AM Container for appattempt_1611303948765_0059_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-01-27 10:02:57.833]File does not exist: hdfs://4399cluster/user/hadoop/.flink/application_1611303948765_0059/flink-dist_2.11-1.12-SNAPSHOT.jar

引用这篇文章重要的一句话

這是tm的什麼fucking錯誤!!拿著別人正確的程式一點一點查,發現是因為有這句話: Configuration conf = new Configuration(); conf.set("fs.default.name", "hdfs://uat84:49100"); 這是什麼意思呢,就是說,你如果是本地跑,就是不引入mapred-site,yarn-site,core-site這些配置檔案,那麼這個地方也不要設定,因為你是在本地跑M/R程式,( fs.default.name預設值是file:///,表示本地檔案系統)這個地方卻又告訴hadoop,需要的jar包從hdfs中取,當然會報以上的問題。那麼,在本地跑直接去掉這句話就ok了。 反之,如果你是提交到叢集,引入了mapred-site,yarn-site,卻沒有引入core-site,也沒有設定fs.default.name,那麼,因為不知道namenode的地址,無法將job.jar提交到hadoop叢集上,因此會報如下錯誤: [ 2014-05-13 16:35:03,625] INFO [main] (Job.java:1358) org.apache.hadoop.mapreduce.Job - Job job_1397132528617_2814 failed with state FAILED due to: Application application_1397132528617_2814 failed 2 times due to AM Container for appattempt_1397132528617_2814_000002 exited with  exitCode: -1000 due to: File file:/tmp/hadoop-yarn/staging/hadoop/.staging/job_1397132528617_2814/job.jar does not exist .Failing this attempt.. Failing the application. 牛不牛!因此我們只要告訴hadoop我們的namenode地址就可以了。引入core-site或是設定 fs.default.name 都是一樣的

所以我就去检查新集群是不是hadoop/conf下面core-site这个文件没有配置namenode的地址,后面排查是有。突然想起自己代码里好像引入之前集群的一个hadoop配置文件,确认后在项目的resources目录里放了一个hdfs-site.xml文件,通过删除这个文件,发现问题解决。

所以遇到这种类型的错,检查你的hadoop文件的配置和项目代码的引用配置文件是否冲突了!

推荐阅读更多精彩内容